I keep a notebook—a physical one, because digital files have a way of disappearing into the ether—titled "AI Claims That Sounded Right But Were Wrong." Near the top of the first page is a quote I heard at a conference in London last year: "AI will eliminate the need for human judgment."
That is not just wrong; it is dangerous. In my twelve years supporting legal teams and investment committees, I have learned that judgment is https://bizzmarkblog.com/the-hallucination-graveyard-a-rigorous-approach-to-source-verification-in-research/ not a commodity to be automated; it is a process to be refined. Over the last four years, as I’ve transitioned from traditional research to AI-assisted workflows, I’ve stopped looking for the "perfect" model. Instead, I’ve started running what I call "The Triangulation of Doubt."
This approach involves putting five distinct frontier models into the same conversation thread. It is not about saving time—if you are using AI purely to "save time," you are likely producing faster errors. It is about architectural rigor. When you force five different reasoning engines to confront one another in a shared context, you stop asking the AI for an answer and start asking it for a defense.
Beyond the Single-Model Echo Chamber
Most professionals use AI like a search engine: prompt, response, copy-paste. This is a recipe for confirmation bias. If you ask a single model to review a complex M&A contract or a geopolitical risk assessment, it will naturally veer toward the path of least resistance—the probabilistic average of its training data. It will give you a "safe," buzzword-laden answer that sounds intelligent but often lacks foundational structural integrity.
By moving to a multi-model thread where frontier models—such as Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and specialized reasoning models—are forced to share the same context, you break the echo chamber. You are effectively convening an internal committee where each member has a different "wiring" for logic, nuance, and retrieval.
Why Context Sharing is the Linchpin
The magic isn't in the models themselves; it is in the context sharing. When you feed the same underlying evidence base (the source documents, the market data, the legal precedents) into a single, multi-model thread, you are establishing a common frame of reference. You aren't just getting five opinions; you are getting five divergent interpretations of the *same objective reality*.
The Anatomy of "The Triangulation of Doubt"
In practice, this workflow operates like a structured adversarial process. I structure these sessions to force the models to critique one another. I don’t ask, "Is this right?" I ask, "What specific evidence in the provided text contradicts the conclusion reached by the previous model?"
The Disagreement Engine: Tracking Contradictions
If all five models agree on a complex point, I am actually more suspicious, not less. Consensus in AI is often a sign of a high-probability hallucination or a well-worn trope. Disagreement, however, is a signal. It tells me where the ambiguity lies.

My protocol for handling this looks like this:
The Baseline Draft: A primary model outlines the interpretation. The Adversarial Scrutiny: I prompt the other four models to identify gaps in logic. The Surfacing of Constraints: Each model is forced to cite the specific page or clause of the provided context. The Synthesis of Doubt: I manually reconcile the disagreement, focusing on the evidence that would change my mind.Decision Intelligence for High-Stakes Work
High-stakes work—whether it’s preparing for a deposition or vetting a series-D investment—requires a higher threshold for "truth." We need to move away from "answer generation" and toward "decision intelligence." This means mapping out the known unknowns.
Below is a breakdown of how different frontier architectures approach a strategic decision within a shared conversation thread:
Model Archetype Strategic Value Common Weakness The Detailist (e.g., Claude 3.5 Sonnet) Exceptional at nuanced reading and legal clause analysis. Can get lost in the weeds if the prompt isn't tightly scoped. The Broad-Stroker (e.g., GPT-4o) Excellent at high-level logic and pattern recognition. Occasional over-confidence in "safe" consensus answers. The Context-Heavyweight (e.g., Gemini 1.5 Pro) Best for processing massive document sets (100k+ tokens). Can struggle with subtle logical contradictions. The Logic-Checker (Specialized Reasoning Models) High accuracy on mathematical and structural dependencies. Less flexible with ambiguous, non-linear reasoning.The Hallucination Detection Mindset
The greatest risk in using frontier models isn't that they are "wrong"—it's that they are *convincingly wrong*. Overconfident AI outputs without citations are the bane of my existence. When I hear someone praise an AI for being "seamless" or "intuitive," I immediately check for citations.
My workflow incorporates a specific "What would change my mind?" prompt for every significant strategic claim. This forces the model to define the boundary conditions of its own argument. If an AI claims, "Company X is likely to default on its debt," it must also define what data would prove that statement false.
By forcing the models to operate in the same thread, I can perform a cross-check:
- If Model A suggests a risk factor, Model B is immediately prompted to cross-reference that risk against the internal document repository. If Model C identifies a trend, Model D is tasked with finding the historical counter-evidence.
This is not about "saving time." In many cases, it takes *longer* to run this protocol than it would to just take the first result. But in the legal and investment fields, time is a cheap commodity compared to accuracy. A single wrong strategic decision costs infinitely more than the few extra minutes required to verify the logic.
The Future is Conflict, Not Consensus
If you take away one thing from this analysis, let it be this: Do not trust a model that agrees with you. When you start using multiple frontier models in the same conversation thread, look for the friction. The places where the models disagree—where they cite contradictory interpretations of the same evidence—are the places where the actual human work begins.
That is where your judgment comes in. You are the mediator. You are the one who determines which model's logic held up under the pressure of the others. You are not a prompter; you are a committee chair.

A Final Note on Workflow Hygiene
I don't name my workflows after the tools I use. Calling a process "The GPT-4 Workflow" is like calling your legal analysis "The Microsoft Word Process." It’s irrelevant. I name my workflows after the outcome: The Triangulation of Doubt, The Adversarial Clause Review, or The Contradiction Audit. When the tool eventually changes or updates, the process remains.
Stop looking for the tool that gives you the answer. Start building the environment that forces the truth to the surface. It’s harder, it’s slower, and it’s significantly more expensive in terms of cognitive load—but in high-stakes research, it’s the only way to survive AI tools for fact checking workflows the scrutiny of an investment committee that has heard it all before.
This post is part of my ongoing effort to document rigorous AI workflows. If you have a workflow for surfacing contradictions that you’d like to share, I’d love to hear how you pressure-test your models. Just spare me the "synergy" talk; I have a low tolerance for fluff.