1. The Gap Between Expert Calls and Investment Decisions
A typical private equity deal team runs 10 to 15 expert calls per transaction. By the time the investment committee deck is assembled, most of what those experts said has been reduced to a handful of paraphrased bullets — or worse, forgotten entirely. The analyst who took notes on call seven has a vague memory that something came up about channel economics in call two, but cannot recall whether that was a red flag or a confirmation of the thesis.
This is not a resource problem. It is a structural problem. Expert calls generate raw material — hours of audio, thousands of words of transcript, hundreds of discrete claims about markets, competitors, customers, unit economics, and regulatory dynamics. Without a systematic process for extracting, scoring, and synthesizing that material, the insights remain buried. The synthesis step that transforms transcripts into actionable intelligence is almost universally missing.
The consequences are predictable. Research gaps go undetected until IC, when it is too late to close them. Contradictions between experts are never surfaced — the IC hears a confident consensus view that does not actually exist. High-confidence, first-hand claims are given the same weight as speculative hearsay from someone three degrees removed from the operation. And the investment thesis is never formally tested against the evidence base; it simply accumulates supporting quotes while contradictory signals are quietly set aside.
The problem is not that deal teams lack rigorous thinking. It is that the tools available for managing expert research — spreadsheets, shared documents, call notes — are not designed for the synthesis task. They are archival tools, not intelligence tools. They can store what was said. They cannot tell you what it means when taken together.
This playbook describes a structured approach to expert intelligence — one that begins with audio and ends with a confidence-scored, synthesized intelligence layer that evolves in real time as each new call is completed. It is built around eight discrete stages, each with a defined input, process, and output. The goal is not to replace analyst judgment. It is to give analysts the structured evidence base on which judgment can actually be exercised.
2. What 'Structured Intelligence' Actually Means
The phrase 'structured intelligence' gets used loosely. For the purposes of this playbook, it has a precise meaning built on four components: claim extraction, attribution, confidence scoring, and cross-call synthesis.
Claim extraction is the process of identifying discrete, assertive statements within a transcript — statements that make a factual or predictive claim about the world. Not every sentence in an expert call is a claim. Questions, pleasantries, procedural remarks, and hedged non-answers are not claims. A claim is specific, attributable, and meaningful.
Attribution connects each claim to the expert who made it, along with the context in which it was made. This matters because the evidential weight of a claim depends entirely on who is making it and on what basis. A former head of customer acquisition speaking from direct operational experience is not the same source as a strategy consultant offering an informed outside view.
Consider the difference between these two representations of the same topic. The first: 'We spoke to an industry expert who said CAC payback was around 12–18 months.' The second is a properly attributed, confidence-scored claim:
“Our day-zero cohort CAC payback ran 15 to 18 months across our 2023 and 2024 cohorts — this was stable across regions.”
The first version is a paraphrase that collapses specificity (15–18 months vs. 12–18 months), strips attribution (who said this and with what authority), removes temporal context (the 2023–2024 cohort framing), and eliminates the corroboration signal (consensus status, meaning at least two independent experts aligned on this proposition). The second version is a piece of intelligence. The first is a note.
Cross-call synthesis is the process of comparing claims across multiple experts to identify patterns — where experts agree, where they contradict each other, and where a proposition has been raised by only one source with no corroboration. This is the step that converts a collection of individual expert perspectives into a coherent intelligence picture. It requires systematic semantic comparison, not just a final read-through of all transcripts by a single analyst.
Together, these four components form the foundation of a structured expert intelligence process. The eight stages described in the following sections operationalize them at scale.
3. Stages 1–4: From Audio to Extracted Claims
The first four stages of the pipeline convert raw audio into a structured set of attributable, discrete claims. Each stage has measurable quality benchmarks.
Stage 1: Ingestion. Audio files are ingested and pre-processed within 30 seconds of upload. This stage handles format normalization, audio quality assessment, and metadata tagging — call date, expert identifier, project code, and interview structure flags. The speed matters: deal teams should not wait for a batch processing window to begin analysis. Intelligence must be available on a rolling basis as calls are completed.
Stage 2: Diarisation. Speaker diarisation separates the audio into distinct speaker turns before transcription. This is a prerequisite for attribution — without knowing which speaker said which sentence, it is impossible to distinguish expert claims from analyst questions or from interruptions. Current benchmark accuracy is 98.4%, measured against human-annotated ground truth across a diverse set of two-party and multi-party expert calls.
Stage 3: Transcription. Each speaker-attributed audio segment is transcribed. The current word error rate for English-language calls is 4.2%, which compares favourably with the industry benchmark of 5–8% for specialized domain vocabulary (finance, operations, regulatory terminology). Higher error rates on technical terms are addressed through domain-specific vocabulary fine-tuning and post-transcription correction passes.
Stage 4: Claim Extraction. This is the most technically complex stage. The system identifies and extracts discrete claims from the expert's speaker turns, tagging each with a unique claim identifier, the source turn in the transcript, and an initial categorization (market structure, unit economics, competitive dynamics, regulatory, management quality, operational, and so on). Performance benchmarks: 91% precision (of claims extracted, 91% are genuine claims rather than procedural text or hedged non-answers), 87% recall (of all claims that could be extracted, 87% are successfully identified).
The output of Stage 4 for an average 60-minute expert call is approximately 127 distinct extractable claims. This figure varies by call style and expert verbosity, but it establishes the scale of the extraction task: a 12-call research project generates on the order of 1,500 discrete claims that require scoring, categorization, and cross-call synthesis. No manual process can handle this volume systematically.
A claim must meet three criteria to be extracted. First, specificity: the statement must make a concrete, falsifiable assertion about the world — not a hedged opinion, a rhetorical question, or a generalization. 'CAC payback ran 15 to 18 months in 2023' qualifies. 'Payback is something operators think about a lot' does not. Second, attributability: the claim must be clearly traceable to a specific speaker turn. Statements that blend the expert's view with the interviewer's framing, or that arise from leading questions, are flagged for review rather than extracted as clean claims. Third, meaningfulness: the claim must be substantively relevant to the research agenda. Procedural conversation — scheduling, clarification of process, social pleasantries — is excluded regardless of how specific or attributable it may be.
4. Stage 5: Confidence Scoring — Why All Claims Are Not Equal
Once claims are extracted, they must be evaluated for evidential weight. A five-factor confidence model assigns a score from 0 to 100 to each claim, weighted as follows: Specificity (30%), Evidence Quality (25%), Linguistic Certainty (20%), Expert Seniority (15%), and Cross-Call Corroboration (10%).
Specificity captures how precise and concrete the claim is. A claim that names a specific metric, time period, and geographic scope scores higher than one that offers a directional view without anchoring details. The 15–18 month CAC payback figure, tied to specific 2023 and 2024 cohorts, scores significantly higher on this factor than 'payback is somewhere in the range of 12 to 18 months.'
Evidence Quality assesses whether the claim is based on first-hand experience, second-hand reporting, or inference. A former operator describing their own company's metrics from direct experience scores at the top of this factor. A consultant paraphrasing what they have heard from clients scores in the middle. A generalized market view constructed from publicly available information scores low.
Linguistic Certainty evaluates the hedging language in the original statement. Claims prefaced with 'I know for a fact' or 'in our own data' score higher than those prefaced with 'I think,' 'it's probably,' or 'I've heard.' This factor is assessed using a trained linguistic certainty classifier, not rule-based keyword matching, to handle the full range of hedging patterns in natural speech.
Expert Seniority reflects the relevance of the expert's professional background to the specific claim being made. A former CFO commenting on unit economics scores higher on this factor than the same individual commenting on product roadmap decisions. Seniority here is contextual, not hierarchical — it measures subject-matter authority relative to the claim, not job title.
Cross-Call Corroboration, at 10% weighting, is the smallest individual factor — but it is the only one that captures evidence from outside a single call. A claim that has been independently corroborated by two or more other experts receives a corroboration boost. The weighting is intentionally modest at the individual claim level because corroboration is more powerfully represented at the synthesis stage, where Consensus, Contradiction, and Unique Signal statuses are applied.
The contrast between a high-confidence and low-confidence claim on the same topic illustrates why scoring matters. The EXP-041 claim above scored 87 — first-hand operational data, high specificity, low hedging. Consider the same topic from a different expert:
“My sense was that payback in this category was somewhere in the 12 to 18 month range, though operators I've spoken with varied quite a bit.”
EXP-052's claim scores 41. It is second-hand (based on conversations with operators, not the consultant's own data), low on specificity (a 6-month range with significant reported variance), high on hedging ('my sense was,' 'varied quite a bit'), and has not been corroborated by any other call in this project. Both claims are valuable in the research file. But they should not be treated as equivalent evidence. An IC deck that cites both as 'industry sources suggest 12–18 month payback' is destroying information, not synthesizing it.
The practical output of Stage 5 is a scored, ranked evidence base for each research question on the agenda. High-confidence claims anchor the IC deck. Low-confidence claims feed the research gap register — they are signals that a topic has been raised but not yet adequately evidenced, and they generate prompts to seek stronger sources.
5. Stages 6–7: Cross-Call Synthesis and the Rolling Thesis
By Stage 6, the system has a scored set of extracted claims from every completed call. The synthesis stage clusters semantically related claims across calls and classifies the relationship between them. This is the step that turns individual data points into a coherent evidence pattern.
Semantic clustering uses a proprietary embedding space trained on investment research corpora. Claims with a cosine similarity above 0.78 are clustered as addressing the same proposition. The 0.78 threshold was established empirically to balance two types of errors: false merges (grouping claims that address related but genuinely distinct propositions) and false splits (treating claims about the same underlying fact as separate when expressed in different language). The threshold is configurable by research team, but the default calibration reflects testing across several thousand annotated claim pairs.
Each cluster of semantically related claims is then classified into one of three synthesis statuses. Consensus means two or more independent experts have made claims that align on the same proposition — they may use different language, come from different backgrounds, and describe the phenomenon from different vantage points, but they converge on the same underlying assertion. Contradiction means two or more experts have made claims that cannot both be true — they assert opposing things about the same proposition. Unique Signal means a proposition has appeared in only one expert's contribution, with no corroboration or contradiction from any other call.
These three statuses carry very different decision implications. Consensus findings can be cited with confidence at IC, particularly when accompanied by high average confidence scores. Contradictions require active resolution — they signal that experts with different information or different vantage points see the world differently, and the team must understand why. Unique Signals are neither confirmed nor dismissed: they are open hypotheses that warrant follow-up.
“The value of expert research lies in patterns across calls, not insights within a single call.”
— Nextyn IQ Methodology GuideStage 7 builds the Rolling Thesis — a live intelligence layer that updates automatically after each new call is processed. The Rolling Thesis is not a document. It is not a synthesis memo produced at the end of the research phase. It is a continuously updated structured representation of the current state of the evidence base: what is known with high confidence, what is contested, what is hypothesized but unconfirmed, and what has not yet been examined.
The practical implication is significant. On a 12-call research project, the team knows after call four whether a thesis dimension is already showing consensus, contradiction, or data absence — and they can adjust the remaining call agenda accordingly. If three of the first four experts have independently confirmed a competitive dynamic, call five does not need to spend 20 minutes re-establishing that baseline. It can go deeper. This is not just efficiency. It is a materially different quality of research.
The Rolling Thesis also surfaces what is absent. If the research agenda includes six thesis dimensions and three of them show robust evidence after eight calls while three show only Unique Signal or no coverage at all, the imbalance is immediately visible. The team can seek targeted experts for the under-evidenced dimensions before the research phase closes.
6. Stage 8: Research Gap Detection — Knowing What You Don't Know
The most dangerous moment in an expert research process is not when you find a contradiction. It is when you reach IC without having asked the right question. Gap detection is the discipline of systematically identifying what the current evidence base does not cover — and doing so before the research phase ends, not after.
The system identifies three types of research gaps. Research Question gaps arise when a question on the agreed research agenda has not been meaningfully addressed by any expert — either because interviewers did not ask it, experts declined to answer, or the answers given were too hedged to constitute extractable claims. These gaps are tracked directly from the research agenda and are visible from the first call.
Thesis Dimension gaps arise when a logical component of the investment thesis has no supporting evidence in the claims database. These can differ from Research Question gaps because they may reflect dimensions that were never explicitly included in the question agenda — they are surfaced by mapping the claims database against the stated thesis structure and identifying missing linkages.
AI-Surfaced gaps are the most novel category. These are topics that appear in expert calls — mentioned in passing, referenced obliquely, or implied by a contradiction — that were not on the original research agenda but that the system flags as potentially relevant to the thesis. A regulatory reference that appears in two different calls from experts discussing distribution dynamics, for example, might generate an AI-Surfaced gap around regulatory exposure that the research agenda had not explicitly addressed.
Across projects, an average of 5.2 gaps are surfaced per research project. Of those, 78% are rated by the analyst team as 'useful or critical' on validation — meaning the system is surfacing material issues, not noise. The remaining 22% are reviewed and dismissed, typically because they reflect topics that are outside scope, already addressed in other diligence workstreams, or genuinely not relevant to the thesis.
The value of gap detection is preventive. The cost of discovering at IC that no expert was asked about regulatory exposure in a heavily regulated market is not just the embarrassment of the question from a committee member. It is the delay, the credibility damage, and — in the worst case — the risk of proceeding on an incomplete evidence base. Gap detection makes that discovery impossible to miss before the research phase closes.
7. Putting It Together: A Real 8-Call Deal Example
The following example is drawn from an anonymized engagement with a Top-10 European private equity firm evaluating a $480M consumer technology acquisition. The research phase ran 8 expert calls over 9 calendar days. It illustrates how the eight stages work as an integrated system rather than a collection of independent tools.
The research agenda was structured around five thesis dimensions: market growth trajectory, competitive moat sustainability, unit economics at scale, management execution quality, and regulatory exposure. Eight experts were engaged across three cohorts: former operators with direct P&L experience in the target company's sector, former competitors who had exited the market, and one independent regulatory specialist.
By the end of call three, the Rolling Thesis showed strong Consensus on market growth trajectory (all three experts independently characterised the category as high-growth with structural tailwinds) and on management execution quality (two of three made specific, high-confidence claims about the management team's operational track record). The unit economics dimension showed one high-confidence Unique Signal and one low-confidence Unique Signal that contradicted each other on CAC payback timelines — a signal that warranted specific follow-up in subsequent calls.
By call six, the unit economics contradiction had been resolved: the high-confidence first-hand view prevailed, with two additional experts corroborating the 15–18 month CAC payback range from direct operational experience. The competitive moat dimension, however, showed a new and significant contradiction: two experts disagreed sharply on whether a specific technology component was proprietary or easily replicable. This was a thesis-critical issue — one of the primary moat arguments in the deal model rested on that component's defensibility.
Calls seven and eight were redirected based on the Rolling Thesis output. Rather than continuing to accumulate further evidence on the already-confirmed thesis dimensions, the team focused the final two calls on resolving the technology defensibility contradiction and closing two AI-Surfaced gaps: an emerging regulatory question around data localisation requirements (flagged when two experts independently referenced cross-border data handling in passing) and a distribution channel risk that had appeared in one expert's remarks about a competitor's exit from the market.
The outcome: an IC-ready evidence base with an average confidence score of 87 out of 100 across all claims cited in the IC deck. Two critical research gaps had been identified and actively addressed before IC — including the technology defensibility contradiction, which materially changed the proposed deal terms when the IC concluded that the moat argument required more structural protection in the deal structure. The total synthesis time from first call completion to IC-ready intelligence layer was 4 days, compared to the firm's historical benchmark of 3 weeks for a comparable research volume.
The compression came not from cutting corners but from eliminating the bottleneck that normally consumes the most time in expert research synthesis: the manual task of reading through all transcripts, trying to remember what was said in which call, reconciling contradictions that were never formally identified, and assembling a coherent narrative from unstructured notes. When those tasks are automated and the synthesis is available in real time, the analyst's time is freed for the judgment work that machines cannot do.
8. Conclusion: The Standard Is Changing
IC decks that say 'industry sources suggest' are getting harder to defend. As investment processes become more competitive and the evidence bar at IC continues to rise, the ability to cite specific, attributed, confidence-scored expert claims — and to demonstrate that those claims have been systematically tested for corroboration and contradiction across a full research programme — is becoming a differentiator.
The firms that systematically structure expert intelligence will make better decisions faster. Not because their experts are better or their analysts are smarter, but because they are capturing and using a higher proportion of the information their expert research programme actually generates. Most of what experts say in a 60-minute call never makes it into a deal decision in any structured form. The playbook described here changes that ratio.
The technology exists. The methodology is proven. The question is whether your research process is designed to capture what your experts are actually telling you — or whether you are running 12 calls and making decisions on the basis of what the analyst who read transcript seven happened to remember.