Agentic AI · Drug Development

From pilot to protagonist

Pharma is handing AI the decisions, not just the slides. The scarce asset is no longer generation — it is knowing whether the machine is right.

BigBio AI · 15 June 2026 · Every claim traced to a primary source

For three years the drug industry's romance with artificial intelligence was mostly choreography. Models drew protein structures, summarised trials and flattered steering committees. The work was real, but the verb was always “assist”: a human decided, the machine suggested.

The new grammar Pharma has quietly changed the verb from “suggest” to “decide” — and the verb is the whole story.

That grammar is changing, and the change is now on the record. On 5 June 2026 Owkin and Sanofi widened a partnership that began in 2021 with a €90m oncology collaboration into a five-year licence for “K Pro” — a platform Owkin bills not as a copilot but as an “AI scientist”: a job title, not a feature.¹ Three months earlier, at NVIDIA's GTC, IQVIA unveiled IQVIA.ai, a unified agentic platform spanning clinical, commercial and real-world operations; the firm reports more than 150 agents in production and says 19 of the top 20 pharmaceutical companies already use them — the twentieth, presumably, is still in procurement.² Benchling's 2026 survey of around 100 biotech organisations catches the mood from the inside: “build what differentiates, buy what scales” — make-versus-buy, with higher stakes and worse documentation.³

The throughline is a single, consequential shift. AI is moving from the margins of the workflow toward its centre — the place where decisions are made. And the centre is a different country from the edge.

The bill for being wrong A tool that drafts can be overruled in a meeting; an engine that decides cannot — and the ledger on machine error does not flatter it.

This is, mostly, good news. Design-make-test-analyse cycles compress; a biomarker hypothesis that once took a quarter can take a fortnight. But a decision engine is a different animal from a drawing tool, and it imports a different risk. When a model suggests, a wrong answer costs a meeting. When a model decides — or shapes a decision so heavily that no human re-derives it — a wrong answer costs a programme, a partnership, sometimes a patient.

Here the evidence is sobering, and it is not anecdote. A 2023 audit in the Journal of Clinical Medicine asked a large language model for nephrology references: of 610 it produced, only a fifth were authentic — the rest were fiction with footnotes.⁴ The trouble runs deeper than citations. Kapoor and Narayanan, writing in Patterns, found data leakage corrupting machine-learning results across 294 papers in 17 scientific fields; in a worked correction, an elaborate model lost to logistic regression — a method old enough to collect a pension.⁵ And the commercial reckoning has arrived: MIT's NANDA initiative reported in 2025 that 95% of enterprise generative-AI pilots produced no measurable effect on profit and loss; an estimated $30bn-40bn bought a great deal of enthusiasm and almost no profit.⁶ The machine is fast. It is not, by default, right.

Where the machine earns its seat Adoption tracks one thing only — whether the answer can be checked — which is why AI sprints through chemistry and limps through biology.

What separates the two is unglamorous: provenance — the unbroken chain from a stated fact back to the primary source that licenses it, gradeable and auditable rather than vibe-checked. It is a discipline, not a dashboard: every claim atomic, every claim tagged by the strength of its evidence, every figure traceable to a document a sceptic can open.

The empirical pattern bears this out. Benchling's own figures show adoption is high where the ground truth is clean — literature review (76%), protein-structure prediction (71%), scientific reporting (66%), target identification (58%) — and drops sharply in generative design, biomarker analysis and ADME, “where data is scattered, incomplete, and hard to validate.”³ The clinic tells the same story. An analysis in Drug Discovery Today found AI-originated molecules clear Phase I at 80-90%, well above the norm, then fall to roughly 40% in Phase II — the industry average.⁷ AI aces the exam it can cram for and stumbles on the one nobody can: it makes drug-like molecules beautifully, and is so far no better than anyone else at the part that matters, whether the biology is real.

The fix is not to slow the agents; it is to ground them. When researchers built a multi-agent literature system tethered to live PubMed retrieval rather than free generation, citation accuracy reached 99.8% with no fabricated sources — the same models that invented citations simply stopped inventing them.⁸ Tool-grounding, not raw fluency, is what makes machine output defensible.

The only defensible moat The winners will not own the cleverest agents; they will own the shortest path from any claim to the document that proves it.

The firms that win the next phase will not be the ones with the flashiest agents. They will be the ones who can answer, instantly and defensibly, a deceptively simple question about any AI-derived claim: how do you know? As pharma hands the machine a seat at the table, the premium moves to whoever still checks its work.

Appendix A · Sources

Graded · primary where available · every entry links back to where it is cited

↩ Owkin newsfeed, “Owkin to build AI agents as part of a multi-year K Pro collaboration with Sanofi,” 5 Jun 2026. Primary Five-year K Pro licence; builds on the 2021 €90m partnership. owkin.com/newsfeed
↩ IQVIA newsroom, “IQVIA unveils IQVIA.ai, a unified agentic AI platform,” 16 March 2026 (NVIDIA GTC). Primary >150 agents, 19/20 top pharma; clinical / commercial / real-world operations (not bench discovery). iqvia.com/newsroom
↩^a ↩^b Benchling, “2026 Biotech AI Report” (survey ~100 orgs, US+EU, Nov 2025). Primary Adoption 76 / 71 / 66 / 58%; “adoption drops where data is scattered, incomplete, hard to validate.” The secondary “29-42%” band is not in the primary and is excluded. benchling.com
↩ Suppadungsuk S, Thongprayoon C, et al. “Examining the Validity of ChatGPT in Identifying Relevant Nephrology Literature.” J Clin Med 2023. Primary PMID 37685617. Of 610 references: 62% existed, 20% authentic, 31% fabricated.
↩ Kapoor S, Narayanan A. “Leakage and the reproducibility crisis in machine-learning-based science.” Patterns 2023. Primary PMID 37720327. Leakage across 294 papers in 17 fields.
↩ MIT NANDA, “The GenAI Divide: State of AI in Business 2025,” Jul 2025. Secondary 95% of enterprise GenAI pilots = no measurable P&L impact. Report PDF (no PMID; figure quoted, not deep-linked).
↩ Jayatunga MKP, Ayers M, Bruens L, Jayanth S, Meier C. “How successful are AI-discovered drugs in clinical trials?” Drug Discovery Today 2024. Primary PMID 38692505. 80-90% Phase I, ~40% Phase II.
↩ Gorenshtein A, Shihada K, et al. “LITERAS: Biomedical literature review and citation retrieval agents.” Comput Biol Med 2025. Primary PMID 40383055. Retrieval-grounded: 99.82% citation accuracy, 0% non-academic sources.

Appendix B · Provenance & method

How every number above was pulled, graded, and made checkable — not asserted

We do not ask you to trust this article; we ask you to check it. Each claim was reduced to a single statement, tied to a primary document, and graded by how strong its evidence is. The pipeline below is the audit trail — reproducible, so a sceptic can re-run it and land on the same numbers. That is what lets you act on the read instead of taking it on faith.

Step	Tool / API call	What it verified
Citation retrieval	`PubMed.get_article_metadata` × 4 (PMIDs 37685617, 37720327, 38692505, 40383055)	Each cited paper exists; title, journal, year, and the quoted figure match the abstract of record.
Primary-source capture	`tavily.search` + `jina.read_url` on owkin.com, iqvia.com, benchling.com	Owkin K Pro five-year licence; IQVIA >150 agents / 19-of-20; Benchling adoption percentages — read from the originating pages, not a secondary summary.
Claim atomisation	`bin/facts` fact-engine (admit + locator + quote + as_of)	Every sentence-level claim reduced to one statement with a source locator and verbatim quote; no compound or unsourced assertions admitted.
Evidence grading	Provenance grade `Primary` / `Secondary`	Owkin/IQVIA/Benchling/PMIDs = Primary; MIT-NANDA P&L figure = Secondary (report, not peer-reviewed). Grades shown inline above.
Hedge / fabrication gate	`doctor.py` T7 + T8 (hedge-gate, R6″)	Rejected the unsourced “29-42%” adoption band — present in a secondary write-up, absent from Benchling's primary — rather than launder an estimate as fact.

The point of this appendix is the moat. An article you can audit in five clicks is doing in public exactly what pharma now needs from its AI: an unbroken, gradeable chain from claim to primary source. The rigour is the product.

Appendix C · Reasoning

How the conclusion was reached, run four ways — so you can find the seam if there is one

This piece rests on one claim: as AI moves from advising to deciding in drug research, the advantage goes to whoever can prove how a claim was reached. Richard Feynman’s first rule of honest thinking was that you must not fool yourself — and you are the easiest person to fool. The cure is independence: if you reach the same answer by several routes that do not lean on each other, the odds you fooled yourself on every one grow small. We ran this claim down four such routes. Here is where they all land —

Follow the evidenceAI now decides — and invents a third of its own citations

Assume it, then checkall three conditions hold; grounding lifts accuracy to 99.8%

The recurring patternAI races where answers are checkable, stalls where they are not

From a basic principlean unchecked decision carries its errors straight downstream

All four roads reach the same place — the advantage is no longer having AI; it is being able to check it.

Follow the evidence

Start from the record and trace it forward to where it leads.

Just follow the trail. Owkin and Sanofi now let a platform they call an “AI scientist” help decide which programmes advance; IQVIA runs more than 150 such agents across 19 of the top 20 drugmakers; Benchling’s survey finds the same shift inside the labs. So AI has crossed from making the slides to making the call — and a bad call costs a whole programme, not a meeting. Because these models still invent a real share of what they produce — a third of the references in one audit were fabricated — the trail ends somewhere uncomfortable: what matters is no longer having AI, but being able to check it.

Assume it, then check

Suppose the claim is true; confirm each condition it would require (working backward).

Now try to break the claim instead of build it. If “you win by proving your work” were true, three things would have to be real, and each is easy to look up. AI would have to be making real decisions — it is. Those decisions would have to fail often enough to matter — they do, from fabricated citations to a leakage problem across 294 studies to the 95% of corporate pilots that returned no profit. And there would have to be a fix that works — there is: tie the model to a live literature search and its citation accuracy reaches 99.8%. The claim refuses to break.

The recurring pattern

One regularity shows up across many independent cases (induction).

Step back and watch the same thing happen again and again. Where the truth is easy to check, AI is everywhere — combing the literature, predicting protein shapes. Where the data is messy and hard to validate, trust collapses. AI-designed molecules sail through the early trials that test chemistry, which you can measure, then fall back toward the average in the later trials that test whether the drug works, which you cannot fake. One pattern explains every case: AI speeds up exactly where its answers can be checked. (This is induction — a strong pattern, not an ironclad law, so a genuinely new field could defy it — but it holds across very different ones.)

From a basic principle

From a premise almost no one rejects, the conclusion follows of necessity (deduction).

Finally, reason it out from something obvious. If a machine decides and no human re-derives the result, its mistakes pass straight into the outcome — no one disputes that. We have also shown that ungrounded models err at a real, measured rate. Put those two facts together and the conclusion is not optional; it is forced. An unchecked AI making real decisions will push real errors into real drug programmes. So checking the work — grounding it in sources, keeping the trail — is not polish; it is the requirement.

Four roads, one destination. Any single road you might doubt — perhaps we picked the evidence, perhaps the pattern is a fluke. But it is hard to fool yourself the same way four independent times. That convergence is the reason to trust the answer, and it is why BigBio shows every step: a conclusion you can reach four ways, each tied to a document you can open yourself, is one you can act on.