Agentic AI · Drug Development
From pilot to protagonist
Pharma is handing AI the decisions, not just the slides. The scarce asset is no longer generation — it is knowing whether the machine is right.
For three years the drug industry's romance with artificial intelligence was mostly choreography. Models drew protein structures, summarised trials and flattered steering committees. The work was real, but the verb was always “assist”: a human decided, the machine suggested.
The new grammar Pharma has quietly changed the verb from “suggest” to “decide” — and the verb is the whole story.
That grammar is changing, and the change is now on the record. On 5 June 2026 Owkin and Sanofi widened a partnership that began in 2021 with a €90m oncology collaboration into a five-year licence for “K Pro” — a platform Owkin bills not as a copilot but as an “AI scientist”: a job title, not a feature.1 Three months earlier, at NVIDIA's GTC, IQVIA unveiled IQVIA.ai, a unified agentic platform spanning clinical, commercial and real-world operations; the firm reports more than 150 agents in production and says 19 of the top 20 pharmaceutical companies already use them — the twentieth, presumably, is still in procurement.2 Benchling's 2026 survey of around 100 biotech organisations catches the mood from the inside: “build what differentiates, buy what scales” — make-versus-buy, with higher stakes and worse documentation.3
The throughline is a single, consequential shift. AI is moving from the margins of the workflow toward its centre — the place where decisions are made. And the centre is a different country from the edge.
The bill for being wrong A tool that drafts can be overruled in a meeting; an engine that decides cannot — and the ledger on machine error does not flatter it.
This is, mostly, good news. Design-make-test-analyse cycles compress; a biomarker hypothesis that once took a quarter can take a fortnight. But a decision engine is a different animal from a drawing tool, and it imports a different risk. When a model suggests, a wrong answer costs a meeting. When a model decides — or shapes a decision so heavily that no human re-derives it — a wrong answer costs a programme, a partnership, sometimes a patient.
Here the evidence is sobering, and it is not anecdote. A 2023 audit in the Journal of Clinical Medicine asked a large language model for nephrology references: of 610 it produced, only a fifth were authentic — the rest were fiction with footnotes.4 The trouble runs deeper than citations. Kapoor and Narayanan, writing in Patterns, found data leakage corrupting machine-learning results across 294 papers in 17 scientific fields; in a worked correction, an elaborate model lost to logistic regression — a method old enough to collect a pension.5 And the commercial reckoning has arrived: MIT's NANDA initiative reported in 2025 that 95% of enterprise generative-AI pilots produced no measurable effect on profit and loss; an estimated $30bn-40bn bought a great deal of enthusiasm and almost no profit.6 The machine is fast. It is not, by default, right.
Where the machine earns its seat Adoption tracks one thing only — whether the answer can be checked — which is why AI sprints through chemistry and limps through biology.
What separates the two is unglamorous: provenance — the unbroken chain from a stated fact back to the primary source that licenses it, gradeable and auditable rather than vibe-checked. It is a discipline, not a dashboard: every claim atomic, every claim tagged by the strength of its evidence, every figure traceable to a document a sceptic can open.
The empirical pattern bears this out. Benchling's own figures show adoption is high where the ground truth is clean — literature review (76%), protein-structure prediction (71%), scientific reporting (66%), target identification (58%) — and drops sharply in generative design, biomarker analysis and ADME, “where data is scattered, incomplete, and hard to validate.”3 The clinic tells the same story. An analysis in Drug Discovery Today found AI-originated molecules clear Phase I at 80-90%, well above the norm, then fall to roughly 40% in Phase II — the industry average.7 AI aces the exam it can cram for and stumbles on the one nobody can: it makes drug-like molecules beautifully, and is so far no better than anyone else at the part that matters, whether the biology is real.
The fix is not to slow the agents; it is to ground them. When researchers built a multi-agent literature system tethered to live PubMed retrieval rather than free generation, citation accuracy reached 99.8% with no fabricated sources — the same models that invented citations simply stopped inventing them.8 Tool-grounding, not raw fluency, is what makes machine output defensible.
The only defensible moat The winners will not own the cleverest agents; they will own the shortest path from any claim to the document that proves it.
The firms that win the next phase will not be the ones with the flashiest agents. They will be the ones who can answer, instantly and defensibly, a deceptively simple question about any AI-derived claim: how do you know? As pharma hands the machine a seat at the table, the premium moves to whoever still checks its work.
Appendix A · Sources
Graded · primary where available · every entry links back to where it is cited
- ↩ Owkin newsfeed, “Owkin to build AI agents as part of a multi-year K Pro collaboration with Sanofi,” 5 Jun 2026. Primary Five-year K Pro licence; builds on the 2021 €90m partnership. owkin.com/newsfeed
- ↩ IQVIA newsroom, “IQVIA unveils IQVIA.ai, a unified agentic AI platform,” 16 March 2026 (NVIDIA GTC). Primary >150 agents, 19/20 top pharma; clinical / commercial / real-world operations (not bench discovery). iqvia.com/newsroom
- ↩a ↩b Benchling, “2026 Biotech AI Report” (survey ~100 orgs, US+EU, Nov 2025). Primary Adoption 76 / 71 / 66 / 58%; “adoption drops where data is scattered, incomplete, hard to validate.” The secondary “29-42%” band is not in the primary and is excluded. benchling.com
- ↩ Suppadungsuk S, Thongprayoon C, et al. “Examining the Validity of ChatGPT in Identifying Relevant Nephrology Literature.” J Clin Med 2023. Primary PMID 37685617. Of 610 references: 62% existed, 20% authentic, 31% fabricated.
- ↩ Kapoor S, Narayanan A. “Leakage and the reproducibility crisis in machine-learning-based science.” Patterns 2023. Primary PMID 37720327. Leakage across 294 papers in 17 fields.
- ↩ MIT NANDA, “The GenAI Divide: State of AI in Business 2025,” Jul 2025. Secondary 95% of enterprise GenAI pilots = no measurable P&L impact. Report PDF (no PMID; figure quoted, not deep-linked).
- ↩ Jayatunga MKP, Ayers M, Bruens L, Jayanth S, Meier C. “How successful are AI-discovered drugs in clinical trials?” Drug Discovery Today 2024. Primary PMID 38692505. 80-90% Phase I, ~40% Phase II.
- ↩ Gorenshtein A, Shihada K, et al. “LITERAS: Biomedical literature review and citation retrieval agents.” Comput Biol Med 2025. Primary PMID 40383055. Retrieval-grounded: 99.82% citation accuracy, 0% non-academic sources.
Appendix B · Provenance & method
How every number above was pulled, graded, and made checkable — not asserted
We do not ask you to trust this article; we ask you to check it. Each claim was reduced to a single statement, tied to a primary document, and graded by how strong its evidence is. The pipeline below is the audit trail — reproducible, so a sceptic can re-run it and land on the same numbers. That is what lets you act on the read instead of taking it on faith.
| Step | Tool / API call | What it verified |
|---|---|---|
| Citation retrieval |
PubMed.get_article_metadata × 4 (PMIDs 37685617, 37720327,
38692505, 40383055)
|
Each cited paper exists; title, journal, year, and the quoted figure match the abstract of record. |
| Primary-source capture |
tavily.search + jina.read_url on owkin.com, iqvia.com,
benchling.com
|
Owkin K Pro five-year licence; IQVIA >150 agents / 19-of-20; Benchling adoption percentages — read from the originating pages, not a secondary summary. |
| Claim atomisation | bin/facts fact-engine (admit + locator + quote + as_of) |
Every sentence-level claim reduced to one statement with a source locator and verbatim quote; no compound or unsourced assertions admitted. |
| Evidence grading | Provenance grade Primary / Secondary |
Owkin/IQVIA/Benchling/PMIDs = Primary; MIT-NANDA P&L figure = Secondary (report, not peer-reviewed). Grades shown inline above. |
| Hedge / fabrication gate | doctor.py T7 + T8 (hedge-gate, R6″) |
Rejected the unsourced “29-42%” adoption band — present in a secondary write-up, absent from Benchling's primary — rather than launder an estimate as fact. |
The point of this appendix is the moat. An article you can audit in five clicks is doing in public exactly what pharma now needs from its AI: an unbroken, gradeable chain from claim to primary source. The rigour is the product.
Appendix C · Reasoning
How the conclusion was reached, run four ways — so you can find the seam if there is one
This piece rests on one claim: as AI moves from advising to deciding in drug research, the advantage goes to whoever can prove how a claim was reached. Richard Feynman’s first rule of honest thinking was that you must not fool yourself — and you are the easiest person to fool. The cure is independence: if you reach the same answer by several routes that do not lean on each other, the odds you fooled yourself on every one grow small. We ran this claim down four such routes. Here is where they all land —
All four roads reach the same place — the advantage is no longer having AI; it is being able to check it.
Follow the evidence
Start from the record and trace it forward to where it leads.
Assume it, then check
Suppose the claim is true; confirm each condition it would require (working backward).
The recurring pattern
One regularity shows up across many independent cases (induction).
From a basic principle
From a premise almost no one rejects, the conclusion follows of necessity (deduction).
Four roads, one destination. Any single road you might doubt — perhaps we picked the evidence, perhaps the pattern is a fluke. But it is hard to fool yourself the same way four independent times. That convergence is the reason to trust the answer, and it is why BigBio shows every step: a conclusion you can reach four ways, each tied to a document you can open yourself, is one you can act on.