Hacker News

Building Reliable Agentic AI Systems

The Challenge: Navigating the Preclinical Data Maze

Preclinical drug discovery is inherently complex and data-intensive. Researchers face the significant challenge of efficiently accessing and analyzing vast volumes of information generated during this critical phase. Traditional keyword-based search methods, often reliant on rigid Boolean logic, frequently fall short when confronted with the nuanced and intricate nature of preclinical research questions.

The advent of Large Language Models (LLMs) has presented a transformative opportunity. By combining the generative power of LLMs with the precision of information retrieval systems, Retrieval-Augmented Generation (RAG) has emerged as a promising technique. This approach holds the potential to revolutionize preclinical data access, enabling researchers to pose complex questions in natural language and receive accurate, context-rich answers grounded in proprietary data.

Recognizing this potential early, Bayer committed to exploring how these technologies could address longstanding challenges in preclinical research. In this post, we share that journey-how Bayer's early investment in generative AI has resulted in PRINCE, an agentic AI system built on Agentic RAG. This case study explores the technical architecture, engineering decisions, and lessons learned in transforming preclinical data retrieval from a challenging maze into an intuitive conversational experience.

Many of the engineering decisions behind PRINCE can now be understood through the lens of context engineering and harness engineering, although when the system was first designed we did not use these terms. Context engineering shaped what information each model received, what it did not receive, and how context moved between specialized steps such as research, reflection, and writing. Harness engineering shaped the scaffolding around the models: orchestration, tool boundaries, state persistence, retries, fallbacks, validation, reflection loops, observability, and human review.

While this post focuses on the technical architecture and engineering challenges, our paper published in Frontiers in Artificial Intelligence covers the product evolution and business impact in more detail.

The Solution: PRINCE - An Evolutionary Platform

To address these challenges, Bayer developed the Preclinical Information Center (PRINCE) platform. PRINCE was conceived as a unified gateway to preclinical data, initially focusing on consolidating previously siloed structured study metadata and exposing them in a "Searchable" manner. This initial phase allowed users to apply advanced filters and retrieve information primarily from structured study metadata.

However, a significant portion of Bayer's valuable preclinical knowledge resides within unstructured PDF study reports accumulated over decades. Due to numerous system migrations over the years, the structured metadata associated with these reports could be incomplete, missing, or even contain incorrect annotations. Crucially, the authoritative "gold standard" information was consistently present within the approved PDF study reports.

The emergence of Generative AI, particularly RAG, provided the key to unlocking this wealth of unstructured data. By integrating RAG capabilities, PRINCE began to shift the paradigm from a filter-based 'search' tool to a natural language 'ask' system, enabling researchers to query the content of these study reports directly.

This evolution reflects PRINCE's progression through three distinct phases:

  • Search: the initial phase focused on creating a unified gateway to thousands of nonclinical study reports, consolidating multiple in-house data silos from various preclinical domains into a searchable format, primarily leveraging structured metadata.
  • Ask: this phase introduced an AI-powered question-answering system utilizing Retrieval Augmented Generation (RAG). This enabled researchers to derive insights directly from unstructured data, including scanned PDFs from historical reports, by posing questions in natural language.
  • Do: the current phase positions PRINCE as an active research assistant capable of executing complex tasks. This is achieved through the integration of multi-agent systems, allowing the platform to handle intricate queries, orchestrate workflows, and support activities like drafting regulatory documents.

This deliberate evolution from Search to Ask to Do represents a strategic response to the industry's need for greater efficiency and innovation in preclinical development. By providing researchers with increasingly powerful tools to access, analyze, and act upon preclinical data, PRINCE aims to enable faster data-driven decision-making, reduce the need for unnecessary experiments, and ultimately accelerate the development of safer, more effective therapies.

System Architecture: Engineering a Reliable Agentic RAG System

The system functions as an interactive conversational UI, powered by a robust backend infrastructure. Its architecture, designed for handling complex queries and delivering accurate, context-rich answers, is orchestrated using LangGraph and served via a FastAPI application. Figure 1 provides the system context-UI, backend, data stores, LLM fallbacks, and observability-while Figure 2 zooms into how the system coordinates its specialized agents.

Figure 1: System context and supporting platforms.

  • User Request: the process begins when a user submits a request through the Conversational UI which is built with React.
  • Orchestration: the user's request is routed to a LangGraph-based orchestration layer in the backend. This workflow engine coordinates a multi-stage process that progresses through clarifying user intent, thinking and planning, conducting research (using RAG and Text-to-SQL), validating data completion, and finally generating a response through the Writer agent. The workflow includes deliberate pause points and feedback loops to ensure data completeness before proceeding. (We explore the details of this agentic workflow in a dedicated section later.)
  • Data Retrieval and State Management: the Researcher agents interact with a comprehensive and distributed data ecosystem:
    • Vector representations of all study reports are stored in OpenSearch, forming the core knowledge base for information retrieval.
    • Curated structured data, resulting from various ETL and harmonization processes, is accessed via Athena.
    • The state of the agent's execution is meticulously tracked. After each logical step (a LangGraph node execution), the corresponding state is persisted in PostgreSQL using a LangGraph checkpointer.
    • Broader application-level state is managed in DynamoDB.
    • The system leverages internal GenAI platforms that host models from OpenAI, Anthropic, Google, and open-source providers. These platforms expose all models via a unified OpenAI-compatible endpoint, making it easy to swap models and choose the best tool for each task. They also manage the control plane, enforcing rate limits and other safeguards to prevent abuse.
  • Resilience and Error Handling: robustness is a critical design principle, with multiple fallback mechanisms in place:
    • If a specific LLM fails, the system automatically retries the request several times before falling back to an alternative model or platform to ensure service continuity.
    • To recover quickly from transient failures, retries are implemented at both the individual LLM call level and the logical node level (i.e., an entire step in the agent's plan).
    • Also, agents are provided the context of the errors so that they can chart a different trajectory or alternative plan of action as a response.
  • Observability and Evaluation: the entire system is monitored for performance and reliability:
    • General system health and metrics are tracked using Cloudwatch.
    • Langfuse serves as the primary observability tool, providing detailed traces of all production traffic. This allows for in-depth debugging of issues. Furthermore, evaluation datasets are stored and managed within Langfuse, making it easier to analyze performance scores and diagnose specific failures. The evaluation is done using RAGAS evaluation framework. The live traffic evaluation is done on a daily basis while the dataset evaluation is done whenever significant changes are made to the core workflow, prompts, or underlying models.
  • Final Response: once the agents have processed the request and generated a satisfactory response, it is sent back to the Conversational UI to be presented to the user.

A design principle running through this architecture is context discipline. Larger context windows did not remove the need to be selective about what each agent sees. In early iterations, putting too much information into the context made the system harder to steer and harder to evaluate. PRINCE therefore avoids treating the prompt as one large container for all available information. Instead, different stages receive different context: planning context for Think & Plan, retrieval context for the Researcher Agent, evidence context for the Reflection Agent, and synthesis context for the Writer Agent. This reduces context pollution and makes the system easier to debug, evaluate, and improve.

These steps ensure that the system can provide reliable and contextually relevant answers to a wide range of complex queries by leveraging a sophisticated, multi-agent architecture and a diverse set of powerful tools and data sources.

The Agentic RAG System

PRINCE incorporates an agentic RAG system (Figure 2) to handle complex user requests that require multiple steps, reasoning, and interaction with different tools or data sources. This setup, implemented using LangGraph, orchestrates the overall workflow and leverages Researcher Agent, Writer Agent, and Reflection Agent for specific tasks. The system is designed to be robust and reliable, with multiple fallback mechanisms in place to ensure that the system can continue to function even if some of the components fail.

Figure 2: The research workflow.

Clarify User Intent

The Clarify User Intent step serves as the first line of defense against ambiguity. As the system scaled to include diverse domains like toxicology and pharmacology, simple user queries often became ambiguous, making it difficult to automatically select the right tools. Rather than relying on expensive trial-and-error across all data sources, the system proactively asks clarifying questions to pinpoint the specific domain or data type. This ensures the system enhances the query with the necessary constraints to target the correct tools.

We are also optimizing this by developing domain-level selection in the UI, which will allow users to pre-filter valid tools upfront. To further reduce friction, the system also provides AI-assisted source recommendations: when a user has not selected any data source - or has selected several without a clear focus - the model analyzes the intent behind the user's query and suggests the most relevant sources. The user retains full control and can accept, adjust, or override the recommendation, ensuring domain expertise always has the final say.

This "fail-fast" mechanism prevents wasted execution on vague queries, while careful tuning ensures the system remains unobtrusive when the intent is already clear. From a context engineering perspective, this step is the first assembly decision in the workflow: it constrains which tools, domains, and data sources will be in scope before any retrieval begins, ensuring subsequent agents receive a focused rather than open-ended problem.

Think & Plan: Process Reflection

The Think & Plan step is responsible for devising a strategy to fulfill the user's request. This critical component gives the system a dedicated space to reason about the next steps before taking action-a technique inspired by Anthropic's Think tool. Importantly, this step performs process reflection: evaluating whether the agent is making the right progress toward its end goal and is on right trajectory, rather than evaluating the data itself.

In multi-step agentic workflows, particularly those involving many sequential actions, process reflection is essential. Consider a scenario where the system needs to execute 50 steps to complete a complex task. At each juncture, the system must ask: Am I taking these steps in the right manner? Am I making the progress I'm supposed to make? Is the current trajectory still aligned with the original goal? This reflective loop prevents the agent from drifting into unproductive paths and ensures efficient progress toward the desired outcome.

Building Trust in a Production LLM System

Trust is a foundational requirement for any production AI system, especially in regulated environments like pharmaceuticals. PRINCE prioritizes trust through transparency, explainability, and human-in-the-loop integration.

  • Transparency: the system surfaces its reasoning process to users, showing which sources were consulted, how conclusions were reached, and what confidence levels apply to different parts of the answer.
  • Explainability: each response includes citations to the underlying study reports, allowing researchers to verify claims against original sources.
  • Human-in-the-loop: critical decisions and draft outputs are routed for human review before finalization, ensuring domain expertise overrides any model errors.

Engineering for Resilience: Error Handling and Recovery

Robustness is engineered at multiple layers of the system:

  • LLM-level retries: if a specific model call fails, the system retries several times before falling back to an alternative model or platform.
  • Node-level retries: entire steps in the agent's workflow can be retried to recover from transient failures without restarting the entire process.
  • Error-aware agents: when errors occur, agents receive context about the failure, enabling them to adapt their strategy or choose an alternative approach.

Enhancing Data Quality: Named Entity Recognition and Annotation

To address the challenge of incomplete or incorrect structured metadata, PRINCE incorporates Named Entity Recognition (NER) and annotation pipelines. These processes extract key entities from unstructured PDF reports-such as compound names, study types, species, and dosage information-and use them to enrich the structured metadata layer. This iterative improvement cycle ensures that the system's knowledge base becomes more accurate and comprehensive over time.

The Journey Continues: Iterative Development

PRINCE is not a finished product but an evolving platform. The team continues to refine the system based on user feedback, new model capabilities, and emerging best practices in agentic AI. Key areas of ongoing development include:

  • Expanding domain coverage to additional preclinical disciplines
  • Improving the efficiency of multi-step workflows
  • Enhancing the reflection and self-correction capabilities of agents
  • Deepening integration with downstream regulatory processes

Conclusion

PRINCE demonstrates AI's transformative potential in pharmaceuticals, significantly improving data accessibility and research efficiency while ensuring governance and compliance. By combining context engineering-carefully shaping what information each agent receives-with harness engineering-building robust orchestration, recovery, and observability around the models-the system achieves the reliability required for production use in a regulated industry.

The evolution from keyword-based search to an intelligent research assistant capable of answering complex questions and drafting regulatory documents represents a paradigm shift in preclinical data access. As the system continues to mature, it promises to accelerate drug development and bring safer, more effective therapies to patients faster.

Comments

No comments yet. Start the discussion.