While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with "workflow friction" from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are $6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a "build-from-within" strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform.
Extract authors, key findings, references, and an executive summary using AI.
In this paper, the authors describe the development and implementation of ChatEHR, an internally managed, vendor-agnostic system designed to integrate large language models (LLMs) directly into clinical workflows at Stanford Medicine. By connecting real-time clinical data to LLMs, the platform allows for both automated, routine tasks (such as pre-charting or eligibility screening) and open-ended conversational support for clinicians through an embedded UI. This end-to-end approach addresses workflow friction and compliance challenges inherent in using external commercial AI tools. The deployment was managed through a rigorous, institution-wide Responsible AI Lifecycle (RAIL) framework, which ensured appropriate risk tiering, ethical assessment, and continuous monitoring of system integrity, performance, and impact. Over 1.5 years, the team successfully launched seven automations and a clinical UI that has processed millions of tokens for over 1,000 routine users. The platform's ability to facilitate complex information retrieval across entire patient timelines has demonstrated tangible clinical and operational benefits. Evaluation of the platform revealed critical insights into the real-world performance of LLMs in medicine. While summarization was the most frequently requested task, it also carried a measured rate of error, prompting the authors to adapt fact-verification methodologies. Economic analysis indicated substantial potential for time and cost savings, with early estimates suggesting $6M in initial savings. The authors conclude that embedding interdisciplinary teams—including data scientists and clinicians—into the healthcare IT organization is essential for building sustainable, scalable AI infrastructure that balances operational efficiency with the safe, responsible, and ethical delivery of care.
While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with "workflow friction" from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are $6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a "build-from-within" strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform.
1.ChatEHR successfully integrates LLMs with longitudinal patient records spanning several years.
2.The system supports both static automations for fixed tasks and an interactive chat interface for open-ended queries.
3.Over 1.5 years, 7 automations were developed and deployed.
The authors explain that ChatEHR was built as a strategic response to the risks and inefficiencies of relying on commercial, vendor-dependent AI tools. They emphasize a "build-from-within" philosophy that prioritizes institutional agency, responsible AI governance (supported by the RAIL framework), and deep clinical-technical integration. Future directions include developing automated, continuously updated truth sets for fact verification, enhancing UI guardrails to flag uncertainty, and adding automatic citations to trace generated facts back to source medical records.