Patient level data from hospitals as electronic health records contain codified tabulated data, as checkboxes for the diagnoses performed and medications prescribed, and unstructured narrative notes as clinicians’ visit reports, comments and justifications. When analyzing hospital data researchers often need to review manually the narrative notes to confirm the codified data, which might be misclassified, or to derive additional detailed variables and labels. This process is commonly known as chart review.
Chart review is usually performed by experienced researchers and only on a small subset of the cohort of interest because it is so time intensive. One usual task is to produce gold standard labels to derive performance measures of algorithms, say on 100 patients, and then apply the best algorithms to the broader cohorts, say on 10,000 patients. Depending on the field of interest, it can become a crucial question, as in mental health in which the codified data is especially limited because many diagnoses co-occur, their definitions are blurry and their severity vary widely. In suicide prevention for example, researchers have long included very broad diagnoses codes in survival analyses in order to capture more events even if it means including many false positives; and this process of identifying less specific codes relies extensively on chart review and is highly population specific thus hard to reuse between studies.
The advent of LLMs enables us to try to address these pressing challenges with the use of agents, here defined as multi-step rule-based pipelines integrating LLMs. In this talk I will present various chart review agents projects we led in collaboration with hospitals such as:
- deriving disease activity scores in inflammatory bowel disease
- identifying rheumatology treatment contraindications as heart failure
- extracting suicide risk factors to create inpatient admission triage models
Reasoning models have greatly improved the capabilities of LLMs by reducing the occurrence of hallucinations but we still encounter them regularly and thus their applications are still limited within the medical domain. By looking at the common points within these projects, we can understand the key steps that are required for medical AI software to be sound and traceable.
While medical applications still require a great amount of clinical knowledge expertise to build useful pipelines and requires extensive efforts for each case study, LLM pipelines enable us to resolve pressing challenges that were unsolvable at scale using codified tabulated data. The richness of narrative notes enable us to build more sophisticated models especially in domains with uncertainty and blurry diagnoses. In our mature projects we usually obtain 90 to 95% accuracy compared to human reviewers, and disagreements between humans and agents lead about 30 to 50% of the time to revising the original human annotations, which shows the usefulness of integrating both LLMs and clinicians within the review process. The use of agents for healthcare chart review has the potential to decrease the financial burden of chart reviews, increase their scale, and unlock new analysis capabilities for scientific discovery.
In this talk I will showcase how to build open source pipelines with publicly downloadable pre-trained models, how they compare to commercial models, explain the key differences between traditional LLMs and reasoning models, and distinguish reasoning-heavy from NLP-heavy applications. I will give concrete examples with various pre-trained models as Llama, Mistral, Deepseek, Qwen, and GPT-OSS; and how the open-source software in the field compare, as Ollama and vLLM.



