Behind the Tech: What It Takes to Train LLMs for Real-World Oncology Use - with Johannes Hoster

Jul 24

We’re going Behind the Tech to spotlight the team behind our integrated approach, combining strategic consulting with purpose-built technology to help clients plan and execute projects with speed, scale and precision.

Our latest spotlight features Johannes Hoster, a new joiner at GIPAM and data analyst, who’s been working on improving how OncoCase extracts structured data from complex, unstructured medical documents. In his first month, he’s focused on fine-tuning large language models and generating synthetic data to support safer, more scalable automation in oncology.

Hi Johannes! You’ve just joined GIPAM, how’s your first month been, and what excited you about the role?

Hi, and thanks again for the warm welcome! My first month at GIPAM has been a really rewarding experience. I’ve been diving into how we can use Large Language Models (LLMs) to extract structured data from complex medical documents and I’ve loved the challenge. It’s quite a shift from my previous work in computer vision and image-based models, but getting to grips with how to steer LLMs toward pulling out the right information (and doing it well) has been genuinely exciting.

What I look forward to most about the role is the clear impact it can have. There’s real purpose behind the work, we’re helping reduce manual data entry, save time for medical professionals, and ultimately create more space for patient care. That combination of technical challenge and meaningful outcome is exactly what motivates me.

You’re working on OncoCase—can you tell us a bit about what it does, and maybe something people might not realise about how it works or what makes it unique?

Absolutely. OncoCase transforms unstructured medical documents into structured data making clinical, treatment, and outcome information accessible for analysis and reporting. That alone creates real value, but what many don’t realize is that OncoCase also handles key administrative tasks.

It supports automated submissions to cancer registries, generates reports for tumor boards and certifications, and even manages non-interventional study workflows. This blend of smart data extraction and practical services makes OncoCase uniquely effective in everyday oncology practice.

You mentioned saving time for medical professionals—how does structured data help make that happen in practice?

Structured data is incredibly valuable in medicine because it enables faster decision-making, easier documentation, and better integration with other systems. Right now, a lot of patient information is entered manually, which is time-consuming and error-prone. Automating that process means clinicians can spend less time typing and more time focusing on patients.

For OncoCase users, this means faster workflows, fewer missed details, and more consistent documentation. Ultimately, that can lead to better treatment planning and better outcomes for patients which is exactly the kind of real-world impact we’re aiming for.

You’ve been generating synthetic data to help train the model. What’s the thinking behind that, and how does it improve the outcome?

Training machine learning models requires large amounts of high-quality, labeled data, that is, data where the correct output is already known. But building such datasets from real medical documents is time-consuming and comes with risks, especially around data privacy and potential leakage.

That’s where synthetic data comes in. By generating artificial documents that closely resemble real ones, along with their correct labels, we can train models in a safe, scalable, and efficient way. It also allows us to control edge cases and create highly diverse training scenarios.

Once the model is trained on this synthetic data, we validate its performance on real-world samples to ensure the learnings transfer well. This approach significantly accelerates development while maintaining data security and performance quality.

How do you go about optimising large language models for tasks as specific as these?

Creating robust solutions for specific tasks, like extracting structured data from medical documents, always starts with clearly defining the output goals and understanding the variability in real-world inputs. We break the task into smaller, manageable components and apply prompt engineering, multiple fine-tuning methods, and systematic evaluation to guide the model toward consistent behavior.

A key part of our approach is using an ensemble pipeline: combining Optical Character Recognition (OCR) to extract text from scanned documents, traditional rule-based retrieval for structured patterns, and LLMs to interpret and contextualize the data. This hybrid setup enhances reliability and precision, especially in challenging cases with inconsistent formatting or poor scan quality. It also significantly improves the overall retrieval-augmented generation (RAG) process, by letting each component play to its strengths.

We validate the system with both real and synthetic data, analyze failure modes, and refine the pipeline iteratively.

This kind of automation could really streamline data collection in oncology. What kind of potential do you see for it in the real world?
The potential goes far beyond saving time. Automating data collection in oncology enables a level of scale and consistency that manual processes simply can’t match. As more clinical and treatment data becomes available in structured form, it opens new possibilities not just for operational efficiency, but for deeper insight into patient outcomes and care patterns.

In practice, this could support smarter decision-making across entire cancer centers, make real-time analytics feasible, and reduce bottlenecks in reporting, research, and certification. It also lays the groundwork for longitudinal studies, benchmarking across institutions, and better integration with evolving digital health tools.

Ultimately, it’s about transforming fragmented information into a reliable, decision-ready resource, which benefits both clinicians and patients in the long run.

What’s been the most interesting or unexpected part of the work so far?

One of the most fascinating discoveries for me has been how differently LLMs “see” text compared to humans. They don’t actually read in the conventional sense instead, they process everything as sequences of tokens, which are essentially chunks of words or characters. That means small changes in formatting, line breaks, or even OCR artifacts can completely alter their understanding of a document.

In the context of medical data extraction, this has a big impact. A human might easily overlook a small scan issue or layout shift, but for an LLM, that could mean the difference between extracting the right value or missing it entirely. It’s pushed me to think not just about the model, but about everything that happens before the model sees the input. That pre-processing step is often just as important as the model itself.

And outside of work, what else are you curious about or enjoy exploring, whether in tech or beyond?

Beyond work, I love exploring the creative side of technology. I tinker with generative tools and read about the intersection of AI and the arts. There’s currently a lot of important discourse around AI-generated art, especially in terms of authorship and fairness, and I think those concerns are valid.

At the same time, I believe there are fair and meaningful ways to incorporate AI into the creative process that respect and enhance the work of artists, rather than replace it. For me, it’s exciting to see how algorithms can be used to support new forms of expression, not just automation. It’s a great reminder that not everything has to be optimized—some things can just be playful, exploratory, or expressive.

Outside of tech entirely, I’m a big fan of science fiction and fantasy—I love stories that explore the boundaries of what's possible, especially when they challenge our assumptions about society, identity, or intelligence. I’d also say I’m loosely part of the rationalist community, and one book I often recommend is “Rationality: From AI to Zombies” by Eliezer Yudkowsky. It’s a long but thought-provoking read that changed how I think about thinking.

Thank you for your time Johannes, It’s great to have you on the team.

Thank you!

Johannes Hoster
Data Analyst

Connect

Amanda Brefo

Behind the Tech: What It Takes to Train LLMs for Real-World Oncology Use - with Johannes Hoster

Behind the Tech: Nils Picker on Building EVIGATOR, Evidence Generation Planning, and Bringing his background in Economics into Market Access

Do you have questions?