# Healthcare Partner Case Study: Enriching Patient Cohorts with EHR Data with Loopback Health

## Summary

- **Healthcare Data for Real World Use Cases**: A leading healthcare AI company partnered with Protege to connect its patient-level data to Loopback Health’s EHR dataset and other healthcare provider data, which unlocked richer training cohorts for training the next generation of AI models.
- **Multi-modal Healthcare Data for AI Training:** Protege coordinated multiple healthcare data partners to deliver both structured and unstructured EHR data, connecting modalities into a single, de-identified dataset ready for AI development.
- **Quick Delivery Timelines**: The initial cohort of thousands of matched patients was delivered in less than 90 days from initial discovery, with ongoing refreshes expanding both the matched and EHR-only cohorts and opening up follow-on data opportunities.

## Multimodal Healthcare Data To Unlock Expanded Use Cases

A leading healthcare AI company was looking to build next-generation AI models that were trained on multiple modalities that built upon their existing data. The company already had one modality; however, they wanted to improve their model’s performance in care recommendation situations by layering on EHR data as well.

As a result, Protege’s healthcare data partner network was uniquely positioned to fulfill the highly targeted overlap needs thanks to Protege’s access to clinical data across its healthcare data partner network. The end result was a multimodal training dataset that connected the existing data with the new modalities, giving the AI models a more thorough understanding of the patient population.

Protege’s healthcare data partners benefitted immensely from this request as well, as it demonstrated the growing need for multi-modal healthcare data that an individual provider tended not to have full diverse coverage for. This unlocked a new, incremental deal opportunity that partners could access with their existing data assets previously used in other Protege deal opportunities — but that no single partner could fulfill on their own given the client’s extensive patient coverage needs.

In addition to the initial overlap, the healthcare AI company also wanted a fast, compliant pathway for ongoing cohort expansion as new data became available via partners. As a result, the final solution needed to:

- Reliably connect existing modality data with new layered on EHR data across modalities into a single dataset
- Preserve privacy and de-identification throughout the data pipeline process
- Support ongoing cohort growth in the future, whether through existing data partner coverage expansion or net new partners added to the Protege platform
- Meet a quick turnaround timeline suitable for active AI development cycles

As the go-to, trusted data network for AI training data, Protege delivered on all of the key criteria, unifying disparate data sources and applying AI data expertise to deliver the required data on schedule and at scale.

## The Protege Solution: A Single Data Provider Network

Protege served as the single access point to a distributed network of healthcare data partners to deliver a unified, model-ready cohort for the healthcare company for AI development.

### Speed and reliability

Protege ran a clear scoping and feasibility process to translate the company’s requirements into a concrete cohort definition and overlap strategy. After reviewing data samples and confirming that they were what the client was looking for, fast contracting and tightly managed coordination with partners supported timely data delivery. This also reduced operational friction for the end buyer.

### Breadth and depth of data

Along with a few other providers, Protege partnered with Loopback Health, a healthcare data company, to make structured and unstructured EHR data from a nationwide network of academic medical centers and integrated delivery networks available for AI training use. For the end AI research team, this ensured that the data would have direct usefulness for their upcoming training data runs and support the full model building cycle.

### Scalable overlap strategy

Rather than create a one-time static patient cohort delivery alone, Protege designed the data pipeline so that continuous dataset refreshes could potentially increase matched patient counts over time. As new data that was applicable to the client was added to the Protege data partner network, this expanded the patient cohort overlap without creating the need to re-architect the original approach and pipeline design.

### Flexible commercialization model

To align with the AI company’s roadmap, Protege implemented multi-year licenses with specified options for re-licensing. This structure allowed the company to plan long-term model development and evaluation on a stable data foundation, while preserving appropriate flexibility over time.

### Partnership operating model

As a part of every healthcare data scope of work, Protege program manages feasibility, contracting, and delivery across the healthcare data partner network. This operating model unlocks broader reach to additional partners for AI-specific use cases, extending what any single provider could offer on its own.

Together, this “single data provider network” approach gave the healthcare company a clear path from the initial overlapping patient dataset to ongoing cohort enrichment in the future to support ongoing training data requirements.

## Timeline & Data Delivery

From initial data discovery to final delivery of the first linked cohort, Protege executed on a **90-day turnaround timeline**. This process included:

- Upfront feasibility and scoping across Protege’s healthcare data partner network
- Contract execution with the AI company and involved partners
- Cohort definition and overlap implementation
- Final data preparation and delivery

Within that overall window, **final data delivery completed within 30 days of contract signature**. This timeliness was important to support the AI company’s active development cycle.

For the patient cohort, Protege delivered patient-level data that connected the existing modality with net new EHR data for those same patients. This also opened the possibility that matched patient counts that could have EHR data layered on in the future could grow via ongoing data refreshes and licensing renewals.
