Synthetic Data in Healthcare — SickKids Strategic Assessment

01 — Foundation

What Is Synthetic Healthcare Data?

Synthetic data is artificially generated information that replicates the statistical properties, patterns, and correlations of real patient records — without containing any actual patient information. Unlike de-identified data?De-Identified DataData where personal identifiers (name, DOB, health card number, etc.) have been removed or obscured. The underlying records still come from real patients — they've just had identifying details stripped away.Key distinctionDe-identified data can sometimes be re-linked back to real people using other data sources. Synthetic data is created from scratch, so there is no original record to re-link.Learn more (which strips identifiers from real records), synthetic data is built entirely from scratch using generative models?Generative ModelsAI/ML models that learn the patterns and statistical structure of a dataset, and then use what they've learned to create brand-new data that looks realistic but isn't based on any single real record. Think of it like an artist who studies thousands of portraits, then paints entirely new faces from memory.ExamplesGANs, VAEs, diffusion models, and Bayesian networks are all types of generative models — each learns data patterns differently.Learn more trained on real datasets.

In a pediatric tertiary care setting like SickKids, this means we could generate realistic datasets reflecting the distributions and clinical relationships in our EHR, imaging, and genomics data — and make them available for research, AI training, and commercial licensing without ever exposing a single real patient record.

Generation techniques range from classical statistical methods (Bayesian networks?Bayesian NetworksA type of statistical model that maps out how different variables depend on each other, using a diagram of connected nodes. For example, it might encode that "if a child has fever AND rash, the probability of measles is X%." It then uses these learned conditional relationships to generate new, realistic records.Why it mattersBayesian networks are more interpretable than deep learning approaches — clinicians can actually inspect and understand the relationships the model is using, which builds trust.Learn more , CART?CARTClassification and Regression Trees. A machine learning method that splits data into groups using a series of yes/no questions — like a decision tree. For synthetic data, CART models learn to predict each variable based on others, then use those predictions to generate new records that follow the same patterns.AnalogyImagine a flowchart: "Is age < 2? → If yes, weight is likely 5–14 kg. Is there a cardiac condition? → If yes, likely on these medications." CART builds thousands of such branching rules from your data.Learn more ) to deep learning approaches (GANs?GANs — Generative Adversarial NetworksA pair of competing neural networks: a generator tries to create fake data, and a discriminator tries to tell real from fake. They train against each other — the generator keeps improving until the discriminator can't tell the difference. The result: highly realistic synthetic records.AnalogyThink of a counterfeiter (generator) and a detective (discriminator). The counterfeiter keeps improving their fakes until even the detective can't spot them. The "fakes" are your synthetic patient records.Learn more , VAEs?VAEs — Variational AutoencodersA neural network that compresses data into a compact "summary" (called a latent space), then learns to reconstruct data from that summary. To generate synthetic data, you sample new points from this compressed space, and the model expands them into full, realistic records.AnalogyImagine summarizing a patient chart into a 10-number code, then using that code to reconstruct a full chart. By tweaking the numbers slightly, you get a new, plausible chart that never existed.Learn more , diffusion models?Diffusion ModelsA generative approach that works in two phases: first, it gradually adds random noise to real data until it becomes pure static. Then, it trains a model to reverse this process — starting from noise and step-by-step removing it to produce clean, realistic data. This is the same technology behind AI image generators like DALL-E and Midjourney.AnalogyImagine taking a photo and slowly turning up the static until it's pure snow. The model learns to reverse this — starting from TV snow and gradually recovering a realistic image. Applied to medical data, it produces realistic patient records from random noise.Learn more ) and increasingly, large language models. The choice of technique affects the fidelity?Data FidelityHow closely the synthetic dataset reproduces the statistical properties of the original real data. High-fidelity synthetic data preserves the same distributions, correlations, and patterns as the source. Low-fidelity data might get the averages right but miss important subtleties like rare conditions or complex variable interactions.Why it mattersIf your synthetic ED visit data says 40% of kids present with fever, but the real rate is 25%, any model trained on that data will make flawed predictions. Fidelity testing catches these gaps.Learn more , privacy guarantees, and downstream utility of the resulting dataset.

Synthetic Data Generation Pipeline

Source Data Ingestion

Real EHR, imaging, or claims data enters a secure environment

Statistical Profiling

Distributions, correlations, and temporal patterns are extracted

Generative Modeling

GANs, VAEs, diffusion models, or Bayesian networks learn the data structure

Synthetic Record Generation

New records are created that are statistically faithful but entirely artificial

Fidelity & Privacy Validation

Utility metrics + re-identification risk assessed before release

Licensing & Distribution

Dataset packaged with documentation for commercial or research use

Generation Techniques

Common Approaches to Synthetic Data Generation

🔄

GANsGenerative Adversarial Networks

Two neural networks compete against each other: a generator creates fake data, and a discriminator tries to tell real from fake. They train in a loop — the generator keeps improving its fakes until the discriminator can't distinguish them from real records. The result is highly realistic synthetic data.

Best for: EHR tabular data, medical imaging

SickKids Example

Synthetic Pediatric Chest X-Rays

A GAN trained on 50,000 real pediatric chest X-rays from SickKids learns the visual patterns of pneumonia, bronchiolitis, and normal findings across different age groups. It then generates thousands of new X-ray images that look clinically realistic — showing realistic lung opacities, cardiac silhouettes, and age-appropriate anatomy — but don't correspond to any real patient.

An AI company building a pediatric pneumonia detection model buys these synthetic images to train their algorithm without ever accessing real patient scans.

GANs can struggle with "mode collapse" — generating only a narrow range of outputs. For rare findings (e.g., diaphragmatic hernia), the GAN may not produce enough variety.

📐

VAEsVariational Autoencoders

VAEs work by compressing real data into a compact mathematical summary (a "latent space"), then learning to reconstruct data from that summary. To generate new synthetic records, you sample new points from this compressed space and the model expands them into full, realistic records. Unlike GANs, VAEs are more stable to train and produce smoother, more continuous outputs.

Best for: Continuous lab values, vital signs, growth curves

SickKids Example

Synthetic Lab Panels for DKA Patients

A VAE is trained on the admission lab panels of 800 pediatric DKA presentations: blood glucose, pH, bicarbonate, potassium, anion gap, and their relationships to age, weight, and severity. The model learns how these values correlate — for example, that very low pH typically co-occurs with high glucose and low bicarbonate in specific patterns.

It then generates 10,000 synthetic lab panels that preserve these clinically meaningful correlations. A researcher uses this dataset to develop a severity-scoring algorithm without needing access to real patient labs.

VAEs tend to produce slightly "blurry" outputs — averages are well-preserved but extreme values (very high glucose, critically low pH) may be underrepresented.

🌊

Diffusion ModelsDenoising Diffusion Probabilistic Models

Diffusion models work in two phases: first, they gradually add random noise to real data until it becomes pure static. Then, they train a model to reverse this process — learning to start from noise and step-by-step remove it to produce clean, realistic output. This is the same core technology behind DALL-E and Midjourney. They currently produce the highest-fidelity outputs of any generative approach.

Best for: Medical imaging, complex multi-variable distributions

SickKids Example

Synthetic Brain MRIs for Pediatric Epilepsy

A diffusion model trained on 5,000 pediatric brain MRIs learns the full spectrum of normal and abnormal findings — cortical dysplasia, hippocampal sclerosis, focal lesions — across age groups from neonates to adolescents. It generates new MRI volumes that preserve subtle anatomical details and pathological patterns with higher fidelity than GANs or VAEs.

A medtech company developing an AI seizure-focus localization tool licenses these synthetic scans to train their model across diverse pathology types they couldn't assemble from a single institution alone.

Diffusion models are computationally expensive — generating a single synthetic MRI volume can take minutes vs. milliseconds for a GAN. Scaling to large datasets requires significant GPU resources.

🌐

Bayesian NetworksProbabilistic Graphical Models

Bayesian networks model data as a graph of connected nodes, where each node is a variable and each connection represents a conditional dependency. For example: "if age is <2 AND temperature >39°C, then probability of UTI is X%." They learn these conditional relationships from real data, then sample from the graph to generate new records that follow the same rules. They are the most interpretable approach — clinicians can inspect and understand the relationships.

Best for: Interpretable clinical relationships, structured EHR data

SickKids Example

Synthetic ED Triage Data for Asthma Visits

A Bayesian network trained on 15,000 pediatric asthma ED visits encodes the conditional relationships: age → severity, severity → O₂ sat, O₂ sat → disposition, prior admissions → length of stay, and so on. The resulting graph is inspectable — a clinician can verify that the model correctly links salbutamol doses to PRAM scores before any synthetic records are generated.

The synthetic dataset of 100,000 asthma visits is then used by a health system planning team to model ED flow patterns under different staffing scenarios.

Bayesian networks require domain experts to define or validate the graph structure. They also struggle with high-dimensional data (many variables) and continuous values compared to deep learning approaches.

🤖

LLMsLarge Language Models

Large language models (like GPT-4, Claude, Llama) can generate realistic free-text clinical narratives — the kinds of notes, summaries, and reports that make up a huge portion of health records but are nearly impossible to synthesize with statistical methods. By fine-tuning an LLM on real clinical text, the model learns medical terminology, documentation patterns, and clinical reasoning structure.

Best for: Unstructured text — clinical notes, discharge summaries, radiology reports

SickKids Example

Synthetic Discharge Summaries for Seizure Admissions

An LLM fine-tuned on 3,000 de-identified discharge summaries from SickKids neurology admissions learns the institutional writing style, common phrasing, medication documentation patterns, and the structure of follow-up plans. It then generates entirely new discharge summaries that read like they were written by a SickKids neurologist — but describe synthetic patients who never existed.

These synthetic notes are used to train an NLP model that extracts structured seizure type, medication, and EEG findings from free text — a task that previously required expensive manual chart review.

LLM-generated text can "hallucinate" clinically implausible details (e.g., a medication dose that doesn't exist). Clinical review of a sample is essential. There's also a risk of memorizing and regurgitating distinctive phrases from real notes.

02 — Advantages

The Case For Synthetic Data

Synthetic data addresses some of the most persistent bottlenecks in healthcare data access — privacy barriers, cost, regulatory overhead, and data scarcity — while enabling new revenue opportunities.

🔒Privacy Preservation

Synthetic records contain no real patient information and cannot be reverse-engineered to identify individuals. Unlike de-identified data, there is no original record to re-link. This sidesteps PHIPA?PHIPAThe Personal Health Information Protection Act — Ontario's law governing how personal health information is collected, used, and shared. It applies to all Ontario healthcare providers and sets strict rules about consent, access, and disclosure of patient health records.In practicePHIPA is why SickKids can't just email a researcher a spreadsheet of patient data. Every data request requires documented authority and often a formal data sharing agreement.Learn more , PIPEDA?PIPEDACanada's federal Personal Information Protection and Electronic Documents Act. It governs how private-sector organizations collect, use, and disclose personal information during commercial activities. It applies to cross-provincial data transfers and commercial health data activities.RelevanceIf SickKids sells or licenses synthetic data commercially, PIPEDA may apply to the commercial transaction even though PHIPA governs the source health data. This dual-jurisdiction issue is an open legal question for synthetic data.Learn more , and HIPAA?HIPAAThe U.S. Health Insurance Portability and Accountability Act. It sets national standards for protecting sensitive patient health information. HIPAA is relevant when sharing data with U.S.-based companies, researchers, or collaborators — which many potential buyers of SickKids synthetic data would be.Key pointFully synthetic data is generally not considered Protected Health Information (PHI) under HIPAA because it contains no real patient data. This is a major advantage for cross-border licensing.Learn more constraints, dramatically simplifying data sharing agreements.

⚡Accelerated Access

Real data requests at SickKids can take months through REB approvals?REB — Research Ethics BoardAn institutional committee that reviews research proposals involving human subjects or their data to ensure ethical standards are met. In Canada, any study using patient data typically requires REB approval before it can begin — a process that can take weeks to months.The bottleneckEven a simple retrospective chart review at SickKids requires REB approval. For multi-site studies, each site's REB must independently approve — multiplying the delay.Learn more , data governance reviews, and DUAs?DUA — Data Use AgreementA legal contract between the data holder (e.g., SickKids) and the data recipient that specifies exactly how data can be used, who can access it, retention timelines, and security requirements. Negotiating DUAs between institutions often takes months.Synthetic data advantageSince synthetic data contains no real patient information, it may not require a formal DUA at all — or at minimum, a much simpler licensing agreement can replace the complex DUA process.. Synthetic datasets can be provisioned in days, compressing AI development cycles from quarters to weeks. The VA deployed synthetic data across 1,300 facilities for this exact reason.

💰Revenue Generation

Licensing real-world data can cost $100K–$1M+. SickKids' unique pediatric tertiary care data could be synthesized and sold at scale — to pharma, medtech, and AI companies — without ever exposing a real patient record. Pediatric data is rare and commercially valuable.

📊Data Augmentation

Rare pediatric conditions produce small cohorts. Synthetic data can amplify sample sizes, enrich minority class representation, and create balanced datasets for ML training — addressing the chronic challenge of statistical power?Statistical PowerThe probability that a study will detect a real effect when one exists. Low statistical power means your dataset is too small to reliably distinguish real patterns from random noise. In pediatric research, many conditions have so few cases that studies are chronically underpowered.ExampleIf SickKids sees only 12 cases of a rare metabolic disorder per year, a study comparing two treatment approaches may not have enough patients to draw statistically meaningful conclusions — even if one treatment is genuinely better. in pediatric research.

🔧Safe Development & Testing

Software teams building clinical decision support, EHR interfaces, and ML models need realistic data to test. Synthetic datasets provide production-grade test environments without the liability of using live data — cutting development costs by 30–40%.

🌍Cross-Border Collaboration

Canadian health data faces strict jurisdictional controls. Synthetic data can be shared freely across international research collaborations and GDPR?GDPRThe EU's General Data Protection Regulation — one of the world's strictest data privacy laws. It governs how personal data of EU residents is collected, stored, and transferred. Cross-border transfers of personal data to non-EU countries (like Canada) require special legal mechanisms.Synthetic data angleThe European Data Protection Supervisor has noted that truly synthetic data may fall outside GDPR's scope since it doesn't relate to identifiable individuals — but this is still debated, especially for high-fidelity datasets.Learn more /HIPAA-governed jurisdictions without triggering cross-border data transfer restrictions.

03 — Risks & Limitations

The Challenges We Must Address

Synthetic data is not a silver bullet. Serious technical, ethical, and regulatory challenges must be mitigated before SickKids pursues commercialization.

⚠️Fidelity & Distribution Drift?Distribution DriftWhen the statistical patterns in synthetic data gradually deviate from the real source data. This can happen during generation (the model subtly shifts averages, smooths out extremes, or alters relationships between variables) or over time (the real patient population changes but the synthetic dataset remains static).Pediatric exampleIf real SickKids data shows that 3% of asthma patients have a rare comorbidity, but the synthetic generator "drifts" this to 1.5%, any downstream research on that subgroup will underestimate its prevalence by half.Learn more

Synthetic generators can subtly distort correlations, flatten rare signals?Rare Signal LossWhen a generative model fails to reproduce uncommon but clinically important patterns in the data. Models tend to learn the most common patterns well, but may smooth over or completely miss rare events — like unusual drug reactions, atypical presentations, or low-frequency diagnoses.Why this matters in pediatricsA rare metabolic condition affecting 0.1% of SickKids patients might be absent entirely from the synthetic data because the generator didn't have enough examples to learn the pattern. Anyone using that synthetic data would have a blind spot for that condition., and introduce artifacts. In pediatrics — where rare diseases and edge cases are clinically critical — even small fidelity losses can produce misleading downstream analyses.

Worked Example — How Distribution Drift Plays Out

Pediatric Sepsis Early Warning Model

SickKids generates a synthetic dataset of 2,000 pediatric sepsis encounters to license to a company building an early warning score. Here's how fidelity loss and drift compound into clinical danger:

Correlation flattening during generation. In the real data, there is a critical clinical correlation: neonates with sepsis often present with hypothermia (low temperature) rather than fever — the opposite of older children. This pattern appears in ~18% of sepsis cases under 60 days. The GAN, optimizing for overall statistical fit, learns that sepsis = fever because that's the dominant pattern. Hypothermic sepsis in neonates drops from 18% to 6% in the synthetic data.

Tail truncation on lab values. Real sepsis data includes a critical tail: ~8% of patients have lactate values above 6 mmol/L, signaling severe tissue hypoperfusion. The synthetic generator smooths the lactate distribution toward the mean, and extreme values above 5.5 mmol/L virtually disappear. The synthetic dataset looks statistically reasonable at a glance — the mean and standard deviation of lactate are close to the real data — but the clinically dangerous tail is gone.

Temporal drift after release. The synthetic dataset is generated in 2026 based on source data from 2020–2025. By 2027, SickKids' real sepsis population has shifted: a new RSV-bacterial co-infection pattern has emerged post-pandemic, antibiotic stewardship has changed empiric therapy practices, and the hospital now sees more immunocompromised oncology patients with atypical sepsis presentations. The synthetic dataset — still being sold — no longer reflects current clinical reality.

The downstream model has blind spots. A sepsis early warning tool trained on this synthetic data learns that "high temperature + elevated WBC = sepsis risk." It performs well for the textbook presentation. But it misses the hypothermic neonate (because the generator flattened that signal), fails to flag critically elevated lactate (because it never saw values that high), and doesn't recognize the new co-infection patterns (because the data is temporally stale).

The missed sepsis case. A 3-week-old presents to an ED using this tool. Temperature is 35.8°C, WBC is normal, but lactate is 7.2 mmol/L. The model scores the patient as low risk — no fever, no leukocytosis. The tool has never learned that this constellation in a neonate is a red flag. The real pattern was in SickKids' source data, but the synthetic generator erased it.

The takeaway: Fidelity loss isn't random noise — it's systematically biased toward erasing exactly the patterns that matter most: rare presentations, extreme values, and atypical correlations. A synthetic dataset can pass standard statistical validation tests (mean, SD, basic correlations) while being clinically dangerous at the tails. Validation must include domain-expert review of edge cases and longitudinal versioning to prevent temporal staleness.

🔓Re-Identification Risk?Re-Identification RiskThe possibility that someone could use a synthetic dataset — even though it contains no real identifiers — to infer or determine the identity of a real individual whose data was used in training. This is especially concerning when the original dataset contains rare or unique combinations of attributes.Pediatric riskImagine a synthetic record for a 7-year-old with a rare cardiac condition, a specific surgical history, and an uncommon ethnic background at a Toronto hospital. Even without a name, that combination might uniquely identify a real child.Learn more

High-fidelity synthetic data trained on small populations (e.g., rare pediatric conditions) can still leak information about real individuals. There are no standardized, objective methods to certify that a synthetic dataset is sufficiently different from the original.

⚖️Regulatory Grey Zone

No Canadian or international regulator has definitively ruled on the legal status of synthetic data. It exists in a gap between PHIPA/PIPEDA's definitions. Ethics boards are inconsistently waiving review for synthetic data research — a loophole that could tighten.

🧬Bias Amplification?Bias AmplificationWhen a generative model not only reproduces existing biases in the training data, but actually makes them worse. If certain populations are underrepresented in the source data, the synthetic generator may further reduce their representation — or learn incorrect patterns about them due to having too few examples.How it happensIf only 2% of SickKids' source data represents Indigenous children (vs. ~8% of Ontario's child population), a GAN might learn that "typical" patients don't look like Indigenous children. The synthetic data then underrepresents them even further — and any AI models trained on it will perform worse for that group.Learn more

If real SickKids data contains demographic under-representation (e.g., Indigenous populations, rural communities), synthetic generation can amplify these biases. Models trained on biased synthetic data may perpetuate health inequities?Health InequitiesSystematic, avoidable, and unfair differences in health outcomes between population groups. Unlike simple health "differences," inequities are rooted in social, economic, and structural factors — and are considered unjust. Using biased data to train clinical tools risks encoding these inequities into automated systems.The concernIf an AI diagnostic tool is trained on synthetic data that underrepresents certain ethnic groups, it may perform less accurately for those groups — effectively providing worse care to populations that already face barriers to healthcare access. at scale.

Worked Example — How Bias Amplification Plays Out

Pediatric Asthma Severity Prediction Model

Imagine SickKids creates a synthetic dataset of 50,000 pediatric asthma ED visits to sell to an AI company building an asthma severity prediction tool. Here's how bias can cascade:

Real data reflects existing disparities. SickKids' real asthma data is ~68% from Toronto's urban core. Only ~4% of records are from Indigenous children, despite Indigenous children having 2–3× higher asthma hospitalization rates nationally. Why? Families in remote communities often present to local hospitals first, not SickKids.

The generator learns "typical" = urban, non-Indigenous. A GAN trained on this data learns that the statistically dominant pattern is an urban child with certain environmental triggers, medication access patterns, and follow-up adherence. Indigenous children's distinct patterns — different environmental exposures, medication access barriers, higher severity at presentation — are treated as noise because there are too few examples.

Synthetic data shrinks the minority further. The synthetic dataset reduces Indigenous representation from 4% to 1.5%. Worse, the few Indigenous synthetic records the model does generate look statistically similar to the urban majority — losing the distinct clinical patterns that actually characterize these presentations.

The downstream AI model encodes the bias. The AI company trains their severity prediction model on this synthetic data. It performs well for urban, non-Indigenous children (88% accuracy). But for Indigenous children, accuracy drops to 61% — it consistently underestimates severity because it never learned the patterns of late presentation, limited prior medication use, and higher baseline inflammation.

The model is deployed nationally. A health system in Manitoba adopts the tool. Indigenous children presenting with severe asthma are triaged as moderate because the model has never seen their pattern. A child who should have been started on IV magnesium gets oral prednisone and a 4-hour reassessment. The bias in SickKids' source data, amplified by the synthetic generator, has now produced a measurable patient safety gap 2,000 km away.

The takeaway: Bias amplification isn't an abstract fairness concern — it's a patient safety issue. Any synthetic data commercialization program must include bias audits comparing demographic distributions in the synthetic data against known population-level prevalence, not just against the source data (which is itself biased).

🏥Clinician Distrust

Healthcare practitioners are risk-averse and often skeptical of AI-derived insights. Clinical decisions informed by models trained on "fake" data face significant adoption barriers — particularly in high-stakes pediatric contexts.

📜Consent & Governance

Patients consented to SickKids using their data for care and research — not necessarily for training commercial generative models. The ethical ground for monetizing derivatives of patient data requires careful institutional governance and potentially new consent frameworks.

04 — Data & Analysis

Quantifying the Landscape

Key metrics and tradeoffs that frame the synthetic data opportunity for a pediatric institution.

Data Access Cost Comparison (USD)

Clinical trial (per patient)

$20K

RWD licensing

$100K+

De-identification

$50K

Synthetic generation

$5–15K

Sources
Clinical trial per-patient costs: Palleos Healthcare (2025) & Abacum — range $15K–$136K by phase. RWD licensing & synthetic cost comparison: Hospitalogy (2024). De-identification costs are estimated from industry benchmarks.

Synthetic Data Use Cases in Healthcare Literature

30% — AI/ML TrainingModel development & validation

25% — ResearchPrivacy-preserving analytics

20% — Software TestingEHR/device development

15% — EducationTraining & simulation

10% — OtherPolicy, epidemiology, linking

Sources
Seven use cases identified from 72 articles: Gonzales et al., PLOS Digital Health (2023). Percentage weightings are approximate, based on relative emphasis across the literature including Giuffrè & Shung, npj Digital Medicine (2023) and Rahman et al., Intell.-Based Med. (2026).

The Privacy–Utility Tradeoff

This is the fundamental tension in synthetic data: the more useful the data is for research and AI training, the more closely it must resemble real patient records — which inherently increases the risk that someone could link a synthetic record back to a real person. There is no way to fully eliminate this tradeoff; the goal is to find the right balance for each use case.

Think of it like a volume dial: turning up "utility" (data accuracy) automatically turns up "risk" (re-identification potential). A children's hospital must set this dial more conservatively than an adult institution, because pediatric populations are smaller and rare conditions are more uniquely identifying.

🛡️ Maximum Privacy

Privacy

Utility

Heavy noise injection and differential privacy?Differential PrivacyA mathematical framework that guarantees an individual's data cannot significantly influence the output of a query or model. It works by adding carefully calibrated random noise to data or results. The key parameter epsilon (ε) controls the tradeoff: lower ε = more privacy but noisier results.In practiceApple and Google use differential privacy to collect usage statistics from millions of devices without learning anything about individual users. In synthetic data, it adds a formal, provable privacy guarantee on top of the generation process.Learn more guarantees make re-identification virtually impossible. But the data is so noisy that clinical correlations break down.

SickKids example: A synthetic dataset of bronchiolitis visits where the relationship between age, O₂ saturation, and admission has been deliberately scrambled. Safe to share publicly, but a researcher couldn't use it to answer "do younger infants with lower O₂ sats get admitted more?" — the signal has been destroyed.

⚖️ Optimal Balance

Privacy

Utility

Validated synthetic generation that preserves key clinical relationships, combined with formal disclosure testing (nearest-neighbor distance, k-anonymity) before release. This is the recommended target for SickKids.

SickKids example: A synthetic dataset of 20,000 asthma ED visits where the correlation between PRAM scores, medication doses, and dispositions is preserved — a pharma company can meaningfully train models on it — but every record has been tested to ensure no synthetic patient is suspiciously close to any real patient in the source data. Rare attribute combinations that could identify a child are suppressed.

🔬 Maximum Utility

Privacy

Utility

Near-replica fidelity where synthetic records are almost indistinguishable from real ones. Maximum analytical power, but unacceptably high re-identification risk — especially for rare pediatric conditions and small cohorts.

SickKids example: A synthetic record for a 6-year-old with Dravet syndrome, specific EEG findings, clobazam and stiripentol dosing, and three PICU admissions at SickKids. That combination is so specific that anyone with knowledge of the SickKids epilepsy program could plausibly identify the real patient. The data is analytically perfect but ethically indefensible for commercial release.

Why this matters for commercialization: Every buyer of SickKids synthetic data will want the dial turned as far toward "utility" as possible — that's what makes the data valuable. SickKids' obligation is to hold the line at a privacy threshold that protects children, even when it means the product is less analytically powerful. The commercial strategy must price this tradeoff honestly: buyers get data that's good enough for most AI training and research, but not so faithful that it risks exposing real patients. Buyers wanting near-replica fidelity should be directed to formal research partnerships with REB oversight instead.

Fidelity Spectrum & Pediatric Risk

"Fidelity" means how closely the synthetic data mirrors the real source data. Higher fidelity is better for research — but in pediatrics, it also increases the risk that a synthetic record could be traced back to a real child. This spectrum shows the four levels, from safest to riskiest.

Low Fidelity Moderate High Fidelity

🧪

Random / Noise

Completely artificial data with no relationship to real patient records. Variables are randomly generated within plausible ranges but don't reflect actual clinical patterns or correlations.

Use case at SickKids:

Populating a test EMR environment with fake patient records so developers can build and test new Epic/Meditech interfaces without touching real data. The records don't need to be clinically realistic — they just need to fill fields.

PRIVACY: ●●●●● Very Safe
UTILITY: ● Minimal

📊

Statistical Replica

Synthetic data that preserves the basic statistical properties of the real dataset — means, standard deviations, proportions — but may not capture complex relationships between variables or rare subgroups.

Use case at SickKids:

A researcher needs a dataset showing realistic age/sex distributions and diagnosis frequencies to prototype a dashboard before applying for REB approval to access real data. The synthetic set gets the shape right but shouldn't be used for clinical conclusions.

PRIVACY: ●●●●○ Safe
UTILITY: ●●○ Moderate

🎯

Deep Generative

GAN/VAE/diffusion-generated data that preserves correlations between variables, temporal patterns, and multivariate relationships. Clinically realistic enough to train AI models and run meaningful analyses.

Use case at SickKids:

A pharma company licenses a synthetic dataset of 20,000 pediatric asthma encounters to train a severity prediction model. The correlations between PRAM scores, O₂ sat, medication doses, and dispositions are clinically faithful. This is the commercial sweet spot.

PRIVACY: ●●●○○ Moderate — needs re-ID testing
UTILITY: ●●●●○ High

⚠️

Near-Replica

Synthetic data so close to the originals that individual records may functionally mirror real patients. Maximum analytical utility but unacceptable privacy risk — especially for pediatric populations where rare conditions make individuals uniquely identifiable.

Why this is off-limits for SickKids:

A near-replica synthetic record of a child with Dravet syndrome, specific EEG patterns, and a particular medication history at a Toronto children's hospital is effectively that child's record with the name removed. No amount of downstream legal protection changes the ethical breach.

PRIVACY: ●○○○○ Unacceptable
UTILITY: ●●●●● Maximum

05 — Suitability by Data Type

Not All Data Is Created Equal

The viability of synthetic data generation depends heavily on the underlying condition — specifically, how common it is, how large the source cohort is, and how clinically complex the variable relationships are. Here's how different types of SickKids data map to synthetic data suitability.

● High Suitability

High-Volume Common Presentations

Asthma Bronchiolitis Gastroenteritis Febrile Seizures Fractures Croup UTI

Large sample sizes (thousands of encounters/year) give generative models abundant training data. Statistical distributions are well-represented, correlations between variables are robust, and rare edge cases within these cohorts are still numerous enough to capture. Synthetic versions of these datasets will closely mirror the real data.

Recommendation: Ideal starting point for a pilot program. High fidelity, low re-identification risk, and strong commercial demand from AI companies training pediatric triage and clinical decision support models.

Expected Fidelity Very High

Privacy Safety Very High

Source Data Volume 1,000s/yr

Commercial Value High

● Moderate Suitability

Subspecialty Conditions — Moderate Volume

Sepsis DKA Epilepsy Appendicitis Pneumonia (complex) Concussion Sickle Cell Crises

Moderate cohort sizes (hundreds of encounters/year) provide adequate training data for most generative approaches, though some rare subtypes within these conditions may be underrepresented. Variable interactions are more complex — sepsis presentations, for example, involve intricate relationships between vitals, lab trends, and interventions that generators must faithfully reproduce.

Key tradeoff: Synthetic data will capture the central tendencies well, but may smooth over atypical presentations (e.g., sepsis without fever, DKA in non-diabetics). Validation against the tails of the distribution is critical before commercial release.

Expected Fidelity Good

Privacy Safety Good

Source Data Volume 100s/yr

Commercial Value High

● Low Suitability — High Value Tension

Rare & Ultra-Rare Conditions

Inborn Errors of Metabolism Rare Cardiac Defects Pediatric Stroke Mitochondrial Disease Rare Cancers Genetic Epilepsies

This is the central paradox: rare disease data is the most commercially valuable (pharma companies will pay a premium for pediatric rare disease cohorts) but the hardest to synthesize safely. With only dozens of cases per year — sometimes single digits — generative models don't have enough examples to learn robust statistical patterns. The resulting synthetic data may either fail to capture the condition accurately, or reproduce records so close to the originals that re-identification becomes a real risk.

Critical risk: A synthetic record describing a 4-year-old with a specific mitochondrial variant, particular lab values, and a surgical history at a Toronto children's hospital could plausibly identify a real child — even without a name attached. Extra privacy safeguards are essential — here's what they are:

🛡️

Differential Privacy

A mathematical technique that adds carefully calibrated random noise to data or model outputs. It provides a formal, provable guarantee that no single individual's data can significantly influence the result. The key parameter epsilon (ε) controls the tradeoff: a smaller ε means stronger privacy but noisier, less useful data.

Applied to rare conditions: Before releasing a synthetic dataset of rare cardiac defects, differential privacy would add controlled noise to variables like age, weight, and procedure type — making it mathematically impossible to determine whether any specific child's record was in the training data, even if an attacker has external knowledge about that child.

👥

k-Anonymity Checks

A privacy test that ensures every combination of identifying attributes in the dataset applies to at least k different records. If k=5, then no combination of age, diagnosis, sex, and hospital unit can appear fewer than 5 times. Any record that would be unique (k=1) is flagged as a re-identification risk.

Applied to rare conditions: After generating synthetic records for pediatric stroke patients, a k-anonymity check would flag that there's only one synthetic 3-year-old female with a posterior circulation stroke. That record would be suppressed or generalized (e.g., age broadened to "2–5 years") before the dataset is released.

🚫

Suppression of Rare Attribute Combinations

A straightforward approach where records containing unusual combinations of variables are removed from the synthetic dataset entirely. If a combination of features is so specific that it could plausibly point to a real individual, the record is deleted rather than released.

Applied to rare conditions: A synthetic record for a patient with Leigh syndrome + a specific mitochondrial DNA variant + a liver transplant at SickKids would be suppressed — that combination is so rare it could functionally identify a real child even without a name. The synthetic dataset ships with that record removed and a note that certain ultra-rare combinations were excluded for privacy.

Expected Fidelity Low–Variable

Privacy Safety Low

Source Data Volume <50/yr

Commercial Value Very High

● Special Considerations

Longitudinal & Multi-Modal Data

Chronic Disease Trajectories Imaging + EHR Paired Data Genomics + Phenotype NICU Longitudinal Records Post-Surgical Follow-Up

Synthesizing data that preserves temporal relationships (how a patient's condition evolves over visits) and cross-modal correlations (linking an X-ray finding to a lab value to a medication change) is significantly harder than generating single-encounter tabular data. Most current commercial tools handle tabular snapshots well but struggle with longitudinal coherence and multi-modal alignment.

Practical implication: A synthetic NICU stay that generates realistic day-by-day vital signs, medication adjustments, and imaging findings in a clinically coherent sequence is at the frontier of current capabilities. Consider starting with single-encounter data and building toward longitudinal datasets as generative techniques mature.

Expected Fidelity Emerging

Privacy Safety Variable

Source Data Volume Varies

Commercial Value Very High

06 — Commercial Assessment

Commercialization Viability for SickKids

The unique value of SickKids' pediatric data — rare conditions, longitudinal records, and a world-renowned brand — creates a compelling commercial opportunity, with important caveats.

Competitive Advantages

Pediatric data is scarce globally — pharma and AI companies struggle to find it
SickKids brand carries trust and academic credibility with buyers
Rare disease cohorts (oncology, metabolic, cardiac) are uniquely valuable for drug development
FHIR?FHIR — Fast Healthcare Interoperability ResourcesA modern standard for exchanging healthcare information electronically. Developed by HL7, FHIR uses a web-based API approach (like apps on your phone talking to servers) to make health data easy to access and share between systems. It's becoming the dominant standard in healthcare IT.Commercial valueIf SickKids' synthetic data is formatted in FHIR, buyers can plug it directly into their development environments, EHR test systems, and AI training pipelines without costly data transformation — making the product much more attractive.Learn more -compatible output enables seamless integration with buyer systems
Revenue diversification beyond traditional research grants and government funding

Critical Risk Factors

Public perception: parents may object to monetization of children's health data derivatives
Consent frameworks were not designed for commercial synthetic data generation
Insurance companies could use synthetic data to discriminate against pediatric populations
Regulatory landscape may shift — what's permissible today could be restricted tomorrow
Requires significant upfront investment in generation infrastructure and validation

Vendor Landscape

Tools & Platforms for Synthetic Data Generation

Tool / Vendor	Type	Approach	Healthcare Focus	Notes
Synthea™	Open Source	Rules-based, statistical	FHIR-native patient records	ONC-supported; includes pediatric modules
MDClone	Commercial	Statistical engine	EHR-to-synthetic conversion	Used by major health systems & universities
Syntegra	Commercial	Deep generative (GANs)	Health data synthesis	Privacy guarantees with statistical validation
Mostly AI	Commercial	Deep generative	Tabular data, multi-industry	Strong privacy metrics; GDPR-focused
Replica Analytics	Commercial	Bayesian networks	Healthcare & government	Canadian company; PHIPA/PIPEDA aware
Gretel.ai	Commercial	Multiple (GAN, LLM)	Multi-industry	API-first; composable generation

The Pediatric Data Premium

Pediatric clinical data occupies a uniquely valuable niche in the healthcare data market. Children's hospitals represent a tiny fraction of all healthcare facilities, yet their data is essential for drug dosing research, developmental biology, rare disease modeling, and pediatric AI applications.

SickKids sees some of the rarest pediatric conditions in North America. A synthetic dataset preserving the statistical signatures of SickKids' metabolic, cardiac, oncological, and neurological cohorts — without any real patient data — would be extremely attractive to pharma companies running pediatric clinical trials, medtech companies developing pediatric devices, and AI companies building pediatric diagnostic models.

However, the same rarity that makes this data valuable also makes it more vulnerable to re-identification. A synthetic record describing a patient with a one-in-a-million condition in a specific age bracket, even without identifiers, could theoretically be linked back to a real child. This is the central tension SickKids must navigate.

Exploring Decentralized AI: Bittensor

Could SickKids use Bittensor — a decentralized AI network — to generate synthetic data and earn revenue through cryptocurrency emissions? We've prepared a dedicated deep-dive assessing the feasibility, workflow, risks, and honest verdict on this emerging approach.

Read the Bittensor Assessment ↗

07 — Recommended Path Forward

Strategic Roadmap

A phased approach to developing SickKids' synthetic data capability — balancing innovation speed with the governance rigor expected of a world-class pediatric institution.

Governance First

Establish a synthetic data governance committee spanning clinical, legal, ethics, IT, and research. Define consent frameworks, acceptable use policies, and a disclosure risk threshold before generating any dataset.

Pilot with Low-Risk Data

Begin with high-volume, low-sensitivity data (e.g., ED visit demographics, triage acuity patterns) before progressing to rare-disease cohorts. Validate fidelity metrics against real data in a controlled research setting.

Benchmark Disclosure Risk

Implement formal re-identification testing (nearest-neighbor distance?Nearest-Neighbor DistanceA privacy metric that measures how close each synthetic record is to the nearest real record in the original dataset. If a synthetic record is very close to a real one, it might effectively be a copy — which means it could reveal real patient information. Larger minimum distances = better privacy.How it's usedAfter generating synthetic data, you compare every synthetic record against every real record and measure the "distance" (difference across all variables). If any synthetic record is suspiciously close to a real one, it gets flagged or removed., membership inference attacks?Membership Inference AttacksA type of privacy attack where someone tries to determine whether a specific individual's data was used to train the generative model. If an attacker has a person's medical record, they test whether the synthetic data "knows" about that person by looking for telltale statistical fingerprints.Why test for thisEven if no record in the synthetic data directly matches a real patient, the model's learned patterns might reveal that a specific person was in the training set — which is itself a privacy breach. Testing against these attacks is considered best practice.Learn more ) and establish a quantitative privacy threshold. Pediatric rare conditions require stricter thresholds than adult general populations.

Select Vendor or Build

Evaluate commercial vendors (MDClone, Syntegra, Replica Analytics) against in-house capability. Consider a Canadian vendor (Replica Analytics) for PHIPA/PIPEDA alignment. Factor in long-term costs vs. per-dataset licensing.

Tiered Licensing Model

Design a commercial licensing framework with tiers: academic (low-cost), pharma R&D (premium), and AI training (usage-based). Include contractual restrictions on downstream use, especially re-identification attempts and insurance applications.

Ongoing Validation Loop

Synthetic datasets degrade as real data distributions shift over time. Establish a quarterly validation cycle comparing synthetic output fidelity against updated source data, and version all released datasets with full audit trails.

Synthetic Data inPediatric Healthcare

What Is Synthetic Healthcare Data?

Source Data Ingestion

Statistical Profiling

Generative Modeling

Synthetic Record Generation

Fidelity & Privacy Validation

Licensing & Distribution

Common Approaches to Synthetic Data Generation

GANsGenerative Adversarial Networks

Synthetic Pediatric Chest X-Rays

VAEsVariational Autoencoders

Synthetic Lab Panels for DKA Patients

Diffusion ModelsDenoising Diffusion Probabilistic Models

Synthetic Brain MRIs for Pediatric Epilepsy

Bayesian NetworksProbabilistic Graphical Models

Synthetic ED Triage Data for Asthma Visits

LLMsLarge Language Models

Synthetic Discharge Summaries for Seizure Admissions

The Case For Synthetic Data

🔒Privacy Preservation

⚡Accelerated Access

💰Revenue Generation

📊Data Augmentation

🔧Safe Development & Testing

🌍Cross-Border Collaboration

The Challenges We Must Address

Pediatric Sepsis Early Warning Model

⚖️Regulatory Grey Zone

Pediatric Asthma Severity Prediction Model

🏥Clinician Distrust

📜Consent & Governance

Quantifying the Landscape

Data Access Cost Comparison (USD)

Synthetic Data Use Cases in Healthcare Literature

The Privacy–Utility Tradeoff

🛡️ Maximum Privacy

⚖️ Optimal Balance

🔬 Maximum Utility

Fidelity Spectrum & Pediatric Risk

Not All Data Is Created Equal

Differential Privacy

k-Anonymity Checks

Suppression of Rare Attribute Combinations

Commercialization Viability for SickKids

Competitive Advantages

Critical Risk Factors

Tools & Platforms for Synthetic Data Generation

The Pediatric Data Premium

Exploring Decentralized AI: Bittensor

Strategic Roadmap

Governance First

Pilot with Low-Risk Data

Benchmark Disclosure Risk

Select Vendor or Build

Tiered Licensing Model

Ongoing Validation Loop

Synthetic Data in
Pediatric Healthcare