cancer | Sydney Lash

Problem Statement

Lung cancer is one of the most significant chronic disease burdens in North Carolina, both in terms of incidence and mortality. According to CDC data from 2023, the state experienced 34.9 deaths per 100,000 residents from lung cancer and 58.4 new cases per 100,000 residents, both higher than the national averages.(1)These elevated rates place lung cancer among the most pressing cancer-related public health challenges in North Carolina.

Disparities are evident among racial and ethnic minorities across timeliness of diagnosis, treatment access, and five-year survival rates. (2) Risk factors also contribute to North Carolina’s elevated burden; adult smoking prevalence alongside of youth e-cigarette use are both higher than the national average.(3)

These statistics underscore the urgency of strengthening surveillance systems to provide timely, representative data that can inform prevention, screening, and equitable treatment strategies for lung cancer in North Carolina.

Current State of Surveillance

The North Carolina Central Cancer Registry (NCCCR) is the state’s legally mandated system for tracking cancer incidence, treatment, and outcomes. As part of the CDC’s National Program of Cancer Registries (NPCR) and the North American Association of Central Cancer Registries (NAACCR), it provides a comprehensive record of cancer cases across hospitals, clinics, and laboratories. NCCCR data are essential for understanding cancer trends, guiding public health policy, and supporting national cancer control efforts.

Despite its strengths, the NCCCR faces several limitations that reduce its effectiveness in addressing urgent public health concerns, such as lung cancer. Timeliness, integration, and equity monitoring are the main challenges. The registry data is often delayed up to 2 years behind real-world diagnoses, NCCCR is often not linked to behavioral data, and equity-based variables may be incomplete.

For a condition such as lung cancer, these limitations hinder North Carolina’s ability to respond quickly and equitably. Modernizing NCCCR to improve timeliness and representativeness is therefore critical to reducing the state’s disproportionate lung cancer burden.

Modernization Strategy

To address these gaps, North Carolina should modernize its cancer surveillance system by aligning the NCCCR with CSTE Objective 2.1: “Improve traditional surveillance systems to provide timely and representative chronic disease insights.” (4) The goal is to reduce reporting lag, strengthen representativeness, and create actionable insights for prevention and treatment of lung cancer.

A unique opportunity exists through the Cancer Identification and Precision Oncology Center (CIPOC) at UNC-Chapel Hill, which was recently awarded ARPA-H funding to aggregate and analyze cancer data from diverse sources—including electronic health records, pathology and radiology images, claims, and geographic information utilizing large language models.(5) CIPOC is designed to support real-time cancer case identification and equitable care delivery. Integrating NCCCR modernization with CIPOC’s infrastructure would allow the registry to improve timeliness, enhance data linkage, and support equity-focused initiatives

By grounding modernization in CSTE’s national strategy while leveraging CIPOC’s cutting-edge infrastructure, North Carolina can create a best-practice model for other states. This integrated approach would demonstrate how traditional registries and advanced AI-enabled systems can work together to provide high-quality data while leveraging the improved efficiency that AI brings.

Summary

North Carolina faces an urgent burden from lung cancer, with incidence and mortality rates above the national average and significant disparities across racial and geographic groups.

Modernizing the NCCCR to improve timeliness, completeness, and representativeness is critical to addressing this challenge. By aligning with CSTE Objective 2.1 and leveraging the AI-enabled infrastructure of CIPOC, the state can reduce delays in reporting, link surveillance data to risk factors and screening uptake, and generate equity-focused insights for targeted interventions.

This integrated approach demonstrates how traditional registries can evolve into rapid, representative systems and provides a best-practice model that other states and chronic conditions can adopt.

The model has clear implications beyond lung cancer. The same framework can be applied to other cancers, as well as non-cancer conditions like COPD or cardiovascular disease. Importantly, the CIPOC project’s use of retrieval-augmented generation and advanced prompting strategies to extract and synthesize multi-modal data provides an adaptable toolkit for modern surveillance. By applying the most effective AI methods refined within CIPOC, North Carolina can not only strengthen its lung cancer registry but also inform future AI applications in healthcare surveillance more broadly. This positions the state as a leader in operationalizing CSTE’s strategic plan while demonstrating how cutting-edge AI methods can scale across diseases and conditions.

Introduction

Cancer treatment delays and modifications can significantly impact patient survival and quality of life. Research has consistently shown that marginalized populations, including Black, Hispanic, Asian, and American Indian/Alaska Native (AIAN) patients, experience higher rates of late-stage cancer diagnoses and lower rates of timely treatment.

For my project, I used machine learning models to examine how race, socioeconomic status, tumor characteristics, cancer stage, grade, subtype, age, clinical trial participation, and time to treatment initiation predict treatment interruptions in breast cancer patients.

This project highlights both the potential and limitations of using machine learning for predicting healthcare disparities in cancer treatment.

Dataset:

This project utilized Simulacrum v2.1.0, a synthetic dataset derived from the National Disease Registration Service (NDRS) Cancer Analysis System at NHS England. While this dataset mimics real-world cancer data, it ensures patient anonymity.

After data cleaning and removing incomplete observations, the dataset contained 69,367 patients, all diagnosed with breast cancer.

Demographics Overview

Gender Distribution:
- Women: 98.7%
- Men: 1.3% (Male breast cancer cases were retained for analysis.)

Age Distribution:
- Mean Age: 61.06 years
- Median Age: 61 years

Racial Distribution:
- White: 85.9%
- Asian: 3.7%
- Black: 1.9%
- Other: 1.7%
- Mixed Race: 0.6%
- Unknown: 6.1% (Reweighting was applied to mitigate algorithmic bias.)

Neighborhood Deprivation (Socioeconomic Status):
- Scored from 1 (most deprived) to 5 (least deprived).
- The dataset was fairly balanced across deprivation levels.

Average Time to Treatment Initiation (by Race)

Overall: 61 days
Other Racial Groups: 71 days
Black Patients: 63 days
White Patients: 62 days
Asian Patients: 54 days
Mixed Race Patients: 45 days
Unknown Race: 57 days

Clinical Trial Participation

84.4% of patients were enrolled in a clinical trial.

Data Processing & Feature Engineering

1. Cleaning & Standardizing Data

Removed inconsistent staging classifications.
Imputed missing values for key variables such as comorbidity scores and time to treatment initiation.

2. Encoding Variables

One-Hot Encoding:
- Race, tumor subtype, estrogen receptor (ER), progesterone receptor (PR), and HER2 status.
Ordinal Encoding:
- Tumor stage, node stage, metastasis stage, overall stage, grade, and deprivation index.

3. Feature Engineering

Created a new feature,ANY_REGIMEN_MOD
- Combined dose reduction, time delay, and early termination variables into one binary target variable.
Grouped tumor biomarkers into cancer subtypes:
- Luminal A: ER+, PR+, HER2- (Least aggressive)
- Luminal B: ER+, PR+, HER2+ (Slightly more aggressive)
- HER2-Enriched: ER-, PR-, HER2+
- Triple Negative (Basal-like): ER-, PR-, HER2- (Most aggressive)

Bias Mitigation Strategies

1. Gender Bias

While 98% of patients were female, male breast cancer cases were retained for rare case analysis.
Applied weighting techniques to balance gender representation during model training.

2. Racial Bias

Since 85% of patients in the dataset were White, inverse weighting was applied to ensure fair contributions across racial groups.

Machine Learning Models & Performance

I trained Random Forest and XGBoost models to predict treatment modifications.

1. Initial Model Performance (Random Forest & XGBoost)

Results were poor. The models performed only slightly better than random guessing.

2. Hyperparameter Tuning & Feature Selection

Used GridSearchCV to optimize parameters.
Dropped the least important features and retrained the models.
Results did not improve, and performance worsened in some cases.

3. Model Performance Issues

Models failed to generalize to testing data.
ROC AUC scores hovered around 0.5, meaning models were barely better than random guessing.
Models overfitted the ‘no treatment modification’ group, inaccurately predicting treatment delays.

4. Alternative Models Attempted

To address these issues, I tested additional models:

Support Vector Machines (SVM)
Neural Networks
Logistic Regression
K-Nearest Neighbors (KNN)
Naïve Bayes

Findings from Alternative Models:

SVM and Logistic Regression performed best (Accuracy: 0.65), but logistic regression completely failed at predicting treatment modifications.
Neural Networks, KNN, and Naïve Bayes were slightly better balanced but still had low accuracy (~0.60–0.62).

5. Last Attempt – Oversampling with SMOTE

Used Synthetic Minority Oversampling (SMOTE) to balance the dataset.
SVM improved slightly, but performance gains were minimal.

Uncovering Insights Through Clustering

Since predictive modeling was unsuccessful, I conducted clustering analysis to find patterns in the data.

Three Distinct Patient Clusters Emerged:

Cluster 0: Moderate-Stage Cancer (35.4% Treatment Modifications)

Average Age: 60.8
Tumor Stage: T2, N0.7, M0.03 (Localized)
Time to Treatment: 32.6 days

Cluster 1: Early-Stage Cancer with Delayed Treatment (36.7% Treatment Modifications)

Average Age: 61.0
Tumor Stage: T1, N0.03, M0.0 (Localized, very early-stage)
Time to Treatment: 46.8 days (longest)

Cluster 2: Older Patients with Shortest Time to Treatment (36.8% Treatment Modifications)

Average Age: 69.4 years
Higher Comorbidities (Charlson Score = 0.63)
Time to Treatment: 14.8 days (shortest)

Key Takeaways from Clustering:

Cluster 1 had the longest time to treatment despite being early-stage. Further investigation needed.
Older patients (Cluster 2) received faster treatment but had more comorbidities.

Conclusion & Next Steps

Machine learning models failed to predict treatment modifications accurately.
Clustering analysis revealed patterns in treatment delays and patient subgroups.

Future Steps:

Explore more sophisticated models (Deep Learning, Bayesian Networks).
Use real-world data instead of synthetic datasets.
Investigate non-quantifiable factors, like patient-provider interactions and healthcare policies.

Final Thoughts

This project highlighted the complexity of predicting cancer treatment interruptions and the importance of interdisciplinary approaches in health equity research.

If you’re interested in machine learning for healthcare, data-driven health equity research, or predictive modeling, let’s connect!

Sydney Lash

Tag / cancer

IT Strategization for Streamlining Lung Cancer Registry Processes In NC

Using Machine Learning to Understand Treatment Delays in Breast Cancer Care