Empowering Rural Communities Through Culturally Competent Health Tech

Assignment Background

For my Implementing Health Informatics Initiatives for Emerging Leaders course in the Gillings School of Global Public Health at UNC–Chapel Hill, I developed an informatics-based concept proposal for the NSF America’s Seed Fund. My proposal focuses on a mobile application designed to support multiple chronic diseases in rural communities, where health literacy tends to be lower. Below is a subsection of that proposal.

HealthBridge Rural: A Culturally Competent Health Literacy App for Chronic Disease Prevention and Management 

Introduction 

Rural communities face persistent health inequities driven by limited healthcare access, transportation barriers, low health literacy, and high prevalence of preventable chronic diseases such as diabetes, hypertension, COPD, and cardiovascular disease. According to CDC data, adults in rural areas are more likely to die prematurely from five leading causes1 and report lower rates of preventive care and health literacy2. Traditional health apps often fail to reach these communities because they assume consistent access to broadband internet, cultural familiarity with digital tools, and standardized health literacy levels. 

In these areas, cultural norms, economic hardship, and health mistrust create additional barriers. For example, a patient may understand their diagnosis but not how to modify their lifestyle affordably or safely. Information about exercise, nutrition, and stress reduction is often disconnected from local contexts; recipes may not utilize affordable, readily available foods, and exercise recommendations rarely account for environmental or safety constraints. 

HealthBridge Rural aims to close this gap by providing culturally competent, low-cost, and locally contextualized health information to empower residents to take small, achievable steps toward improved health. It combines behavioral science, public health data, and user-centered design to build trust, comprehension, and sustainable engagement across several chronic conditions. 

Application Overview: 

HealthBridge Rural is a mobile and web-based application designed using Dr. Mica Endsley’s Situation Awareness-Oriented Design (SAOD) principles—Perception, Comprehension, and Projection—to support informed, context-driven decision-making3. The app offers a personalized dashboard that integrates educational tools, behavioral nudges, and local resource directories to help users navigate their health more effectively. 

1. Perception  

 The app aggregates publicly available and user-reported data to present relevant, localized health insights: 

  • Maps of affordable grocery stores, farmers’ markets that accept EBT, community gardens, and food giveaway programs. 
  •  Maps of parks, community recreation centers, and walking trails. 
  • Listings of free or low-cost health programs, community clinics, and smoking cessation groups. 
  • Daily check-ins for sleep, stress, diet, and activity. 

2. Comprehension  

 AI translates medical information into culturally relevant, accessible language. 

 Examples: 

  • “Instead of sugary tea, try this $2 pitcher of flavored water using ingredients from Dollar General.” 
  • “Here’s how to manage stress when working long shifts—three exercises that take less than 5 minutes.” 
  • Audio and visual literacy options for users with limited reading skills. 

The AI integrates local cultural insights, such as traditional foods or family-centered values to ensure content resonates with the user’s lived experiences. 

3. Projection – Supporting Informed Action 

 Based on users’ inputs, HealthBridge Rural generates personalized recommendations and reminders: 

  • Predictive prompts (“You’ve been logging low activity—try this free walking club nearby.”) 
  • Behavior simulation (“If you switch one meal per day to a home-cooked option, here’s the 3-month health impact and savings.”) 
  • Smart notifications tailored to disease risk factors (e.g., hypertension management reminders). 

Logic Model (Adaptation of Endsley’s Situational Awareness Model): 

Level Description HealthBridge Rural Application 
Perception Awareness of environment and current state Local maps, cost-aware recipes, nearby resources 
Comprehension Understanding significance of data AI-curated education, culturally adapted health tips 
Projection Anticipating outcomes and next steps Personalized goals, predictive alerts, self-monitoring tools 

1 https://www.cdc.gov/health-equity-chronic-disease/health-equity-rural-communities/index.html 

2  https://www.ruralhealthinfo.org/toolkits/health-literacy/1/barriers 

3 Endsley, M. R., & Jones, D. G. (2024). Situation Awareness Oriented Design: Review and Future Directions. International Journal of Human–Computer Interaction, 40(7), 1487–1504. https://doi.org/10.1080/10447318.2024.2318884 

IT Strategization for Streamlining Lung Cancer Registry Processes In NC

Problem Statement

Lung cancer is one of the most significant chronic disease burdens in North Carolina, both in terms of incidence and mortality. According to CDC data from 2023, the state experienced 34.9 deaths per 100,000 residents from lung cancer and 58.4 new cases per 100,000 residents, both higher than the national averages.(1)These elevated rates place lung cancer among the most pressing cancer-related public health challenges in North Carolina.

Disparities are evident among racial and ethnic minorities across timeliness of diagnosis, treatment access, and five-year survival rates. (2) Risk factors also contribute to North Carolina’s elevated burden; adult smoking prevalence alongside of youth e-cigarette use are both higher than the national average.(3)

These statistics underscore the urgency of strengthening surveillance systems to provide timely, representative data that can inform prevention, screening, and equitable treatment strategies for lung cancer in North Carolina.

Current State of Surveillance

The North Carolina Central Cancer Registry (NCCCR) is the state’s legally mandated system for tracking cancer incidence, treatment, and outcomes. As part of the CDC’s National Program of Cancer Registries (NPCR) and the North American Association of Central Cancer Registries (NAACCR), it provides a comprehensive record of cancer cases across hospitals, clinics, and laboratories. NCCCR data are essential for understanding cancer trends, guiding public health policy, and supporting national cancer control efforts.

Despite its strengths, the NCCCR faces several limitations that reduce its effectiveness in addressing urgent public health concerns, such as lung cancer. Timeliness, integration, and equity monitoring are the main challenges. The registry data is often delayed up to 2 years behind real-world diagnoses, NCCCR is often not linked to behavioral data, and equity-based variables may be incomplete.

For a condition such as lung cancer, these limitations hinder North Carolina’s ability to respond quickly and equitably. Modernizing NCCCR to improve timeliness and representativeness is therefore critical to reducing the state’s disproportionate lung cancer burden.

Modernization Strategy

To address these gaps, North Carolina should modernize its cancer surveillance system by aligning the NCCCR with CSTE Objective 2.1: “Improve traditional surveillance systems to provide timely and representative chronic disease insights.” (4) The goal is to reduce reporting lag, strengthen representativeness, and create actionable insights for prevention and treatment of lung cancer.

A unique opportunity exists through the Cancer Identification and Precision Oncology Center (CIPOC) at UNC-Chapel Hill, which was recently awarded ARPA-H funding to aggregate and analyze cancer data from diverse sources—including electronic health records, pathology and radiology images, claims, and geographic information utilizing large language models.(5) CIPOC is designed to support real-time cancer case identification and equitable care delivery. Integrating NCCCR modernization with CIPOC’s infrastructure would allow the registry to improve timeliness, enhance data linkage, and support equity-focused initiatives

By grounding modernization in CSTE’s national strategy while leveraging CIPOC’s cutting-edge infrastructure, North Carolina can create a best-practice model for other states. This integrated approach would demonstrate how traditional registries and advanced AI-enabled systems can work together to provide high-quality data while leveraging the improved efficiency that AI brings.

Summary

North Carolina faces an urgent burden from lung cancer, with incidence and mortality rates above the national average and significant disparities across racial and geographic groups.

Modernizing the NCCCR to improve timeliness, completeness, and representativeness is critical to addressing this challenge. By aligning with CSTE Objective 2.1 and leveraging the AI-enabled infrastructure of CIPOC, the state can reduce delays in reporting, link surveillance data to risk factors and screening uptake, and generate equity-focused insights for targeted interventions.

This integrated approach demonstrates how traditional registries can evolve into rapid, representative systems and provides a best-practice model that other states and chronic conditions can adopt.

The model has clear implications beyond lung cancer. The same framework can be applied to other cancers, as well as non-cancer conditions like COPD or cardiovascular disease. Importantly, the CIPOC project’s use of retrieval-augmented generation and advanced prompting strategies to extract and synthesize multi-modal data provides an adaptable toolkit for modern surveillance. By applying the most effective AI methods refined within CIPOC, North Carolina can not only strengthen its lung cancer registry but also inform future AI applications in healthcare surveillance more broadly. This positions the state as a leader in operationalizing CSTE’s strategic plan while demonstrating how cutting-edge AI methods can scale across diseases and conditions.

Finding Hidden Structures in Patient Data: A SDoH Network Exploration

Understanding social determinants of health (SDoH) is becoming increasingly important in healthcare services and public health research.

Healthy People 2030, a major public health initiative, emphasizes addressing these areas to achieve real health equity.

Traditionally, research has often relied on proxy measures (like using race as a stand-in for the experience of racism) to represent structural factors. While useful, these proxies can miss the complex, relational ways that social determinants interact to shape health outcomes.

Network models offer a new approach: instead of looking at variables one by one, they allow us to model the hidden relationships between different social conditions and populations.

In this project, I used data from the Medical Expenditure Panel Survey (MEPS) to model latent similarity patterns among patients based on SDoH, including food insecurity, access to care, income, and social isolation.

I constructed a patient similarity network using cosine similarity, applied spectral embedding to project patients into a latent space, and used k-means clustering to identify subgroups within the network.

The results revealed subtle but meaningful patterns. Clusters differed in healthcare access barriers, financial instability, and compounded vulnerabilities such as low income paired with food insecurity.

These findings demonstrate how relational network modeling can uncover hidden gradients of disadvantage that traditional feature-by-feature analyses may overlook.

Future work could integrate health outcome data or longitudinal measures to better understand how these latent social structures impact health trajectories over time.

Sources:

[1] Office of Disease Prevention and Health Promotion. Healthy People 2030 Framework. U.S. Department of Health and Human Services.​

[2] Qing, H. (2023). Latent class analysis by regularized spectral clustering. arXiv preprint arXiv:2310.18727. https://doi.org/10.48550/arXiv.2310.18727
[3] Agency for Healthcare Research and Quality. Medical Expenditure Panel Survey (MEPS), 2022 Full-Year Consolidated Data File.

[4] https://www.geeksforgeeks.org/spectral-embedding/

Race, Place, and Quality

A Look at Hypertension Management Among Black Women in North Carolina

This semester, in my Social Epidemiology course, we had to propose a study related to our course content. I chose to explore how race, place, and healthcare quality intersect to shape outcomes for Black women living with hypertension.


Why This Study?

Black women in the U.S. face unique and layered health challenges due to the intersecting effects of racism and sexism. The weathering hypothesis, developed by Geronimus et al., suggests that chronic exposure to social, economic, and racial stressors accelerates health deterioration—particularly among Black women.1,2

Despite advances in treatment, disparities in chronic disease outcomes persist. Black women, especially in rural areas, remain more likely to experience uncontrolled hypertension. This raises critical questions about how structural barriers, such as geographic isolation and limited access to high-quality care, interact with in-clinic factors, such as provider bias.

Layered forms of marginalization interact to shape Black women’s health experiences

My Research Question

Among Black women receiving care at Federally Qualified Health Centers (FQHCs) in North Carolina, how does the impact of geographic proximity on hypertension management differ between urban and rural communities?

My hypothesis: Geographic proximity will have a smaller impact on hypertension management among Black women in rural communities compared to those in urban areas due to compounded marginalization and systemic barriers.


Background Context

  • Rural communities tend to have a higher chronic disease burden.3
  • High-burden ZIP Code Tabulation Areas (ZCTAs) had nearly double the proportion of Black residents compared to low-burden ZCTAs.3
  • On average, people in high-burden areas live 8.7 miles from the nearest FQHC, compared to 4.6 miles in low-burden areas.3
  • Clinical quality gaps persist: Non-Hispanic Black individuals are 12% less likely to have adequately controlled blood pressure, even after adjusting for socioeconomic and healthcare access factors.11,12

A map showing FQHC distribution across North Carolina and a gradient scale representing rurality by RUCA score.


Proposed Study Design

  • Design: Cross-sectional observational study
  • Population: Black women aged 18+ with a diagnosis of hypertension and at least two FQHC visits between Jan 2023 – Dec 2024
  • Data Source: EHR data from FQHCs across North Carolina
  • Exposure: Proximity to FQHC, measured by distance from home ZIP to clinic
  • Comparison Groups: Urban vs. rural residence (based on RUCA score)
  • Outcome: Blood pressure control (e.g., SBP <140 mmHg)
  • Covariates: Socioeconomic status, insurance, comorbidities, clinic characteristics

A directed acyclic graph illustrating the conceptual model of how structural and clinical factors interact to influence hypertension management.


Final Thoughts

Developing this proposal deepened my understanding of how place, identity, and systemic inequity play a role in measurable health outcomes. Mapping these disparities is only the beginning. As researchers, we must also imagine ways to redesign care systems that serve Black women and their intersectional identities more equitably.

References

  1. Geronimus AT, Hicken M, Keene D, Bound J. “Weathering” and age patterns of allostatic load scores among blacks and whites in the United States. Am J Public Health. 2006;96(5):826-833. doi:10.2105/AJPH.2004.060749
  2. Chinn JJ, Martin IK, Redmond N. Health Equity Among Black Women in the United States. J Womens Health (Larchmt). 2021;30(2):212-219. doi:10.1089/jwh.2020.8868
  3. Benavidez GA, Zahnd WE, Hung P, Eberth JM. Chronic Disease Prevalence in the US: Sociodemographic and Geographic Variations by Zip Code Tabulation Area. Prev Chronic Dis 2024;21:230267. DOI: http://dx.doi.org/10.5888/pcd21.230267.
  4. Ndugga N, Hill L, Artiga S. Key data on health and health care by race and ethnicity. KFF. Published June 11, 2024. Accessed November 14, 2024. https://www.kff.org/key-data-on-health-and-health-care-by-race-and-ethnicity/?entry=health-status-and-outcomes-chronic-disease-and-cancer
  5. Agency for Healthcare Research and Quality. 2023 National Healthcare Quality and Disparities Report Appendixes. AHRQ Pub. No. 23(24)-0091-EF. December 2023.
  6. Ochieng N, Biniek JF, Cubanski J, Neuman T. Disparities in health measures by race and ethnicity among beneficiaries in Medicare Advantage: A review of the literature. KFF. Published December 13, 2023. Accessed October 15, 2024. https://www.kff.org/medicare/report/disparities-in-health-measures-by-race-and-ethnicity-among-beneficiaries-in-medicare-advantage-a-review-of-the-literature/
  7. Jha AK, Zaslavsky AM, Orav EJ, Epstein AM, Ayanian JZ. Quality of ambulatory care for privately insured and Medicare Advantage enrollees in the United States. Health Aff (Millwood).
  8. Tong M, Hill L, Artiga S. Racial disparities in cancer outcomes, screening, and treatment. KFF. Published February 3, 2022. Accessed November 14, 2024. https://www.kff.org/racial-equity-and-health-policy/issue-brief/racial-disparities-in-cancer-outcomes-screening-and-treatment/
  9. Alsheik N, Blount L, Qiong Q, et al. Outcomes by race in breast cancer screening with digital breast tomosynthesis versus digital mammography. J Am Coll Radiol. 2021;18(7):906-918. doi:10.1016/j.jacr.2020.12.033
  10. Miller-Kleinhenz JM, Collin LJ, Seidel R, Oyesanmi O. Racial disparities in diagnostic delay among women with breast cancer. J Am Coll Radiol. 2021;18(10):1384-1393. doi:10.1016/j.jacr.2021.06.019
  11. Abrahamowicz AA, Ebinger J, Whelton SP, Commodore-Mensah Y, Yang E. Racial and Ethnic Disparities in Hypertension: Barriers and Opportunities to Improve Blood Pressure Control. Curr Cardiol Rep. 2023;25(1):17-27. doi:10.1007/s11886-022-01826-x
  12. Crim MT, Yoon SS, Ortiz E, et al. National surveillance definitions for hypertension prevalence and control among adults. Circ Cardiovasc Qual Outcomes. 2012;5(3):343-351. doi:10.1161/CIRCOUTCOMES.111.963439

Using Machine Learning to Understand Treatment Delays in Breast Cancer Care

Introduction

Cancer treatment delays and modifications can significantly impact patient survival and quality of life. Research has consistently shown that marginalized populations, including Black, Hispanic, Asian, and American Indian/Alaska Native (AIAN) patients, experience higher rates of late-stage cancer diagnoses and lower rates of timely treatment.

For my project, I used machine learning models to examine how race, socioeconomic status, tumor characteristics, cancer stage, grade, subtype, age, clinical trial participation, and time to treatment initiation predict treatment interruptions in breast cancer patients.

This project highlights both the potential and limitations of using machine learning for predicting healthcare disparities in cancer treatment.


Dataset:

This project utilized Simulacrum v2.1.0, a synthetic dataset derived from the National Disease Registration Service (NDRS) Cancer Analysis System at NHS England. While this dataset mimics real-world cancer data, it ensures patient anonymity.

After data cleaning and removing incomplete observations, the dataset contained 69,367 patients, all diagnosed with breast cancer.

Demographics Overview

  • Gender Distribution:
    • Women: 98.7%
    • Men: 1.3% (Male breast cancer cases were retained for analysis.)
  • Age Distribution:
    • Mean Age: 61.06 years
    • Median Age: 61 years
  • Racial Distribution:
    • White: 85.9%
    • Asian: 3.7%
    • Black: 1.9%
    • Other: 1.7%
    • Mixed Race: 0.6%
    • Unknown: 6.1% (Reweighting was applied to mitigate algorithmic bias.)

  • Neighborhood Deprivation (Socioeconomic Status):
    • Scored from 1 (most deprived) to 5 (least deprived).
    • The dataset was fairly balanced across deprivation levels.

Average Time to Treatment Initiation (by Race)

  • Overall: 61 days
  • Other Racial Groups: 71 days
  • Black Patients: 63 days
  • White Patients: 62 days
  • Asian Patients: 54 days
  • Mixed Race Patients: 45 days
  • Unknown Race: 57 days

Clinical Trial Participation

  • 84.4% of patients were enrolled in a clinical trial.

Data Processing & Feature Engineering

1. Cleaning & Standardizing Data

  • Removed inconsistent staging classifications.
  • Imputed missing values for key variables such as comorbidity scores and time to treatment initiation.

2. Encoding Variables

  • One-Hot Encoding:
    • Race, tumor subtype, estrogen receptor (ER), progesterone receptor (PR), and HER2 status.
  • Ordinal Encoding:
    • Tumor stage, node stage, metastasis stage, overall stage, grade, and deprivation index.

3. Feature Engineering

  • Created a new feature,ANY_REGIMEN_MOD
    • Combined dose reduction, time delay, and early termination variables into one binary target variable.
  • Grouped tumor biomarkers into cancer subtypes:
    • Luminal A: ER+, PR+, HER2- (Least aggressive)
    • Luminal B: ER+, PR+, HER2+ (Slightly more aggressive)
    • HER2-Enriched: ER-, PR-, HER2+
    • Triple Negative (Basal-like): ER-, PR-, HER2- (Most aggressive)

Bias Mitigation Strategies

1. Gender Bias

  • While 98% of patients were female, male breast cancer cases were retained for rare case analysis.
  • Applied weighting techniques to balance gender representation during model training.

2. Racial Bias

  • Since 85% of patients in the dataset were White, inverse weighting was applied to ensure fair contributions across racial groups.

Machine Learning Models & Performance

I trained Random Forest and XGBoost models to predict treatment modifications.

1. Initial Model Performance (Random Forest & XGBoost)

Results were poor. The models performed only slightly better than random guessing.

2. Hyperparameter Tuning & Feature Selection

  • Used GridSearchCV to optimize parameters.
  • Dropped the least important features and retrained the models.
  • Results did not improve, and performance worsened in some cases.

3. Model Performance Issues

  • Models failed to generalize to testing data.
  • ROC AUC scores hovered around 0.5, meaning models were barely better than random guessing.
  • Models overfitted the ‘no treatment modification’ group, inaccurately predicting treatment delays.

4. Alternative Models Attempted

To address these issues, I tested additional models:

  • Support Vector Machines (SVM)
  • Neural Networks
  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Naïve Bayes

Findings from Alternative Models:

  • SVM and Logistic Regression performed best (Accuracy: 0.65), but logistic regression completely failed at predicting treatment modifications.
  • Neural Networks, KNN, and Naïve Bayes were slightly better balanced but still had low accuracy (~0.60–0.62).

5. Last Attempt – Oversampling with SMOTE

  • Used Synthetic Minority Oversampling (SMOTE) to balance the dataset.
  • SVM improved slightly, but performance gains were minimal.

Uncovering Insights Through Clustering

Since predictive modeling was unsuccessful, I conducted clustering analysis to find patterns in the data.

Three Distinct Patient Clusters Emerged:

Cluster 0: Moderate-Stage Cancer (35.4% Treatment Modifications)

  • Average Age: 60.8
  • Tumor Stage: T2, N0.7, M0.03 (Localized)
  • Time to Treatment: 32.6 days

Cluster 1: Early-Stage Cancer with Delayed Treatment (36.7% Treatment Modifications)

  • Average Age: 61.0
  • Tumor Stage: T1, N0.03, M0.0 (Localized, very early-stage)
  • Time to Treatment: 46.8 days (longest)

Cluster 2: Older Patients with Shortest Time to Treatment (36.8% Treatment Modifications)

  • Average Age: 69.4 years
  • Higher Comorbidities (Charlson Score = 0.63)
  • Time to Treatment: 14.8 days (shortest)

Key Takeaways from Clustering:

  • Cluster 1 had the longest time to treatment despite being early-stage. Further investigation needed.
  • Older patients (Cluster 2) received faster treatment but had more comorbidities.

Conclusion & Next Steps

Machine learning models failed to predict treatment modifications accurately.
Clustering analysis revealed patterns in treatment delays and patient subgroups.

Future Steps:

  1. Explore more sophisticated models (Deep Learning, Bayesian Networks).
  2. Use real-world data instead of synthetic datasets.
  3. Investigate non-quantifiable factors, like patient-provider interactions and healthcare policies.

Final Thoughts

This project highlighted the complexity of predicting cancer treatment interruptions and the importance of interdisciplinary approaches in health equity research.

If you’re interested in machine learning for healthcare, data-driven health equity research, or predictive modeling, let’s connect!

Addressing Healthcare Disparities in North Carolina

Through my Healthcare Data Visualization course I, alongside two other classmates, were tasked with creating a dashboard using a data set and platform of our choice. Using data from the 2018 Health Professional Shortage Area (HPSA) dataset provided by the U.S. Department of Health & Human Services, our analysis reveals critical insights into the challenges and opportunities for improving healthcare access statewide.

Key Findings from the Analysis

  1. Healthcare Shortages Are Severe in Underserved Areas
    The average provider-to-population ratio in HPSA-designated areas is 2.11 clinicians per 10,000 residents, significantly below the recommended 6.67 clinicians. This stark disparity highlights the strain on underserved communities, especially in rural regions.
  2. Poverty Rates Compound Access Issues
    North Carolina’s poverty rate—measured as residents below 200% of the federal poverty line—is 27%, slightly above the national average of 26.9%. This economic disadvantage exacerbates barriers to healthcare, disproportionately affecting rural counties.
  3. Medically Underserved Areas (MUAs) Require Immediate Attention
    MUA scores, which account for clinician ratios, infant mortality, poverty levels, and the percentage of elderly populations, show an alarming average of 51.76 across NC—well below the national threshold of 62. Henderson and Transylvania counties, with MUA scores of 0, represent the most critically underserved areas in the state.
  4. Rural Hospitals Closures and Policy Impacts
    Historical trends show peaks in shortages in 2002 and 2015-2018, correlating with changes in HPSA methodology and rural hospital closures. These events further stress the importance of sustained policy interventions.
  5. Prioritization of Resources by HPSA Scores
    The HPSA scoring system, used by the National Health Service Corps (NHSC), prioritizes counties for clinician assignments. Mecklenburg County has the highest HPSA score due to its large population, indicating where current resources are concentrated. However, smaller rural counties with lower scores risk being overlooked despite their critical needs.

Access to quality healthcare is a fundamental need, yet many counties across North Carolina face significant shortages in healthcare providers, particularly in rural and economically disadvantaged areas. Our findings underscore the urgent need for targeted interventions. Allocating resources to counties with the highest ratios of underserved populations, addressing the economic and geographic barriers to care, and replicating successful policies in declining shortage areas can help mitigate these disparities. For policymakers, healthcare providers, and community leaders, this analysis serves as a roadmap for reducing inequities and ensuring better access to healthcare for all North Carolinians.

Sources:

https://data.hrsa.gov/tools/shortage-area/hpsa-find

https://www.bls.gov/opub/reports/working-poor/2020/#:~:text=In%202020%2C%2037.2%20million%20people%2C%20or%2011.4,notes%20section%20for%20examples%20of%20poverty%20levels.)

https://healthycommunitiesnc.org/

ciceroinstitute.org/research/north-carolina-physician-shortage-facts/

Clustering Individuals Based on Health and Socioeconomic Indicators Using the CDC’s BRFSS Data

Project Overview: I analyzed the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset using K-means clustering to identify groups based on reported health and life satisfaction patterns. By combining health indicators with socioeconomic factors (household income and education) I aimed to understand how these social determinants relate to individual health outcomes. The initial dataset contained over 400,000 observations, which I reduced to 15,032 by cleaning out incomplete data.

Variables Used:

  • General Health (GENHLTH): Measures perceived overall health (higher values indicate poorer health).
  • Mental Health (MENTHLTH): Number of days mental health negatively impacted daily life (higher values indicate more frequent struggles).
  • Physical Health (PHYSHLTH): Number of days physical health was poor.
  • Life Satisfaction (LSATISFY): Reflects self-reported quality of life (higher values mean lower satisfaction).

Clustering Analysis and Findings:

Analysis 1: Health and Income Using the elbow method, I determined an optimal cluster count of 6. Here’s what I found:

  • Cluster 0 (18%): Highest income group (>$75,000) with excellent health and high life satisfaction.
  • Cluster 1 (9%): Lowest income group ($10,000-$15,000) with significant health challenges but moderate life satisfaction.
  • Cluster 2 (6%): Middle-income earners ($35,000-$50,000) with the poorest health indicators but moderate satisfaction.
  • Cluster 3 (14%): Middle-income group ($35,000-$50,000) with good health and high life satisfaction.
  • Cluster 4 (35%): Largest group with highest income levels (>$75,000), showing good health and high life satisfaction.
  • Cluster 5 (18%): Upper middle-income earners ($50,000-$75,000) with similar good health and high satisfaction.

This analysis highlights that higher income is associated with better health outcomes and life satisfaction, reinforcing existing evidence on the impact of socioeconomic factors.

This image displays the relationship between income (INCOME2, y-axis) and physical health (PHYSHLTH, x-axis). Clusters 0, 3, and 5 are skewed to the left, indicating that these groups experience fewer days where physical health negatively impacts their daily life. These clusters also belong to the highest income categories, suggesting that higher income groups tend to have better physical health outcomes. In contrast, Cluster 2 is skewed to the right, showing a higher number of days of poor physical health, with income levels spread throughout the range. Clusters 1 and 2 have a denser concentration of observations on the right side, reflecting the groups with the poorest health outcomes, regardless of their income distribution.

Analysis 2: Health and Education For this analysis, I used 9 clusters based on the elbow method. Key findings include:

  • Cluster 0 (19%): Highly educated (college graduates) with very good health and high life satisfaction.
  • Cluster 2 (4%): Individuals with some college education but severe physical and mental health challenges.
  • Cluster 6 (8%): High school graduates with moderate health challenges yet high life satisfaction, suggesting resilience.
  • Other clusters demonstrated how different levels of educational attainment impact health outcomes and satisfaction levels.
In general, the majority of the data set reported high or moderate life satisfaction. Clusters 0, 1, and 4 show high concentrations toward the left side of the plot, indicating high life satisfaction. These clusters also represent individuals with the highest levels of educational attainment (primarily college graduates). In contrast, Cluster 2 displays the widest spread in life satisfaction levels and consists mostly of individuals with high school education or lower.

Key Takeaways:

  • The majority of the dataset reported moderate to high life satisfaction. Clusters with the highest educational levels (college graduates) were concentrated in groups with higher satisfaction and better health outcomes.
  • Cluster 2 showed the widest spread of life satisfaction and predominantly consisted of individuals with high school education or lower, indicating the need for a more in-depth understanding of what contributes to variability in well-being among this group.

Critical Reflections and Future Directions:

  1. Dataset Limitations: The dataset is predominantly composed of white and highly educated individuals, limiting the generalizability of these findings. To make public health insights more inclusive, future analyses should use more diverse datasets.
  2. Adding More Variables: Incorporating factors like healthcare access, chronic disease indicators, and racial identity could provide a more comprehensive understanding of health disparities and social determinants.
  3. Methodological Improvements: While K-means clustering in Weka is effective for straightforward analysis, it has limitations with non-linear relationships and imbalanced datasets. Future projects will explore more advanced clustering techniques like DBSCAN or hierarchical clustering using Python for deeper insights.
  4. Actionable Steps: I plan to expand future analyses by integrating more demographic variables and advanced techniques to provide a fuller picture of factors influencing health and life satisfaction in the U.S. population.

By continually refining my approach, I aim to produce more meaningful and comprehensive public health insights. This project served as a valuable practice in understanding how socioeconomic factors impact health outcomes.

Full lab write up

Exploring Predictive Analytics with KNIME: A Comparative Model Challenge

Background

For my Information Analytics course, I took on the KNIME challenge, where I had full freedom to explore a data science project. KNIME is a data management and analytics platform similar to Alteryx. I planned and executed the entire project independently, exploring different predictive models and evaluating their accuracy in predicting hospital charges. My primary goal was to compare multiple models using a medical cost dataset to determine which one performed the best.

Dataset

I used the Medical Cost Dataset from Kaggle for this project, which includes eight variables like age, BMI, smoker status, and medical charges. My objective was to predict medical charges (target variable) using various predictors (age, BMI, smoker status, etc.).

Model Comparisons

Model 1: Random Forest (Social and Biomedical Variables)

  1. R² and Adjusted R²: Both values were high, indicating the model captured 86% of the variance in hospital charges.
  2. Mean Absolute Error (MAE): The MAE was 2794.003, representing 21% of the mean charge and 30% of the median charge, indicating potential inaccuracy for predicting lower-cost cases.
  3. Root Mean Squared Error (RMSE): The RMSE of 4,406.298, higher than the MAE of 2,794.003, indicates significant outliers in the dataset, as RMSE is more sensitive to large errors and emphasizes the impact of extreme values.
  4. Correlation: The correlation between predicted and actual charges was 0.929 (p-value of 0), showing a strong relationship. However, the model underpredicted charges in many cases.

Model 2: Random Forest (Biomedical-Only Variables)

  1. R² and Adjusted R²: Both were reported as 1, suggesting a perfect fit, which likely indicates overfitting. Further testing on separate datasets is needed to confirm generalizability.
  2. MAE: The MAE was reported as 0, which was unrealistic given discrepancies observed between predicted and actual charges. This raised concerns about the validity of the model’s metrics.
  3. Correlation: The biomedical-only model had a correlation of 0.927 (p-value of 0), slightly lower than the social-biomedical model, showing the importance of including social variables.

Model 3: Linear Regression

  1. R² and Adjusted R²: These values were lower than Random Forest, explaining 78% of the variance in medical charges.
  2. MAE: The MAE was 3770.463, which was higher than in the Random Forest model, representing 28% of the mean charge and 40% of the median charge. This indicates less accuracy in predicting costs, especially in lower-cost instances.
  3. Root Mean Squared Error (RMSE):The RMSE was 5,472.896, which is higher than both the MAE and the RMSE of the Random Forest model.
  4. Correlation: The model had a correlation of 0.886 (p-value of 0), but it also predicted some charges to be negative, which is unrealistic in a medical cost context.

Model 4: Linear Regression (Biomedical-Only Variables)

  1. R² and Adjusted R²: These values remained the same as the social-biomedical model, explaining 78% of the variance.
  2. MAE: The MAE was 3,824.172, slightly higher than the social-biomedical linear regression model, indicating less accuracy. This model also underpredicted and overpredicted charges similarly to the combined social-biomedical model.
  3. Root Mean Squared Error (RMSE): The RMSE was 5,474.757
  4. Correlation: The correlation was 0.886, identical to the social-biomedical model, further suggesting the need for social variables to improve prediction accuracy.

Model 5: K-Nearest Neighbors (KNN)

I experimented with two binning approaches for this model:

  1. 5 Bins: The model performed well, with high accuracy:
  • Recall: 98.1%
  • Precision: 98%
  • Overall Accuracy: 99.3% (suggesting simplicity due to fewer bins)
  1. 20 Bins: With more bins, accuracy decreased slightly but remained within an acceptable range:
  • Recall: 95.5%
  • Precision: 95.84%
  • Overall Accuracy: 95.5% (which is in line with recommended accuracy ranges for predictive models)

Conclusion

Through this challenge, I gained valuable experience using KNIME and comparing multiple predictive models for hospital charge predictions. I applied both regression and classification models and developed a deeper understanding of how different models perform in healthcare contexts.

Based on performance, I recommend using the Random Forest model with both biomedical and social variables for predicting hospital charges, as it demonstrated the most accurate and reliable results with the given dataset without falling victim to overfitting the training data set.


This project has enhanced my ability to interpret and communicate model results effectively, and I look forward to applying these insights to future healthcare-related predictive analytics projects.

For additional details you can check out the lab report submitted for my course:

Exploring Social Data with Principal Component Analysis (PCA)

During my summer internship at the NIH, I was introduced to Principal Component Analysis (PCA) through a colleague. Intrigued by PCA’s potential, I wanted to apply this technique to social data, particularly from the National Longitudinal Study of Adolescent to Adult Health (Add Health), which contains a rich dataset of 10,237 variables.

Objective

My goal was to identify underlying patterns in social factors like academic performance, self-esteem, relationships with parents, and substance use. I narrowed down the vast dataset to 50 key variables to uncover trends and relationships.

Approach

I began by learning PCA through various resources, including Kaggle tutorials and DataCamp courses. I also revisited linear algebra fundamentals to ensure a solid mathematical understanding.

For the analysis:

  1. Data Cleaning: Initially, I filled missing values with -1, but realized this approach needed refinement based on the scale of responses.
  2. PCA Implementation: I used the prcomp function in R to perform PCA. Focusing on the first two principal components, which explained 27.3% of the variance, allowed me to manage the complexity.
  3. Visualization: I created a biplot to visualize the results. Due to the large number of variables, I filtered for the most influential ones, revealing that alcohol usage significantly impacts dataset variability.

Findings

  • Principal Component 1: Associated with lower self-esteem, moderate alcohol use, and less satisfaction in parent relationships.
  • Principal Component 2: Linked to positive school behavior, higher grades, less loneliness, and lower alcohol consumption.

Using K-means clustering, I identified two groups:

  • Cluster 1 (Red): Higher on PC1, indicating lower self-esteem and weaker parental bonds.
  • Cluster 2 (Blue): Higher on PC2, suggesting better academic performance and less loneliness.

The analysis highlighted how alcohol usage and social factors contribute to overall data variability. I plan to refine my approach with a smaller dataset for better interpretation.

Resources Used

Quantifying Health Equity in Cancer: A Comparative Analysis Using Mortality-Incidence Ratios (MIR) Across Racial Groups

TL;DR:

This article explores racial disparities in cancer outcomes in NC using the Mortality-Incidence Ratio (MIR) as a metric to quantify health equity. By comparing MIRs across racial groups with the white MIR and overall MIR as reference points, significant disparities were found, particularly among Black and Native American populations. The analysis underscores the importance of careful benchmark selection in health equity research and highlights the complex factors contributing to these disparities. Addressing these issues requires interdisciplinary research and targeted public health interventions to ensure equitable health outcomes for all

Key Findings:

– Black patients with melanoma had an MIR of 0.43, meaning 43% of those diagnosed in NC died, compared to just 7% of White patients.

– Native American populations faced extreme disparities in ovarian and esophageal cancers, with alarmingly high MIRs.

– Hispanic populations showed fewer disparities compared to the overall and white reference groups, but this finding may be misleading due to the tendency to view this diverse group as a monolith, which can obscure the unique disparities within subgroups.

Introduction

As I embark on my research career, my focus has increasingly centered on health equity—a concept that examines the fairness and justice of health outcomes across different populations. My PhD work aims to develop quantitative methods that better assess and address disparities in healthcare delivery and outcomes. This interest in health equity alongside recent experiences—ranging from my summer internship at the NIH focused on ovarian cancer to my current role in clinical data science within an Oncology clinical trial at Novant Health—have led me to this project.

The Mortality-Incidence Ratio (MIR) offers a powerful metric for this purpose, serving as an indicator of how lethal a disease is relative to its occurrence within a population. By examining the MIR across different racial groups, I aimed to quantify health equity within the realm of cancer care. This analysis compares MIRs using two reference points: the White MIR and the overall MIR, providing insights into how racial disparities manifest in cancer outcomes.

Background

Racial disparities in health outcomes are a critical and well-documented issue in public health, manifesting in various forms across different diseases. These disparities are often driven by a complex interplay of factors, including social determinants of health (such as education, income, and access to healthcare), genetic predispositions, and environmental exposures. The Mortality-Incidence Ratio (MIR) is particularly useful for examining these disparities because it quantifies the severity of a disease by comparing the mortality rate to the incidence rate within a population.

However, it is important to recognize that the MIR is just one piece of the puzzle. Health outcomes, especially in diseases as multifaceted as cancer, are influenced by numerous factors that extend beyond the scope of a single metric. These include healthcare access, the quality of care received, health education, and broader social and environmental determinants, such as healthy food insecurity, access to exercise, and pollution exposure. By understanding these interactions, we can better interpret the disparities revealed through MIR analysis and work toward more equitable health outcomes.

Methods

To investigate racial disparities in cancer outcomes, I conducted an analysis of Mortality-Incidence Ratios (MIRs) across various racial groups using data from the North Carolina Department of Health and Human Services (NC DHHS) for the years 2018-2022. The data was age-adjusted to the standard 2000 population to account for differences in age distribution across racial groups, ensuring that the comparisons were as accurate as possible.

MIRs were calculated by dividing the mortality rate by the incidence rate for each racial group. To provide a comprehensive view of disparities, I used two reference points for comparison: the MIR for the white population and the overall MIR, which represents the aggregated outcomes across all racial groups. This dual approach allowed me to assess how each racial group’s outcomes compared both to the population as a whole and to a specific racial group with historically better health outcomes.

Data

Difference in MIR Relative to Overall Reference Group
Difference in MIR Relative to White Reference Group

Key findings

The analysis revealed significant disparities in cancer outcomes across racial groups. For instance, from 2018 to 2022, the MIR for melanoma among Black patients in North Carolina was 0.43, indicating that 43% of Black individuals diagnosed with melanoma during this period died from the disease. In contrast, the MIR for melanoma among white patients was markedly lower, at 0.07 (7%).

This disparity likely reflects late-stage diagnosis among Black patients, which can result from several factors:

Lack of Awareness: There may be a limited understanding among Black patients regarding their risk for melanoma, contributing to delayed diagnosis and treatment.

Access to Specialized Care: Limited access to dermatological care or healthcare in general can exacerbate the severity of the disease by delaying diagnosis and treatment.

Physician Training: Many physicians may not receive adequate training on the presentation of melanoma in Black patients, leading to missed or late diagnoses.

Native American populations faced particularly severe disparities in ovarian and esophageal cancers, with MIRs of 1 meaning that all patients diagnosed in 2018 – 2022 with these cancers died. These cancers are already aggressive, but the outcomes were disproportionately worse for Native American patients, likely due to:

Healthcare Access: Persistent barriers to accessing quality healthcare.

Intergenerational Trauma: Long-term, intergenerational oppression and its impacts on health.

Interestingly, the data showed fewer disparities for Hispanics when compared to the overall or white reference groups. However, this finding warrants caution. The Hispanic population is often treated as a monolithic group, despite being genetically and culturally diverse. This homogenization can obscure the nuances of disparities within this group. Additionally, underreporting due to immigration status may further distort the data. Notably, the American Cancer Society reports a lifetime cancer mortality risk of 1 in 5 for Hispanic men and 1 in 6 for Hispanic women—figures that are not fully reflected in the North Carolina data.

Discussion

The disparities observed using different reference points underscore the importance of selecting appropriate benchmarks in health equity research. The more pronounced disparities identified using the white MIR suggest that this group benefits from factors—such as better access to care—that improve outcomes, making it a stringent reference point. In contrast, the overall MIR, which averages outcomes across all racial groups, may obscure significant disparities that are critical to understanding and addressing health equity.

Quantifying health equity through metrics like the MIR is essential for identifying where interventions are most needed. However, achieving health equity requires more than just identifying disparities—it demands concerted efforts to address the underlying causes, which are often deeply rooted in social, economic, and political contexts. While health equity may not be a focal point in current political discussions, it is a critical area that must be continuously upheld and prioritized.

Interdisciplinary research is key to advancing our understanding of health disparities. By integrating insights from epidemiology, sociology, economics, and other fields, we can work toward a future where the MIR differential approaches zero across all racial groups. As we look ahead, it will be important to monitor how policy changes, such as the recent expansion of Medicare, impact these disparities—potentially revealing whether the core issue lies in access to care or other systemic factors.

Sources:

Why are so many Black patients dying of skin cancer? | AAMC

Melanoma Among Non-Hispanic Black Americans.

Racial differences in time to treatment for melanoma – PMC

The ongoing racial disparities in melanoma: An analysis of the Surveillance, Epidemiology, and End Results database (1975–2016)).

Cancer statistics for American Indian and Alaska Native individuals, 2022: Including increasing disparities in early onset colorectal cancer

How recognizing diversity among Hispanics could improve health outcomes | AAMC

Cancer Facts & Figures for Hispanics & Latinos 2018-2020