Clustering Individuals Based on Health and Socioeconomic Indicators Using the CDC’s BRFSS Data

Project Overview: I analyzed the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset using K-means clustering to identify groups based on reported health and life satisfaction patterns. By combining health indicators with socioeconomic factors (household income and education) I aimed to understand how these social determinants relate to individual health outcomes. The initial dataset contained over 400,000 observations, which I reduced to 15,032 by cleaning out incomplete data.

Variables Used:

  • General Health (GENHLTH): Measures perceived overall health (higher values indicate poorer health).
  • Mental Health (MENTHLTH): Number of days mental health negatively impacted daily life (higher values indicate more frequent struggles).
  • Physical Health (PHYSHLTH): Number of days physical health was poor.
  • Life Satisfaction (LSATISFY): Reflects self-reported quality of life (higher values mean lower satisfaction).

Clustering Analysis and Findings:

Analysis 1: Health and Income Using the elbow method, I determined an optimal cluster count of 6. Here’s what I found:

  • Cluster 0 (18%): Highest income group (>$75,000) with excellent health and high life satisfaction.
  • Cluster 1 (9%): Lowest income group ($10,000-$15,000) with significant health challenges but moderate life satisfaction.
  • Cluster 2 (6%): Middle-income earners ($35,000-$50,000) with the poorest health indicators but moderate satisfaction.
  • Cluster 3 (14%): Middle-income group ($35,000-$50,000) with good health and high life satisfaction.
  • Cluster 4 (35%): Largest group with highest income levels (>$75,000), showing good health and high life satisfaction.
  • Cluster 5 (18%): Upper middle-income earners ($50,000-$75,000) with similar good health and high satisfaction.

This analysis highlights that higher income is associated with better health outcomes and life satisfaction, reinforcing existing evidence on the impact of socioeconomic factors.

This image displays the relationship between income (INCOME2, y-axis) and physical health (PHYSHLTH, x-axis). Clusters 0, 3, and 5 are skewed to the left, indicating that these groups experience fewer days where physical health negatively impacts their daily life. These clusters also belong to the highest income categories, suggesting that higher income groups tend to have better physical health outcomes. In contrast, Cluster 2 is skewed to the right, showing a higher number of days of poor physical health, with income levels spread throughout the range. Clusters 1 and 2 have a denser concentration of observations on the right side, reflecting the groups with the poorest health outcomes, regardless of their income distribution.

Analysis 2: Health and Education For this analysis, I used 9 clusters based on the elbow method. Key findings include:

  • Cluster 0 (19%): Highly educated (college graduates) with very good health and high life satisfaction.
  • Cluster 2 (4%): Individuals with some college education but severe physical and mental health challenges.
  • Cluster 6 (8%): High school graduates with moderate health challenges yet high life satisfaction, suggesting resilience.
  • Other clusters demonstrated how different levels of educational attainment impact health outcomes and satisfaction levels.
In general, the majority of the data set reported high or moderate life satisfaction. Clusters 0, 1, and 4 show high concentrations toward the left side of the plot, indicating high life satisfaction. These clusters also represent individuals with the highest levels of educational attainment (primarily college graduates). In contrast, Cluster 2 displays the widest spread in life satisfaction levels and consists mostly of individuals with high school education or lower.

Key Takeaways:

  • The majority of the dataset reported moderate to high life satisfaction. Clusters with the highest educational levels (college graduates) were concentrated in groups with higher satisfaction and better health outcomes.
  • Cluster 2 showed the widest spread of life satisfaction and predominantly consisted of individuals with high school education or lower, indicating the need for a more in-depth understanding of what contributes to variability in well-being among this group.

Critical Reflections and Future Directions:

  1. Dataset Limitations: The dataset is predominantly composed of white and highly educated individuals, limiting the generalizability of these findings. To make public health insights more inclusive, future analyses should use more diverse datasets.
  2. Adding More Variables: Incorporating factors like healthcare access, chronic disease indicators, and racial identity could provide a more comprehensive understanding of health disparities and social determinants.
  3. Methodological Improvements: While K-means clustering in Weka is effective for straightforward analysis, it has limitations with non-linear relationships and imbalanced datasets. Future projects will explore more advanced clustering techniques like DBSCAN or hierarchical clustering using Python for deeper insights.
  4. Actionable Steps: I plan to expand future analyses by integrating more demographic variables and advanced techniques to provide a fuller picture of factors influencing health and life satisfaction in the U.S. population.

By continually refining my approach, I aim to produce more meaningful and comprehensive public health insights. This project served as a valuable practice in understanding how socioeconomic factors impact health outcomes.

Full lab write up

Exploring Social Data with Principal Component Analysis (PCA)

During my summer internship at the NIH, I was introduced to Principal Component Analysis (PCA) through a colleague. Intrigued by PCA’s potential, I wanted to apply this technique to social data, particularly from the National Longitudinal Study of Adolescent to Adult Health (Add Health), which contains a rich dataset of 10,237 variables.

Objective

My goal was to identify underlying patterns in social factors like academic performance, self-esteem, relationships with parents, and substance use. I narrowed down the vast dataset to 50 key variables to uncover trends and relationships.

Approach

I began by learning PCA through various resources, including Kaggle tutorials and DataCamp courses. I also revisited linear algebra fundamentals to ensure a solid mathematical understanding.

For the analysis:

  1. Data Cleaning: Initially, I filled missing values with -1, but realized this approach needed refinement based on the scale of responses.
  2. PCA Implementation: I used the prcomp function in R to perform PCA. Focusing on the first two principal components, which explained 27.3% of the variance, allowed me to manage the complexity.
  3. Visualization: I created a biplot to visualize the results. Due to the large number of variables, I filtered for the most influential ones, revealing that alcohol usage significantly impacts dataset variability.

Findings

  • Principal Component 1: Associated with lower self-esteem, moderate alcohol use, and less satisfaction in parent relationships.
  • Principal Component 2: Linked to positive school behavior, higher grades, less loneliness, and lower alcohol consumption.

Using K-means clustering, I identified two groups:

  • Cluster 1 (Red): Higher on PC1, indicating lower self-esteem and weaker parental bonds.
  • Cluster 2 (Blue): Higher on PC2, suggesting better academic performance and less loneliness.

The analysis highlighted how alcohol usage and social factors contribute to overall data variability. I plan to refine my approach with a smaller dataset for better interpretation.

Resources Used