Clustering Individuals Based on Health and Socioeconomic Indicators Using the CDC’s BRFSS Data

Project Overview: I analyzed the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset using K-means clustering to identify groups based on reported health and life satisfaction patterns. By combining health indicators with socioeconomic factors (household income and education) I aimed to understand how these social determinants relate to individual health outcomes. The initial dataset contained over 400,000 observations, which I reduced to 15,032 by cleaning out incomplete data.

Variables Used:

  • General Health (GENHLTH): Measures perceived overall health (higher values indicate poorer health).
  • Mental Health (MENTHLTH): Number of days mental health negatively impacted daily life (higher values indicate more frequent struggles).
  • Physical Health (PHYSHLTH): Number of days physical health was poor.
  • Life Satisfaction (LSATISFY): Reflects self-reported quality of life (higher values mean lower satisfaction).

Clustering Analysis and Findings:

Analysis 1: Health and Income Using the elbow method, I determined an optimal cluster count of 6. Here’s what I found:

  • Cluster 0 (18%): Highest income group (>$75,000) with excellent health and high life satisfaction.
  • Cluster 1 (9%): Lowest income group ($10,000-$15,000) with significant health challenges but moderate life satisfaction.
  • Cluster 2 (6%): Middle-income earners ($35,000-$50,000) with the poorest health indicators but moderate satisfaction.
  • Cluster 3 (14%): Middle-income group ($35,000-$50,000) with good health and high life satisfaction.
  • Cluster 4 (35%): Largest group with highest income levels (>$75,000), showing good health and high life satisfaction.
  • Cluster 5 (18%): Upper middle-income earners ($50,000-$75,000) with similar good health and high satisfaction.

This analysis highlights that higher income is associated with better health outcomes and life satisfaction, reinforcing existing evidence on the impact of socioeconomic factors.

This image displays the relationship between income (INCOME2, y-axis) and physical health (PHYSHLTH, x-axis). Clusters 0, 3, and 5 are skewed to the left, indicating that these groups experience fewer days where physical health negatively impacts their daily life. These clusters also belong to the highest income categories, suggesting that higher income groups tend to have better physical health outcomes. In contrast, Cluster 2 is skewed to the right, showing a higher number of days of poor physical health, with income levels spread throughout the range. Clusters 1 and 2 have a denser concentration of observations on the right side, reflecting the groups with the poorest health outcomes, regardless of their income distribution.

Analysis 2: Health and Education For this analysis, I used 9 clusters based on the elbow method. Key findings include:

  • Cluster 0 (19%): Highly educated (college graduates) with very good health and high life satisfaction.
  • Cluster 2 (4%): Individuals with some college education but severe physical and mental health challenges.
  • Cluster 6 (8%): High school graduates with moderate health challenges yet high life satisfaction, suggesting resilience.
  • Other clusters demonstrated how different levels of educational attainment impact health outcomes and satisfaction levels.
In general, the majority of the data set reported high or moderate life satisfaction. Clusters 0, 1, and 4 show high concentrations toward the left side of the plot, indicating high life satisfaction. These clusters also represent individuals with the highest levels of educational attainment (primarily college graduates). In contrast, Cluster 2 displays the widest spread in life satisfaction levels and consists mostly of individuals with high school education or lower.

Key Takeaways:

  • The majority of the dataset reported moderate to high life satisfaction. Clusters with the highest educational levels (college graduates) were concentrated in groups with higher satisfaction and better health outcomes.
  • Cluster 2 showed the widest spread of life satisfaction and predominantly consisted of individuals with high school education or lower, indicating the need for a more in-depth understanding of what contributes to variability in well-being among this group.

Critical Reflections and Future Directions:

  1. Dataset Limitations: The dataset is predominantly composed of white and highly educated individuals, limiting the generalizability of these findings. To make public health insights more inclusive, future analyses should use more diverse datasets.
  2. Adding More Variables: Incorporating factors like healthcare access, chronic disease indicators, and racial identity could provide a more comprehensive understanding of health disparities and social determinants.
  3. Methodological Improvements: While K-means clustering in Weka is effective for straightforward analysis, it has limitations with non-linear relationships and imbalanced datasets. Future projects will explore more advanced clustering techniques like DBSCAN or hierarchical clustering using Python for deeper insights.
  4. Actionable Steps: I plan to expand future analyses by integrating more demographic variables and advanced techniques to provide a fuller picture of factors influencing health and life satisfaction in the U.S. population.

By continually refining my approach, I aim to produce more meaningful and comprehensive public health insights. This project served as a valuable practice in understanding how socioeconomic factors impact health outcomes.

Full lab write up

Exploring Predictive Analytics with KNIME: A Comparative Model Challenge

Background

For my Information Analytics course, I took on the KNIME challenge, where I had full freedom to explore a data science project. KNIME is a data management and analytics platform similar to Alteryx. I planned and executed the entire project independently, exploring different predictive models and evaluating their accuracy in predicting hospital charges. My primary goal was to compare multiple models using a medical cost dataset to determine which one performed the best.

Dataset

I used the Medical Cost Dataset from Kaggle for this project, which includes eight variables like age, BMI, smoker status, and medical charges. My objective was to predict medical charges (target variable) using various predictors (age, BMI, smoker status, etc.).

Model Comparisons

Model 1: Random Forest (Social and Biomedical Variables)

  1. R² and Adjusted R²: Both values were high, indicating the model captured 86% of the variance in hospital charges.
  2. Mean Absolute Error (MAE): The MAE was 2794.003, representing 21% of the mean charge and 30% of the median charge, indicating potential inaccuracy for predicting lower-cost cases.
  3. Root Mean Squared Error (RMSE): The RMSE of 4,406.298, higher than the MAE of 2,794.003, indicates significant outliers in the dataset, as RMSE is more sensitive to large errors and emphasizes the impact of extreme values.
  4. Correlation: The correlation between predicted and actual charges was 0.929 (p-value of 0), showing a strong relationship. However, the model underpredicted charges in many cases.

Model 2: Random Forest (Biomedical-Only Variables)

  1. R² and Adjusted R²: Both were reported as 1, suggesting a perfect fit, which likely indicates overfitting. Further testing on separate datasets is needed to confirm generalizability.
  2. MAE: The MAE was reported as 0, which was unrealistic given discrepancies observed between predicted and actual charges. This raised concerns about the validity of the model’s metrics.
  3. Correlation: The biomedical-only model had a correlation of 0.927 (p-value of 0), slightly lower than the social-biomedical model, showing the importance of including social variables.

Model 3: Linear Regression

  1. R² and Adjusted R²: These values were lower than Random Forest, explaining 78% of the variance in medical charges.
  2. MAE: The MAE was 3770.463, which was higher than in the Random Forest model, representing 28% of the mean charge and 40% of the median charge. This indicates less accuracy in predicting costs, especially in lower-cost instances.
  3. Root Mean Squared Error (RMSE):The RMSE was 5,472.896, which is higher than both the MAE and the RMSE of the Random Forest model.
  4. Correlation: The model had a correlation of 0.886 (p-value of 0), but it also predicted some charges to be negative, which is unrealistic in a medical cost context.

Model 4: Linear Regression (Biomedical-Only Variables)

  1. R² and Adjusted R²: These values remained the same as the social-biomedical model, explaining 78% of the variance.
  2. MAE: The MAE was 3,824.172, slightly higher than the social-biomedical linear regression model, indicating less accuracy. This model also underpredicted and overpredicted charges similarly to the combined social-biomedical model.
  3. Root Mean Squared Error (RMSE): The RMSE was 5,474.757
  4. Correlation: The correlation was 0.886, identical to the social-biomedical model, further suggesting the need for social variables to improve prediction accuracy.

Model 5: K-Nearest Neighbors (KNN)

I experimented with two binning approaches for this model:

  1. 5 Bins: The model performed well, with high accuracy:
  • Recall: 98.1%
  • Precision: 98%
  • Overall Accuracy: 99.3% (suggesting simplicity due to fewer bins)
  1. 20 Bins: With more bins, accuracy decreased slightly but remained within an acceptable range:
  • Recall: 95.5%
  • Precision: 95.84%
  • Overall Accuracy: 95.5% (which is in line with recommended accuracy ranges for predictive models)

Conclusion

Through this challenge, I gained valuable experience using KNIME and comparing multiple predictive models for hospital charge predictions. I applied both regression and classification models and developed a deeper understanding of how different models perform in healthcare contexts.

Based on performance, I recommend using the Random Forest model with both biomedical and social variables for predicting hospital charges, as it demonstrated the most accurate and reliable results with the given dataset without falling victim to overfitting the training data set.


This project has enhanced my ability to interpret and communicate model results effectively, and I look forward to applying these insights to future healthcare-related predictive analytics projects.

For additional details you can check out the lab report submitted for my course: