
Background
For my Information Analytics course, I took on the KNIME challenge, where I had full freedom to explore a data science project. KNIME is a data management and analytics platform similar to Alteryx. I planned and executed the entire project independently, exploring different predictive models and evaluating their accuracy in predicting hospital charges. My primary goal was to compare multiple models using a medical cost dataset to determine which one performed the best.
Dataset
I used the Medical Cost Dataset from Kaggle for this project, which includes eight variables like age, BMI, smoker status, and medical charges. My objective was to predict medical charges (target variable) using various predictors (age, BMI, smoker status, etc.).
Model Comparisons
Model 1: Random Forest (Social and Biomedical Variables)
- R² and Adjusted R²: Both values were high, indicating the model captured 86% of the variance in hospital charges.
- Mean Absolute Error (MAE): The MAE was 2794.003, representing 21% of the mean charge and 30% of the median charge, indicating potential inaccuracy for predicting lower-cost cases.
- Root Mean Squared Error (RMSE): The RMSE of 4,406.298, higher than the MAE of 2,794.003, indicates significant outliers in the dataset, as RMSE is more sensitive to large errors and emphasizes the impact of extreme values.
- Correlation: The correlation between predicted and actual charges was 0.929 (p-value of 0), showing a strong relationship. However, the model underpredicted charges in many cases.
Model 2: Random Forest (Biomedical-Only Variables)
- R² and Adjusted R²: Both were reported as 1, suggesting a perfect fit, which likely indicates overfitting. Further testing on separate datasets is needed to confirm generalizability.
- MAE: The MAE was reported as 0, which was unrealistic given discrepancies observed between predicted and actual charges. This raised concerns about the validity of the model’s metrics.
- Correlation: The biomedical-only model had a correlation of 0.927 (p-value of 0), slightly lower than the social-biomedical model, showing the importance of including social variables.
Model 3: Linear Regression
- R² and Adjusted R²: These values were lower than Random Forest, explaining 78% of the variance in medical charges.
- MAE: The MAE was 3770.463, which was higher than in the Random Forest model, representing 28% of the mean charge and 40% of the median charge. This indicates less accuracy in predicting costs, especially in lower-cost instances.
- Root Mean Squared Error (RMSE):The RMSE was 5,472.896, which is higher than both the MAE and the RMSE of the Random Forest model.
- Correlation: The model had a correlation of 0.886 (p-value of 0), but it also predicted some charges to be negative, which is unrealistic in a medical cost context.
Model 4: Linear Regression (Biomedical-Only Variables)
- R² and Adjusted R²: These values remained the same as the social-biomedical model, explaining 78% of the variance.
- MAE: The MAE was 3,824.172, slightly higher than the social-biomedical linear regression model, indicating less accuracy. This model also underpredicted and overpredicted charges similarly to the combined social-biomedical model.
- Root Mean Squared Error (RMSE): The RMSE was 5,474.757
- Correlation: The correlation was 0.886, identical to the social-biomedical model, further suggesting the need for social variables to improve prediction accuracy.
Model 5: K-Nearest Neighbors (KNN)
I experimented with two binning approaches for this model:
- 5 Bins: The model performed well, with high accuracy:
- Recall: 98.1%
- Precision: 98%
- Overall Accuracy: 99.3% (suggesting simplicity due to fewer bins)
- 20 Bins: With more bins, accuracy decreased slightly but remained within an acceptable range:
- Recall: 95.5%
- Precision: 95.84%
- Overall Accuracy: 95.5% (which is in line with recommended accuracy ranges for predictive models)
Conclusion
Through this challenge, I gained valuable experience using KNIME and comparing multiple predictive models for hospital charge predictions. I applied both regression and classification models and developed a deeper understanding of how different models perform in healthcare contexts.
Based on performance, I recommend using the Random Forest model with both biomedical and social variables for predicting hospital charges, as it demonstrated the most accurate and reliable results with the given dataset without falling victim to overfitting the training data set.
This project has enhanced my ability to interpret and communicate model results effectively, and I look forward to applying these insights to future healthcare-related predictive analytics projects.
For additional details you can check out the lab report submitted for my course: