The escalating cost of healthcare necessitates accurate prediction methods for determining medical insurance premiums. This research compares the performance of three nonlinear regression models, namely Polynomial, Ridge, and Lasso, in estimating individual health insurance costs. The research process follows the CRISP-DM framework, which includes the stages of business understanding, data processing, modeling, and evaluation. The dataset used is the Medical Cost Personal Dataset from Kaggle, containing 1,338 individual data points with seven demographic and behavioral features. Six outliers in the BMI and charges features were removed using the IQR method, while categorical features were encoded with One Hot Encoding. Numerical features were transformed using second-degree Polynomial Features to capture nonlinear relationships, and then the data was split into 80% training and 20% testing. Evaluation used the Mean Squared Error (MSE) and R-squared (R²) metrics. The results indicate Ridge Regression yielded the best performance with an R² value of 0.857 and an MSE of 2.35×10⁷. This model is more stable and effective in handling multicollinearity compared to the other two models. Nevertheless, the average prediction error of approximately USD 4,800 suggests the need for increased accuracy through parameter tuning or data augmentation before being implemented in a real business environment.
Copyrights © 2025