So I wanted to make a simple way to make a prediction of medical charges using R programing language and the Kaggle dataset "Medical Cost Personal Datasets" that can be found here...
This data is to used to try to predict the insurance costs of people of various backgrounds and lifestyles
After downloading and extracting the data a folder that I set up for this project I opened RStudios and started the markdown and told R to read the CSV as a dataframe. Then I figured that viewing the the numerical data against each other to view the data and do some exploratory searching.

There is a fairly obvious correlation between "Age" and "Charges" and strangely two areas that looked linked in the "BMI" and "Charges" categories.
These obviously are not the only things that affect the cost of insurance so I looked at the other variables that are included in the dataset.

That led to these charts being created showing some more correlations.
The facts that stood out the most to me was the smokers chart shows a large differences in charges and I didn't see that much of a statistical difference between the average in the 4 areas of the country.
Then using a simple liner regression and made a model of that regression.


This gives a Rsquared value of approximately .75. That is not the best score possible but being able to predict with about 75% accuracy seems close enough. Yes more data and refined data could give a higher Rsquared value as well as trying a more complex model might even increase.
Here is a copy of the code...

Avery Smith talked me through this project. Please go check out his Youtube channel here...
Comments