An example of how predictive analytics can benefit an organization

Joey Stipek
6 min readJun 25, 2023

Note: For this example, I am using a fake company name.

The Insurance Company (TIC) is an organization whose name explains what they do. The company provides insurance for vehicles of all different purposes to consumers. The company wishes to predict which insurance policies customers either will actually or potentially purchase in the future. There are limitations with the dataset provided.

The first is the size of the dataset. This is either because of the limited number of consumers choosing the company as their insurance provider or maybe due to a data management issue. The dataset will be used to build a description, train and validate any models of 5582 pre-existing consumers.

Each individual record has 86 different attributes within the data: socioeconomic data (1–43) and product ownership (44–86). The socioeconomic data is distinguishable by zip codes. Those living within the zip codes in the dataset all have the same socioeconomic attributes. The target variable within the dataset is labeled “CARAVAN: Number of home policies.” This attribute is labeled number 86. Utilizing a predictive analysis is the desired outcome here.

The potential value of this analysis is to determine the company’s consumer base. The company plans to analyze its existing data to better market policies to increase its consumer base. Who is interested in buying an insurance plan for a caravan and why?

The potential value will have increased consumer and base and increase value of the organization through this data initiative. The organization can leverage the data to better understand their consumer base, improve goals, and increase their standing in the marketplace. This analysis will help the organization to better attract consumers in the future.

Data Set

The Insurance Company provided a dataset containing 5582 rows on the same number of existing consumers. Utilizing the summary() command allows for all of the datasets to be summarized. There are 86 columns as per a dimension command [dim(0] and the summary command [summary()] verified there are 5582 rows in the dataset provided by the company.

Data Summary

Who is interested in buying an insurance plan for a caravan and why? For modeling, we will determine those individuals looking to purchase caravan insurance policies. This will be determined by analyzing the dataset. For this analysis to be successful, there must be a proper correlation between the variables of consumer analysis.

If the analyst can determine the correlation between the attributes/characteristics, they can determine the attributes of those looking to purchase caravan insurance policies. This leads to determining and understanding consumer behavior based upon their purchasing habits. An analyst would need to use a Random Forest model to help determine if consumers would be likely to purchase caravan insurance policies from the organization.

By using the Random Forest modeling, the analyst can utilize the classification algorithm. The features of historical and training validate the dataset. Once validated, the algorithm can utilize a predictive analysis of whether a consumer will or won’t purchase caravan insurance policies.

Predictive Algorithms

An analyst will utilize The Insurance Company provided the data to build a predictive model. The analyst will utilize Random Forest Analysis to determine what customers are making inquiries about purchasing caravan insurance policies. The analyst should apply the classification algorithm to better carry out this task. The algorithm will be utilized in a predictive analysis of whether or not a customer will purchase caravan insurance policies.

Data Analytic Tools

The data analytics tools to be utilized for the Random Forest Analysis are both RStudios and rattle. Rattle is an open source software providing graphical user interface for data mining in R statistical programming language. The analyst will target caravans in the dataset. Both RStudios and Rattle will help the analyst determine insights on improving The Insurance Company’s target marketing for getting customers enrolled in purchase caravan insurance policies.

Model Optimization

First, the predictive analysis modeling if optimizing would make scoring would make the data analysis insightful. Three scoring methods an analyst should utilize the confusion matrix.

The three scoring methods are: Area Under the Curve (AUC), Error Matrix or Rate and Sensitivity Analysis, and Receiver Operating Character (ROC). The Area Under the Curve gives the model predicting true positives devoid of any false positives. The analyst should know the difference in viewing the first chart versus the second chart. The Area Under the Curve can confirm the prediction. The Area Under the Curve will be shown by a red line.

Second, the Receiver Operating Character will display a false positive plot. This false positive plot’s rate compared to the true positive rate will assist an analyst in determining the well of the model productivity.

Third, a Sensitivity Analysis would showcase vulnerable target variables. These variables would be from the changes found when analyzing input variables. What sensitivity does it compute the ratio from the detected positive classes thus showing the analyst whether or not the model is recognized as a positive class based upon scoring. An analyst should be able to specifically see the rating at 1 based on the change in input and target variables.

The Error Matrix assists with determining scoring analysis. This assists with determining the error rate proportion of incorrect responses. The Error Matrix assists with serving as a dependent variable. The errors shown in this example showcase the probability of a true positive in the dataset classification. The overall error is 13%. The averaged class error is 21.25%. The recall and sensitivity chart is determined at 1, the positivity rate.

A risk chart (or cumulative gain chart) would also be beneficial for an analyst for this undertaking. An analyst finds this by clicking the risk option in the evaluate tab in Rattle.

The analyst can view the historical data to determine fraud or other redundancies within the dataset.An analyst can compare binary multipliers of two or more within the dataset. The analyst can adjust the risk with the parameters. The precision will decline when the curve touches (with a score of 50%).

Reproductive Research

Reproducibility in data science projects is an important aspect for an analyst to determine findings. Reproducibility is defined as the ability to take the original researcher’s data and analysis to get similar results. The main purpose of reproducibility is to ensure all results can be recreated and verified independently by researchers to build upon in future research.

Here’s what is reproduced: statistical evidence in support of the findings (confidence intervals, credible intervals, p-values), tables, values reported in the text, and visualizations/figures/graphs.

For this project, this is used in R for summary analysis of the dataset and visualization purposes. The aforementioned Random Forest modeling to answer predictive analytics questions and show insight into the model. The company plans to use research with its data governance and mining plans in an attempt to become a more data focused organization.

The plan to do this through identifying goals found in historical data, developing a predictive analytic model for its analysts in R and Rattle, evaluation to determine if the processes were effective by the company’s analysts, refining the processes by the company’s analysts.

The Insurance Company (TIC) is an organization needing data governance. If the organization properly leverages its data, they better understand their consumer base, improve goals, and increase their standing in the marketplace.

They can be an industry standard in the marketplace by following these recommendations. Following a strict standard for ethics and data quality, putting structure and standards in place, and finally having a data driven mindset.

--

--

Joey Stipek

Joey Stipek’s data research + writing has been featured at newsrooms including The Oklahoman, Colorado Springs Gazette, New York Times, The Frontier, and KOSU.