Data science has grown in inevitability as it has grown in value. Many organizations are finding that the time they spend in carefully extracting the “truth” from their data is time that pays real dividends. Part of the credit goes to those data scientists who conceived of a data science methodology that would unify processes and standardize the science. Methods matter.
In Part 1
and Part 2
of our series on data science methods, we set the stage. Data science is not very different from other applied sciences in that it uses the best building blocks and information it can to form a viable solution to an issue, whatever that issue may be. So, great care is taken to make sure that those building blocks are clean and free from debris. It would be wonderful if the next step were to simply plug the data into the solution and let it run. Unfortunately, there is no one
solution. Most often, the solution must be iteratively built.
This can be surprising to those who are unfamiliar with data analytics. “Doesn’t a plug-and-play solution just exist?” The answer is both yes and no. For example, repeat analytics, and those with fairly simple parameters and simple data streams, reusable tools and models, do exist. However, when an organization is looking for unique answers to unique issues, a unique solution is the best and only safe approach. Let’s consider an example.
See also: Forget Big Data -- Focus on Small Data
In insurance marketing, customer retention is a vital metric of success. Insurance marketers are continually keeping tabs on aspects of customer behavior that may lead to increasing retention. They may be searching for specific behaviors that will allow them to lower rates for certain groups, or they may look for triggers that will help the undesired kind of customer to leave. Data will answer many of their questions, but knowing how to employ that data will vary with every insurer.
For example, each insurer’s data contains the secrets to its customer persistency (or lack thereof), and no two insurers are alike. Applying one set of analytically derived business rules may work well for one insurer — while it would be big mistake to use the same criteria for another insurer. To arrive at the correct business conclusions, insurers need to build a custom-created solution that accounts for their uniqueness.
Building the Solution
In data science, building the solution is also a matter of testing a variety of different techniques. Multiple models will very likely be produced in the course of finding the solution that produces the best results.
Once the data set is prepared and extensive exploratory analysis has been performed, it is time to begin to build the models. The data set will be broken into at least two parts. The first part will be used for “training” the solution. The second portion of the data will be saved for testing the solution’s validity. If the solution can be used to “predict” historical trends correctly, it will likely be viable for predicting the near future as well.
What is involved in training the solution?
A multitude of statistical and machine-learning techniques can be applied to the training set to see which method generates the most accurate predictions on the test data. The methods chosen are largely determined by the distribution of the target variable. The target variable is what you are trying to predict.
A host of techniques and criteria are used to determine which technique will work best on the test data. There is a bucketful of acronyms from which a data scientist will choose (e.g. AUC, MAPE and MSE). Sometimes business metrics are more important than statistical metrics for determining the best model. Simplicity and understandability are two other factors the data scientist takes into consideration when choosing a technique.
Modeling is more complex than simply picking a technique. It is an iterative process where successive rounds of testing may cause the data scientist to add or drop features based upon their predictive strengths. Not unlike underwriting and actuarial science, the final result of data modeling is often a combination of art and science.
See also: Competing in an Age of Data Symmetry
What are data scientists looking for when they are testing the solution?
Accuracy is just one of the traits desired in an effective method. If the predictive strength of the model holds up on the test data, then it is a viable solution. If the predictive strength is drastically reduced on the test data set, then the model may be overfitted. In that case, it is time to reexamine the solution and finalize an approach that generates consistently accurate results between the training data and the test data. It is at this stage that a data scientist will often open up their findings to evaluation and scrutiny.
To validate the solution, the data scientist will show multiple models and their results to business analysts and other data scientists, explaining the different techniques that were used to come to the data’s “conclusions.” The greater team will take many things into consideration and often has great value in making sure that some unintentional issues haven’t crept into the analysis. Are there factors that may have tainted the model? Are the results that the model seems to be generating still relevant to the business objectives they were designed to achieve? After a thorough review, the solution is approved for real testing and future use.
In our final installment, we’ll look at what it means to test and “go live” with a data project, letting the real data flow through the solution to provide real conclusions. We will also discuss how the solution can maintain its value to the organization through monitoring and updating as needed based on changing business dynamics. As a part of our last thoughts, we will also give some examples of how data projects can have a deep impact on the insurers that use them — choosing to operate from a position of analysis and understanding instead of thoughtful conjecture.
The image used with this article first appeared here.