Tag Archives: CRISP-DM

Data Science: Methods Matter (Part 4)

Putting a data science solution into production after weeks or months of hard work is undoubtedly the most fun and satisfying part. Models do not exist for their own sakes; they exist to make a positive change in the business. Models that are not in production have not realized their true value. Putting models into production involves not only testing and implementation, but also a plan for monitoring and updating the analytics as time goes on. We’ll walk through these in a moment and see how the methods we employ will allow us to get the maximum benefit from our investment of time and effort.

First, let’s review briefly where we’ve been. In Part 1 of our series on Data Science Methods, we discussed CRISP-DM, a data project methodology that is now in common use across industries. We looked at the reasons insurers pursue data science at the first step, project design. In Part 2, we looked at building a data set and exploratory data analysis. In Part 3, we covered what is involved in building a solution, including setting up the data in the right way to validate the solution.

Now, we are ready for the launch phase. Just like NASA, data scientists need green lights across the board, only launching when they are perfectly ready and when they have addressed virtually every concern.

See also: The Science (and Art) of Data, Part 2  

Test and Implement

Once an analytic model has been built and shown to perform well in the lab, it’s time to deploy it into the wild: a real live production environment. Many companies are hesitant to simply flip a switch to move their business processes from one approach to a new one. They prefer to take a more cautious approach and implement a solution in steps or phases. Often, they choose to use either an A/B test and control approach or a phased geographic deployment. In an A/B test approach, the business results of the new analytic solution are compared with the solution that has been used in the past. For example, 50% of the leads in a marketing campaign are allocated to the new approach while 50% are allocated to the old approach, randomly. If the results from the new solution are superior, then it is fully implemented and the old solution removed. Or, if results in one region of the country look promising, then the solution can be rolled out nationwide.

Depending on the computing platform, the code base of the analytic solution may be automatically dropped into existing business processes. Scores may be generated live or in batch, depending on the need. Marketing, for instance, would be a good candidate to receive batch processed results. The data project may have been designed to pre-select good candidates for insurance who are also likely respondents. The results would return an entire prospect group within the data pool.

Live results meet a completely different set of objectives. Giving a broker a real-time indication of our appetite to quote a particular piece of business would be a common use of real-time scoring.

Sometimes, to move a model to production, there’s some coding that needs to happen. This occurs when a model is built and proven in R, but the deployed version of the model has to be implemented in C for performance or platform considerations. The code has to be translated into the new language. Checks must be performed to confirm that variables, final scores and the passing of correct values to end-users are all correct.

 Monitor and Update

Some data projects are “one time only.” Once the data has appeared to answer the question, then business strategies can be addressed that will support that answer. Others, however, are designed for long-term use and re-use. These can be very valuable over their periods of use, but special considerations must be taken into account when the plan is to reuse the analytic components of a data project. If a model starts to change over time, you want to manage that change as it happens. Monitoring and updating will help the project hold its business value, as opposed to letting its value decrease as variables and circumstances change. Effective monitoring is insurance for data science models.

For example, a model designed to repeatedly identify “good” candidates for a particular life product may give excellent results at the outset. As the economy changes, or demographics change, credit scoring may exclude good candidates. As health data exchanges improve, new data streams may be better indicators of overall health. Algorithms or data sets may need to be adapted. Minor tweaks may be needed or a whole new project may prove to be the best option if business conditions have drastically changed. Monitoring the intended business results compared with results at the outset and results over time will allow insurers to identify analysis features that no longer provide the most valid results.

See also: Competing in an Age of Data Symmetry

Monitoring is important enough that it goes beyond running periodic reports and having hunches that the models have not lost effectiveness. Monitoring needs its own plan. How often will report(s) run? What are the criteria we can use to validate that the model is still working? Which indicators will tell us that the model is beginning to fail? These criteria are identified by both the data scientists and the business users who are in touch with the business strategy. Depending on the project and previous experience, data scientists may even know intuitively which components within the method are likely to slide out of balance. They can create criteria to monitor those areas more closely.

Updating the model breathes new life into the original work. Depending on what may be happening to the overall solution, the data scientist will know whether a small tweak to a formula is called for or an entirely new solution needs to be built based on new data models. An update saves as much of the original time investment as possible without jeopardizing the results.

Though the methodology may seem complicated, and there seem to be many steps, the results are what matter. Insurance data science continually fuels the business with answers of competitive and operational value. It captures accurate images of reality and allows users to make the best decisions. As data streams grow in availability and use, insurance data science will be poised to make the most of them.

Data Science: Methods Matter (Part 2)

What makes data science a science?

Methodology.

When data analytics crosses the line with simple formulas, much conjecture and an arbitrary methodology behind it, it often fails in what it was designed to do —give accurate answers to pressing questions.

So at Majesco, we pursue a proven data science methodology in an attempt to lower the risk of misapplying data and to improve predictive results. In Methods Matter, Part 1, we provided a picture of the methodology that goes into data science. We discussed CRISP-DM and the opening phase of the life cycle, project design.

In Part 2, we will be discussing the heart of the life cycle — the data itself. To do that, we’ll take an in-depth look at two central steps: building a data set, and exploratory data analysis. These two steps compose the phase that is  extremely critical for project success, and they illustrate why data analytics is more complex than many insurers realize.

Building a Data Set

Building a data set, in one way, is no different than gathering evidence to solve a mystery or a criminal case. The best case will be built with verifiable evidence. The best evidence will be gathered by paying attention to the right clues.

There will also almost never be just one piece of evidence used to build a case, but a complete set of gathered evidence — a data set. It’s the data scientist’s job to ask, “Which data holds the best evidence to prove our case right or wrong?”

Data scientists will survey the client or internal resources for available in-house data, and then discuss obtaining additional external data to complete the data set. This search for external data is more prevalent now than previously. The growth of external data sources and their value to the analytics process has ballooned with an increase in mobile data, images, telematics and sensor availability.

See also: The Science (and Art) of Data, Part 1

A typical data set might include, for example, typical external sources such as credit file data from credit reporting agencies and internal policy and claims data. This type of information is commonly used by actuaries in pricing models and is contained in state filings with insurance regulators. Choosing what features go into the data set is the result of dozens of questions and some close inspection. The task is to find the elements or features of the data set that have real value in answering the questions the insurer needs to answer.

In-house data, for example, might include premiums, number of exposures,    new and renewal policies and more. The external credit data may include information such as number of public records, number of mortgage accounts, number of accounts that are 30+ days past due among others. The goal at this point is to make sure that the data is as clean as possible. A target variable of interest might be something like frequency of claims, severity of claims, or loss ratio. This step is many times performed by in-house resources, insurance data analysts familiar with the organization’s available data, or external consultants such as Majesco.

At all points along the way, the data scientist is reviewing the data source’s suitability and integrity. An experienced analyst will often quickly discern the character and quality of the data by asking themselves, “Does the number of policies look correct for the size of the book of business? Does the average number of exposures per policy look correct? Does the overall loss ratio seem correct? Does the number of new and renewal policies look correct? Are there an unusually high number of missing or unexpected values in the data fields? Is there an apparent reason for something to look out of order? If not, how can the data fields be corrected? If they can’t be corrected, are the data issues so important that these fields should be dropped from the data set? Some whole record observations may clearly contain bad data and should be dropped from the data set. Even further, is the data so problematic that the whole data set should be redesigned or the whole analytics project should be scrapped?

 

Once the data set has been built, it is time for an in-depth analysis that steps closer toward solution development.

Exploratory Data Analysis

Exploratory data analysis takes the newly minted data set and begins to do something with it — “poking it” with measurements and variables to see how it might stand up in actual use. The data scientist runs preliminary tests on the “evidence.” The data set is subjected to a deeper look at its collective value. If the percentage of missing values is too large, the feature is probably not a good predictor variable and should be excluded from future analysis. In this phase, it may make sense to create more features, including mathematical transformations for non-linear relationships between the features and the target variable.

For non-statisticians, marketing managers and non-analytical staff, the details of exploratory data analysis can be tedious and uninteresting. Yet they are the crux of the genius involved in data science project methodology. Exploratory Data Analysis is where data becomes useful, so it is a part of the process that can’t be left undone. No matter what one thinks of the mechanics of the process, the preliminary questions and findings can be absolutely fascinating.

Questions such as these are common at this stage:

  • Does frequency increase as the number of accounts that are 30+ days past due increases? Is there a trend?
  • Does severity decrease as the number of mortgage trades decreases? Do these trends make sense?
  • Is the number of claims per policy greater for renewal policies than for new policies? Does this finding make sense? If not, is there an error in the way the data was prepared or in the source data itself?
  • If younger drivers have lower loss ratios, should this be investigated as an error in the data or an anomaly in the business? Some trends will not make any sense, and perhaps these features should be dropped from analysis or the data set redesigned.

See also: The Science (and Art) of Data, Part 2

The more we look at data sets, the more we realize that the limits to what can be discovered or uncovered are small and growing smaller. Thinking of relationships between personal behavior and buying patterns or between credit patterns and claims can fuel the interest of everyone in the organization. As the details of the evidence begin to gain clarity, the case also begins to come into focus. An apparent “solution” begins to appear and the data scientist is ready to build that solution.

In Part 3, we’ll look at what is involved in building and testing a data science project solution and how pilots are crucial to confirming project findings.

Data Science: Methods Matter (Part 1)

Why should an insurer employ data science? How does data science differ from any other business analytics that might be happening within the organization? What will it look like to bring data science methodology into the organization?

In nearly every engagement, Majesco’s data science team fields questions as foundational as these, as well as questions related to the details of business needs. Business leaders are smart to do their due diligence — asking IF data science will be valuable to the organization and HOW valuable it might be.

To provide a feel for how data science operates, in this first of three blog posts we will touch briefly on the history of data mining methodology, then look at what an insurer can expect when first engaging in the data science process. Throughout the series, we’re going to keep our eyes on the focus of all of our efforts: answers.

Answers

The goal of most data science is to apply the proper analysis to the right sets of data to provide answers. That proper analysis is just as important as the question an insurer is attempting to answer. After all, if we are in pursuit of meaningful business insights, we certainly don’t want to come to the wrong conclusions. There is nothing worse than moving your business full-speed ahead in the wrong direction based upon faulty analysis. Today’s analysis benefits from a thoughtfully constructed data project methodology.

As data mining was on the rise in the 1990s, it became apparent there were a thousand ways a data scientist might pursue answers to business questions. Some of those methods were useful and good, and some were suspect — they couldn’t truly be called methods. To help keep data scientists and their clients from arriving at the wrong conclusions, a methodology needed to be introduced. A defined yet flexible process would not only assist in managing a specific project scope but would also work toward verifying conclusions by building in pre-test and post-project monitoring against expected results. In 1996, the Cross Industry Process for Data Mining (CRISP-DM) was introduced, the first step in the standardization of data mining projects. Though CRISP-DM was a general data project methodology, insurance had its hand in the development. The Dutch insurer OHRA was one of the four sponsoring organizations to co-launch the standardization initiative.

See also: Data Science: Methods Matter

CRISP-DM has proven to be a strong foundation in the world of data science. Even though the number of available data streams has skyrocketed in the last 20 years and the tools and technology of analysis have improved, the overall methodology is still solid. Majesco uses a variance of CRISP-DM, honed over many years of experience in multiple industries.

Pursuing the right questions — Finding the business nugget in the data mine

Before data mining project methodologies were introduced, one issue companies had was a lack of substantial focus on obtainable goals. Projects didn’t always have a success definition that would help the business in the end. Research could be vague, and methods could be transient.

Research needs focus, so the key ingredient in data science methodology is business need. The insurer has a problem it wishes to solve. It has a question that has no readily apparent answer. If an insurer hasn’t used data scientists, this is a frequent point of entry. It is also the one of the greatest differentiators between traditional in-house data analysis and project-based data science methodology. Instead of tracking trends, data science methodology is focused on  finding clear answers to defined questions. Normally these issues are more difficult to solve and represent a greater business risk, making it easy to justify seeking outside assistance.

Project Design — First meeting and first steps

Phase 1 of a data science project life cycle is project design. This phase is about listening and learning about the business problem (or problems) that are ready to be addressed. For example, a P&C insurer might be wondering why loyalty is lowest in the three states where it has the highest claims — Florida, Georgia and Texas. Is this an anomaly, or is there a correlation between the two statistics? A predictive model could be built to predict the likelihood of attrition. The model score could then be used to determine what actions should be taken to reward and keep a good customer, or perhaps what actions could be taken to remove frequent or high-risk claimants from the books.

The insurer must unpack background and pain points. Does the customer have access to all of the data that is needed for analysis? Should the project be segmented in such a way that it provides for detailed analysis at multiple levels? For example, the insurer may need to run the same type of claims analysis across personal auto, commercial vehicle, individual home and business property. These would represent segmented claims models under the same project.

See also: What Comes After Big Data

The insurer must identify assumptions, definitions, possible solutions and a picture of the risks involved for the project, sorting out areas where segmented analysis may be needed. The team must also collect some information to assist in creating a cost-benefit analysis for the project.

As a part of the project design meetings, the company must identify the analytic techniques that will be used and discuss the features the analysis can use. At the end of the project design phase, everyone knows which answers they are seeking and the questions that will be used to frame those answers. They have a clear understanding of the data that is available for their use and have an outline of the full project.

With the clarity to move forward, the insurers move into a closer examination of the data that will be used.

In Part 2, we will look at the two-step data preparation process that is essential to building an effective solution. We will also look at how the proliferation of data sources is supplying insurers with greater analytic opportunities than ever.