Tag Archives: Turnbull

Data Science: Methods Matter (Part 3)

Data science has grown in inevitability as it has grown in value. Many organizations are finding that the time they spend in carefully extracting the “truth” from their data is time that pays real dividends. Part of the credit goes to those data scientists who conceived of a data science methodology that would unify processes and standardize the science. Methods matter.

 In Part 1 and Part 2 of our series on data science methods, we set the stage. Data science is not very different from other applied sciences in that it uses the best building blocks and information it can to form a viable solution to an issue, whatever that issue may be. So, great care is taken to make sure that those building blocks are clean and free from debris. It would be wonderful if the next step were to simply plug the data into the solution and let it run. Unfortunately, there is no one solution. Most often, the solution must be iteratively built.

This can be surprising to those who are unfamiliar with data analytics. “Doesn’t a plug-and-play solution just exist?” The answer is both yes and no. For example, repeat analytics, and those with fairly simple parameters and simple data streams, reusable tools and models, do exist. However, when an organization is looking for unique answers to unique issues, a unique solution is the best and only safe approach. Let’s consider an example.

See also: Forget Big Data — Focus on Small Data

In insurance marketing, customer retention is a vital metric of success. Insurance marketers are continually keeping tabs on aspects of customer behavior that may lead to increasing retention. They may be searching for specific behaviors that will allow them to lower rates for certain groups, or they may look for triggers that will help the undesired kind of customer to leave. Data will answer many of their questions, but knowing how to employ that data will vary with every insurer.

For example, each insurer’s data contains the secrets to its customer persistency (or lack thereof), and no two insurers are alike. Applying one set of analytically derived business rules may work well for one insurer — while it would be big mistake to use the same criteria for another insurer. To arrive at the correct business conclusions, insurers need to build a custom-created solution that accounts for their uniqueness.

Building the Solution

In data science, building the solution is also a matter of testing a variety of different techniques. Multiple models will very likely be produced in the course of finding the solution that produces the best results.

Once the data set is prepared and extensive exploratory analysis has been performed, it is time to begin to build the models. The data set will be broken into at least two parts. The first part will be used for “training” the solution. The second portion of the data will be saved for testing the solution’s validity. If the solution can be used to “predict” historical trends correctly, it will likely be viable for predicting the near future as well.

What is involved in training the solution? 

 A multitude of statistical and machine-learning techniques can be applied to the training set to see which method generates the most accurate predictions on the test data. The methods chosen are largely determined by the distribution of the target variable. The target variable is what you are trying to predict.

A host of techniques and criteria are used to determine which technique will work best on the test data. There is a bucketful of acronyms from which a data scientist will choose (e.g. AUC, MAPE and MSE). Sometimes business metrics are more important than statistical metrics for determining the best model. Simplicity and understandability are two other factors the data scientist takes into consideration when choosing a technique.

Modeling is more complex than simply picking a technique. It is an iterative process where successive rounds of testing may cause the data scientist to add or drop features based upon their predictive strengths. Not unlike underwriting and actuarial science, the final result of data modeling is often a combination of art and science.

See also: Competing in an Age of Data Symmetry

What are data scientists looking for when they are testing the solution?

Accuracy is just one of the traits desired in an effective method. If the predictive strength of the model holds up on the test data, then it is a viable solution. If the predictive strength is drastically reduced on the test data set, then the model may be overfitted. In that case, it is time to reexamine the solution and finalize an approach that generates consistently accurate results between the training data and the test data. It is at this stage that a data scientist will often open up their findings to evaluation and scrutiny.

To validate the solution, the data scientist will show multiple models and their results to business analysts and other data scientists, explaining the different techniques that were used to come to the data’s “conclusions.”  The greater team will take many things into consideration and often has great value in making sure that some unintentional issues haven’t crept into the analysis. Are there factors that may have tainted the model? Are the results that the model seems to be generating still relevant to the business objectives they were designed to achieve? After a thorough review, the solution is approved for real testing and future use.

In our final installment, we’ll look at what it means to test and “go live” with a data project, letting the real data flow through the solution to provide real conclusions. We will also discuss how the solution can maintain its value to the organization through monitoring and updating as needed based on changing business dynamics. As a part of our last thoughts, we will also give some examples of how data projects can have a deep impact on the insurers that use them — choosing to operate from a position of analysis and understanding instead of thoughtful conjecture.

The image used with this article first appeared here.

Data Science: Methods Matter (Part 2)

What makes data science a science?

Methodology.

When data analytics crosses the line with simple formulas, much conjecture and an arbitrary methodology behind it, it often fails in what it was designed to do —give accurate answers to pressing questions.

So at Majesco, we pursue a proven data science methodology in an attempt to lower the risk of misapplying data and to improve predictive results. In Methods Matter, Part 1, we provided a picture of the methodology that goes into data science. We discussed CRISP-DM and the opening phase of the life cycle, project design.

In Part 2, we will be discussing the heart of the life cycle — the data itself. To do that, we’ll take an in-depth look at two central steps: building a data set, and exploratory data analysis. These two steps compose the phase that is  extremely critical for project success, and they illustrate why data analytics is more complex than many insurers realize.

Building a Data Set

Building a data set, in one way, is no different than gathering evidence to solve a mystery or a criminal case. The best case will be built with verifiable evidence. The best evidence will be gathered by paying attention to the right clues.

There will also almost never be just one piece of evidence used to build a case, but a complete set of gathered evidence — a data set. It’s the data scientist’s job to ask, “Which data holds the best evidence to prove our case right or wrong?”

Data scientists will survey the client or internal resources for available in-house data, and then discuss obtaining additional external data to complete the data set. This search for external data is more prevalent now than previously. The growth of external data sources and their value to the analytics process has ballooned with an increase in mobile data, images, telematics and sensor availability.

See also: The Science (and Art) of Data, Part 1

A typical data set might include, for example, typical external sources such as credit file data from credit reporting agencies and internal policy and claims data. This type of information is commonly used by actuaries in pricing models and is contained in state filings with insurance regulators. Choosing what features go into the data set is the result of dozens of questions and some close inspection. The task is to find the elements or features of the data set that have real value in answering the questions the insurer needs to answer.

In-house data, for example, might include premiums, number of exposures,    new and renewal policies and more. The external credit data may include information such as number of public records, number of mortgage accounts, number of accounts that are 30+ days past due among others. The goal at this point is to make sure that the data is as clean as possible. A target variable of interest might be something like frequency of claims, severity of claims, or loss ratio. This step is many times performed by in-house resources, insurance data analysts familiar with the organization’s available data, or external consultants such as Majesco.

At all points along the way, the data scientist is reviewing the data source’s suitability and integrity. An experienced analyst will often quickly discern the character and quality of the data by asking themselves, “Does the number of policies look correct for the size of the book of business? Does the average number of exposures per policy look correct? Does the overall loss ratio seem correct? Does the number of new and renewal policies look correct? Are there an unusually high number of missing or unexpected values in the data fields? Is there an apparent reason for something to look out of order? If not, how can the data fields be corrected? If they can’t be corrected, are the data issues so important that these fields should be dropped from the data set? Some whole record observations may clearly contain bad data and should be dropped from the data set. Even further, is the data so problematic that the whole data set should be redesigned or the whole analytics project should be scrapped?

 

Once the data set has been built, it is time for an in-depth analysis that steps closer toward solution development.

Exploratory Data Analysis

Exploratory data analysis takes the newly minted data set and begins to do something with it — “poking it” with measurements and variables to see how it might stand up in actual use. The data scientist runs preliminary tests on the “evidence.” The data set is subjected to a deeper look at its collective value. If the percentage of missing values is too large, the feature is probably not a good predictor variable and should be excluded from future analysis. In this phase, it may make sense to create more features, including mathematical transformations for non-linear relationships between the features and the target variable.

For non-statisticians, marketing managers and non-analytical staff, the details of exploratory data analysis can be tedious and uninteresting. Yet they are the crux of the genius involved in data science project methodology. Exploratory Data Analysis is where data becomes useful, so it is a part of the process that can’t be left undone. No matter what one thinks of the mechanics of the process, the preliminary questions and findings can be absolutely fascinating.

Questions such as these are common at this stage:

  • Does frequency increase as the number of accounts that are 30+ days past due increases? Is there a trend?
  • Does severity decrease as the number of mortgage trades decreases? Do these trends make sense?
  • Is the number of claims per policy greater for renewal policies than for new policies? Does this finding make sense? If not, is there an error in the way the data was prepared or in the source data itself?
  • If younger drivers have lower loss ratios, should this be investigated as an error in the data or an anomaly in the business? Some trends will not make any sense, and perhaps these features should be dropped from analysis or the data set redesigned.

See also: The Science (and Art) of Data, Part 2

The more we look at data sets, the more we realize that the limits to what can be discovered or uncovered are small and growing smaller. Thinking of relationships between personal behavior and buying patterns or between credit patterns and claims can fuel the interest of everyone in the organization. As the details of the evidence begin to gain clarity, the case also begins to come into focus. An apparent “solution” begins to appear and the data scientist is ready to build that solution.

In Part 3, we’ll look at what is involved in building and testing a data science project solution and how pilots are crucial to confirming project findings.

Data Science: Methods Matter (Part 1)

Why should an insurer employ data science? How does data science differ from any other business analytics that might be happening within the organization? What will it look like to bring data science methodology into the organization?

In nearly every engagement, Majesco’s data science team fields questions as foundational as these, as well as questions related to the details of business needs. Business leaders are smart to do their due diligence — asking IF data science will be valuable to the organization and HOW valuable it might be.

To provide a feel for how data science operates, in this first of three blog posts we will touch briefly on the history of data mining methodology, then look at what an insurer can expect when first engaging in the data science process. Throughout the series, we’re going to keep our eyes on the focus of all of our efforts: answers.

Answers

The goal of most data science is to apply the proper analysis to the right sets of data to provide answers. That proper analysis is just as important as the question an insurer is attempting to answer. After all, if we are in pursuit of meaningful business insights, we certainly don’t want to come to the wrong conclusions. There is nothing worse than moving your business full-speed ahead in the wrong direction based upon faulty analysis. Today’s analysis benefits from a thoughtfully constructed data project methodology.

As data mining was on the rise in the 1990s, it became apparent there were a thousand ways a data scientist might pursue answers to business questions. Some of those methods were useful and good, and some were suspect — they couldn’t truly be called methods. To help keep data scientists and their clients from arriving at the wrong conclusions, a methodology needed to be introduced. A defined yet flexible process would not only assist in managing a specific project scope but would also work toward verifying conclusions by building in pre-test and post-project monitoring against expected results. In 1996, the Cross Industry Process for Data Mining (CRISP-DM) was introduced, the first step in the standardization of data mining projects. Though CRISP-DM was a general data project methodology, insurance had its hand in the development. The Dutch insurer OHRA was one of the four sponsoring organizations to co-launch the standardization initiative.

See also: Data Science: Methods Matter

CRISP-DM has proven to be a strong foundation in the world of data science. Even though the number of available data streams has skyrocketed in the last 20 years and the tools and technology of analysis have improved, the overall methodology is still solid. Majesco uses a variance of CRISP-DM, honed over many years of experience in multiple industries.

Pursuing the right questions — Finding the business nugget in the data mine

Before data mining project methodologies were introduced, one issue companies had was a lack of substantial focus on obtainable goals. Projects didn’t always have a success definition that would help the business in the end. Research could be vague, and methods could be transient.

Research needs focus, so the key ingredient in data science methodology is business need. The insurer has a problem it wishes to solve. It has a question that has no readily apparent answer. If an insurer hasn’t used data scientists, this is a frequent point of entry. It is also the one of the greatest differentiators between traditional in-house data analysis and project-based data science methodology. Instead of tracking trends, data science methodology is focused on  finding clear answers to defined questions. Normally these issues are more difficult to solve and represent a greater business risk, making it easy to justify seeking outside assistance.

Project Design — First meeting and first steps

Phase 1 of a data science project life cycle is project design. This phase is about listening and learning about the business problem (or problems) that are ready to be addressed. For example, a P&C insurer might be wondering why loyalty is lowest in the three states where it has the highest claims — Florida, Georgia and Texas. Is this an anomaly, or is there a correlation between the two statistics? A predictive model could be built to predict the likelihood of attrition. The model score could then be used to determine what actions should be taken to reward and keep a good customer, or perhaps what actions could be taken to remove frequent or high-risk claimants from the books.

The insurer must unpack background and pain points. Does the customer have access to all of the data that is needed for analysis? Should the project be segmented in such a way that it provides for detailed analysis at multiple levels? For example, the insurer may need to run the same type of claims analysis across personal auto, commercial vehicle, individual home and business property. These would represent segmented claims models under the same project.

See also: What Comes After Big Data

The insurer must identify assumptions, definitions, possible solutions and a picture of the risks involved for the project, sorting out areas where segmented analysis may be needed. The team must also collect some information to assist in creating a cost-benefit analysis for the project.

As a part of the project design meetings, the company must identify the analytic techniques that will be used and discuss the features the analysis can use. At the end of the project design phase, everyone knows which answers they are seeking and the questions that will be used to frame those answers. They have a clear understanding of the data that is available for their use and have an outline of the full project.

With the clarity to move forward, the insurers move into a closer examination of the data that will be used.

In Part 2, we will look at the two-step data preparation process that is essential to building an effective solution. We will also look at how the proliferation of data sources is supplying insurers with greater analytic opportunities than ever.

Helping Data Scientists Through Storytelling

Good communication is always a two-way street. Insurers that employ data scientists or partner with data science consulting firms often look at those experts much like one-way suppliers. Data science supplies the analytics; the business consumes the analytics.

But as data science grows within the organization, most insurers find the relationship is less about one-sided data storytelling and more about the synergies that occur in data science and business conversations. We at Majesco don’t think it is overselling data science to say these conversations and relationships can have a monumental impact on the organization’s business direction. So, forward-thinking insurers will want to take some initiative in supporting both data scientists and business data users as they work to translate their efforts and needs for each other.

In my last two blog posts, we walked through why effective data science storytelling matters, and we looked at how data scientists can improve data science storytelling in ways that will have a meaningful impact.

In this last blog post of the series, we want to look more closely at the organization’s role in providing the personnel, tools and environment that will foster those conversations.

Hiring, supporting and partnering

Organizations should begin by attempting to hire and retain talented data scientists who are also strong communicators. They should be able to talk to their audience at different levels—very elementary levels for “newbies” and highly theoretical levels if their customers are other data scientists. Hiring a data scientist who only has a head for math or coding will not fulfill the business need for meaningful translation.

Even data scientists who are proven communicators could benefit from access to in-house designers and copywriters for presentation material. Depending on the size of the insurer, a small data communication support staff could be built to include a member of in-house marketing, a developer who understands reports and dashboards and the data scientist(s). Just creating this production support team, however, may not be enough. The team members must work together to gain their own understanding. Designers, for example, will need to work closely with the analyst to get the story right for presentation materials. This kind of scenario works well if an organization is mass-producing models of a similar type. Smooth development and effective data translation will happen with experience. The goal is to keep data scientists doing what they do best—using less time on tasks that are outside of their domain—and giving data’s story its best possibility to make an impact.

Many insurers aren’t yet large enough to employ or attract data scientists. A data science partner provides more than just added support. It supplies experience in marketing and risk modeling, experience in the details of analytic communications and a broad understanding of how many areas of the organization can be improved.

Investing in data visualization tools

Organizations will need to support their data scientists, not only with advanced statistical tools but with visualization tools. There are already many data mining tools on the market, but many of these are designed with outputs that serve a theoretical perspective, not necessarily a business perspective. For these, you’ll want to employ tools such as Tableau, Qlikview and YellowFin, which are all excellent data visualization tools that are key to business intelligence but are not central to advanced analytics. These tools are especially effective at showing how models can be used to improve the business using overlaid KPIs and statistical metrics. They can slice and dice the analytical populations of interest almost instantaneously.

When it comes to data science storytelling, one tool normally will not tell the whole story. Story telling will require a variety of tools, depending on the various ideas the data scientist is trying to convey. To implement the data and model algorithms into a system the insurer already uses, a number of additional tools may be required. (These normally aren’t major investments.)

In the near future, I think data mining/advanced analytics tools will morph into something able to contain more superior data visualization tools than are currently available. Insurers shouldn’t wait, however, to test and use the tools that are available today. Experience today will improve tomorrow’s business outcomes.

Constructing the best environment

Telling data’s story effectively may work best if the organization can foster a team management approach to data science. This kind of strategic team (different than the production team) would manage the traffic of coming and current data projects. It could include a data liaison from each department, a project manager assigned by IT to handle project flow and a business executive whose role is to make sure priority focus remains on areas of high business impact. Some of these ideas, and others, are dealt with in John Johansen’s recent blog series, Where’s the Real Home for Analytics?

To quickly reap the rewards of the data team’s knowledge, a feedback vehicle should be in place. A communication loop will allow the business to comment on what is helpful in communication; what is not helpful; which areas are ripe for current focus; and which products, services and processes could use (or provide) data streams in the future. With the digital realm in a consistent state of fresh ideas and upheaval, an energetic data science team will have the opportunity to grow together, get more creative and brainstorm more effectively on how to connect analytics to business strategies.

Equally important in these relationships is building adequate levels of trust. When the business not only understands the stories data scientists have translated for them but also trusts the sources and the scientists themselves, a vital shift has occurred. The value loop is complete, and the organization should become highly competitive.

Above all, in discussing the needs and hurdles, do not lose the excitement of what is transpiring. An insurer’s thirst for data science and data’s increased availability is a positive thing. It means complex decisions are being made with greater clarity and better opportunities for success. As business users see results that are tied to the stories supplied by data science, its value will continue to grow. It will become a fixed pillar of organizational support.

This article was written by Jane Turnbull, vice president – analytics for Majesco.