Tag Archives: data sets

Data Science: Methods Matter (Part 2)

What makes data science a science?


When data analytics crosses the line with simple formulas, much conjecture and an arbitrary methodology behind it, it often fails in what it was designed to do —give accurate answers to pressing questions.

So at Majesco, we pursue a proven data science methodology in an attempt to lower the risk of misapplying data and to improve predictive results. In Methods Matter, Part 1, we provided a picture of the methodology that goes into data science. We discussed CRISP-DM and the opening phase of the life cycle, project design.

In Part 2, we will be discussing the heart of the life cycle — the data itself. To do that, we’ll take an in-depth look at two central steps: building a data set, and exploratory data analysis. These two steps compose the phase that is  extremely critical for project success, and they illustrate why data analytics is more complex than many insurers realize.

Building a Data Set

Building a data set, in one way, is no different than gathering evidence to solve a mystery or a criminal case. The best case will be built with verifiable evidence. The best evidence will be gathered by paying attention to the right clues.

There will also almost never be just one piece of evidence used to build a case, but a complete set of gathered evidence — a data set. It’s the data scientist’s job to ask, “Which data holds the best evidence to prove our case right or wrong?”

Data scientists will survey the client or internal resources for available in-house data, and then discuss obtaining additional external data to complete the data set. This search for external data is more prevalent now than previously. The growth of external data sources and their value to the analytics process has ballooned with an increase in mobile data, images, telematics and sensor availability.

See also: The Science (and Art) of Data, Part 1

A typical data set might include, for example, typical external sources such as credit file data from credit reporting agencies and internal policy and claims data. This type of information is commonly used by actuaries in pricing models and is contained in state filings with insurance regulators. Choosing what features go into the data set is the result of dozens of questions and some close inspection. The task is to find the elements or features of the data set that have real value in answering the questions the insurer needs to answer.

In-house data, for example, might include premiums, number of exposures,    new and renewal policies and more. The external credit data may include information such as number of public records, number of mortgage accounts, number of accounts that are 30+ days past due among others. The goal at this point is to make sure that the data is as clean as possible. A target variable of interest might be something like frequency of claims, severity of claims, or loss ratio. This step is many times performed by in-house resources, insurance data analysts familiar with the organization’s available data, or external consultants such as Majesco.

At all points along the way, the data scientist is reviewing the data source’s suitability and integrity. An experienced analyst will often quickly discern the character and quality of the data by asking themselves, “Does the number of policies look correct for the size of the book of business? Does the average number of exposures per policy look correct? Does the overall loss ratio seem correct? Does the number of new and renewal policies look correct? Are there an unusually high number of missing or unexpected values in the data fields? Is there an apparent reason for something to look out of order? If not, how can the data fields be corrected? If they can’t be corrected, are the data issues so important that these fields should be dropped from the data set? Some whole record observations may clearly contain bad data and should be dropped from the data set. Even further, is the data so problematic that the whole data set should be redesigned or the whole analytics project should be scrapped?


Once the data set has been built, it is time for an in-depth analysis that steps closer toward solution development.

Exploratory Data Analysis

Exploratory data analysis takes the newly minted data set and begins to do something with it — “poking it” with measurements and variables to see how it might stand up in actual use. The data scientist runs preliminary tests on the “evidence.” The data set is subjected to a deeper look at its collective value. If the percentage of missing values is too large, the feature is probably not a good predictor variable and should be excluded from future analysis. In this phase, it may make sense to create more features, including mathematical transformations for non-linear relationships between the features and the target variable.

For non-statisticians, marketing managers and non-analytical staff, the details of exploratory data analysis can be tedious and uninteresting. Yet they are the crux of the genius involved in data science project methodology. Exploratory Data Analysis is where data becomes useful, so it is a part of the process that can’t be left undone. No matter what one thinks of the mechanics of the process, the preliminary questions and findings can be absolutely fascinating.

Questions such as these are common at this stage:

  • Does frequency increase as the number of accounts that are 30+ days past due increases? Is there a trend?
  • Does severity decrease as the number of mortgage trades decreases? Do these trends make sense?
  • Is the number of claims per policy greater for renewal policies than for new policies? Does this finding make sense? If not, is there an error in the way the data was prepared or in the source data itself?
  • If younger drivers have lower loss ratios, should this be investigated as an error in the data or an anomaly in the business? Some trends will not make any sense, and perhaps these features should be dropped from analysis or the data set redesigned.

See also: The Science (and Art) of Data, Part 2

The more we look at data sets, the more we realize that the limits to what can be discovered or uncovered are small and growing smaller. Thinking of relationships between personal behavior and buying patterns or between credit patterns and claims can fuel the interest of everyone in the organization. As the details of the evidence begin to gain clarity, the case also begins to come into focus. An apparent “solution” begins to appear and the data scientist is ready to build that solution.

In Part 3, we’ll look at what is involved in building and testing a data science project solution and how pilots are crucial to confirming project findings.

3 Warning Signs of Adverse Selection

The top 25 insurers consume 70% of the market share in workers’ compensation, and, as the adoption of data and predictive analytics continues to grow in the insurance industry, so does the divide between insurers with competitive advantage and those without it. One of the largest outcomes of this analytics revolution is the increasing threat of adverse selection, which occurs when a competitor undercuts the incumbent’s pricing on the best risks and avoids writing poor performing risks at inadequate prices.

Every commercial lines carrier faces it, whether it knows it or not. A relative few are actively using adverse selection offensively to carve out new market opportunities from less sophisticated opponents. An equally small crowd knows that they are the unwilling victims of adverse selection, with competitors currently replacing their best long-term risks with a bunch of poor-performing accounts.

It’s the much larger middle group that’s in real trouble — those that are having their lunch quietly stolen each and every day, without even realizing it.

Three Warning Signs of Adverse Selection
Adverse selection is a particularly dangerous threat because it is deadly to a portfolio yet only recognizable after the damage has been done. However, there are specific warning signs to look out for that indicate your company is vulnerable:

  1. Loss Ratios and Loss Costs Climb – When portfolio loss ratios are climbing, it is easy to blame market conditions and the competition’s “irrational pricing.” If you or your colleagues are talking about the crazy pricing from the competition, it could be a sign that your competitor has better information to assess the same risks. For example, in 2009, Travelers Insurance, known to be utilizing predictive analytics for pricing, had a combined ratio of 89% while all of P&C had a combined ratio of 101%.
  2. Rates Go Up, and Volumes Declines – As loss ratios increase along with losses per earned exposure, the actuarial case emerges: Manual rates are inadequate to cover expected future costs. In this situation, tension grows among the chief decision makers. Raising rates will put policy retention and volumes at risk, but failing to raise rates will cut deeply into portfolio profitability. Often in the early stages of this warning sign, insurers opt to raise rates, which makes it tougher on both acquisition and retention. After another policy cycle, there is often a lurking surprise: The actuary will find that the rate increase was insufficient to cover the higher projected future losses. At this point, adversely selected insurers raise rates again (assuming their competitors are doing the same). The cycle repeats, and adverse selection has taken hold.
  3. Reserves Become Inadequate – When actuaries express signs of mild reserve inadequacy, the claims department often argues that reserving practices haven’t changed, but their loss frequency and severity have increased. This leads to major decreases in return on assets (ROA) and forces insurers to downsize and focus on a niche specialization to survive, with little hope of future growth. The fundamental problem leading to this occurrence is that the insurer cannot identify and price risk with the accuracy that competitors can.

Predictive Analytics Evens the Playing Field
The easiest way to prevent your business from being adversely selected is starting with the foundation of your risk management — the underwriting. Traditional insurance companies rely only on their own data to price risks, but more analytically driven companies are using a diversified set of data to prevent sample bias.

For small to mid-sized businesses that can’t afford to build out their internal data assets, there are third-party sources and solutions that can provide underwriters with the insight to make quicker and smarter pricing decisions. Having access to large quantities of granular data allows insurers to assess risk more accurately and win the right business for the best price while avoiding bad business.

Additionally, insurers are using predictive analytics to expand their scope of influence in insurance. With market share consolidation on the rise, insurers in niche markets of workers’ compensation face even more pressure of not only protecting their current business, but also achieving the confidence to underwrite risks in new markets to expand their book of business. According to a recent Accenture survey, 72% of insurers are struggling with maintaining underwriting and pricing discipline. The trouble will only increase as insurers attempt to expand into new territories without the wealth of data needed to write these new risks appropriately. The market will divide into companies that use predictive models to price risks more accurately and those that do not.

At the very foundation of any adversely selected insurer is the inability to price new and renewal business accurately. Overhauling your entire enterprise overnight to be data-driven and equipped to utilize advanced data analytics is an unreasonable goal. However, beginning with a specific segment of your business is not only reasonable but will help you fight against adverse selection and lower loss ratio.

This article first appeared on wci360.com.