Tag Archives: data quality

Healthcare Data: The Art and the Science

Medicine is often considered part science and part art. There is a huge amount of content to master, but there is an equal amount of technique regarding diagnosis and delivery of service. To optimally succeed, care providers need to master both components. The same can be said for the problem of processing healthcare data in bulk. In spite of the existence of many standards and protocols regarding healthcare data, the problem of translating and consolidating data across many sources of information in a reliable and repeatable way is a tremendous challenge. At the heart of this challenge is recognizing when quality has been compromised. The successful implementation of a data quality program within an organization, similar to medicine, also combines a science with an art form. Here, we will run through the basic framework that is essential to data quality initiative and then provide some of the lesser-understood processes that need to be in place in order to succeed.

The science of implementing a data quality program is relatively straightforward. There is field-level validation, which ensures that strings, dates, numbers and lists of valid values are in good form. There is cross-field validation and cross-record validation, which checks the integrity of the expected relationships to be found within the data. There is also profiling, which considers historical changes in the distribution and volume of data and determines significance. Establishing a framework to embed this level of quality checks and associated reporting is a major effort, but it is also clearly an essential part of any successful implementation involving healthcare data.

Data profiling and historical trending are also essential tools in the science of data-quality management. As we go further down the path of conforming and translating our healthcare data, there are inferences to be made. There is provider and member matching based on algorithms, categorizations and mappings that are logic-based, and then there are the actual analytical results and insights generated from the data for application consumption.

See also: Big Data? How About Quality Data?  

Whether your downstream application is analytical, workflow, audit, outreach-based or something else, you will want to profile and perform historical trending of the final result of your load processes. There are so many data dependencies between and among fields and data sets that it is nearly impossible for you to anticipate them all. A small change in the relationship between, say, the place of service and the specialty of the service provider can alter your end-state results in surprising and unexpected ways.

This is the science of data-quality management. It is quite difficult to establish full coverage – nearly impossible – and that is where “art” comes into play.

If we do a good job and implement a solid framework and reporting around data quality, we immediately find that there is too much information. We are flooded with endless sets of exceptions and variations.

The imperative of all of this activity is to answer the question, “Are our results valid?” Odd as it may seem, there is some likelihood that key teams or resident SMEs will decide not to use all that exception data because it is hard to find the relevant exceptions from the irrelevant. This is a more common outcome than one might think. How do we figure out which checks are the important ones?

Simple cases are easy to understand. If the system doesn’t do outbound calls, then maybe phone number validation is not very important. If there is no e-mail generation or letter generation, maybe these data components are not so critical.

In many organizations, the final quality verification is done by inspection, reviewing reports and UI screens. Inspecting the final product is not a bad thing and is prudent in most environments, but clearly, unless there is some automated validation of the overall results, such organizations are bound to learn of their data problems from their customers. This is not quite the outcome we want. The point is that many data-quality implementations are centered primarily on the data as it comes in, and less on the outcomes produced.

Back to the science. The overall intake process can be broken down into three phases: staging, model generation and insight generation. We can think of our data-quality analysis as post-processes to these three phases. Post-staging, we look at the domain (field)-level quality; post-model generation, we look at relationships, key generation, new and orphaned entities. Post-insight generation, we check our results to see if they are correct, consistent and in line with prior historical results.

If the ingestion process takes many hours, days or weeks, we will not want to wait until the entire process has completed to find out that results don’t look good. The cost of re-running processes is a major consideration. Missing a deadline due to the need to re-run is a major setback.

The art of data quality management is figuring out how separate the noise from the essential information. Instead of showing all test results from all of the validations, we need to learn how to minimize the set of tests made while maximizing the chances of seeing meaningful anomalies. Just as an effective physician would not subject patients to countless tests that may or may not be relevant to a particular condition, an effective data-quality program should not present endless test results that may or may not be relevant to the critical question regarding new data introduced to the system. Is it good enough to continue, or is there a problem?

We need to construct a minimum number of views into the data that represents a critical set and is a determinant of data quality. This minimum reporting set is not static, but changes as the product changes. The key is to focus on insights, results and, generally, the outputs of your system. The critical function of your system determines the critical set.

Validation should be based on the configuration of your customer. Data that is received and processed but not actively used should not be validated along with data that is used. There is also a need for customer-specific validation in many cases. You will want controls by product and by customer. The mechanics of adding new validation checks should be easy and the framework should scale to accommodate large numbers of validations. The priority of each verification should be considered carefully. Too many critical checks and you miss clues that are buried in data. Too few and you miss clues because they don’t stand out.

See also: 4 Ways to Keep Data Quality High  

Profiling your own validation data is also a key. You should know, historically, how many of each type of errors you typically encounter and flag statistically significant variation just as you would when you detect variations in essential data elements and entities.  Architecture is important. You will want the ability to profile and report anything that implies it is developed in a common way that is centralized rather than having different approaches to different areas you want to profile.

Embedding critical validations as early in the ingestion process as possible is essential. It is often possible to provide validations that emulate downstream processing. The quality team should have incentives to pursue these types of checks on a continuing basis. They are not obvious and are never complete, but are part of any healthy data-quality initiative.

A continuous improvement program should be in place to monitor and tune the overall process. Unless the system is static, codes change, dependencies change and data inputs change. There will be challenges, and with every exposed gap found late in the process there is an opportunity to improve.

This post had glossed over a large amount of material, and I have oversimplified much to convey some of the not-so-obvious learnings of the craft. Quality is a big topic, and organizations should treat it as such. Getting true value is indeed an art as it is easy to invest and not get the intended return. This is not a project with a beginning and an end but a continuing process. Just as with the practice of medicine, there is a lot to learn in terms of the science of constructing the proper machinery, but there is an art to establishing active policies and priorities that effectively deliver successfully.

Model Errors in Disaster Planning

“All models are wrong; some are useful.” – George Box

We have spent three articles (article 1, article 2, article 3) explaining how catastrophe models provide a tool for much-needed innovation to the global insurance industry. Catastrophe models have covered for the lack of experience with many losses and let insurers properly price and underwrite risks, manage portfolios, allocate capital and design risk management strategies. Yet for all the practical benefits CAT models have infused into the industry, product innovation has stalled.

The halt in progress is a function of what models are and how they work. In fairness to those who do not put as much stock in the models as a useful tool, it is important to speak of the models’ limitations and where the next wave of innovation needs to come from.

Model Design

Models are sets of simplistic instructions that are used to explain phenomena and provide relevant insight on future events (for CAT models – estimating future catastrophic losses). We humans start using models at very early ages. No one would confuse a model airplane with a real one; however, if a parent wanted to simplify the laws of physics to explain to a child how planes fly, then a model airplane is a better tool than, say, a physics book or computer-aided design software. Conversely, if you are a college student studying engineering or aerodynamics, the reverse is true. In each case, we are attempting to use a tool – models of flight, in this instance – to explain how things work and to lend insight into what could happen based on historical data so that we can merge theory and practice into something useful. It is the constant iteration between theory and practice that allows an airplane manufacturer to build a new fighter jet, for instance. No manufacturer would foolishly build an airplane based on models no matter how scientifically advanced those models are, but those models would be incredibly useful in guiding the designers to experimental prototypes. We build models, test them, update them with new knowledge, test them again and repeat the process until we achieve desired results.

The design and use of CAT models follows this exact pattern. The first CAT models estimated loss by first calculating total industry losses and then proportionally allocating losses to insurers based on assumption of market share. That evolved into calculating loss estimates for specific locations at specific addresses. As technology advanced into the 1990s, model developers harnessed that computing power and were able to develop simulation programs to analyze more data, faster. The model vendors then added more models to cover more global peril regions. Today’s CAT models can even estimate construction type, height and building age if an insurer does not readily have that information.

As catastrophic events occur, modelers routinely compare the actual event losses with the models and measure how well or how poorly the models performed. Using actual incurred loss data helps calibrate the models and also enables modelers to better understand the areas in which improvements must be made to make them more resilient.

However, for all the effort and resources put into improving the models (model vendors spend millions of dollars each year on model research, development, improvement and quality assurance), there is still much work to be done to make them even more useful to the industry. In fact, virtually every model component has its limitations. A CAT model’s hazard module is a good example.

The hazard module takes into account the frequency and severity of potential disasters. Following the calamitous 2004 and 2005 U.S. hurricane seasons, the chief model vendors felt pressure to amend their base catalogs with something to reflect the new high-risk era we were in, that is, taking into account higher-than-average sea surface temperatures. These model changes dramatically affected reinsurance purchase decisions and account pricing. And yet, little followed. What was assumed to be the new normal of risk taking actually turned into one of the quietest periods on record.

Another example was the magnitude 9.0, 2011 Great Tōhuko Earthquake in Japan. The models had no events even close to this monster earthquake in their event catalogs. Every model clearly got it wrong, and, as a result, model vendors scrambled to fix this “error” in the model. Have the errors been corrected? Perhaps in these circumstances, but what other significant model errors exist that have yet to be corrected?

CAT model peer reviewers have also taken issue with actual event catalogs used in the modeling process to quantify catastrophic loss. For example, a problem for insurers is answering the type of question of: What is the probability of a Category 5 hurricane making landfall in New York City? Of course, no one can provide an answer with certainty. However, while no one can doubt the significance of the level of damage an event of that intensity would bring to New York City (Super Storm Sandy was not even a hurricane at landfall in 2012 and yet caused tens of billions of dollars in insured damages), the critical question for insurers is: Is this event rare enough that it can be ignored, or do we need to prepare for an event of that magnitude?

To place this into context, the Category 3, 1938 Long Island Express event would probably cause more than $50 billion in insured losses today, and that event did not even strike New York City. If a Category 5 hurricane hitting New York City was estimated to cause $100 billion in insured losses, then knowing whether this was a 1-in-10,000-year possibility or a 1-in-100-year possibility could mean the difference between solvency and insolvency for many carriers. If that type of storm was closer to a 1-in-100-year probability, then insurers have the obligation to manage their operations around this possibility; the consequences are too grave, otherwise.

Taking into account the various chances of a Category 5 directly striking New York City, what does that all mean? It means that adjustments in underwriting, pricing, accumulated capacity in that region and, of course, reinsurance design all need to be considered — or reconsidered, depending on an insurer’s present position relative to its risk appetite. Knowing the true probability is not possible at this time; we need more time and research to understand that. Unfortunately for insurers, rating agencies and regulators, we live in the present, and sole reliance on the models to provide “answers” is not enough.

Compounding this problem is that, regardless of the peril, errors exist in every model’s event catalog. These errors cannot even be avoided, and the problem escalates where our paucity of historical recordings and scientific experiments limit our industry’s ability to inch us closer and closer to greater certainty.

Earthquake models still lie beyond a comfortable reach of predictability. Some of the largest and most consequential earthquakes in U.S. history have occurred near New Madrid, MO. Scientists are still wrestling with the mechanics of that fault system. Thus, managing a portfolio of properties solely dependent on CAT model output is foolhardy at best. There is too much financial consequence from phenomena that scientists still do not understand.

Modelers also need to continuously assess property vulnerability when it comes to taking into consideration various building stock types with current building codes. Assessing this with imperfect data and across differing building codes and regulations is difficult. That is largely the reason that so-called “vulnerability curves” oftentimes are revised after spates of significant events. Understandably, each event yields additional data points for consideration, which must be taken into account in future model versions. Damage surveys following Hurricane Ike showed that the models underestimated contents vulnerability within large high-rises because of water damage caused by wind-driven rain.

As previously described, a model is a set of simplified instructions, which can be programmed to make various assumptions based on the input provided. Models, therefore, fall into the Garbage In – Garbage out complex. As insurers adapt to these new models, they often need to cajole their legacy IT systems to provide the required data to run the models. For many insurers, this is an expensive and resource-intensive process, often taking years.

Data Quality’s Importance

Currently, the quality of industry data to be used in such tools as CAT models is generally considered poor. Many insurers are inputting unchecked data into the models. For example, it is not uncommon that building construction type, occupancy, height and age, not to mention a property’s actual physical address, are unknown! For each  property whose primary and secondary risk characteristics are missing, the models must make assumptions regarding those precious missing inputs – even regarding where the property is located. This increases model uncertainty, which can lead to inaccurate assessment of an insurer’s risk exposure.

CAT modeling results are largely ineffective without quality data collection. For insurers, the key risk is that poor data quality could lead to a misunderstanding regarding what their exposure is to potential catastrophic events. This, in turn, will have an impact on portfolio management, possibly leading to unwanted exposure distribution and unexpected losses, which will affect both insurers’ and their reinsurers’ balance sheets. If model results are skewed as a result of poor data quality, this can lead to incorrect assumptions, inadequate capitalization and the failure to purchase sufficient reinsurance for insurers. Model results based on complete and accurate data ensures greater model output certainty and credibility.

The Future

Models are designed and built based on information from the past. Using them is like trying to drive a car by only looking in the rear view mirror; nonetheless, catastrophes, whether natural or man-made, are inevitable, and having a robust means to quantify them is critical to the global insurance marketplace and lifecycle.

Or is it?

Models, and CAT models in particular, provide a credible industry tool to simulate the future based on the past, but is it possible to simulate the future based on perceived trends and worst-case scenarios? Every CAT model has its imperfections, which must be taken into account, especially when employing modeling best practices. All key stakeholders in the global insurance market, from retail and wholesale brokers to reinsurance intermediaries, from insurers to reinsurers and to the capital markets and beyond, must understand the extent of those imperfections, how error-sensitive the models can be and how those imperfections must be accounted for to gain the most accurate insight into individual risks or entire risk portfolios. The difference in a few can mean a lot.

The next wave of innovation in property insurance will come from going back to insurance basics: managing risk for the customer. Despite model limitations, creative and innovative entrepreneurs will use models to bundle complex packages of risks that will be both profitable to the insurer and economical to the consumer. Consumers desiring to protect themselves from earthquake risks in California, hurricane risks in Florida and flood risks on the coast and inland will have more options. Insurers looking to deploy capital and find new avenues of growth will use CAT models to simulate millions of scenarios to custom create portfolios optimizing their capacity and create innovative product features to distinguish their products against competitors. Intermediaries will use the models to educate and craft effective risk management programs to maximize their clients’ profitability.

For all the benefit CAT models have provided the industry over the past 25 years, we are only driving the benefit down to the consumer in marginal ways. The successful property insurers of the future will be the ones who close the circle and use the models to create products that make the transfer of earthquake, hurricane and other catastrophic risks available and affordable.

In our next article, we will examine how we can use CAT models to solve some of the critical insurance problems we face.

4 Ways to Keep Data Quality High

“Know your customers” is the new data mantra for the 21st century. Clean, high-quality customer data gives insurers powerful marketing and service advantages and prevents expensive headaches. A well-conceived data warehouse is a good place to start, but, as core insurance systems develop problems over time, data-quality issues grow undetected in the data warehouse.

These problems usually only show up when reports are generated from the warehouse and business people question the validity of the data. By then it is too late, and correcting the problem will take much time and money.

So how do you avoid this problem? The answer lies in searching for small data problems before they get bigger. And, once they’ve been found, fix them right away.

The same principles for running a great data warehouse apply to property/casualty, life and health insurers. All have complex challenges, but health insurers, which deal with patients, providers, employers and brokers, may face the biggest data challenges.

To avoid data integrity issues, carriers should consider establishing a simple yet effective four-step program.

1. Control totals

The standard approach is to keep track of the number of records in the file and make sure that same number end up in the warehouse. That’s a good start, but take this concept further and use it with individual fields that are important for the business. For example, while loading patient data, we can get the control counts for male/female and match them with the membership system. Another example would be to get the control count based on age bands and make sure they match the membership system.

2. Aggregate data and check for trends

Your system should aggregate certain data to make sure that the percentage is as expected and lies within a trend. For example, in a typical month, 18% of members may have claims. If that number is suddenly showing up as 8% or 28%, you know you probably have a data problem.

To track the change, calculate the percentage that matched upfront and store it in the aggregate table. Storing of the aggregated data helps identify problems with the data quickly if trends change.

3. Set up automatic alerts

Your system should automatically issue alerts whenever it detects a problem: controls totals that do not match or a percentage that’s outside the range of expected results.

4. Build and empower data teams

Build a team whose job is to identify the data-quality issues. This team has to be knowledgeable about the business and understand trends. Data-quality team members should include representatives of various business departments and IT.

When any problems arise, the data-quality team will report them to the data steward/governance team. The latter team is empowered to take prompt corrective action.

The key to making any data warehouse successful is to continually build trust and credibility in the data. Checking for data anomalies is not a one-time thing. It needs to be done continuously as part of a healthy data program.

Having a set of strategies for automating data-quality checking helps maintain the trust in data over time. Building a support team that is vigilant about finding data-quality issues is a must for continuing data quality.

Making Sure What You Did Stays Done

The workers’ compensation industry is often accused of resisting change. Simple observation bears this out, such as how long it has taken the industry to address analytics. Having now put its toe in the water, the industry is still light years behind other industries in implementing analytics.

An opportunity can be rather simple, yet taking the necessary steps to achieve it is daunting to many. An executive was once overheard saying of a proposal, “It all makes sense, but we would need to change the way we do things to make it happen.”

What is it in the industry culture that causes resistance to change?

It may not be the amount of effort required. Analytics are totally dependent on data quality, so if it is inaccurate or incomplete, the analytics are of less value — a poor trade — but improving data quality can be intimidating because it is not an IT responsibility. Even when IT can play a part in improving data quality, it is management that must demand it.

Sometimes the problem of data quality is the source of the data. Bill review data may not contain all the data fields needed, for instance. Again, only management can address the problem with the vendor organization.

Sustaining change

A significant amount of the effort needed for change is not so much mobilizing the action but sustaining the initiatives. Change directives must become an integral part of the organization’s process. Management must continually check to see which mandates are carried out and which have slipped. Performance accountability is key.

The degree to which a change initiative is successful is positively correlated with management oversight. It is not difficult, but it can be tedious. In this regard, a definition of management is:

Good management is continually making sure that what you did stays done.

Initiate the change, then follow up to ensure continued practice. The real challenge is to keep doing it.

Accurate and complete data is the only affordable and practical resource on the horizon to advance to the next levels of medical management and measureable cost control through analytics. And only management can change data quality.

Why Poor Data Quality Is Not an IT Problem

The adage about data, “garbage in, garbage out,” has taken on magnified importance because the volume, quality, and impact of data has reached unprecedented levels. The importance of quality data is paramount.

Companies must change business practices regarding data management or suffer financial disadvantages. Unfortunately, the people who have the power to change frequently think the issue belongs elsewhere.

Not an IT problem!

The misconception is that, if there is a data problem, it must be an IT problem. However, only senior management has the power to hold people and organizations accountable for data quality.

The following is an excerpt from an email I received recently from our IT describing one client’s data:

“There is a field for NPI number in the data feeds, but it is not often populated.  When it is populated we can definitely use that information to derive the specialty and possibly to determine the individual provider rather than the practice or facility.”

This example highlights a widespread problem in workers’ compensation data. Even though a field is available to capture a specific data element, in this case the NPI (National Provider Identification) number, it is not populated. This number is derived from medical bills, and the reason for the omission should be thoroughly investigated.

The information trail

The first place to look is upstream in the information trail, to the submitting provider or entity. Standard billing forms such as the HCFA 1500 contain a field for NPI, but it may not be filled. Second along the information hand-off line is the bill review company. Is the NPI number being captured from the bill?

If the provider is submitting the NPI, is the bill review company capturing it? Then, if the bill review company is capturing the NPI, is it included in the data set transmitted to the payer? Once the source of the problem is discovered, management must require the necessary process changes.

Management intervention

If the submitting provider is not including the data needed, in this case the NPI number, the best management intervention is refusing to pay incomplete bills. Likewise, if the bill review company or system is not capturing the data or is not passing it on to the payer, management must demand the data needed.

Seemingly trivial data omissions can lead to multiple other problems.

Another common data problem is the submitting provider or entity entering a facility, group or practice name while excluding that of the individual treating physician. Management should insist upon using the individual treating physician name and NPI number rather than the entity name only. Systems should capture all three pieces of information.

Bad data comes in many forms beyond missing data in existing fields. Other kinds of bad data include erroneous data and duplicate records. Regardless of the form and source of bad data, the challenges on the horizon are significant. The simple fact is, benefits from analytics to gain cost advantages are not accessible to those with poor data quality. 

Management owns data quality

Accurate and complete data is the only affordable and practical resource on the horizon to advance to the next levels of medical management and measureable cost control. Only management can ensure data quality.