Tag Archives: hadoop

3 Phases to Produce Real IoT Value

In May, I wrote about The Three Phases Insurers Need for Real Big Data Value, assessing how insurance companies progress through levels of maturity as they invest in and innovate around big data. It turns out that there’s a similar evolution around how insurers consume and use feeds from the Internet of Things, whether talking about sensor devices, wearables, drones or any other source of complex, unstructured data. The growth of IoT in the insurance space (especially with automotive telematics) is one of the major reasons insurers have needed to think beyond traditional databases. This is no surprise, as Novarica has explained previously how these emerging technologies are intertwined in their increasing adoption.

The reality on the ground is that the adoption of the Internet of Things in the insurance industry has outpaced the adoption of big data technologies like Hadoop and other NoSQL/unstructured databases. Just because an insurer hasn’t yet built up a robust internal skill set for dealing with big data doesn’t mean that those insurers won’t want to take advantage of the new information and insight available from big data sources. Despite the seeming contradiction in that statement, there are actually three different levels of IoT and big data consumption that allow insurers at various phases of technology adoption to work with these new sources.

See also: 7 Predictions for IoT Impact on Insurance  

Phase 1: Scored IoT Data Only

For certain sources of IoT/sensor data, it’s possible for insurers to bypass the bulk of the data entirely. Rather than pulling the big data into their environment, the insurer can rely on a trusted third party to do the work for it, gathering the data and then using analytics and predictive models to reduce the data to a score. One example in use now is third-party companies that gather telematics data for drivers and generate a “driver score” that assesses a driver’s behavior and ability relative to others. On the insurer’s end, only this high-level score is stored and associated with a policyholder or a risk, much like how credit scores are used.

This kind of scored use of IoT data is good for top-level decision-making, executive review across the book of business or big-picture analysis of the data set. It requires having significant trust in the third-party vendor’s ability to calculate the score. Even when the insurer does trust that score, it’s never going to be as closely correlated to the insurer’s business because it’s built with general data rather than the insurer’s claims and loss history. In some cases, especially insurers with smaller books of business, this might actually be a plus, because a third party might be basing its scores on a wider set of contributory data sets. And even large insurers that have matured to later phases of IoT data consumption might still want to leverage these third-party scores as a way to validate and accentuate the kind of scoring they do internally.

One limitation is that a third party that aggregates and scores the kind of IoT data the insurer is interested in has to already exist. While this is the case for telematics, there may be other areas where that’s not the case, leaving the insurer to move to one of the next phases on its own.

Phase 2: Cleansed/Simplified IoT Data Ingestion

Just because an insurer has access to an IoT data source (whether through its own distribution of devices or by tapping into an existing sensor network) doesn’t mean the insurer has the big data capability to consume and process all of it. The good news is it’s still possible to get value out of these data sources even if that’s the case. In fact, in an earlier survey report by Novarica, while more than 60% of insurers stated that they were using some forms of big data, less than 40% of those insurers were using anything other than traditional SQL databases. How is that possible if traditional databases are not equipped to consume the flow of big data from IoT devices?

What’s happening is that these insurers are pulling the key metrics from an IoT data stream and loading it into a traditional relational database. This isn’t a new approach; insurers have been doing this for a long time with many types of data sets. For example, when we talk about weather data we’re typically not actually pulling all temperatures and condition data throughout the day in every single area, but rather simplifying it to condition and temperature high and low at a zip code (or even county) on a per-day basis. Similarly, an insurer can install telematics devices in vehicles and only capture a slice of the data (e.g. top speed, number of hard breaks, number of hard accelerations—rather than every minor movement), or filter only a few key metrics from a wearable device (e.g. number of steps per day rather than full GPS data).

This kind of reduced data set limits the full set of analysis possible, but it does provide some benefits, too. It allows human querying and visualization without special tools, as well as a simpler overlay onto existing normalized records in a traditional data warehouse. Plus, and perhaps more importantly, it doesn’t require an insurer to have big data expertise inside its organization to start getting some value from the Internet of Things. In fact, in some cases the client may feel more comfortable knowing that only a subset of the personal data is being stored.

Phase 3: Full IoT Data Ingestion

Once an insurer has a robust big data technology expertise in house, or has brought in a consultant to provide this expertise, it’s possible to capture the entire range of data being generated by IoT sensors. This means gathering the full set of sensor data, loading it into Hadoop or another unstructured database and layering it with existing loss history and policy data. This data is then available for machine-driven correlation and analysis, identifying insights that would not have been available or expected with the more limited data sets of the previous phases. In addition, this kind of data is now available for future insight as more and more data sets are layered into the big data environment. For the most part, this kind of complete sensor data set is too deep for humans to use directly, and it will require tools to do initial analysis and visualization such that what the insurer ends up working with makes sense.

As insurers embrace artificial intelligence solutions, having a lot of data to underpin machine learning and deep learning systems will be key to their success. An AI approach will be a particularly good way of getting value out of IoT data. Insurers working only in Phase 1 or Phase 2 of the IoT maturity scale will not be building the history of data in this fashion. Consuming the full set of IoT data in a big data environment now will establish a future basis for AI insight, even if there is a limited insight capability to start.

See also: IoT’s Implications for Insurance Carriers  

Different Phases Provide Different Value

These three IoT phases are not necessarily linear. Many insurers will choose to work with IoT data using all three approaches simultaneously, due to the different values they bring. An insurer that is fully leveraging Hadoop might still want to overlay some cleansed/simplified IoT data into its existing data warehouse, and may also want to take advantage of third-party scores as a way of validating its own complete scoring. Insurers need to not only develop the skill set to deal with IoT data, but also the use cases for how they want it to affect their business. As is the case with all data projects, if it doesn’t affect concrete decision-making and business direction, then the value will not be clear to the stakeholders.

Data Opportunities in Underwriting

For more than a decade, Americans have been trained to assess and buy insurance products as commodities. This is partly thanks to commercials by Geico, the biggest advertising spender in insurance for many years, which has pushed the concept that “Fifteen minutes can save you 15%,” portraying policies as “the same,” where the only differentiator is the price. Some have dubbed insurance’s being viewed as a commodity as the industry’s biggest challenge.

On top of price-centric buying behavior, most consumers who are required to purchase certain insurance products — such as medical and auto — expect to have a wide selection and may switch insurance carriers at a blink of an eye. With competition increasing, big data and associated technologies provide timely opportunities by reshaping the modern insurance landscape.

The insurance business model typically comprises four parts:

  • Underwriting — where insurance companies make money.
  • Investment — where insurance companies invest money.
  • Claims — where insurance companies pay out (the cost factor).
  • Marketing — where insurance products and services are promoted and often advertised.

Insurance companies have always used data in each part of the business model — to assess risk, set policy prices and to win/retain consumers. Previously, insurers would formulate policies by comparing customers’ histories, yielding a simplistic and not-very-accurate assessment of risk. Today, our increasing ability to access and analyze data as well as advancements in data science allow insurers to feed broader historical, continuous and real-time data through complex algorithms to construct a much more sophisticated and accurate picture of risk. This enables insurance companies to offer more competitive prices that ensure profit by covering perceived risk and working within customers’ budgets. Such prices, or setting policy premiums, come from underwriting.

In this post, we will focus on an underwriting use case in the highly competitive auto insurance space, where accuracy of risk assessment and rate setting ultimately drive the insurers’ profitability. Future posts will address other parts of the insurance business model.

More accurate (and competitive) pricing for auto insurance underwriting

Auto insurance may be the most competitive part of the insurance marketplace. Customers shop around (often marketed to by price-comparison services) and change insurers at will. To offer competitive premiums that allow profitability, auto insurers have no choice but to assess risk as accurately as possible.

See also: Why Data Analytics Are Like Interest  

In auto insurance, insurers use both “small” and “big” data. David Cummings explains the two as:

“Traditionally, underwriters have developed auto insurance prices based on smaller data — such as the car’s make, model and manufacturer’s suggested retail price (MSRP). But ‘bigger data’ is now available, providing far more information and allowing insurers to price policies with a better understanding of the vehicle’s safety. From manufacturers and third-party vendors, insurers can learn about a car’s horsepower, weight, bumper height, crash test ratings and safety features. That big data helps insurers create sophisticated predictive models and more accurate vehicle-based rate segmentation.”

As data increasingly becomes the lifeblood for insurance companies, the combination of big data and analytics is driving a significant shift in insurance underwriting. For example, faster processing technologies such as Hadoop have allowed insurers like Allstate to dig through customer information — quotes, policies, claims, etc. — to note patterns and generate competitive premiums to win new customers.

The data and analytics movement has also made room for newcomers like Metromile to enter the market. Although the company started out with no proprietary data of its own, Metromile has quickly gained customers and collected data with a new model: auto insurance by the mile.

This entrance of Metromile into the auto insurance space has both disrupted the industry and put pressure on incumbent insurance providers to make advances with their own models.

In auto insurance underwriting, a number of ways to use new data to achieve more accurate pricing have gained attention:

  • Using usage-based insurance (UBI)
  • Leveraging external data
  • Leveraging real-time data

Usage-based insurance

UBI can be used to more closely align premium rates with driving behaviors. The UBI idea is not new — there have been attempts to align premiums with empirical risk based on how the insured actually drives for a couple of decades. In 2011, Allstate filed a patent on a UBI cost determination system and method. Progressive, State Farm and The Hartford are just a few examples of other companies that are embracing UBI methods in underwriting.

Technological evolutions like the Internet of Things (IOT) and all its attendant sensors provide new ways to capture and analyze more data. The UBI market has flourished and is expected to reach $123 billion by 2022. The U.S., the largest auto insurance market in the world, will lead the way in UBI marketing and innovation in 2017. With UBI’s market potential, there has also been a rise in business models such as pay-per-mile insurance for low-mileage drivers using UBI methods in underwriting. Embracing UBI methods in underwriting is no small feat, because of the huge amounts of data that must be collected and integrated. Progressive collected more than 10 billion miles of driving data with its UBI program, Snapshot, as of March 2014. For the most part, the data focuses on mileage, duration of driving and counts of braking/speeding events. These are all “exposure-related” driving variables, which are considered secondary contributors to risk. They can be bolstered with external data such as traffic patterns, road type and conditions, which are considered primary contributors to risk, to create a more accurate picture of an individual driver’s risk.

Leveraging external data

The idea of using external data is also not new. As early as the 1930s, insurance companies combined internal and external data to determine the rate for policy applicants. However, more recently, the speed of technological advancements has allowed insurers to dramatically redefine and improve their processes.

For example, customer applications for insurance today are significantly shorter than before, thanks to external data. With basics like name and address, insurers can access accurate data files that will append other necessary information — such as occupation, income and demographics. This means expedited underwriting processes and improved customer experiences. Some speculate that all insurers will purchase external data by 2019 to streamline their underwriting (among other things).

Another consideration is that the definition of external data has been evolving. Leveraging external data in an auto insurance risk assessment today may mean going beyond weather and geographic data to include data on shopping behaviors, historical quotes and purchases, telematics, social media behaviors and more. McKinsey says, “The proliferation of third-party data sources is reducing insurers’ dependence on internal data.” Auto insurers can incorporate credit scores into their underwriting analysis as empirical evidence that those who pay bills on time also tend to be safer drivers.

Better access to third-party data also allows insurers to pose new questions and gain a better understanding of different risks. With the availability of external data like social data, insurers can go beyond underwriting and pricing to really managing risks. External data doesn’t just go beyond telematics and geographic data; it may also have real-time implications.

Leveraging real-time data

Real-time data is a subset of the rich external data set, but it has some unique properties that make it worth considering it as a separate category. The usage of real-time data (such as apps that engage customers with warnings of impending weather events) can cut the cost of claims. Insurers can also factor data such as weather into the overall assessment at the time of underwriting to more accurately price the risk. In the earlier example of using external data to shorten the underwriting process, accessing external information in real-time and checking with multiple sources makes the information in auto insurance application forms more accurate, which, in turn, leads to more accurate rates.

Underwriters can also work with integrated sales and marketing platforms and can reference data such as social media updates, real-time news feeds and research to provide a more accurate assessment for those who seek to be insured. Real-time digital “data exhaust” — for example, from multimedia and social media, smartphones and other devices — has offered behavioral insights for insurers. For example, Allstate is considering monitoring and evaluating drivers’ heart rate, electrocardiograph signals and blood pressure through sensors embedded in the steering wheel.

See also: Industry’s Biggest Data Blind Spot  

Insurers can influence the insured’s driving behaviors through real-time monitoring, significantly altering the relationship with each other. A number of insurance companies, such as Progressive — in addition to the pay-per-mile insurer Metromile — are monitoring their customers’ driving real-time and are using that data for underwriting purposes. Allstate filed a patent on a game-like system where drivers are put in groups. Those in the same group could monitor driving scores in real-time and encourage better driving to improve the group’s driving score. Groups can earn rewards by capturing better scores.


There’s no doubt that the risky business of insurance is sophisticated. The above examples of leveraging UBI, external data and real-time data merely scratch the surface on data-driven opportunities in auto insurance. For example, what about fraud? Efficiency and automation? Closing the loop between risk and claims? Because only 36% of insurers are even projected to use UBI by 2020, those that embrace data-driven techniques will quickly find themselves ahead of the game.

While it’s outside the scope of this post, we should note that leveraging data and methods shouldn’t be done without careful consideration for consumers. As consumers enjoy easier insurance application processes, as well as having more products to choose from and compare prices on, increasingly they will want to understand how these data and analytics techniques affect them personally — including their data privacy and rights.

As we pause and reflect on how data and analytics have driven changes in auto insurance underwriting, we welcome questions and discussions in the comments section below. In the future, we’ll examine other ways the insurance market is becoming more data-driven, including the changes that data and analytics are driving in auto insurance claims and the rising focus of marketing.

This article first appeared on the site of Silicon Valley Data Science

Your Data Strategies: #Same or #Goals?

Goldilocks entered the house of the three bears. The first bowl she saw was full of the standard, no-frills porridge. She took a picture with her smart phone and posted it to Instagram, with the caption #same. Then she came to Papa Bear’s bowl. It was filled with organic, locally grown lettuce and kale, locally sourced quinoa, farm-fresh goat cheese and foraged mushrooms. The dressing base was olive oil, pressed and filtered from Tuscan olives. It was presented in a Williams Sonoma bowl on a farm table background. She posted a photo with the caption #goals. By the time Goldilocks went to bed, she had 147 likes. The End.

Enter the era of the exceptional, where all that seems to matter is what is new, different and better. When Twitter came out, it didn’t take me long to pick up how to use hashtags. But then hashtags took on a life of their own and spawned a new language of twisted usage. Now we have #same — usually representing what is not exciting, new or distinctive. And we have #goals — something we could aim for (think Beyoncé’s hair or Bradley Cooper’s abs).

See also: Data and Analytics in P&C Insurance  

Despite their trendy, poppy, teenage feel, #same and #goals are actually excellent portable concepts. When it comes to your IT and data strategies, are they #same or are they #goals? What do your business goals look like? Are you possibly mistaking #same for #goals? Let’s consider our alternatives.

Are our strategies aspirational enough?

If you are involved in insurance technology — whether that is in infrastructure, core insurance systems, digital, innovation or data and analytics — you are perpetually looking forward. Insurance organizations are grappling daily with their future-focused strategies. One common theme we encounter relates to goals and strategies. Many organizations think they are moving forward, but they may just be doing the work that needs to be done to remain operational. #Same. When thinking through the portfolio of projects and looking at the overall strategy, it is common to wonder, “Isn’t this just another version of what we did three months ago, even three years ago?” Is the organization looking at business, markets, products and channels and asking, “Are we ready to make a difference in this market?” No one wants the bowl of lukewarm, plain porridge — especially customers.

Are we aiming one bowl too far?

On the flip side, our goals do need to remain rooted in reality. It’s almost as common for optimistic teams to look at a really great strategy employed by Amazon, only to be reminded that the company isn’t Amazon and doesn’t need to be Amazon. It just needs to consider using Amazon-like capabilities that can enable the insurance strategy.

Data lakes can be a compelling component in modern insurance business processing architectures. But setting a goal to launch a 250-node cloud-based Hadoop cluster and declaring you’ll be entirely out of the business of running your own servers is not a strategy that’s right for everyone.

If the organization is pushed too far on risk or on reality, it creates organizational dissonance. It’s tough to recover from that. Leaders and teams may pull back and hesitate to try again. Our #goals shouldn’t become a #fail.

Finding the “just right” bowl.

Effective strategies are certainly based in reality, but do they stretch the organization to consider the future and how the strategies will help it to grow? When the balance is reached and the “just right” bowl full of aspirations is chosen, there is no better feeling. Our experience is that well-aligned organizational objectives married to positive stretch goals infuse insurers with energy.

This example of bowls, goals, balance and alignment is especially apropos to data and analytics organization. It is easy for data teams to lay new visuals on last year’s reports and spin through cycles improving processing throughput. To avoid the #same tag, these teams also need to evaluate all the emerging sources for third-party aggregated data and big data scalable technologies. With one foot in reality and one stretching toward new questions and new solutions, data analysts will remain engaged in providing ever-improving value.

See also: How to Capture Data Using Social Media  

Even if an organization could be technically advanced and organizationally perfect, it would still want to reach for something new, because change is constant. Reaching unleashes the power of your teams. Reaching challenges individuals to think at the top of their capacity and to tap into their creative sides. The excitement and motivation that improves productivity will also foster a culture of excellence and pride.

We are then left to the analysis of our individual circumstances. If you could snap a photo of your organization’s three-year plans, would you caption it #same or #goals? Inventing your own scale of aspiration, how many of your goals will stretch the organization and how many will just keep the lights on?


To Go Big (Data), Try Starting Small

Just about every organization in every industry is rolling in data—and that means an abundance of opportunities to use that data to transform operations, improve performance and compete more effectively.

“Big data” has caught the attention of many—and perhaps nowhere more than in the healthcare industry, which has volumes of fragmented data ready to be converted into more efficient operations, bending of the “cost curve” and better clinical outcomes.

But, despite the big opportunities, for most healthcare organizations, big data thus far has been more of a big dilemma: What is it? And how exactly should we “do” it?

Not surprisingly, we’ve talked to many healthcare organizations that recognize a compelling opportunity, want to do something and have even budgeted accordingly. But they can’t seem to take the first step forward.

Why is it so hard to move forward?

First, most organizations lack a clear vision and direction around big data. There are several fundamental questions that healthcare firms must ask themselves, one being whether they consider data a core asset of the organization. If so, then what is the expected value of that asset, and how much will the company invest annually toward maintaining and refining that asset? Oftentimes, we see that, although the organization may believe that data is one of its core assets, in fact the company’s actions and investments do not support that theory. So first and foremost, an organization must decide whether it is a “data company.”

Second is the matter of getting everyone on the same page. Big data projects are complex efforts that require involvement from various parties across an organization. Data necessary for analysis resides in various systems owned and maintained by disparate operating divisions within the organization. Moreover, the data is often not in the form required to draw insight and take action. It has to be accessed and then “cleansed”—and that requires cooperation from different people from different departments. Likely, that requires them to do something that is not part of their day jobs—without seeing any tangible benefit from contributing to the project until much later. The “what’s in it for me” factor is practically nil for most such departments.

Finally, perception can also be an issue. Big data projects often are lumped in with business intelligence and data warehouse projects. Most organizations, and especially healthcare organizations, have seen at least one business intelligence and data warehouse project fail. People understand the inherent value but remain skeptical and un-invested to make such a transformational initiative successful. Hence, many are reticent to commit too deeply until it’s clear the organization is actually deriving tangible benefits from the data warehouse.

A more manageable approach

In our experience, healthcare organizations make more progress in tapping their data by starting with “small data“—that is, well-defined projects of a focused scope. Starting with a small scope and tackling a specific opportunity can be an effective way to generate quick results, demonstrate potential for an advanced analytics solution and win support for broader efforts down the road.

One area particularly ripe for opportunity is population health. In a perfect world with a perfect data warehouse, there are infinite disease conditions to identify, stratify and intervene for to improve clinical outcomes. But it might take years to build and shape that perfect data warehouse and find the right predictive solution for each disease condition and comorbidity. A small-data project could demonstrate tangible results—and do so quickly.

A small-data approach focuses on one condition—for example, behavioral health, an emerging area of concern and attention. Using a defined set of data, it allows you to study sources of cost and derive insights from which you can design and target a specific intervention for high-risk populations. Then, by measuring the return on the intervention program, you can demonstrate value of the small data solution; for example, savings of several million dollars over a one-year period. That, in turn, can help build a business case for taking action, possibly on a larger scale and gaining the support of other internal departments.

While this approach helps build internal credibility, which addresses one of the biggest roadblocks to big data, it does have some limitations. There is a risk that initiating multiple independent small-data projects can create “siloed” efforts with little consistency and potential for fueling the organization’s ultimate journey toward using big data. Such risks can be mitigated with intelligent and adaptive data architecture and a periodic evaluation of the portfolio of small-data solutions.

Building the “sandbox” for small-data projects

To get started, you need two things: 1) a potential opportunity to test and 2) tools and an environment that enable fast analysis and experimentation.

It is important to understand quickly whether a potential solution has a promising business case, so that you can move quickly to implement it—or move on to something else without wasting further investment.

If a business case exists, proceed to find a solution. Waiting to procure servers for analysis or for permission to use an existing data warehouse will cost valuable time and money. So that leaves two primary alternatives for supporting data analysis: leveraging Software-as-a-Service solutions such as Hadoop with in-house expertise, or partnering with an organization that provides a turnkey solution for establishing analytics capabilities within a couple of days.

You’ll then need a “sandbox” in which to “play” with those tools. The “sandbox” is an experimentation environment established outside of the organization’s production systems and operations that facilitate analysis of an opportunity and testing of potential intervention solutions. In addition to the analysis tools, it also requires resources with the skills and availability to interpret the analysis, design solutions (e.g., a behavioral health intervention targeted to a specific group), implement the solution and measure the results.

Then building solutions

For building a small-data initiative, it is a good idea to keep a running list of potential business opportunities that may be ripe for cost-reduction or other benefits. Continuing our population health example, this might include areas as simple as finding and intervening for conditions that lead to the common flu and reduced employee productivity, to preventing pre-diabetics from becoming diabetics, to behavioral health. In particular, look at areas where there is no competing intervention solution already in the marketplace and where you believe you can be a unique solution provider.

It is important to establish clear “success criteria” up front to guide quick “go” or “no-go” decisions about potential projects. These should not be specific to the particular small-data project opportunity but rather generic enough to apply across topics—as they become the principles guiding small data as a journey to broader analytics initiatives. Examples of success criteria might include:

– Cost-reduction goals
– Degree to which the initiative changes clinical outcomes
– Ease of access to data
– Ease of cleansing data so that it is in a form needed for analysis

For example, you might have easy access to data, but it requires a lot of effort to “clean” it for analysis—so it isn’t actually easy to use.

Another important criterion is presence of operational know-how for turning insight into action that will create outcomes. For example, if you don’t have behavioral health specialists who can call on high-risk patients and deliver the solution (or a partner that can provide those services), then there is little point in analyzing the issue to start with. There must be a high correlation between data, insight and application.

Finally, you will need to consider the effort required to maintain a specific small-data solution over time. For instance, a new predictive model to help identify high-risk behavioral health patients or high-risk pregnancies. Will that require a lot of rework each year to adjust the risk model as more data becomes available? If so, that affects the solution’s ease of use. Small-data solutions need to be dynamic and able to adjust easily to the market needs.

Just do it

Quick wins can accelerate progress toward realizing the benefits of big data. But realizing those quick wins requires the right focus—”small data”—and the right environment for making rapid decisions about when to move forward with a solution or when to abandon it and move on to something else. If in a month or two, you haven’t produced a solution that is translating into tangible benefits, it is time to get out and try something else.

A small-data approach requires some care and good governance, but it can be a much more effective way to make progress toward the end goal of leveraging big data for enterprise advantage.

This article first appeared at Becker’s Hospital Review.

Frustrated on Your Data Journey?

It’s going to take how much longer?! It’s going to cost how much more?!!

If those sound like all too familiar expressions of frustration, in relation to your data journey (projects), you’re in good company.

It seems most corporations these days struggle to make the progress they plan, with regards to building a single customer view (SCV), or providing the data needed by their analysts.

An article on MyCustomer.com, by Adrian Kingwell, cited a recent Experian survey that found 72% of businesses understood the advantages of an SCV, but only 16% had one in place. Following that, on CustomerThink.com, Adrian Swinscoe makes an interesting case for it being more time/cost-effective to build one directly from asking the customer.

That approach could work for some businesses (especially small and medium-sized busineses) and can be combined with visible data transparency, but it is much harder for large, established businesses to justify troubling the customer for data they should already have. So the challenge remains.

A recent survey on Customer Insight Leader suggests another reason for problems in “data project land.” In summary, you shared that:

  • 100% of you disagree or strongly disagree with the statement that you have a conceptual data model in place;
  • 50% of you disagreed (rest were undecided) with the statement that you have a logical data model in place;
  • Only 50% agreed (rest disagreed) with the statement that you have a physical data model in place.

These results did not surprise me, as they echo my experience of working in large corporations. Most appear to lack especially the conceptual, data models. Given the need to be flexible in implementation and respond to the data quality or data mapping issues that always arise on such projects, this is concerning. With so much focus on technology these days, I fear the importance of a model/plan/map has been lost. Without a technology independent view of the data entities, relationships and data items that a team needs to do their job, businesses will continue to be at the mercy of changing technology solutions.

Your later answers also point to a related problem that can plague customer insight analysts seeking to understand customer behavior:

  • All of you strongly disagreed with the statement that all three types of data models are updated when your business changes;
  • 100% of you also disagreed with the statement that you have effective meta data (e.g. up-to-date data dictionary) in place.

Without the work to keep models reflecting reality and meta data sources guiding users/analysts on the meaning of fields and which can be trusted, both can wither on the vine. Isn’t it short-sighted investment to spend perhaps millions of pounds on a technology solution but then balk at the cost of data specialists to manage these precious knowledge management elements?

Perhaps those of us speaking about insight, data science, big data, etc. also carry a responsibility. If it has always been true that data tends to be viewed as a boring topic compared with analytics, it is doubly true that we tend to avoid the topics of data management and data modeling. But voices need to cry out in the wilderness for these disciplines. Despite the ways Hadoop, NoSQL or other solutions can help overcome potential technology barriers — no one gets data solutions for their business “out of the box.” It takes hard work and diligent management to ensure data is used & understood effectively.

I hope, in a very small way, these survey results act as a bit of a wake up call. Over coming weeks I will be attending or speaking at various events. So, I’ll also reflect how I can speak out more effectively for this neglected but vital skill.

On that challenge of why businesses fail to build the SCVs they need, another cause has become apparent to me over the years. Too often, requirements are too ambitious in the first place. Over time working on both sides of the “IT fence,” it is common to hear expressed by analytical teams that they want all the data available (at least from feeds they can get). Without more effective prioritization of which data feeds, or specifically which variables within those feeds, are worth the effort – projects get bogged down in excessive data mapping work.

Have you seen the value of a “data labs” approach? Finding a way to enable your analysts to manually get hold of an example data extract, so they can try analyzing data and building models, can help massively. At least 80% of the time, they will find that only a few of the variable are actually useful in practice. This enables more pragmatic requirements and a leaner IT build which is much more likely to deliver (sometimes even within time & budget).

Here’s that article from Adrian Swinscoe, with links to Adrian Kingwell, too.

What’s your experience? If you recognize the results of this survey, how do you cope with the lack of data models or up-to-date meta data? Are you suffering data project lethargy as a result?