When Big Data Can Define Pricing (Part 2)

Algorithms have been developed and are moving from a proof-of-concept phase in academia to implementations in insurance firms.

Ivelin M. Zvezdov

May 10, 2017

This is the second part of a two-part series. The first part can be found here. Abstract In the second part of this article, we extend the discourse to a notional micro-economy and examine the impact of diversification and insurance big data components on the potential for developing strategies for sustainable and economical insurance policy underwriting. We review concepts of parallel and distributed algorithmic computing for big data clustering, mapping and resource reducing algorithms. 1.0 Theoretical Expansion to a Single Firm Micro-Economy Case We expand the discourse from part one to a simple theoretical micro-economy, and examine if the same principles derived for the aggregate umbrella insurance product still hold on the larger scale of an insurance firm. In a notional economy with {1…to…N} insurance risks r_1,N and policy holders respectively, we have only one insurance firm, which at time T, does not have an information data set θ_T about dependencies among per-risk losses. Each premium is estimated by the traditional standard deviation principle in (1.1). For the same time period T the insurance firm collects a total premium π_T[total] equal to the linear sum of all {1…to…N} policy premiums π_T[r_N] in the notional economy.

There is full additivity in portfolio premiums, and because of unavailability of data on inter-risk dependencies for modeling, the insurance firm cannot take advantage of competitive premium cost savings due to market share scale and geographical distribution and diversification of the risks in its book of business. For coherence we assume that all insurance risks and policies belong to the same line of business and cover the same insured natural peril - flood, so that the only insurance risks diversification possible is due to insurance risk independence derived from geo-spatial distances. A full premium additivity equation similar to an aggregate umbrella product premium (3.0), extended for the case of the total premium of the insurance firm in our micro-economy, is composed in (9.0)

In the next time period T+1 the insurance firm acquires a data set θ_T+1 which allows it to model geo-spatial dependencies among risks and to identify fully dependent, partially dependent and fully independent risks. The dependence structure is expressed and summarized in a [NxN] correlation matrix - ρ_i,N. Traditionally, full independence between any two risks is modeled with a zero correlation factor, and partial dependence is modeled by a correlation factor less than one. With this new information we can extend the insurance product expression (7.0) to the total accumulated premium π_T+1[total] of the insurance firm at time T+1

The impacts of full independence and partial dependence, which are inevitably present in a full insurance book of business, guarantee that the sub-additivity principle for premium accumulation comes into effect. In our case study sub-additivity has two related expressions. Between the two time periods the acquisition of the dependence data set θ_T which is used for modeling and definition of the correlation structure ρ_i,N provides that a temporal sub-additivity or inequality between the total premiums of the insurance firm can be justified in (10.1).

It is undesirable for any insurance firm to seek lowering its total cumulative premium intentionally because of reliance on diversification. However an underwriting guidelines’ implication could be that after the total firm premium is accumulated with a model taking account of inter-risk dependencies, then this total monetary amount can be back-allocated to individual risks and policies and thus provide a sustainable competitive edge in pricing. The business function of diversification and taking advantage of its consequent premium cost savings is achieved through two statistical operations: accumulating pure flood premium with a correlation structure, and then back-allocating the total firms’ premium down to single contributing risk granularity. A backwardation relationship for the back-allocated single risk and single policy premium π'_T+1[r_N] can be derived with a standard deviations’ proportional ratio. This per-risk back-allocation ratio is constructed from the single risk standard deviation of expected loss σ_T+1[r_N] and the total linear sum of all per-risk standard deviations

in the insurance firm’s book of business.

From the temporal sub-additivity inequality between total firm premiums in (10.1) and the back-allocation process for total premium

down to single risk premium in (11.0), it is evident that there are economies of scale and cost in insurance policy underwriting between the two time periods for any arbitrary single risk r_N. These cost savings are expressed in (12.0).

In our case study of a micro economy and one notional insurance firms’ portfolio of one insured peril, namely flood, these economies of premium cost are driven by geo-spatial diversification among the insured risks. We support this theoretical discourse with a numerical study. 2.0 Notional Flood Insurance Portfolio Case Study We construct two notional business units each containing ten risks, and respectively ten insurance policies. The risks in both units are geo-spatially clustered in high intensity flood zones – Jersey City in New Jersey – ‘Unit NJ’ and Baton Rouge in Louisiana – ‘Unit BR’. For each business unit we perform two numerical computations for premium accumulation under two dependence regimes. Each unit’s accumulated fully dependent premium is computed by equation (9.0). Each unit’s accumulated partially dependent premium, modeled with a constant correlation factor of 0.6 (60%), between any two risks, for both units is computed by equation (10.0). The total insurance firm’s premium under both cases of full dependencies and partial dependence is simply a linear sum – ‘business unit premiums’ roll up to the book total.

In all of our case studies we have focused continuously on the impact of measuring geo-spatial dependencies and their interpretation and usability in risk and premium diversification. For the actuarial task of premium accumulation across business units, we assume that the insurance firm will simply roll - up unit total premiums, and will not look for competitive pricing as a result of diversification across business units. This practice is justified by underwriting and pricing guidelines being managed somewhat autonomously by geo-admin business unit, and premium and financial reporting being done in the same manner. In our numerical case study we prove that the theoretical inequality (10.1), which defines temporal subadditivity of premium with and without dependence modeled impact is maintained. Total business unit premium computed without modeled correlation data and under assumption of full dependence

always exceeds the unit’s premium under partial dependence

computed with acquired and modeled correlation factors.

This justifies performing back-allocation in both business units, using procedure (11.0), of the total premium

computed under partial dependence. In this way competitive cost savings can be distributed down to single risk premium. In table 4, we show the results of this back-allocation procedure for all single risks in both business units:

For each single risk we observe that the per-risk premium inequality (12.0) is maintained by the numerical results. Partial dependence, which can be viewed as the statistical – modeling expression of imperfect insurance risk diversification proves that it could lead to opportunities for competitive premium pricing and premium cost savings for the insured on a per-risk and per-policy cost savings. 3.0 Functions and Algorithms for Insurance Data Components 3.1 Definition of Insurance Big Data Components Large insurance data component facilitate and practically enable the actuarial and statistical tasks of measuring dependencies, modeled loss accumulations and back-allocation of total business unit premium to single risk policies. For this study our definition of big insurance data components covers historical and modeled data at high geospatial granularity, structured in up to one million simulation maps. For modeling of a single (re)insurance product a single map can contain a few hundred historical, modeled, physical measure data points. At the large book of business or portfolio simulation, one map may contain millions of such data points. Time complexity is another feature of big data. Global but structured and distributed data sets are updates asynchronously and oftentimes without a schedule, depending on scientific and business requirements and computational resources. Thus such big data components have a critical and indispensable role in defining competitive premium cost savings for the insureds, which otherwise may not be found sustainable by the policy underwriters and the insurance firm. 3.2 Intersections of Exposure, Physical and Modeled Simulated data sets Fast compute and big data platforms are designed to provide various geospatial modeling and analysis tasks. A fundamental task is the projection of an insured exposure map and computing its intersection with multiple simulated stochastic flood intensity maps and geo-physical properties maps containing coastal and river banks elevations and distances to water bodies. This particular algorithm performs spatial cashing and indexing of all latitude and longitude geo-coded units and grid-cells with insured risk exposure and modeled stochastic flood intensity. Geo-spatial interpolation is also employed to compute and adjust peril intensities to distances and geo-physical elevations of the insured risks. 3.3 Reduction and Optimization through Mapping and Parallelism One relevant definition of Big Data to our own study is datasets that are too large and too complex to be processed by traditional technologies and algorithms. In principle moving data is the most computationally expensive task in solving big geo-spatial scale problems, such as modeling and measuring inter-risk dependencies and diversification in an insurance portfolio. The cost and expense of big geo-spatial solutions is magnified by large geo-spatial data sets typically being distributed across multiple hard physical computational environments as a result of their size and structure. The solution is distributed optimization, which is achieved by a sequence of algorithms. As a first step a mapping and splitting algorithm will divide large data sets into sub-sets and perform statistical and modeling computations on the smaller sub-sets. In our computational case study the smaller data chunks represent insurance risks and policies in geo-physically dependent zones, such as river basins and coastal segments. The smaller data sets are processed as smaller sub-problems in parallel by assigned appropriate computational resources. In our model we solve smaller scale and chunked data sets computations for flood intensity and then for modeling and estimating of fully simulated and probabilistic insurance loss. Once the cost effective sub-set operations are complete on the smaller sub-sets, a second algorithm will collect and map together the results of the first stage compute for consequent operations for data analytics and presentation. For single insurance products, business units and portfolios an ordered accumulation of risks is achieved via mapping by scale of the strength or lack thereof their dependencies. Data sets and tasks with identical characteristics could be grouped together and resources for their processing significantly reduced by avoiding replication or repetition of computational tasks, which we have already mapped and now can be reused. The stored post-analytics, post-processed data could also be distributed on different physical storage capacities by a secondary scheduling algorithm, which intelligently allocates chunks of modeled and post-processed data to available storage resources. This family of techniques is generally known as MapReduce. 3.4 Scheduling and Synchronization by Service Chaining Distributed and service chaining algorithms process geo-spatial analysis tasks on data components simultaneously and automatically. For logically independent processes, such as computing intensities or losses on uncorrelated iterations of a simulation, service chaining algorithms will divide and manage the tasks among separate computing resources. Dependencies and correlations among such data chunks may not exist because of large geo-spatial distances, as we saw in the modeling and pricing of our cases studies. Hence they do not have to be accounted for computationally and performance improvements are gained. For such cases both input data and computational tasks can be broken down to pieces and sub-tasks respectively. For logically inter-dependent tasks, such as accumulations of inter-dependent quantities such as losses in geographic proximity, chaining algorithms automatically order the start and completion of dependent sub-tasks. In our modeled scenarios, the simulated loss distributions of risks in immediate proximity are accumulated first, where dependencies are expected to be strongest. A second tier of accumulations for risks with partial dependence and full independence measures is scheduled for once the first tier of accumulations of highly dependent risks is complete. Service chaining methodologies work in collaboration with auto-scaling memory algorithms, which provide or remove computational memory resources, depending on the intensity of modeling and statistical tasks. Challenges still are significant in processing shared data structures. An insurance risk management example, which we are currently developing for our a next working paper, would be pricing a complex multi-tiered product, comprised of many geo-spatially dependent risks, and then back-allocating a risk metric, such as tail value at risk down to single risk granularity. On the statistical level this back-allocation and risk management task involves a process called de-convolution or also component convolution. A computational and optimization challenge is present when highly dependent and logically connected statistical operations are performed with chunks of data distributed across different hard storage resources. Solutions are being developed for multi-threaded implementations of map-reduce algorithms, which address such computationally intensive tasks. In such procedures the mapping is done by task definition and not directly onto the raw and static data. Some Conclusions and Further Work With advances in computational methodologies for natural catastrophe and insurance portfolio modeling, practitioners are producing increasingly larger data sets. Simultaneously single product and portfolio optimization techniques are used in insurance premium underwriting, which take advantage of metrics in diversification and inter-risk dependencies. Such optimization techniques significantly increase the frequency of production of insurance underwriting data, and require new types of algorithms, which can process multiple large, distributed and frequently updated sets. Such algorithms have been developed theoretically and now they are entering from a proof of concept phase in the academic environments to implementations in production in the modeling and computational systems of insurance firms. Both traditional statistical modeling methodologies such as premium pricing, and new advances in definition of inter-risk variance-covariance and correlation matrices and policy and portfolio accumulation principles, require significant data management and computational resources to account for the effects of dependencies and diversification. Accounting for these effects allows the insurance firm to support cost savings in premium value for the insurance policy holders. With many of the reviewed advances at present, there are still open areas for research in statistical modeling, single product pricing and portfolio accumulation and their supporting optimal big insurance data structures and algorithms. Algorithmic communication and synchronization cost between global but distributed structured and dependent data is expensive. Optimizing and reducing computational processing cost for data analytics is a top priority for both scientists and practitioners. Optimal partitioning and clustering of data, and particularly so of geospatial images, is one other active area of research.