Tag Archives: ivelin m. zvezdov

When Big Data Can Define Pricing (Part 2)

This is the second part of a two-part series. The first part can be found here. 


In the second part of this article, we extend the discourse to a notional micro-economy and examine the impact of diversification and insurance big data components on the potential for developing strategies for sustainable and economical insurance policy underwriting. We review concepts of parallel and distributed algorithmic computing for big data clustering, mapping and resource reducing algorithms.


1.0 Theoretical Expansion to a Single Firm Micro-Economy Case

We expand the discourse from part one to a simple theoretical micro-economy, and examine if the same principles derived for the aggregate umbrella insurance product still hold on the larger scale of an insurance firm. In a notional economy with {1…to…N} insurance risks r1,N and policy holders respectively, we have only one insurance firm, which at time T, does not have an information data set θT about dependencies among per-risk losses. Each premium is estimated by the traditional standard deviation principle in (1.1). For the same time period T the insurance firm collects a total premium πT[total] equal to the linear sum of all {1…to…N} policy premiums πT[rN] in the notional economy.

There is full additivity in portfolio premiums, and because of unavailability of data on inter-risk dependencies for modeling, the insurance firm cannot take advantage of competitive premium cost savings due to market share scale and geographical distribution and diversification of the risks in its book of business. For coherence we assume that all insurance risks and policies belong to the same line of business and cover the same insured natural peril – flood, so that the only insurance risks diversification possible is due to insurance risk independence derived from geo-spatial distances. A full premium additivity equation similar to an aggregate umbrella product premium (3.0), extended for the case of the total premium of the insurance firm in our micro-economy, is composed in (9.0)

In the next time period T+1 the insurance firm acquires a data set θT+1 which allows it to model geo-spatial dependencies among risks and to identify fully dependent, partially dependent and fully independent risks. The dependence structure is expressed and summarized in a [NxN] correlation matrix – ρi,N. Traditionally, full independence between any two risks is modeled with a zero correlation factor, and partial dependence is modeled by a correlation factor less than one. With this new information we can extend the insurance product expression (7.0) to the total accumulated premium πT+1[total] of the insurance firm at time T+1

The impacts of full independence and partial dependence, which are inevitably present in a full insurance book of business, guarantee that the sub-additivity principle for premium accumulation comes into effect. In our case study sub-additivity has two related expressions. Between the two time periods the acquisition of the dependence data set θT which is used for modeling and definition of the correlation structure ρi,N provides that a temporal sub-additivity or inequality between the total premiums of the insurance firm can be justified in (10.1).

It is undesirable for any insurance firm to seek lowering its total cumulative premium intentionally because of reliance on diversification. However an underwriting guidelines’ implication could be that after the total firm premium is accumulated with a model taking account of inter-risk dependencies, then this total monetary amount can be back-allocated to individual risks and policies and thus provide a sustainable competitive edge in pricing. The business function of diversification and taking advantage of its consequent premium cost savings is achieved through two statistical operations: accumulating pure flood premium with a correlation structure, and then back-allocating the total firms’ premium down to single contributing risk granularity. A backwardation relationship for the back-allocated single risk and single policy premium π’T+1[rN] can be derived with a standard deviations’ proportional ratio. This per-risk back-allocation ratio is constructed from the single risk standard deviation of expected loss σT+1[rN] and the total linear sum of all per-risk standard deviations  in the insurance firm’s book of business.

From the temporal sub-additivity inequality between total firm premiums in (10.1) and the back-allocation process for total premium  down to single risk premium in (11.0), it is evident that there are economies of scale and cost in insurance policy underwriting between the two time periods for any arbitrary single risk rN. These cost savings are expressed in (12.0).

In our case study of a micro economy and one notional insurance firms’ portfolio of one insured peril, namely flood, these economies of premium cost are driven by geo-spatial diversification among the insured risks. We support this theoretical discourse with a numerical study.

2.0 Notional Flood Insurance Portfolio Case Study

We construct two notional business units each containing ten risks, and respectively ten insurance policies. The risks in both units are geo-spatially clustered in high intensity flood zones – Jersey City in New Jersey – ‘Unit NJ’ and Baton Rouge in Louisiana – ‘Unit BR’. For each business unit we perform two numerical computations for premium accumulation under two dependence regimes. Each unit’s accumulated fully dependent premium is computed by equation (9.0). Each unit’s accumulated partially dependent premium, modeled with a constant correlation factor of 0.6 (60%), between any two risks, for both units is computed by equation (10.0). The total insurance firm’s premium under both cases of full dependencies and partial dependence is simply a linear sum – ‘business unit premiums’ roll up to the book total.

In all of our case studies we have focused continuously on the impact of measuring geo-spatial dependencies and their interpretation and usability in risk and premium diversification. For the actuarial task of premium accumulation across business units, we assume that the insurance firm will simply roll – up unit total premiums, and will not look for competitive pricing as a result of diversification across business units. This practice is justified by underwriting and pricing guidelines being managed somewhat autonomously by geo-admin business unit, and premium and financial reporting being done in the same manner.

In our numerical case study we prove that the theoretical inequality (10.1), which defines temporal subadditivity of premium with and without dependence modeled impact is maintained. Total business unit premium computed without modeled correlation data and under assumption of full dependence  always exceeds the unit’s premium under partial dependence computed with acquired and modeled correlation factors.

This justifies performing back-allocation in both business units, using procedure (11.0), of the total premium  computed under partial dependence. In this way competitive cost savings can be distributed down to single risk premium. In table 4, we show the results of this back-allocation procedure for all single risks in both business units:


For each single risk we observe that the per-risk premium inequality (12.0) is maintained by the numerical results. Partial dependence, which can be viewed as the statistical – modeling expression of imperfect insurance risk diversification proves that it could lead to opportunities for competitive premium pricing and premium cost savings for the insured on a per-risk and per-policy cost savings.

3.0 Functions and Algorithms for Insurance Data Components

3.1 Definition of Insurance Big Data Components

Large insurance data component facilitate and practically enable the actuarial and statistical tasks of measuring dependencies, modeled loss accumulations and back-allocation of total business unit premium to single risk policies. For this study our definition of big insurance data components covers historical and modeled data at high geospatial granularity, structured in up to one million simulation maps. For modeling of a single (re)insurance product a single map can contain a few hundred historical, modeled, physical measure data points. At the large book of business or portfolio simulation, one map may contain millions of such data points. Time complexity is another feature of big data. Global but structured and distributed data sets are updates asynchronously and oftentimes without a schedule, depending on scientific and business requirements and computational resources. Thus such big data components have a critical and indispensable role in defining competitive premium cost savings for the insureds, which otherwise may not be found sustainable by the policy underwriters and the insurance firm.

3.2 Intersections of Exposure, Physical and Modeled Simulated data sets

Fast compute and big data platforms are designed to provide various geospatial modeling and analysis tasks. A fundamental task is the projection of an insured exposure map and computing its intersection with multiple simulated stochastic flood intensity maps and geo-physical properties maps containing coastal and river banks elevations and distances to water bodies. This particular algorithm performs spatial cashing and indexing of all latitude and longitude geo-coded units and grid-cells with insured risk exposure and modeled stochastic flood intensity. Geo-spatial interpolation is also employed to compute and adjust peril intensities to distances and geo-physical elevations of the insured risks.

3.3 Reduction and Optimization through Mapping and Parallelism

One relevant definition of Big Data to our own study is datasets that are too large and too complex to be processed by traditional technologies and algorithms. In principle moving data is the most computationally expensive task in solving big geo-spatial scale problems, such as modeling and measuring inter-risk dependencies and diversification in an insurance portfolio. The cost and expense of big geo-spatial solutions is magnified by large geo-spatial data sets typically being distributed across multiple hard physical computational environments as a result of their size and structure. The solution is distributed optimization, which is achieved by a sequence of algorithms. As a first step a mapping and splitting algorithm will divide large data sets into sub-sets and perform statistical and modeling computations on the smaller sub-sets. In our computational case study the smaller data chunks represent insurance risks and policies in geo-physically dependent zones, such as river basins and coastal segments. The smaller data sets are processed as smaller sub-problems in parallel by assigned appropriate computational resources. In our model we solve smaller scale and chunked data sets computations for flood intensity and then for modeling and estimating of fully simulated and probabilistic insurance loss. Once the cost effective sub-set operations are complete on the smaller sub-sets, a second algorithm will collect and map together the results of the first stage compute for consequent operations for data analytics and presentation. For single insurance products, business units and portfolios an ordered accumulation of risks is achieved via mapping by scale of the strength or lack thereof their dependencies. Data sets and tasks with identical characteristics could be grouped together and resources for their processing significantly reduced by avoiding replication or repetition of computational tasks, which we have already mapped and now can be reused. The stored post-analytics, post-processed data could also be distributed on different physical storage capacities by a secondary scheduling algorithm, which intelligently allocates chunks of modeled and post-processed data to available storage resources. This family of techniques is generally known as MapReduce.

3.4 Scheduling and Synchronization by Service Chaining

Distributed and service chaining algorithms process geo-spatial analysis tasks on data components simultaneously and automatically. For logically independent processes, such as computing intensities or losses on uncorrelated iterations of a simulation, service chaining algorithms will divide and manage the tasks among separate computing resources. Dependencies and correlations among such data chunks may not exist because of large geo-spatial distances, as we saw in the modeling and pricing of our cases studies. Hence they do not have to be accounted for computationally and performance improvements are gained. For such cases both input data and computational tasks can be broken down to pieces and sub-tasks respectively. For logically inter-dependent tasks, such as accumulations of inter-dependent quantities such as losses in geographic proximity, chaining algorithms automatically order the start and completion of dependent sub-tasks. In our modeled scenarios, the simulated loss distributions of risks in immediate proximity are accumulated first, where dependencies are expected to be strongest. A second tier of accumulations for risks with partial dependence and full independence measures is scheduled for once the first tier of accumulations of highly dependent risks is complete. Service chaining methodologies work in collaboration with auto-scaling memory algorithms, which provide or remove computational memory resources, depending on the intensity of modeling and statistical tasks. Challenges still are significant in processing shared data structures. An insurance risk management example, which we are currently developing for our a next working paper, would be pricing a complex multi-tiered product, comprised of many geo-spatially dependent risks, and then back-allocating a risk metric, such as tail value at risk down to single risk granularity. On the statistical level this back-allocation and risk management task involves a process called de-convolution or also component convolution. A computational and optimization challenge is present when highly dependent and logically connected statistical operations are performed with chunks of data distributed across different hard storage resources. Solutions are being developed for multi-threaded implementations of map-reduce algorithms, which address such computationally intensive tasks. In such procedures the mapping is done by task definition and not directly onto the raw and static data.

Some Conclusions and Further Work

With advances in computational methodologies for natural catastrophe and insurance portfolio modeling, practitioners are producing increasingly larger data sets. Simultaneously single product and portfolio optimization techniques are used in insurance premium underwriting, which take advantage of metrics in diversification and inter-risk dependencies. Such optimization techniques significantly increase the frequency of production of insurance underwriting data, and require new types of algorithms, which can process multiple large, distributed and frequently updated sets. Such algorithms have been developed theoretically and now they are entering from a proof of concept phase in the academic environments to implementations in production in the modeling and computational systems of insurance firms.

Both traditional statistical modeling methodologies such as premium pricing, and new advances in definition of inter-risk variance-covariance and correlation matrices and policy and portfolio accumulation principles, require significant data management and computational resources to account for the effects of dependencies and diversification. Accounting for these effects allows the insurance firm to support cost savings in premium value for the insurance policy holders.

With many of the reviewed advances at present, there are still open areas for research in statistical modeling, single product pricing and portfolio accumulation and their supporting optimal big insurance data structures and algorithms. Algorithmic communication and synchronization cost between global but distributed structured and dependent data is expensive. Optimizing and reducing computational processing cost for data analytics is a top priority for both scientists and practitioners. Optimal partitioning and clustering of data, and particularly so of geospatial images, is one other active area of research.

When Big Data Can Define Pricing

This is the first part of a two-part series.


Examining the intersection of research on the effects of (re)insurance risk diversification and availability of big insurance data components for competitive underwriting and premium pricing is the purpose for this paper. We study the combination of physical diversification by geography and insured natural peril with the complexity of aggregate structured insurance products, and furthermore how big historical and modeled data components affect product underwriting decisions. Under such market conditions, the availability of big data components facilitates accurate measurement of inter-dependencies among risks, and the definition of optimal and competitive insurance premium at the level of the firm and the policy holders. In the second part of this article, we extend the discourse to a notional micro-economy and examine the impact of diversification and insurance big data components on the potential for developing strategies for sustainable and economical insurance policy underwriting. We review concepts of parallel and distributed algorithmic computing for big data clustering, mapping and resource reducing algorithms.


This working paper will examine how big data and fast compute platforms solve some complex premium pricing and portfolio structuring and accumulation problems in the context of flood insurance markets. Our second objective is to measure the effects of geo-spatial insurance risk diversification through modeling of interdependencies and show that such measures have impact on single risk premium definition and its market cost. The single product case studies examine the pricing of insurance umbrella coverage. They are selected to address scenarios relevant to current (re)insurance market conditions under intense premium competition. Then we extend the discourse to a micro-economy of multiple policy holders and aim to generalize some finding on economies of scale and diversification. The outcomes of all case studies and theoretical analysis depend on the availability of big insurance data components for modeling and pricing workflows. The quality, usability and computational cost of such data components determine their direct impact on the underwriting and pricing process and on definition of the single risk cost of insurance.

1.0 Pricing Aggregate Umbrella Policies

Insurers are competing actively for insureds’ premiums and looking for economies of scale to offset and balance premium competition and thus develop more sustainable long-term underwriting strategies. While writing competitive premium policies and setting up flexible contract structures, insurers are mindful of risk concentration and the lower bounds of fair technical pricing. Structuring of aggregate umbrella policies lends itself to underwriting practices of larger scales in market share and diversification. Only large insurers have the economies of scale to offer such products to their clients.

Premium pricing of umbrella and global policies relies on both market conditions and mathematical modeling arguments. On the market and operational side, the insurer relies on lower cost of umbrella products due to efficiencies of scale in brokerage, claims management, administration and even in the computational scale-up of the modeling and pricing internal functions of its actuarial departments. In our study, we will first focus on the statistical modeling argument, and then we will define big data components, which allow for solving such policy structuring and pricing problems.

See also: 3 Reasons Insurance Is Changed Forever  

We first set up the case study on a smaller scale in context of two risks — with insured limits for flood of $90 million and $110 million. These risks are priced for combined river-rain and storm surge flood coverage, first with both single limits separately and independently and then in an aggregate umbrella insurance product with a combined limit of $200 million:

Umbrella(200M) = Limit 1 (90M) + Limit 2 (110M)

The two risks are owned by a single insured and are located in a historical flood zone, less than 1 kilometer from each other.

For premium pricing, we assume a traditional approach dependent on modeled expected values of insured loss and standard deviation of loss.

To set the statistical mechanics of the case study for both risks, we have a modeled flood insurance loss data samples Qt and St respectively for both risks, from a stochastic simulation – T. Modeled insured losses have an expected value E[.] and a standard deviation σ[.], which define a standard policy premium of π(.)

When both policies’ premiums are priced independently, by the standard deviation pricing principle we have:

π(St) = E[St] + σ[St]
π(Qt) = E[Qt] + σ[Qt]

With non-negative loadings, it follows that:

π(St) ≥ E[St]
π(Qt) ≥ E[Qt]

Because both risks are owned by the same insured, we aggregate the two standard premium equations, using traditional statistical accumulation principles for expected values and standard deviations of loss.

π(Qt)+π(St)= E[St]+σ[St]+E[Qt]+σ[Qt]
π(Qt)+π(St)= E[St+Qt]+σ[St]+σ[Qt]

The theoretical joint insured loss distribution function fS,Q(St,Qt) of the two risks will have an expected value of insured loss:

E[St + Qt] = E[St] + E[Qt]

And a joint theoretical standard deviation of insured loss:

σ[St + Qt] = √(σ2[St] + σ2[Qt] + 2ρσ[St] * σ[Qt])

We use further these aggregation principles to express the sum of two single risks premiums π(Qt), π(St), as well as to derive a combined premium π(Qt + St) for an umbrella coverage product insuring both risks with equivalency in limits as in (1.0). An expectation for full equivalency in premium definition produces the following equality:

π(Qt + St) = E[St + Qt] + √(σ2[St] + σ2[Qt] + 2ρσ[St] * σ[Qt]) = π(Qt) + π(St

The expression introduces a correlation factor ρ between modeled insured losses of the two policies. In our case study, this correlation factor specifically expresses dependencies between historical and modeled losses for the same insured peril due to geo-spatial distances. Such correlation factors are derived by algorithms that measure dependencies of historical and modeled losses on their sensitivities to geo-spatial distances among risks. In this article, we will not delve into the definition of such geo-spatial correlation algorithms. Three general cases of dependence relationships among flood risks due to their geographical situation and distances are examined in our article: full independence, full dependence and partial dependence.

2.0 Sub-Additivity, Dependence and Diversification

Scenario 2.1: Two Boundary Cases of Fully Dependent and Fully Independent Risks

In the first boundary case, where we study full dependence between risks, expressed with a unit correlation factor, we have from first statistical principles that the theoretical sum of the standard deviations of loss of the fully dependent risks is equivalent to the standard deviation of the joint loss distribution of the two risks combined, as defined in equation (4.1).

σ[St + Qt] = √(σ2[St] + σ2[Qt] + 2σ[St] * σ[Qt]) = σ[St] + σ[Qt]

For expected values of loss, we already have a known theoretical relationship between single risks’ expected insurance loss and umbrella product expected loss in equation (4.0). The logic of summations and equalities for the two components in standard premium definition in (4.0) and (4.3) leads to deriving a relationship of proven full additivity in premiums between the single policies and the aggregate umbrella product, as described in equation (4.2), and shortened as:

π(Qt + St) = π(Qt) + π(St)

Some underwriting conclusions are evident from this analysis. When structuring a combined umbrella product for fully dependent risks, in very close to identical geographical space, same insured peril and line-of-business, the price of the aggregated umbrella product should approach the sum of single risk premiums priced independently. The absence of diversification in geography and insured catastrophe peril prevents any significant opportunities for cost savings or competitiveness in premium pricing. The summation of riskiness form single policies to aggregate forms of products is linear and co-monotonic. Economies of market share scale do not play a role in highly clustered and concentrated pools of risks, where diversification is not achievable, and inter-risk dependencies are close to perfect. In such scenarios, the impact of big data components to underwriting and pricing practices is not as prominent, because formulation of standard premiums for single risks and aggregated products could be achieved by theoretical formulations.

See also: #1 Affliction Costing Businesses Billions  

In our second boundary case of full and perfect independence, when two or more risks with two separate insurance limits are priced independently and separately, the summation of their premiums is still required for portfolio accumulations by line of business and geographic and administrative region. This premium accumulation task or “roll-up” of fully independent risks is accomplished by practitioners accordingly with the linear principles of equation (3.0). However, if we are to structure an aggregate umbrella cover for these same single risks with an aggregated premium of π(Qt + St) , the effect of statistical independence expressed with a zero correlation factor will reduce equation (3.0) to equation (5.0).

π(Qt + St) = E[St + Qt] + √(σ2[St] + σ2[Qt])

Full independence among risks more strongly than any other cases supports the premium sub-additivity principle, which is stated in (6.0).

π(Qt + St) ≤ π(Qt) + π(St)

An expanded expression of the subadditivity principle is easily derived from the linear summation of premiums in (3.0) and the expression of the combined single insurance product premium in (5.0).

Some policy and premium underwriting guidelines can be derived from this regime of full statistical independence. Under conditions of full independence, when two risks are priced independently and separately the sum of their premiums will always be larger than the premium of an aggregate umbrella product covering these same two risks. The physical and geographic characteristics of full statistical independence for modeled insurance loss are large geo-spatial distances and independent insured catastrophe perils and business lines. In practice, this is generally defined as insurance risk portfolio diversification by geography, line and peril. In insurance product terms, we proved that diversification by geography, peril and line of business, which are the physical prerequisites for statistical independence, allow us to structure and price an aggregate umbrella product with a premium less than the sum of the independently priced premiums of the underlying insurance risks.

In this case, unlike with the case of full dependence, big data components have a computing and accuracy function to play in the underwriting and price definition process. Once the subadditivity of the aggregate umbrella product premium as in (6.0) is established, this premium is then back-allocated to the single component risks covered by the insurance product. This is done to measure the relative riskiness of the assets under the aggregate insurance coverage and each risk individual contribution to the formation of the aggregate premium. The back-allocation procedure is described further in the article in the context of a notional micro economy case.

Scenario 2.2: Less Than Fully Dependent Risks

In our case study, we have geo-spatial proximity of the two insured risks in a known flood zone with measured and available averaged historical flood intensities, which leads to a measurable statistical dependence of modeled insurance loss. We express this dependence with a computed correlation factor in the interval [0 < ρ’ < 1.0].

Partial dependence with a correlation factor 0 < ρ’ < 1.0 has immediate impact on the theoretical standard deviation of combined modeled loss, which is a basic quantity in the formulation of risk and loading factors for premium definition.

σ[St + Qt] = √(σ2[St] + σ2[Qt] + 2ρ’σ[St]σ[Qt]) ≤ σ[St] + σ[Qt]

This leads to redefining the equality in (4.3) to an expression of inequality between the premium of the aggregate umbrella product and the independent sum of the single risk premiums, as in the case of complete independence.

π(Qt + St) = √(E[St + Qt] + σ2[St] + σ2[Qt] + 2ρ’σ[St] * σ[Qt]) ≤ π(Qt) + π(St

The principle of premium sub-additivity (6.0), as in the case of full independence, again comes into force. The expression of this principle is not as strong with partial dependence as with full statistical independence, but we can clearly observe a theoretical ranking of aggregate umbrella premiums π(Qt+St) in the three cases reviewed so far.

πFull Independence ≤ πPartial Dependence ≤ πFull Dependence

This theoretical ranking is further confirmed in the next section with computed numerical results.

Less than full dependencies, i.e. partial dependencies among risks, could still be viewed as a statistical modeling argument for diversification in market share geography, line of business and insured peril. Partial but effective diversification still offers an opportunity for competitive premium pricing. In insurance product and portfolio terms, our study proves that partial or imperfect diversification by geography affects the sensitivity of premium accumulation and allows for cost savings in premium for aggregate umbrella products vs. the summation of multiple single-risk policy premiums.

3.0 Numerical Results of Single-Risk and Aggregate Premium Pricing Cases

In our flood risk premium study, we modeled and priced three scenarios, using classical formulas for a single risk premium in equation (1.0) and for umbrella policies in equation (7.0). In our first scenario, we price each risk separately and independently with insured limits of $90 million and $110 million. In the second and third scenarios, we price an umbrella product with a limit of $200 million, in three sub-cases with {1.0, 0.3 and 0.0} correlation factors, respectively to represent full dependence, partial dependence and full independence of modeled insured loss. We use stochastic modeled insurance flood losses computed with high geo-spatial granularity of 30 meters.

The numerical results of our experiment fully support the conclusions and guidelines that we earlier derived from theoretical statistical relationships. For fully dependent risks in close proximity, the sum of single-risk premiums approaches the price of an umbrella product, which is priced with 1.0 (100%) correlation factor. This is the stochastic relationship of full premium additivity. For partially dependent risks, the price of a combined product, modeled and priced with a 0.3 (30%) correlation factor, could be less than the sum of single-risk premiums. For fully independent risks, priced with a 0 (0.0%) correlation factor, the price of the combined insurance cover will further decrease to the price of an umbrella on partially dependent risks (30% correlation). Partial dependence and full independence support the stochastic ordering principle of premium sub-additivity. The premium ranking relationship in (7.1) is strongly confirmed by these numerical pricing results.

See also: How Quote Data Can Optimize Pricing  

Less than full dependence among risks, which is a very likely and practical measurement in real insurance umbrella coverage products, could still be viewed as the statistical modeling argument for diversification in market share geography. Partial and incomplete dependence theoretically and numerically supports the argument that partial but effective diversification offers an opportunity for competitive premium pricing.