Metric Decomposition

Metric decomposition is the process by which DataJunction breaks down complex metric expressions into simpler, pre-aggregatable components. This enables efficient materialization to OLAP databases while preserving the mathematical correctness of metrics when queried at different dimension granularities.

Why Decomposition is Necessary

OLAP databases like Druid are optimized for rollup aggregations (e.g., SUM, COUNT, MIN, MAX) but cannot directly compute complex metrics like averages or rates from pre-aggregated data. For example:

-- This metric calculates average repair price
SELECT AVG(price) FROM repair_orders

If we pre-aggregate this as AVG(price) at the daily level, we cannot correctly compute the weekly average by averaging the daily averages. The mathematical property required is called additivity - only additive aggregations (SUM, COUNT, MIN, MAX) can be correctly rolled up.

DataJunction solves this by decomposing metrics into their additive components.

The Three Phases of Decomposition

Each decomposable aggregation defines three operations. Let’s use AVG(price) as a running example:

Phase 1: Accumulate

This phase defines how to build the raw components from source data during initial ingestion.

For AVG(price), we need two components:

  • SUM(price) -> price_sum
  • COUNT(price) -> price_count

Each of these components are computed when data is loaded into the measures table.

Phase 2: Merge

This phase defines how to combine pre-aggregated components when rolling up to coarser granularity. This uses associative operations that produce the same result regardless of grouping.

For our AVG components:

  • price_sum merges via SUM (add the sums together)
  • price_count merges via SUM (add the counts together)

Example: Rolling up from hourly to daily measures:

Hourprice_sumprice_count
10am3003
11am5005
12pm2002

After merge (daily level):

Dayprice_sumprice_count
Mon100010

Phase 3: Combine

Reconstruct the final metric value from the merged components. This formula is applied at query time.

For AVG(price):

SUM(price_sum) / SUM(price_count)

Using our example: 1000 / 10 = 100

This is mathematically equivalent to computing AVG(price) on the original data, but works on pre-aggregated measures.

Decomposition by Aggregation Type

Simple Aggregations: SUM, COUNT, MIN, MAX

These are already additive and decompose trivially:

MetricAccumulateMergeCombine
SUM(x)SUM(x)SUMSUM(sum_col)
COUNT(x)COUNT(x)SUMSUM(count_col)
MIN(x)MIN(x)MINMIN(min_col)
MAX(x)MAX(x)MAXMAX(max_col)

Note that COUNT merges as SUM (we sum up the counts).

AVG (Average)

Average requires two components: sum and count.

ComponentAccumulateMerge
sum_colSUM(x)SUM
count_colCOUNT(x)SUM

Combiner: SUM(sum_col) / SUM(count_col)

Example:

-- Original metric
SELECT AVG(price) FROM orders

-- Decomposed measures
SELECT SUM(price) AS price_sum, COUNT(price) AS price_count FROM orders

-- Combiner query (at query time)
SELECT SUM(price_sum) / SUM(price_count) FROM measures_table

APPROX_COUNT_DISTINCT (HyperLogLog)

Approximate distinct counts use HyperLogLog (HLL) sketches, which are probabilistic data structures that can be merged associatively.

ComponentAccumulateMerge
hll_colhll_sketch_agg(x)hll_union

Combiner: hll_sketch_estimate(hll_union(hll_col))

DataJunction uses Spark function names internally (hll_sketch_agg, hll_union, hll_sketch_estimate) and translates to dialect-specific functions at query time:

DJ Internal (Spark)DruidTrino
hll_sketch_aggDS_HLLapprox_set
hll_unionDS_HLLmerge
hll_sketch_estimateAPPROX_COUNT_DISTINCT_DS_HLLcardinality

Example:

-- Original metric
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events

-- Decomposed measures (stored as binary HLL sketch)
SELECT hll_sketch_agg(user_id) AS user_id_hll FROM events

-- Combiner query (Spark)
SELECT hll_sketch_estimate(hll_union(user_id_hll)) FROM measures_table

-- Combiner query (Druid, after translation)
SELECT APPROX_COUNT_DISTINCT_DS_HLL(DS_HLL(user_id_hll)) FROM measures_table

COUNT(DISTINCT x) - Exact Distinct Counts

Exact distinct counts cannot be decomposed for full pre-aggregation because there’s no way to merge distinct value sets associatively without storing all values.

DJ marks these as LIMITED aggregability, meaning they can only be pre-aggregated if the query includes all columns that affect distinctness.

Variance & Standard Deviation

Variance uses the identity VAR = E[X²] - E[X]² to decompose into three components:

ComponentAccumulateMerge
sum_colSUM(x)SUM
sum_sq_colSUM(POWER(x, 2))SUM
count_colCOUNT(x)SUM

Supported functions:

FunctionCombiner
VAR_POP(x)(sum_sq/n) - (sum/n)²
VAR_SAMP(x)(n*sum_sq - sum²) / (n*(n-1))
STDDEV_POP(x)SQRT(VAR_POP)
STDDEV_SAMP(x)SQRT(VAR_SAMP)
VARIANCE(x)Same as VAR_SAMP
STDDEV(x)Same as STDDEV_SAMP

Example:

-- Original metric
SELECT STDDEV(response_time) FROM requests

-- Decomposed measures
SELECT 
  SUM(response_time) AS response_time_sum,
  SUM(POWER(response_time, 2)) AS response_time_sum_sq,
  COUNT(response_time) AS response_time_count
FROM requests

Covariance (Two-Column Metrics)

Covariance measures the relationship between two variables and uses the identity COVAR = E[XY] - E[X]*E[Y]:

ComponentAccumulateMerge
sum_xSUM(x)SUM
sum_ySUM(y)SUM
sum_xySUM(x * y)SUM
countCOUNT(x)SUM

Supported functions:

FunctionCombiner
COVAR_POP(x, y)(sum_xy/n) - (sum_x/n)*(sum_y/n)
COVAR_SAMP(x, y)(n*sum_xy - sum_x*sum_y) / (n*(n-1))

Example:

-- Original metric: covariance between ad spend and revenue
SELECT COVAR_POP(ad_spend, revenue) FROM campaigns

-- Decomposed measures
SELECT 
  SUM(ad_spend) AS ad_spend_sum,
  SUM(revenue) AS revenue_sum,
  SUM(ad_spend * revenue) AS ad_spend_revenue_sum_xy,
  COUNT(ad_spend) AS count
FROM campaigns

Correlation (Two-Column Metrics)

Correlation normalizes covariance by standard deviations: CORR = COVAR(X,Y) / (STDDEV(X) * STDDEV(Y)). This requires 6 components:

ComponentAccumulateMerge
sum_xSUM(x)SUM
sum_ySUM(y)SUM
sum_x_sqSUM(POWER(x, 2))SUM
sum_y_sqSUM(POWER(y, 2))SUM
sum_xySUM(x * y)SUM
countCOUNT(x)SUM

Combiner:

numerator = n * sum_xy - sum_x * sum_y
denominator = SQRT((n * sum_x_sq - sum_x²) * (n * sum_y_sq - sum_y²))
CORR = numerator / denominator

Example:

-- Original metric: correlation between price and quantity
SELECT CORR(price, quantity) FROM orders

-- Decomposed into 6 measures, combined at query time

Complex Metric Examples

Rate Metrics

-- Click-through rate: clicks / impressions
SELECT SUM(clicks) / SUM(impressions) FROM ad_events

Decomposes to two SUM measures that merge independently and combine via division.

Conditional Aggregations

-- Discounted order rate
SELECT CAST(SUM(IF(discount > 0, 1, 0)) AS DOUBLE) / COUNT(*) FROM orders

Both components (conditional sum and count) decompose normally.

Mixed Aggregations

-- Revenue per unique user
SELECT SUM(revenue) / APPROX_COUNT_DISTINCT(user_id) FROM transactions
ComponentTypeAccumulateMerge
revenue_sumSUMSUM(revenue)SUM
user_id_hllHLLhll_sketch_agg(user_id)hll_union

Combiner: SUM(revenue_sum) / hll_sketch_estimate(hll_union(user_id_hll))

Aggregability Classification

DJ classifies each metric’s aggregability:

TypeDescriptionExample
FULLCan be pre-aggregated at any dimension levelSUM, AVG, APPROX_COUNT_DISTINCT
LIMITEDCan only pre-aggregate with specific dimensionsCOUNT(DISTINCT x)
NONECannot be pre-aggregatedMEDIAN, percentiles, MAX_BY

How Materialization Uses Decomposition

When you configure a cube for materialization:

  1. DJ analyzes each metric in the cube and extracts decomposable components
  2. A measures table is created with columns for each component
  3. Data is ingested using the accumulate expressions (Phase 1)
  4. At query time, DJ generates SQL that:
    • Reads from the measures table
    • Applies merge functions for rollup (Phase 2)
    • Applies combiner expressions (Phase 3)
    • Translates functions to the target dialect

Example: Cube with Multiple Metrics

# Cube definition
name: sales_cube
metrics:
  - total_revenue      # SUM(amount)
  - avg_order_value    # AVG(amount)  
  - unique_customers   # APPROX_COUNT_DISTINCT(customer_id)
dimensions:
  - date
  - region
  - product_category

Generated Measures Table:

ColumnSource MetricAccumulate
amount_sumtotal_revenue, avg_order_valueSUM(amount)
amount_countavg_order_valueCOUNT(amount)
customer_id_hllunique_customershll_sketch_agg(customer_id)

Query for avg_order_value by region:

SELECT 
    region,
    SUM(amount_sum) / SUM(amount_count) AS avg_order_value
FROM sales_cube_measures
GROUP BY region

Non-Decomposable Aggregations

Some aggregations cannot be decomposed for pre-aggregation:

FunctionReason
MEDIANRequires all values to compute
PERCENTILERequires all values to compute
MAX_BY(x, y)Cannot merge without full data
MIN_BY(x, y)Cannot merge without full data
COUNT(DISTINCT x)Requires all distinct values (use APPROX_COUNT_DISTINCT instead)

Metrics using these functions will have empty measure lists and cannot benefit from cube materialization.

SQL API: Getting Decomposed Measures

You can retrieve the decomposed measures for any metric using the API:

# Get measures for a metric
curl http://localhost:8000/nodes/default.avg_repair_price/measures/

Response:

{
  "measures": [
    {
      "name": "price_sum_abc123",
      "expression": "price",
      "aggregation": "SUM",
      "merge": "SUM"
    },
    {
      "name": "price_count_abc123", 
      "expression": "price",
      "aggregation": "COUNT",
      "merge": "SUM"
    }
  ],
  "combiner": "SUM(price_sum_abc123) / SUM(price_count_abc123)"
}

Supported Functions Reference

Quick reference for all decomposable aggregation functions:

Function# ComponentsAggregabilityNotes
SUM(x)1FULLSimple additive
COUNT(x)1FULLMerges via SUM
MIN(x)1FULLSimple associative
MAX(x)1FULLSimple associative
ANY_VALUE(x)1FULLAny value from group
COUNT_IF(cond)1FULLConditional count
AVG(x)2FULLsum + count
VAR_POP(x)3FULLsum + sum_sq + count
VAR_SAMP(x)3FULLsum + sum_sq + count
VARIANCE(x)3FULLAlias for VAR_SAMP
STDDEV_POP(x)3FULLsqrt of VAR_POP
STDDEV_SAMP(x)3FULLsqrt of VAR_SAMP
STDDEV(x)3FULLAlias for STDDEV_SAMP
COVAR_POP(x, y)4FULLTwo-column metric
COVAR_SAMP(x, y)4FULLTwo-column metric
CORR(x, y)6FULLTwo-column metric
APPROX_COUNT_DISTINCT(x)1FULLHLL sketch
COUNT(DISTINCT x)1LIMITEDRequires full data
MEDIAN(x)-NONENot decomposable
PERCENTILE(x, p)-NONENot decomposable
MAX_BY(x, y)-NONENot decomposable
MIN_BY(x, y)-NONENot decomposable