BI tips: The sparsing data trap

easybilu_sparsing_data_trap_header

Introduction

Did you ever ask yourself if all existing aggregate functions are always working the same way no matter the data quantity you have for a given period ? I would say « it depend of the need » but in a perfect world it always simplest to fill the gap for missing data in SQL queries results.

Be aware that I also wanted to simplify this official Oracle post according to my personal experience and introducing this concept to get a better understanding of my next post about windowing aggregation and how I avoid useless loop on data.

Why simplest ?  

Because data remains immutable and most important clear in your head for SQL queries you have to develop. I saw plenty queries not very consistent about execution plan and retrieving incomplete data dimension. Data are in fact sparse with no regular density and then difficult to format or render.

Increasing data volume ?

Sure, data volume is already big but adding extra generated lines have no cost for the database as these lines are not existing physically. To demonstrate this point, just launch this simple query to simulate a Cartesian product generating 10M lines using a simple « cross join » clause.

-- **Schema (MySQL v8.0)**

    with cartesian as (
    select 1 as number union select 2 
    union select 3 union select 4
    union select 5 union select 6
    union select 7 union select 8
    union select 9 union select 10 )
    
    select count(*) from cartesian c
    cross join cartesian c1
    cross join cartesian c2
    cross join cartesian c3
    cross join cartesian c4
    cross join cartesian c5
    cross join cartesian c6;

10 millions lines generated in 1.5 sec, you can try and check with a SQL fiddle sandbox here

A simple customer need

Let see with an simple example how is it important, so business came next to you for sales reporting, the need is basic, for the last year they need :

  • Average sales amount
  • Number of sales period (even during covid pandemic)
  • Total sales amount
  • The minimum sales figure

No datawarehouse !

This need is such a piece of cake that you don’t even realize there is no available data warehouse with the famous and universal timetable 😉

Let’s see how to manage this by looking the sale stable content.

with sales as
(
select '2021' as year, 1 as month, 10 as amt
union select '2021' as year, 5, 10
union select '2021' as year, 12, 10
)
select * from sales
yearmonthamount
2021110
2021510
20211210

Let’s write the query for our dear fellow business team

with sales as
(
select '2022' as year, 1 as month, 10 as amt
union select '2022', 5, 10
union select '2022', 12, 10
)

select 'cpt' as AGG, count(amt) as AMT from sales group by year
union select 'avg', avg(amt) from sales group by year
union select 'sum', sum(amt) from sales group by year
union select 'min', min(amt)  from sales group by year
aggamt
cpt3.0000
avg10.0000
sum30.0000
min10.0000
Result from the first attempt

Introduce the results

You’re confident with this result but… wait… something goes wrong with business team, it seems that figures are not the result they were expecting for. Indeed, after a shot brainstorming, the shop remains open during covid pandemic but no customer came due to quarantine, we have to keep count of the inactivity sales period.

Solution

Let’s create a temporary timetable using a other with clause then we use a left join to fill the gap, I also re-query and compare the original column AMT I renamed SPARSE with a new column NEW_AMT.

with timetable as
(
select '2021' as year, 1 as month union select '2021',2
union select '2021',3 union select '2021',4
union select '2021',5 union select '2021',6
union select '2021',7 union select '2021',8
union select '2021',9 union select '2021',10
union select '2021',11 union select '2021',12
),
sales as
(
select '2021' as year, 1 as month, 10 as amt
union select '2021' as year, 5, 10
union select '2021' as year, 12, 10
),

t1 as
(
select 
timetable.year, timetable.month,
sales.amt as sparse, coalesce(sales.amt,0) as new_amt
from timetable
left join sales on sales.month = timetable.month
)

select 'cpt' as AGG, count(sparse) as SPARSE, count(new_amt) as NEW_AMT from t1 group by year
union select 'avg' as agg, avg(sparse) , avg(new_amt) from t1 group by year
union select 'sum' as agg, sum(sparse), sum(new_amt)  from t1 group by year
union select 'min' as agg, min(sparse) , min(new_amt)  from t1 group by year
AGGSPARSENEW_AMT
cpt3.000012.0000
avg10.00002.5000
sum30.000030.0000
min10.00000.0000
New result using a time table

Test again the above example here, we directly see the correct figure versus old bad results, only the sum aggregation remains correct.

Conclusion

Filling missing data is a good starting point for miscellaneous aggregation by getting a immutable data set we can query, this tip is especially useful for windowing aggregation BI approach I’ll explain in a future post… stay tune 😉