BI tips: The sparsing data trap

Introduction

Did you ever ask yourself if all existing aggregate functions are always working the same way no matter the data quantity you have for a given period ? I would say « it depend of the need » but in a perfect world it always simplest to fill the gap for missing data in SQL queries results.

Be aware that I also wanted to simplify this official Oracle post according to my personal experience and introducing this concept to get a better understanding of my next post about windowing aggregation and how I avoid useless loop on data.

Why simplest ?

Because data remains immutable and most important clear in your head for SQL queries you have to develop. I saw plenty queries not very consistent about execution plan and retrieving incomplete data dimension. Data are in fact sparse with no regular density and then difficult to format or render.

Increasing data volume ?

Sure, data volume is already big but adding extra generated lines have no cost for the database as these lines are not existing physically. To demonstrate this point, just launch this simple query to simulate a Cartesian product generating 10M lines using a simple « cross join » clause.

-- **Schema (MySQL v8.0)**

    with cartesian as (
    select 1 as number union select 2 
    union select 3 union select 4
    union select 5 union select 6
    union select 7 union select 8
    union select 9 union select 10 )
    
    select count(*) from cartesian c
    cross join cartesian c1
    cross join cartesian c2
    cross join cartesian c3
    cross join cartesian c4
    cross join cartesian c5
    cross join cartesian c6;

10 millions lines generated in 1.5 sec, you can try and check with a SQL fiddle sandbox here

A simple customer need

Let see with an simple example how is it important, so business came next to you for sales reporting, the need is basic, for the last year they need :

Average sales amount
Number of sales period (even during covid pandemic)
Total sales amount
The minimum sales figure

No datawarehouse !

This need is such a piece of cake that you don’t even realize there is no available data warehouse with the famous and universal timetable 😉

Let’s see how to manage this by looking the sale stable content.

with sales as
(
select '2021' as year, 1 as month, 10 as amt
union select '2021' as year, 5, 10
union select '2021' as year, 12, 10
)
select * from sales

year	month	amount
2021	1	10
2021	5	10
2021	12	10

Let’s write the query for our dear fellow business team

with sales as
(
select '2022' as year, 1 as month, 10 as amt
union select '2022', 5, 10
union select '2022', 12, 10
)

select 'cpt' as AGG, count(amt) as AMT from sales group by year
union select 'avg', avg(amt) from sales group by year
union select 'sum', sum(amt) from sales group by year
union select 'min', min(amt)  from sales group by year

agg	amt
cpt	3.0000
avg	10.0000
sum	30.0000
min	10.0000

Result from the first attempt

Introduce the results

You’re confident with this result but… wait… something goes wrong with business team, it seems that figures are not the result they were expecting for. Indeed, after a shot brainstorming, the shop remains open during covid pandemic but no customer came due to quarantine, we have to keep count of the inactivity sales period.

Solution

Let’s create a temporary timetable using a other with clause then we use a left join to fill the gap, I also re-query and compare the original column AMT I renamed SPARSE with a new column NEW_AMT.

with timetable as
(
select '2021' as year, 1 as month union select '2021',2
union select '2021',3 union select '2021',4
union select '2021',5 union select '2021',6
union select '2021',7 union select '2021',8
union select '2021',9 union select '2021',10
union select '2021',11 union select '2021',12
),
sales as
(
select '2021' as year, 1 as month, 10 as amt
union select '2021' as year, 5, 10
union select '2021' as year, 12, 10
),

t1 as
(
select 
timetable.year, timetable.month,
sales.amt as sparse, coalesce(sales.amt,0) as new_amt
from timetable
left join sales on sales.month = timetable.month
)

select 'cpt' as AGG, count(sparse) as SPARSE, count(new_amt) as NEW_AMT from t1 group by year
union select 'avg' as agg, avg(sparse) , avg(new_amt) from t1 group by year
union select 'sum' as agg, sum(sparse), sum(new_amt)  from t1 group by year
union select 'min' as agg, min(sparse) , min(new_amt)  from t1 group by year

AGG	SPARSE	NEW_AMT
cpt	3.0000	12.0000
avg	10.0000	2.5000
sum	30.0000	30.0000
min	10.0000	0.0000

New result using a time table

Test again the above example here, we directly see the correct figure versus old bad results, only the sum aggregation remains correct.

Conclusion

Filling missing data is a good starting point for miscellaneous aggregation by getting a immutable data set we can query, this tip is especially useful for windowing aggregation BI approach I’ll explain in a future post… stay tune 😉

Le blog

BI tips: The sparsing data trap

Introduction

Why simplest ?

Increasing data volume ?

A simple customer need

No datawarehouse !

Introduce the results

Solution

Conclusion

Articles récents

Commentaires récents

Archives

Catégories

Méta