After investigating a problematic sql
I saw that when we use CASE in the select it can lead to a major decrease in performance .
Here is the example :
the query is
Code: Select all
explain select a12.DT_WEEK AS DT_WEEK,
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
CASE WHEN max(CASE WHEN a11.HOT_LEAD_IND = 1 THEN 1 ELSE 0 END) = 1 THEN count(distinct CASE WHEN a11.HOT_LEAD_IND = 1 THEN a11.VS_LP_SESSION_ID ELSE NULL END) ELSE NULL END AS WJXBFS2,
CASE WHEN max(CASE WHEN a11.HOT_LEAD_IND = 1 THEN 1 ELSE 0 END) = 1 THEN (count(distinct CASE WHEN a11.HOT_LEAD_IND = 1 THEN a11.VS_LP_SESSION_ID ELSE NULL END) * 1.0) ELSE NULL END AS WJXBFS3
from lp_15744040.visit_date_time_prejoin3_b0 a11
join lp_15744040.DIM_DATE_TIME a12
on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by a12.DT_WEEK,
a11.LP_ACCOUNT_ID
We filter by VISIT_FROM_DT_TRUNC
We join by VISIT_FROM_DT_TRUNC and DATE_TIME_ID
We select distinct by HOT_LEAD_IND
In order for vertica to use it's strength we needed to create a pre join projection
which will be sorted by DT_WEEK,LP_ACCOUNT_ID,HOT_LEAD_IND
the explain of this query is :
Access Path:
Code: Select all
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 720K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT <SVAR>), max(<SVAR>)
| Group By: visit_date_time_prejoin4_b0.DT_WEEK, visit_date_time_prejoin4_b0.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 719K, Rows: 10K] (PATH ID: 2)
| | Aggregates: max(CASE WHEN (visit_date_time_prejoin4_b0.HOT_LEAD_IND = 1) THEN 1 ELSE 0 END)
| | Group By: visit_date_time_prejoin4_b0.DT_WEEK, visit_date_time_prejoin4_b0.LP_ACCOUNT_ID, CASE WHEN (visit_date_time_prejoin4_b0.HOT_LEAD_IND = 1) THEN visit_date_time_prejoin4_b0.VS_LP_SESSION_ID E
LSE NULL END
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 631K, Rows: 1M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin4_b0
| | | Materialize: visit_date_time_prejoin4_b0.DT_WEEK, visit_date_time_prejoin4_b0.LP_ACCOUNT_ID, visit_date_time_prejoin4_b0.HOT_LEAD_IND, visit_date_time_prejoin4_b0.VS_LP_SESSION_ID
| | | Filter: (visit_date_time_prejoin4_b0.LP_ACCOUNT_ID = '57386690')
| | | Filter: ((visit_date_time_prejoin4_b0.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (visit_date_time_prejoin4_b0.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))
| | | Execute on: All Nodes
The problem with this query is GROUPBY HASH (SORT OUTPUT) which cause vertica to store all rows and sort them before doing the distinct .
We want it to use GROUPBY PIPELINED so the results will flow to the next step as they come .
When we change the CASE to a simple count distinct vertica will be able to PIPELINED the results to the next step .
The explain will result in:
Code: Select all
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 209K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin4_b0.HOT_LEAD_IND)
| Group By: visit_date_time_prejoin4_b0.DT_WEEK, visit_date_time_prejoin4_b0.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 209K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin4_b0.DT_WEEK, visit_date_time_prejoin4_b0.HOT_LEAD_IND, visit_date_time_prejoin4_b0.LP_ACCOUNT_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 200K, Rows: 1M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin4_b0
| | | Materialize: visit_date_time_prejoin4_b0.DT_WEEK, visit_date_time_prejoin4_b0.LP_ACCOUNT_ID, visit_date_time_prejoin4_b0.HOT_LEAD_IND
| | | Filter: (visit_date_time_prejoin4_b0.LP_ACCOUNT_ID = '57386690')
| | | Filter: ((visit_date_time_prejoin4_b0.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (visit_date_time_prejoin4_b0.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))
| | | Execute on: All Nodes
To conclude:
We need to be very careful with case in the select statement .
In Vertica (compare to oracle) it will cause a significant decrease in performance .
Zvika