Analytic function taking too much time
Moderator: NorbertKrupa
Analytic function taking too much time
Hello All,
I have sql query which uses a Vertica Analytic function called First_value. The query has to run on 15 Billion records. The query is working fine but taking too much of time of around 2 Hrs 20 Mins ( 8337 Seconds). Can anybody please suggest me a alternatie for this query.
Original Query:
select distinct serialno , FIRST_VALUE(pid) over(partition by serialno order by time asc) fcid, first_value(pid) over(partition by serialno order by time desc) lcid
from dwh.look_in_hist;
Please help with this.
I have sql query which uses a Vertica Analytic function called First_value. The query has to run on 15 Billion records. The query is working fine but taking too much of time of around 2 Hrs 20 Mins ( 8337 Seconds). Can anybody please suggest me a alternatie for this query.
Original Query:
select distinct serialno , FIRST_VALUE(pid) over(partition by serialno order by time asc) fcid, first_value(pid) over(partition by serialno order by time desc) lcid
from dwh.look_in_hist;
Please help with this.
Re: Analytic function taking too much time
Hi Navin!
Can you provide some info:
1. Your physical parameters (num of nodes, RAM, CPU and storage, Vertica version)
2.
3.
4.
Can you provide some info:
1. Your physical parameters (num of nodes, RAM, CPU and storage, Vertica version)
2.
Code: Select all
count distinct serialno from dwh.look_in_hist;
Code: Select all
explain select distinct serialno , FIRST_VALUE(pid) over(partition by serialno order by time asc) fcid, first_value(pid) over(partition by serialno order by time desc) lcid
from dwh.look_in_hist;
Code: Select all
select * from PROJECTIONS where projection_schema = 'dwh' and anchor_table_name = 'look_in_hist';
Re: Analytic function taking too much time
3 Nodes cluster,
RAM 128GB each
Vertica5.1.1
count distinct serialno from dwh.look_in_hist;
dbadmin=> select count (serialno) from dwh.look_in_hist;
count
-------------
15006217950
(1 row)
dbadmin=> select count (distinct serialno) from dwh.look_in_hist;
count
-----------
100000000
(1 row)
dbadmin=> explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno order by time asc) fcid, first_value(cid) over(partition by serialno or der by time desc) lcid from DW.Check_In_Hist;
QUERY PLAN
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------------------------
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno
order by time asc) fcid, first_value(cid) over(partition by serialno order by ti
me desc) lcid from dwh.look_In_Hist;
Access Path:
+-SORT [Cost: 967M, Rows: 15B (PATH ID: 1)
| Order: look_In_Hist.serialno ASC, first_value(look_In_Hist.cid, false) ASC
, first_value(look_In_Hist.cid, false) ASC
| Execute on: All Nodes
| +---> ANALYTICAL [Cost: 393M, Rows: 15B (PATH ID: 2)
| | Analytic Group
| | Functions: first_value()
| | Group Sort: look_In_Hist.serialno ASC, look_In_Hist."time" DESC NULLS FIRST
| | Analytic Group
| | Functions: first_value()
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for look_In_Hist [Cost: 16M, Rows: 15B)] (PATH ID: 3)
| | | Projection: dwh.look_In_Hist_DBD_9_seg_T04_T04
| | | Materialize: look_In_Hist.serialno, look_In_Hist."time", look_In_Hist.cid
| | | Execute on: All Nodes
------------------------------
-----------------------------------------------
PLAN: BASE QUERY PLAN (GraphViz Format)
-----------------------------------------------
digraph G {
graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno order by time asc) fcid, first_value(cid) over(partition
by serialno order by time desc) lcid from dwh.look_In_Hist;\n\nAll Nodes Vector: \n\n node[0]=v_htcpoc_node0001 (initiator) Up\n node[1]=v_htcpoc_node0002 (executor) Up\n node[2]=v_htcpo
c_node0003 (executor) Up\n", labelloc=t, labeljust=l ordering=out]
0[label = "Root \nOutBlk=[UncTuple(3)]", color = "green", shape = "house"];
1[label = "NewEENode \nOutBlk=[UncTuple(3)]", color = "green", shape = "box"];
2[label = "GroupByPipe: 3 keys\nAggs:\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
3[label = "ExprEval: \n look_In_Hist.serialno\n \<SVAR\>\n \<SVAR\>\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
4[label = "Recv(keys = A,A,A,N,N)\nRecv from: v_htcpoc_node0001,v_htcpoc_node0002,v_htcpoc_node0003\nNet id: 1000\nMerge\n\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integ
er(8)\nUnc: Varchar(15)", color = "green", shape = "box"];
5[label = "Send\nSend to: v_htcpoc_node0001\nNet id: 1000\nMerge\n\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "green", shape = "box"
];
6[label = "Sort: (keys = A,A,A,N,N)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "green", shape = "box"];
7[label = "ExprEval: \n look_In_Hist.serialno\n first_value(look_In_Hist.cid, false)\n first_value(look_In_Hist.cid, false)\n look_In_Hist.\"time\"\n look_In_Hist.cid", color = "
green", shape = "box"];
8[label = "AnalyticPipe:\nPartitionBy: look_In_Hist.serialno\nOrderBy:look_In_Hist.\"time\"\nAnalytical Functions:\n first_value(look_In_Hist.cid, false)\nUnc: Varchar(15)\nUnc: Integ
er(8)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
9[label = "Sort: (keys = A,D,N,N)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
10[label = "AnalyticPipe:\nPartitionBy: Check_In_Hist.serialno\nOrderBy: look_In_Hist.\"time\"\nAnalytical Functions:\n first_value(look_In_Hist.cid, false)\nUnc: Varchar(15)\nUnc: Inte
ger(8)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
11[label = "StorageMergeStep: lookIn_Hist_DBD_9_seg_T04_T04; 3 sorted\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "purple", shape = "box"];
12[label = "ScanStep: look_In_Hist_DBD_9_seg_T04_T04\nserialno\ntime RLE\ncid\nUnc: Varchar(15)\nRLE: Integer(8)\nUnc: Varchar(15)", color = "brown", shape = "box"];
1->0 [label = "V[0] C=3",color = "black",style="bold", arrowtail="inv"];
2->1 [label = "0",color = "blue"];
3->2 [label = "0",color = "blue"];
4->3 [label = "0",color = "blue"];
5->4 [label = "0",color = "blue"];
6->5 [label = "0",color = "blue"];
7->6 [label = "0",color = "blue"];
8->7 [label = "0",color = "blue"];
9->8 [label = "0",color = "blue"];
10->9 [label = "0",color = "blue"];
11->10 [label = "0",color = "blue"];
12->11 [label = "0",color = "blue"];
}
(55 rows)
dbadmin=>
dbadmin=> select * from PROJECTIONS where projection_schema = 'dwh' and anchor_table_name = 'look_In_Hist'
dbadmin-> ;
projection_schema_id | projection_schema | projection_id | projection_name | owner_id | owner_name | anchor_table_id | anchor_table_name | node_id | node_name | is_prejoin | created_epoch | create_type | verified_fault_tolerance | is_up_to_date | has_statistics | is_segmented
----------------------+-------------------+-------------------+---------------------------------+-------------------+------------+-------------------+-------------------+---------+-----------+------------+---------------+-------------------+--------------------------+---------------+----------------+--------------
45035996273742294 |dwh | 45035996273813174 | look_In_Hist_DBD_3_seg_T04_T04 | 45035996273704962 | dbadmin | 45035996273813170 | look_In_Hist | 0 | | f | 454 | CREATE PROJECTION | 1 | t | t | t
45035996273742294 | dwh | 45035996273813360 | look_In_Hist_DBD_9_seg_T04_T04 | 45035996273704962 | dbadmin | 45035996273813170 | look_In_Hist | 0 | | f | 454 | CREATE PROJECTION | 1 | t | t | t
(2 rows)
RAM 128GB each
Vertica5.1.1
count distinct serialno from dwh.look_in_hist;
dbadmin=> select count (serialno) from dwh.look_in_hist;
count
-------------
15006217950
(1 row)
dbadmin=> select count (distinct serialno) from dwh.look_in_hist;
count
-----------
100000000
(1 row)
dbadmin=> explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno order by time asc) fcid, first_value(cid) over(partition by serialno or der by time desc) lcid from DW.Check_In_Hist;
QUERY PLAN
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------------------------
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno
order by time asc) fcid, first_value(cid) over(partition by serialno order by ti
me desc) lcid from dwh.look_In_Hist;
Access Path:
+-SORT [Cost: 967M, Rows: 15B (PATH ID: 1)
| Order: look_In_Hist.serialno ASC, first_value(look_In_Hist.cid, false) ASC
, first_value(look_In_Hist.cid, false) ASC
| Execute on: All Nodes
| +---> ANALYTICAL [Cost: 393M, Rows: 15B (PATH ID: 2)
| | Analytic Group
| | Functions: first_value()
| | Group Sort: look_In_Hist.serialno ASC, look_In_Hist."time" DESC NULLS FIRST
| | Analytic Group
| | Functions: first_value()
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for look_In_Hist [Cost: 16M, Rows: 15B)] (PATH ID: 3)
| | | Projection: dwh.look_In_Hist_DBD_9_seg_T04_T04
| | | Materialize: look_In_Hist.serialno, look_In_Hist."time", look_In_Hist.cid
| | | Execute on: All Nodes
------------------------------
-----------------------------------------------
PLAN: BASE QUERY PLAN (GraphViz Format)
-----------------------------------------------
digraph G {
graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno order by time asc) fcid, first_value(cid) over(partition
by serialno order by time desc) lcid from dwh.look_In_Hist;\n\nAll Nodes Vector: \n\n node[0]=v_htcpoc_node0001 (initiator) Up\n node[1]=v_htcpoc_node0002 (executor) Up\n node[2]=v_htcpo
c_node0003 (executor) Up\n", labelloc=t, labeljust=l ordering=out]
0[label = "Root \nOutBlk=[UncTuple(3)]", color = "green", shape = "house"];
1[label = "NewEENode \nOutBlk=[UncTuple(3)]", color = "green", shape = "box"];
2[label = "GroupByPipe: 3 keys\nAggs:\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
3[label = "ExprEval: \n look_In_Hist.serialno\n \<SVAR\>\n \<SVAR\>\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
4[label = "Recv(keys = A,A,A,N,N)\nRecv from: v_htcpoc_node0001,v_htcpoc_node0002,v_htcpoc_node0003\nNet id: 1000\nMerge\n\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integ
er(8)\nUnc: Varchar(15)", color = "green", shape = "box"];
5[label = "Send\nSend to: v_htcpoc_node0001\nNet id: 1000\nMerge\n\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "green", shape = "box"
];
6[label = "Sort: (keys = A,A,A,N,N)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "green", shape = "box"];
7[label = "ExprEval: \n look_In_Hist.serialno\n first_value(look_In_Hist.cid, false)\n first_value(look_In_Hist.cid, false)\n look_In_Hist.\"time\"\n look_In_Hist.cid", color = "
green", shape = "box"];
8[label = "AnalyticPipe:\nPartitionBy: look_In_Hist.serialno\nOrderBy:look_In_Hist.\"time\"\nAnalytical Functions:\n first_value(look_In_Hist.cid, false)\nUnc: Varchar(15)\nUnc: Integ
er(8)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
9[label = "Sort: (keys = A,D,N,N)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
10[label = "AnalyticPipe:\nPartitionBy: Check_In_Hist.serialno\nOrderBy: look_In_Hist.\"time\"\nAnalytical Functions:\n first_value(look_In_Hist.cid, false)\nUnc: Varchar(15)\nUnc: Inte
ger(8)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
11[label = "StorageMergeStep: lookIn_Hist_DBD_9_seg_T04_T04; 3 sorted\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "purple", shape = "box"];
12[label = "ScanStep: look_In_Hist_DBD_9_seg_T04_T04\nserialno\ntime RLE\ncid\nUnc: Varchar(15)\nRLE: Integer(8)\nUnc: Varchar(15)", color = "brown", shape = "box"];
1->0 [label = "V[0] C=3",color = "black",style="bold", arrowtail="inv"];
2->1 [label = "0",color = "blue"];
3->2 [label = "0",color = "blue"];
4->3 [label = "0",color = "blue"];
5->4 [label = "0",color = "blue"];
6->5 [label = "0",color = "blue"];
7->6 [label = "0",color = "blue"];
8->7 [label = "0",color = "blue"];
9->8 [label = "0",color = "blue"];
10->9 [label = "0",color = "blue"];
11->10 [label = "0",color = "blue"];
12->11 [label = "0",color = "blue"];
}
(55 rows)
dbadmin=>
dbadmin=> select * from PROJECTIONS where projection_schema = 'dwh' and anchor_table_name = 'look_In_Hist'
dbadmin-> ;
projection_schema_id | projection_schema | projection_id | projection_name | owner_id | owner_name | anchor_table_id | anchor_table_name | node_id | node_name | is_prejoin | created_epoch | create_type | verified_fault_tolerance | is_up_to_date | has_statistics | is_segmented
----------------------+-------------------+-------------------+---------------------------------+-------------------+------------+-------------------+-------------------+---------+-----------+------------+---------------+-------------------+--------------------------+---------------+----------------+--------------
45035996273742294 |dwh | 45035996273813174 | look_In_Hist_DBD_3_seg_T04_T04 | 45035996273704962 | dbadmin | 45035996273813170 | look_In_Hist | 0 | | f | 454 | CREATE PROJECTION | 1 | t | t | t
45035996273742294 | dwh | 45035996273813360 | look_In_Hist_DBD_9_seg_T04_T04 | 45035996273704962 | dbadmin | 45035996273813170 | look_In_Hist | 0 | | f | 454 | CREATE PROJECTION | 1 | t | t | t
(2 rows)
Re: Analytic function taking too much time
Hi Navin.
Some notes:
1. ) You have only one segmented projection. Try to create more projection with (distinct serialno, cid, time) columns sorted by (serialno, time)
2. ) its hard to understand what your query do but why do you think that is bad results? (total rows: 15006217950; execution time 8337 secs):I don't know your data but its looks as very impressive result - ~1,800K rows per second. Can you compare with other queries (witch run "fast")?
3. ) What do DISTINCT ? (as I understand it have no influence on output)
4. ) You have 2 different order windows clause of analytic function but:
equals to
(except nulls but you can play with it)
Some notes:
1. ) You have only one segmented projection. Try to create more projection with (distinct serialno, cid, time) columns sorted by (serialno, time)
2. ) its hard to understand what your query do but why do you think that is bad results? (total rows: 15006217950; execution time 8337 secs):
Code: Select all
cluster_db=> select public.floor(15006217950.0 / 8337.0) as 'ROWS PER SEC';
ROWS PER SEC
--------------
1799954
(1 row)
3. ) What do DISTINCT ? (as I understand it have no influence on output)
4. ) You have 2 different order windows clause of analytic function but:
Code: Select all
FIST_VALUE( col1 ) OVER (PARTITION BY col2 ORDER BY col3 DESC)
Code: Select all
LAST_VALUE( col1 ) OVER (PARTITION BY col2 ORDER BY col3 ASC)
Re: Analytic function taking too much time
Example for case 4 from my prev. post:
- Table:
Code: Select all
CREATE TABLE public.EventsExample ( serial_num char(1), event_time time, event_pid int );
- Data:
Code: Select all
serial_num | event_time | event_pid ------------+------------+----------- A | 10:00:00 | 10 A | 10:30:00 | 15 A | 11:00:00 | 7 B | 07:42:00 | 3 B | 09:12:00 | 8 B | 11:13:00 | 1 C | 00:00:00 | 100 C | 01:00:00 | 50 C | 02:00:00 | 75
- Query:
Code: Select all
SELECT DISTINCT serial_num, FIRST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM EventsExample WINDOW group_win AS (PARTITION BY serial_num ORDER BY event_time);
- Result:
Code: Select all
serial_num | ?column? | ?column? ------------+----------+---------- A | 10 | 7 B | 3 | 1 C | 100 | 75
Re: Analytic function taking too much time
Case 1
15 bil of rows and only super projection that segmented?
Lets take example from previous post:
Query plan:
Total cost = 51 + 52 + 53 = 155
Now lets partition our BIG DATA table:
Data:
and query plan:
Total cost = 39 + 40 + 41 = 120 => improvement by cost 20%
For bigger tables it has to give little more improvement.
15 bil of rows and only super projection that segmented?
Lets take example from previous post:
Query plan:
Code: Select all
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
EXPLAIN SELECT DISTINCT serial_num,
FIRST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING),
LAST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
FROM EventsExample
WINDOW group_win AS (PARTITION BY serial_num ORDER BY event_time);
Code: Select all
Access Path:
+-SORT [Cost: 53, Rows: 9] (PATH ID: 1)
| Order: EventsExample.serial_num ASC, first_value(EventsExample.event_pid, false) ASC, last_value(EventsExample.event_pid, false) ASC
| +---> ANALYTICAL [Cost: 52, Rows: 9] (PATH ID: 2)
| | Analytic Group
| | Functions: first_value(), last_value()
| | Group Sort: EventsExample.serial_num ASC, EventsExample.event_time ASC NULLS LAST
| | +---> STORAGE ACCESS for EventsExample [Cost: 51, Rows: 9] (PATH ID: 3)
| | | Projection: public.EventsExample_super
| | | Materialize: EventsExample.serial_num, EventsExample.event_time, EventsExample.event_pid
...
Now lets partition our BIG DATA table:
Code: Select all
CREATE TABLE EventsExample2
(
serial_num char(1) NOT NULL,
event_time time,
event_pid int
)
PARTITION BY (EventsExample2.serial_num);
Code: Select all
insert into EventsExample2 select * from EventsExample;select refresh();
Code: Select all
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
EXPLAIN SELECT DISTINCT serial_num,
FIRST_VALUE(event_pid) OVER (group_win RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING),
LAST_VALUE(event_pid) OVER (group_win RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
FROM EventsExample2
WINDOW group_win AS (PARTITION BY serial_num ORDER BY event_time);
Code: Select all
Access Path:
+-SORT [Cost: 41, Rows: 9 (NO STATISTICS)] (PATH ID: 1)
| Order: EventsExample2.serial_num ASC, first_value(EventsExample2.event_pid, false) ASC, last_value(EventsExample2.event_pid, false) ASC
| +---> ANALYTICAL [Cost: 40, Rows: 9 (NO STATISTICS)] (PATH ID: 2)
| | Analytic Group
| | Functions: first_value(), last_value()
| | Group Sort: EventsExample2.serial_num ASC, EventsExample2.event_time ASC NULLS LAST
| | +---> STORAGE ACCESS for EventsExample2 [Cost: 39, Rows: 9 (NO STATISTICS)] (PATH ID: 3)
| | | Projection: public.EventsExample2_super
| | | Materialize: EventsExample2.serial_num, EventsExample2.event_time, EventsExample2.event_pid
For bigger tables it has to give little more improvement.
Re: Analytic function taking too much time
It's not clear from your description:
- what you'd consider a "good" query time for such (or equivalent) query returning 100M rows
- how quickly your client can receive 100M rows of the result (i.e for example how long it'll take - using the same client - to do )
- how frequently data is being added to this table and how many rows (on avg) are added at once?
- is data in dwh.look_in_hist ever updated or only purged
- did you try to put this projection containing serialno , pid and time on a separate fast storage (like RAID1 made from 2 SAS SSDs + fast controller)
- what you'd consider a "good" query time for such (or equivalent) query returning 100M rows
- how quickly your client can receive 100M rows of the result (i.e for example how long it'll take - using the same client - to do
Code: Select all
SELECT serialno , pid fcid, pid+100 lcid, time from dwh.look_in_hist LIMIT 100000000
- how frequently data is being added to this table and how many rows (on avg) are added at once?
- is data in dwh.look_in_hist ever updated or only purged
- did you try to put this projection containing serialno , pid and time on a separate fast storage (like RAID1 made from 2 SAS SSDs + fast controller)