Analytic function taking too much time

Moderator: NorbertKrupa

Post Reply
User avatar
nnani
Master
Master
Posts: 302
Joined: Fri Apr 13, 2012 6:28 am
Contact:

Analytic function taking too much time

Post by nnani » Sun May 20, 2012 5:29 am

Hello All,

I have sql query which uses a Vertica Analytic function called First_value. The query has to run on 15 Billion records. The query is working fine but taking too much of time of around 2 Hrs 20 Mins ( 8337 Seconds). Can anybody please suggest me a alternatie for this query.

Original Query:

select distinct serialno , FIRST_VALUE(pid) over(partition by serialno order by time asc) fcid, first_value(pid) over(partition by serialno order by time desc) lcid
from dwh.look_in_hist;

Please help with this.
nnani........
Long way to go

You can check out my blogs at vertica-howto

id10t
GURU
GURU
Posts: 732
Joined: Mon Apr 16, 2012 2:44 pm

Re: Analytic function taking too much time

Post by id10t » Sun May 20, 2012 7:51 am

Hi Navin!

Can you provide some info:
1. Your physical parameters (num of nodes, RAM, CPU and storage, Vertica version)
2.

Code: Select all

count distinct serialno from dwh.look_in_hist;
3.

Code: Select all

 explain select distinct serialno , FIRST_VALUE(pid) over(partition by serialno order by time asc) fcid, first_value(pid) over(partition by serialno order by time desc) lcid
from dwh.look_in_hist;
4.

Code: Select all

select * from PROJECTIONS where projection_schema = 'dwh' and anchor_table_name = 'look_in_hist';

User avatar
nnani
Master
Master
Posts: 302
Joined: Fri Apr 13, 2012 6:28 am
Contact:

Re: Analytic function taking too much time

Post by nnani » Sun May 20, 2012 8:54 am

3 Nodes cluster,
RAM 128GB each
Vertica5.1.1


count distinct serialno from dwh.look_in_hist;

dbadmin=> select count (serialno) from dwh.look_in_hist;
count
-------------
15006217950
(1 row)

dbadmin=> select count (distinct serialno) from dwh.look_in_hist;
count
-----------
100000000
(1 row)




dbadmin=> explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno order by time asc) fcid, first_value(cid) over(partition by serialno or der by time desc) lcid from DW.Check_In_Hist;


QUERY PLAN



--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------------------------
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------

explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno
order by time asc) fcid, first_value(cid) over(partition by serialno order by ti
me desc) lcid from dwh.look_In_Hist;

Access Path:
+-SORT [Cost: 967M, Rows: 15B (PATH ID: 1)
| Order: look_In_Hist.serialno ASC, first_value(look_In_Hist.cid, false) ASC
, first_value(look_In_Hist.cid, false) ASC
| Execute on: All Nodes
| +---> ANALYTICAL [Cost: 393M, Rows: 15B (PATH ID: 2)
| | Analytic Group
| | Functions: first_value()
| | Group Sort: look_In_Hist.serialno ASC, look_In_Hist."time" DESC NULLS FIRST
| | Analytic Group
| | Functions: first_value()
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for look_In_Hist [Cost: 16M, Rows: 15B)] (PATH ID: 3)
| | | Projection: dwh.look_In_Hist_DBD_9_seg_T04_T04
| | | Materialize: look_In_Hist.serialno, look_In_Hist."time", look_In_Hist.cid
| | | Execute on: All Nodes


------------------------------
-----------------------------------------------
PLAN: BASE QUERY PLAN (GraphViz Format)
-----------------------------------------------
digraph G {
graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: explain select distinct serialno , FIRST_VALUE(cid) over(partition by serialno order by time asc) fcid, first_value(cid) over(partition
by serialno order by time desc) lcid from dwh.look_In_Hist;\n\nAll Nodes Vector: \n\n node[0]=v_htcpoc_node0001 (initiator) Up\n node[1]=v_htcpoc_node0002 (executor) Up\n node[2]=v_htcpo
c_node0003 (executor) Up\n", labelloc=t, labeljust=l ordering=out]
0[label = "Root \nOutBlk=[UncTuple(3)]", color = "green", shape = "house"];
1[label = "NewEENode \nOutBlk=[UncTuple(3)]", color = "green", shape = "box"];
2[label = "GroupByPipe: 3 keys\nAggs:\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
3[label = "ExprEval: \n look_In_Hist.serialno\n \<SVAR\>\n \<SVAR\>\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
4[label = "Recv(keys = A,A,A,N,N)\nRecv from: v_htcpoc_node0001,v_htcpoc_node0002,v_htcpoc_node0003\nNet id: 1000\nMerge\n\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integ
er(8)\nUnc: Varchar(15)", color = "green", shape = "box"];
5[label = "Send\nSend to: v_htcpoc_node0001\nNet id: 1000\nMerge\n\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "green", shape = "box"
];
6[label = "Sort: (keys = A,A,A,N,N)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "green", shape = "box"];
7[label = "ExprEval: \n look_In_Hist.serialno\n first_value(look_In_Hist.cid, false)\n first_value(look_In_Hist.cid, false)\n look_In_Hist.\"time\"\n look_In_Hist.cid", color = "
green", shape = "box"];
8[label = "AnalyticPipe:\nPartitionBy: look_In_Hist.serialno\nOrderBy:look_In_Hist.\"time\"\nAnalytical Functions:\n first_value(look_In_Hist.cid, false)\nUnc: Varchar(15)\nUnc: Integ
er(8)\nUnc: Varchar(15)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
9[label = "Sort: (keys = A,D,N,N)\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
10[label = "AnalyticPipe:\nPartitionBy: Check_In_Hist.serialno\nOrderBy: look_In_Hist.\"time\"\nAnalytical Functions:\n first_value(look_In_Hist.cid, false)\nUnc: Varchar(15)\nUnc: Inte
ger(8)\nUnc: Varchar(15)\nUnc: Varchar(15)", color = "green", shape = "box"];
11[label = "StorageMergeStep: lookIn_Hist_DBD_9_seg_T04_T04; 3 sorted\nUnc: Varchar(15)\nUnc: Integer(8)\nUnc: Varchar(15)", color = "purple", shape = "box"];
12[label = "ScanStep: look_In_Hist_DBD_9_seg_T04_T04\nserialno\ntime RLE\ncid\nUnc: Varchar(15)\nRLE: Integer(8)\nUnc: Varchar(15)", color = "brown", shape = "box"];
1->0 [label = "V[0] C=3",color = "black",style="bold", arrowtail="inv"];
2->1 [label = "0",color = "blue"];
3->2 [label = "0",color = "blue"];
4->3 [label = "0",color = "blue"];
5->4 [label = "0",color = "blue"];
6->5 [label = "0",color = "blue"];
7->6 [label = "0",color = "blue"];
8->7 [label = "0",color = "blue"];
9->8 [label = "0",color = "blue"];
10->9 [label = "0",color = "blue"];
11->10 [label = "0",color = "blue"];
12->11 [label = "0",color = "blue"];
}
(55 rows)

dbadmin=>





dbadmin=> select * from PROJECTIONS where projection_schema = 'dwh' and anchor_table_name = 'look_In_Hist'
dbadmin-> ;
projection_schema_id | projection_schema | projection_id | projection_name | owner_id | owner_name | anchor_table_id | anchor_table_name | node_id | node_name | is_prejoin | created_epoch | create_type | verified_fault_tolerance | is_up_to_date | has_statistics | is_segmented
----------------------+-------------------+-------------------+---------------------------------+-------------------+------------+-------------------+-------------------+---------+-----------+------------+---------------+-------------------+--------------------------+---------------+----------------+--------------
45035996273742294 |dwh | 45035996273813174 | look_In_Hist_DBD_3_seg_T04_T04 | 45035996273704962 | dbadmin | 45035996273813170 | look_In_Hist | 0 | | f | 454 | CREATE PROJECTION | 1 | t | t | t
45035996273742294 | dwh | 45035996273813360 | look_In_Hist_DBD_9_seg_T04_T04 | 45035996273704962 | dbadmin | 45035996273813170 | look_In_Hist | 0 | | f | 454 | CREATE PROJECTION | 1 | t | t | t
(2 rows)
nnani........
Long way to go

You can check out my blogs at vertica-howto

id10t
GURU
GURU
Posts: 732
Joined: Mon Apr 16, 2012 2:44 pm

Re: Analytic function taking too much time

Post by id10t » Sun May 20, 2012 1:23 pm

Hi Navin.

Some notes:
1. ) You have only one segmented projection. Try to create more projection with (distinct serialno, cid, time) columns sorted by (serialno, time)

2. ) its hard to understand what your query do but why do you think that is bad results? (total rows: 15006217950; execution time 8337 secs):

Code: Select all

cluster_db=> select public.floor(15006217950.0 / 8337.0) as 'ROWS PER SEC';

 ROWS PER SEC 
--------------
      1799954
(1 row)

I don't know your data but its looks as very impressive result - ~1,800K rows per second. Can you compare with other queries (witch run "fast")?

3. ) What do DISTINCT ? (as I understand it have no influence on output)

4. ) You have 2 different order windows clause of analytic function but:

Code: Select all

FIST_VALUE( col1 ) OVER (PARTITION BY col2 ORDER BY col3 DESC) 
equals to

Code: Select all

LAST_VALUE( col1 ) OVER (PARTITION BY col2 ORDER BY col3 ASC)
(except nulls but you can play with it)

id10t
GURU
GURU
Posts: 732
Joined: Mon Apr 16, 2012 2:44 pm

Re: Analytic function taking too much time

Post by id10t » Sun May 20, 2012 8:54 pm

Example for case 4 from my prev. post:
  • Table:

    Code: Select all

    CREATE TABLE public.EventsExample
    (
        serial_num char(1),
        event_time time,
        event_pid int
    );
    
  • Data:

    Code: Select all

     serial_num | event_time | event_pid 
    ------------+------------+-----------
     A          | 10:00:00   |        10
     A          | 10:30:00   |        15
     A          | 11:00:00   |         7
     B          | 07:42:00   |         3
     B          | 09:12:00   |         8
     B          | 11:13:00   |         1
     C          | 00:00:00   |       100
     C          | 01:00:00   |        50
     C          | 02:00:00   |        75
    
  • Query:

    Code: Select all

    SELECT DISTINCT serial_num,
        FIRST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING
                                                        AND UNBOUNDED FOLLOWING),
        LAST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING
                                                       AND UNBOUNDED FOLLOWING)
    FROM EventsExample
    WINDOW group_win AS (PARTITION BY serial_num ORDER BY event_time);
  • Result:

    Code: Select all

     serial_num | ?column? | ?column? 
    ------------+----------+----------
     A          |       10 |        7
     B          |        3 |        1
     C          |      100 |       75
    

id10t
GURU
GURU
Posts: 732
Joined: Mon Apr 16, 2012 2:44 pm

Re: Analytic function taking too much time

Post by id10t » Sun May 20, 2012 9:51 pm

Case 1

15 bil of rows and only super projection that segmented?

Lets take example from previous post:
Query plan:

Code: Select all

 ------------------------------ 
 QUERY PLAN DESCRIPTION: 
 ------------------------------
 EXPLAIN SELECT DISTINCT serial_num,
     FIRST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING
                                                     AND UNBOUNDED FOLLOWING),
     LAST_VALUE(event_pid) OVER (group_win ROWS BETWEEN UNBOUNDED PRECEDING
                                                    AND UNBOUNDED FOLLOWING)
 FROM EventsExample
 WINDOW group_win AS (PARTITION BY serial_num ORDER BY event_time);
 

Code: Select all

Access Path:
 +-SORT [Cost: 53, Rows: 9] (PATH ID: 1)
 |  Order: EventsExample.serial_num ASC, first_value(EventsExample.event_pid, false) ASC, last_value(EventsExample.event_pid, false) ASC
 | +---> ANALYTICAL [Cost: 52, Rows: 9] (PATH ID: 2)
 | |      Analytic Group
 | |       Functions: first_value(), last_value()
 | |       Group Sort: EventsExample.serial_num ASC, EventsExample.event_time ASC NULLS LAST
 | | +---> STORAGE ACCESS for EventsExample [Cost: 51, Rows: 9] (PATH ID: 3)
 | | |      Projection: public.EventsExample_super
 | | |      Materialize: EventsExample.serial_num, EventsExample.event_time, EventsExample.event_pid
...
Total cost = 51 + 52 + 53 = 155

Now lets partition our BIG DATA table:

Code: Select all

CREATE TABLE EventsExample2
(
    serial_num char(1) NOT NULL,
    event_time time,
    event_pid int
)
PARTITION BY (EventsExample2.serial_num);
Data:

Code: Select all

insert into EventsExample2 select * from EventsExample;select refresh();
and query plan:

Code: Select all

 ------------------------------ 
 QUERY PLAN DESCRIPTION: 
 ------------------------------
EXPLAIN SELECT DISTINCT serial_num,
     FIRST_VALUE(event_pid) OVER (group_win RANGE BETWEEN UNBOUNDED PRECEDING
                                                     AND UNBOUNDED FOLLOWING),
     LAST_VALUE(event_pid) OVER (group_win RANGE BETWEEN UNBOUNDED PRECEDING
                                                    AND UNBOUNDED FOLLOWING)
 FROM EventsExample2
 WINDOW group_win AS (PARTITION BY serial_num ORDER BY event_time);

Code: Select all

 Access Path:
 +-SORT [Cost: 41, Rows: 9 (NO STATISTICS)] (PATH ID: 1)
 |  Order: EventsExample2.serial_num ASC, first_value(EventsExample2.event_pid, false) ASC, last_value(EventsExample2.event_pid, false) ASC
 | +---> ANALYTICAL [Cost: 40, Rows: 9 (NO STATISTICS)] (PATH ID: 2)
 | |      Analytic Group
 | |       Functions: first_value(), last_value()
 | |       Group Sort: EventsExample2.serial_num ASC, EventsExample2.event_time ASC NULLS LAST
 | | +---> STORAGE ACCESS for EventsExample2 [Cost: 39, Rows: 9 (NO STATISTICS)] (PATH ID: 3)
 | | |      Projection: public.EventsExample2_super
 | | |      Materialize: EventsExample2.serial_num, EventsExample2.event_time, EventsExample2.event_pid
Total cost = 39 + 40 + 41 = 120 => improvement by cost 20%

For bigger tables it has to give little more improvement.

D_B_A
Newbie
Newbie
Posts: 1
Joined: Thu May 24, 2012 2:17 pm

Re: Analytic function taking too much time

Post by D_B_A » Thu May 24, 2012 2:34 pm

It's not clear from your description:
- what you'd consider a "good" query time for such (or equivalent) query returning 100M rows
- how quickly your client can receive 100M rows of the result (i.e for example how long it'll take - using the same client - to do

Code: Select all

SELECT serialno , pid fcid, pid+100 lcid, time from dwh.look_in_hist LIMIT 100000000
)
- how frequently data is being added to this table and how many rows (on avg) are added at once?
- is data in dwh.look_in_hist ever updated or only purged
- did you try to put this projection containing serialno , pid and time on a separate fast storage (like RAID1 made from 2 SAS SSDs + fast controller)

Post Reply

Return to “Vertica SQL”