Need Syntax for Hash And Merge Join

Moderator: NorbertKrupa

Post Reply
saidaraom
Newbie
Newbie
Posts: 3
Joined: Thu Feb 07, 2013 1:27 pm

Need Syntax for Hash And Merge Join

Post by saidaraom » Fri Feb 15, 2013 7:25 am

What is the syntax for Hash And Merge Join in Vertica Database ??

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: Need Syntax for Hash And Merge Join

Post by JimKnicely » Fri Feb 15, 2013 3:01 pm

Hi, saidaraom!

What do you mean by "syntax for Hash And Merge Join"? I do not believe that there is any specific syntax to force a join to be a HASH or MERGE, but we can influence Vertica to choose one or the other.

Here are some details about each join in Vertica:

MERGE:
  • 1. Offers the best performance
    2. Optimizer may choose if both inputs are pre-sorted on the join column
    3. Vertica can also performs a sort-merge join where the inner input is unsorted but only if the outer input is already sorted on the join keys
    4. If the incoming data isn't already sorted correctly, you can facilitate a merge join by adding a projection that is sorted correctly
HASH:
  • 1. Used for joining large data sets
    2. No sort requirement
    3. Consumes more memory because a hash table is built with the values in the inner table
    4. Cost of performing a hash join is low if the entire hash table can fit in memory
    5. The optimizer chooses a hash join when projections are not sorted on the join columns
Basically, if you want a MERGE join to occur than make sure you have projections that are sorted on the JOIN columns in a query.

For example:

Say I create the following two tables named test1 and test2:

Code: Select all

dbadmin=> CREATE TABLE test1 (col1 INT, col2 VARCHAR(100)) UNSEGMENTED ALL NODES;
CREATE TABLE
dbadmin=> CREATE TABLE test2 (col1 INT, col2 VARCHAR(100)) UNSEGMENTED ALL NODES;
CREATE TABLE
The default projections are sorted by col1 and then col2 for each table:

Code: Select all

dbadmin=> SELECT p.projection_name,
dbadmin->        pc.projection_column_name,
dbadmin->        pc.column_position
dbadmin->   FROM projections p
dbadmin->   JOIN projection_columns pc
dbadmin->     ON pc.projection_name = p.projection_name
dbadmin->  WHERE p.anchor_table_name IN ('test1', 'test2')
dbadmin->  ORDER
dbadmin->     BY p.projection_name,
dbadmin->        pc.column_position;
 projection_name | projection_column_name | column_position
-----------------+------------------------+-----------------
 test1_node0001  | col1                   |               0
 test1_node0001  | col2                   |               1
 test1_node0002  | col1                   |               0
 test1_node0002  | col2                   |               1
 test1_node0003  | col1                   |               0
 test1_node0003  | col2                   |               1
 test2_node0001  | col1                   |               0
 test2_node0001  | col2                   |               1
 test2_node0002  | col1                   |               0
 test2_node0002  | col2                   |               1
 test2_node0003  | col1                   |               0
 test2_node0003  | col2                   |               1
(12 rows)
So if I check the explain plan for a SELECT joining the two tables on col2, I'll see a HASH join:

Code: Select all

 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT t1.col2, t2.col2
           FROM test1 t1
           JOIN test2 t2
             ON t1.col2 = t2.col2;

 Access Path:
 +-JOIN HASH [Cost: 2K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
 |  Join Cond: (t1.col2 = t2.col2)
 |  Execute on: Query Initiator
 | +-- Outer -> STORAGE ACCESS for t1 [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
 | |      Projection: intersect_wh.test1_node0001
 | |      Materialize: t1.col2
 | |      Execute on: Query Initiator
 | |      Runtime Filter: (SIP1(HashJoin): t1.col2)
 | +-- Inner -> STORAGE ACCESS for t2 [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
 | |      Projection: intersect_wh.test2_node0001
 | |      Materialize: t2.col2
 | |      Execute on: Query Initiator
Now if I create projections for the tables sorted first by col2 and then col1, I can get a MERGE join!

Code: Select all

dbadmin=> CREATE PROJECTION test1_p AS SELECT * FROM test1 ORDER BY col2, col1;
CREATE PROJECTION
dbadmin=> CREATE PROJECTION test2_p AS SELECT * FROM test2 ORDER BY col2, col1;
CREATE PROJECTION

Code: Select all

 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT t1.col2, t2.col2
           FROM test1 t1
           JOIN test2 t2
             ON t1.col2 = t2.col2;

 Access Path:
 +-JOIN MERGEJOIN(inputs presorted) [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
 |  Join Cond: (t1.col2 = t2.col2)
 |  Execute on: Query Initiator
 | +-- Outer -> STORAGE ACCESS for t1 [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
 | |      Projection: intersect_wh.test1_p
 | |      Materialize: t1.col2
 | |      Execute on: Query Initiator
 | |      Runtime Filter: (SIP1(MergeJoin): t1.col2)
 | +-- Inner -> STORAGE ACCESS for t2 [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
 | |      Projection: intersect_wh.test2_p
 | |      Materialize: t2.col2
 | |      Execute on: Query Initiator
I hope this helps a little!
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

Post Reply

Return to “New to Vertica”