Merge join over pre-join projections

atmc · Post by **atmc** » Fri Jan 15, 2016 9:29 am

Hello!

I have one big fact table, f1, and two relatively big dictionary tables attached to it by foreign key, d1 and d2.

Let's say they have the following structure:
f1(id, d1_id, d2_id)
d1(id, value)
d2(id, value)

Now I have to filter on both dictionary tables in my query:

select
f1.id
from
f1
join
d1 on f1.d1_id = d1.id
join
d2 on f1.d2_id = d2.id
where
d1.value = v1
and
d2.value = v2

If I build conventional projections on those dictionary tables, I can have one merge join at most, either to d1 or d2, the second join will be hash join.

Hash join is slow, as dictionary tables are relatively big and don't entirely fit to memory.

As far as I reckon, the proposed way to tackle such queries is to build a pre-join projection over f1,d1,d2.

Now what if I build two pre-join projections instead:
{f1.id, d1.id, d1.value}, sorted by f1.id
{f1.id, d2.id, d2.value}, sorted by f1.id again

Can optimizer use those pre-join projections to merge join them? Both of those projections are ordered the same way, so they meet the requirement for a merge join, and have all the columns used in my query.

I tried it on version 7.0, but it didn't work.

Post by **JimKnicely** » Fri Jan 15, 2016 3:45 pm

Here is one way to get two MERGE joins:

Code: Select all

dbadmin=> \d f1;
                                List of Fields by Tables
 Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+--------+------+------+---------+----------+-------------+-------------
 public | f1    | d1_id  | int  |    8 |         | t        | t           |
 public | f1    | d2_id  | int  |    8 |         | t        | t           |
(2 rows)

dbadmin=> \d d1;
                                List of Fields by Tables
 Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+--------+------+------+---------+----------+-------------+-------------
 public | d1    | d1_id  | int  |    8 |         | t        | t           |
(1 row)

dbadmin=> \d d2;
                                List of Fields by Tables
 Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+--------+------+------+---------+----------+-------------+-------------
 public | d2    | d2_id  | int  |    8 |         | t        | t           |
(1 row)

dbadmin=> select * from f1;
 d1_id | d2_id
-------+-------
     1 |     1
     2 |     2
(2 rows)

dbadmin=> select * from d1;
 d1_id
-------
     1
     2
(2 rows)

dbadmin=> select * from d2;
 d2_id
-------
     1
     2
(2 rows)

The following query will use a MERGE and HASH:

Code: Select all

dbadmin=> explain select * from f1 join d1 on d1.d1_id = f1.d1_id join d2 on d2.d2_id = f1.d2_id;
 QUERY PLAN
 ----------------------------------------------------------------------------------------------------------
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 explain select * from f1 join d1 on d1.d1_id = f1.d1_id join d2 on d2.d2_id = f1.d2_id;

 Access Path:
 +-JOIN HASH [Cost: 133, Rows: 2] (PATH ID: 1)
 |  Join Cond: (d2.d2_id = f1.d2_id)
 |  Materialize at Input: f1.d2_id
 |  Materialize at Output: f1.d1_id
 | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [Cost: 51, Rows: 2] (PATH ID: 2)
 | |      Join Cond: (d1.d1_id = f1.d1_id)
 | | +-- Outer -> STORAGE ACCESS for f1 [Cost: 33, Rows: 2] (PATH ID: 3)
 | | |      Projection: public.f1_super
 | | |      Materialize: f1.d1_id
 | | |      Runtime Filters: (SIP2(MergeJoin): f1.d1_id), (SIP1(HashJoin): f1.d2_id)
 | | +-- Inner -> STORAGE ACCESS for d1 [Cost: 17, Rows: 2] (PATH ID: 4)
 | | |      Projection: public.d1_super
 | | |      Materialize: d1.d1_id
 | +-- Inner -> STORAGE ACCESS for d2 [Cost: 17, Rows: 2] (PATH ID: 5)
 | |      Projection: public.d2_super
 | |      Materialize: d2.d2_id

But this query will use two MERGE joins:

Code: Select all

dbadmin=> explain select * from (select * from f1 join d1 on d1.d1_id = f1.d1_id order by f1.d2_id) foo join d2 on d2.d2_id = foo.d2_id;
 QUERY PLAN
 ----------------------------------------------------------------------------------------------------------
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 explain select * from (select * from f1 join d1 on d1.d1_id = f1.d1_id order by f1.d2_id) foo join d2 on d2.d2_id = foo.d2_id;

 Access Path:
 +-JOIN MERGEJOIN(inputs presorted) [Cost: 102, Rows: 2] (PATH ID: 1)
 |  Join Cond: (d2.d2_id = foo.d2_id)
 | +-- Outer -> SELECT [Cost: 84, Rows: 2] (PATH ID: 2)
 | | +---> SORT [Cost: 84, Rows: 2] (PATH ID: 3)
 | | |      Order: f1.d2_id ASC
 | | |      Runtime Filter: (SIP2(MergeJoin): foo.d2_id)
 | | | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 83, Rows: 2] (PATH ID: 4)
 | | | |      Join Cond: (d1.d1_id = f1.d1_id)
 | | | |      Materialize at Output: f1.d2_id
 | | | | +-- Outer -> STORAGE ACCESS for f1 [Cost: 33, Rows: 2] (PATH ID: 5)
 | | | | |      Projection: public.f1_super
 | | | | |      Materialize: f1.d1_id
 | | | | |      Runtime Filter: (SIP1(MergeJoin): f1.d1_id)
 | | | | +-- Inner -> STORAGE ACCESS for d1 [Cost: 17, Rows: 2] (PATH ID: 6)
 | | | | |      Projection: public.d1_super
 | | | | |      Materialize: d1.d1_id
 | +-- Inner -> STORAGE ACCESS for d2 [Cost: 17, Rows: 2] (PATH ID: 7)
 | |      Projection: public.d2_super
 | |      Materialize: d2.d2_id

But on a large data set, the query with the two MERGE joins might be slower because of the extra sort operation...

atmc · Post by **atmc** » Fri Jan 15, 2016 6:36 pm

Jim, thank you for your answer!

It's a nice trick, but as you mentioned it requires an additional sort in every query run.

My idea was to make projections that are already presorted for a merge join and to reuse them in queries on main fact table and different dictionaries joined to it.

I guess, an alternative could be to flatten the structure of my tables and move all the fields that are used for filtering in queries into the fact table, but pre-join projections have one big advantage - less license cost, as only raw data is licensed.

P.S. Found an answer here:
https://my.vertica.com/docs/7.1.x/HTML/ ... icates.htm

Note: HP Vertica uses a maximum of one pre-join projection per query. More than one pre-join projection might appear in a query plan, but at most, one will have been used to replace the join that would be computed with the precomputed pre-join. Any other pre-join projections are used as regular projections to supply records from a particular table.

Merge join over pre-join projections

Merge join over pre-join projections

Re: Merge join over pre-join projections

Re: Merge join over pre-join projections