Loads into pre-join projections of 2 fact tables don't scale

JakaJancar · Post by **JakaJancar** » Wed Jan 30, 2013 1:41 pm

Hello,

We are trying to create a pre-join projection that joins two large fact tables.

The problem is that INSERTs are slow: they take O(existing_data), not O(added_data).

In practice, this means that after importing a couple of days worth of data, they become prohibitively slow.

The projections are written so that:

- INSERT uses MERGE JOINs only.
- Data for INSERT is presorted, i.e. no (SORT ON JOIN KEY).
- When running on a multi-node cluster, there is no (RESEGMENT).
- There's an extra batch_id column, that serves as a bucket, as per this help page.

Doing a SELECT .. JOIN .. for what I think Vertica has to do when populating the projections *is fast*, but INSERT .. SELECT is slow.

The following script demonstrates the issue:

https://dl.dropbox.com/u/41764/vertica/ ... ng-test.sh

The script will:

Code: Select all

 1. create a schema with 2 tables, `a` and `b`
 2. create "normal" projections for `a` and prejoin for `b` x `a`
 3. repeat 100 times
     3.1. load 1M rows into `a_temp`
     3.2. load 1M rows into `b_temp`
     3.3. insert into a select * from a_temp
     3.4. insert into b select * from b_temp

You will see that the times for #3.3 are constant, but the times for #3.4 grow with the number of *existing*, not *inserted* rows. Here is a chart:

Has anyone had success with pre-join projections over 2 large tables?

Best,
Jaka

becky · Post by **becky** » Wed Jan 30, 2013 7:30 pm

Hi Jaka,

Nice write up!

I've never had any luck using INSERT statements (> 1000) into any Vertica tables, regardless of the type of projections the table has. Have you tried using the COPY command? I had a case open with Vertica a while back and the guy told me to always use the COPY command. Another option is have you tried not having the foreien key between the two fact tables. By the way, are you using a snow flake schema?

I'm sure someone on this board can help provide better advice for you!

JakaJancar · Post by **JakaJancar** » Wed Jan 30, 2013 8:08 pm

Hi Becky,

Similarly to how it's done in the example reproduction script, we

1. insert into a temp table, using JDBC batch inserts (which under the hood should get converted to COPY), then
2. do "insert into final_table select * from temp_table".

Indeed we have not tried doing COPY directly into the final table, for other reasons, but I'm sceptical that it would perform any better...

As for removing the foreign key, you need that to create the pre-join projection.

Our schema consists of multiple fact tables that contain nested/related facts, which you want to join together (in addition to dimension tables).

Julie · Post by **Julie** » Wed Jan 30, 2013 9:18 pm

FYI ..

This is a reponse I got from Vertica support about INSERT vs. COPY command:

In the present release of vertica inserts and updates are working as designed. If an insert statement takes about x milli seconds then running it y times serially takes approximately x*y milli seconds. But if we run multiple insert statement parallel based on the resources available on the system it takes the time.

If there are any possibilities of batching up the data and doing bulk load with copy is the preferred method over individual inserts to get good performance.

Basically, insert statements suck in Vertica

If you still want to use inserts, we found that separating the inserts on different nodes so that they run in parallel helped with performance.

zvika · Post by **zvika** » Thu Jan 31, 2013 3:56 pm

Julie wrote:FYI ..

Basically, insert statements suck in Vertica

If you still want to use inserts, we found that separating the inserts on different nodes so that they run in parallel helped with performance.

This is interesting as Updates also sucks ....

JakaJancar · Post by **JakaJancar** » Fri Feb 01, 2013 12:39 pm

Based on your feedback and instructions from Vertica support, I have tried two more things

1) Single node, explicitly sorted projections, no RESEGMENT and no JOIN at all in the EXPLAIN (vertica-prejoin-scaling-test-2.sh)

Code: Select all

------------------------------ 
 QUERY PLAN DESCRIPTION: 
 ------------------------------
 
 EXPLAIN INSERT /*+ direct */ INTO b SELECT * FROM b_temp
 
 Access Path:
 +-DML INSERT [Cost: 0, Rows: 0]
 |  Target Projection: public.b_super_b0
 |  Target Prep:
 | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 4K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
 | |      Join Cond: (a.batch_id = b.batch_id) AND (a.a_id = b.a_id)
 | | +-- Outer -> SELECT [Cost: 2K, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
 | | | +---> STORAGE ACCESS for b_temp [Cost: 2K, Rows: 10K (NO STATISTICS)] (PATH ID: 4)
 | | | |      Projection: public.b_temp_super_b0
 | | | |      Materialize: b_temp.batch_id, b_temp.a_id, b_temp.b_id, b_temp.b_data
 | | +-- Inner -> STORAGE ACCESS for a [Cost: 2K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
 | | |      Projection: public.a_super_b0
 | | |      Materialize: a.batch_id, a.a_id, a.a_data

Loads still get slower with every batch, problem persists.

2) Loads without use of temporary tables (COPY directly into destination) (vertica-prejoin-scaling-test-notemp.sh)

Loads are now HASH JOINs, even slower and have higher memory consumption. This is way worse.

JakaJancar · Post by **JakaJancar** » Fri Feb 01, 2013 2:41 pm

We have come to the bottom of this:

Inserting into pre-join projections does not recognize the transitivity of predicates based on join keys in pre-join projections.

As a consequence, some INSERTs into pre-join projections between two large tables that could be O(inserted_data) are O(existing_data), making them completely non-scalable.

Simulating a pre-join projection with a table containing the same columns shows significant performance improvements, but at a cost of polluted logical schema, more difficult maintenance and higher licensing costs.

Repro case: https://dl.dropbox.com/u/41764/vertica/ ... rejoin.sql

Loads into pre-join projections of 2 fact tables don't scale

Loads into pre-join projections of 2 fact tables don't scale

Re: Loads into pre-join projections of 2 fact tables don't s

Re: Loads into pre-join projections of 2 fact tables don't s

Re: Loads into pre-join projections of 2 fact tables don't s

Re: Loads into pre-join projections of 2 fact tables don't s

Re: Loads into pre-join projections of 2 fact tables don't s

Re: Loads into pre-join projections of 2 fact tables don't s