We are trying to create a pre-join projection that joins two large fact tables.
The problem is that INSERTs are slow: they take O(existing_data), not O(added_data).
In practice, this means that after importing a couple of days worth of data, they become prohibitively slow.
The projections are written so that:
- INSERT uses MERGE JOINs only.
- Data for INSERT is presorted, i.e. no (SORT ON JOIN KEY).
- When running on a multi-node cluster, there is no (RESEGMENT).
- There's an extra batch_id column, that serves as a bucket, as per this help page.
Doing a SELECT .. JOIN .. for what I think Vertica has to do when populating the projections *is fast*, but INSERT .. SELECT is slow.
The following script demonstrates the issue:
https://dl.dropbox.com/u/41764/vertica/ ... ng-test.sh
The script will:
- Code: Select all
1. create a schema with 2 tables, `a` and `b`
2. create "normal" projections for `a` and prejoin for `b` x `a`
3. repeat 100 times
3.1. load 1M rows into `a_temp`
3.2. load 1M rows into `b_temp`
3.3. insert into a select * from a_temp
3.4. insert into b select * from b_temp
You will see that the times for #3.3 are constant, but the times for #3.4 grow with the number of *existing*, not *inserted* rows. Here is a chart:
Has anyone had success with pre-join projections over 2 large tables?