Suppose I add a column to a table that already exists, then want to fill the column by joining the original table against other tables. Is there an efficient way to do this?
Problem origin: we have a very large table created by joining a bunch of other tables. Sometimes we want to add a field that's in the join (which previously wasn't selected to be in the big table). We could drop the large table and start all over again, but the table is large enough that that's prohibitively expensive.
Add column, then efficiently fill it
Moderator: NorbertKrupa
-
- GURU
- Posts: 527
- Joined: Tue Oct 22, 2013 9:36 pm
- Location: Chicago, IL
- Contact:
Re: Add column, then efficiently fill it
The way I see you doing this is something like:
The only thing I would suggest is to use /*+direct*/ to write directly to disk and bypass WOS.
Code: Select all
INSERT INTO tableA (col1, col2, col3)
SELECT a.col1, a.col2, b.col1
FROM tableA a
JOIN tableB b ...
Checkout vertica.tips for more Vertica resources.
Re: Add column, then efficiently fill it
That's the basic idea, though we're really doing an update, not an insert.
Of course, updates are extremely expensive. So, conceptually, we're recreating each partition of the table using a join, then replacing the old partition with the new one.
Apparently (this started before I started my job) it uses up so much memory that we've been doing the JOIN in the filesystem (probably with C or Python).
The only way I can think of to do it entirely within Vertica is to chop each partition up into smaller ROSs, then replace each one, then merge them. But I haven't tried that yet.
(The table itself is on the order of 1 TB in size.)
Of course, updates are extremely expensive. So, conceptually, we're recreating each partition of the table using a join, then replacing the old partition with the new one.
Apparently (this started before I started my job) it uses up so much memory that we've been doing the JOIN in the filesystem (probably with C or Python).
The only way I can think of to do it entirely within Vertica is to chop each partition up into smaller ROSs, then replace each one, then merge them. But I haven't tried that yet.
(The table itself is on the order of 1 TB in size.)
-
- GURU
- Posts: 527
- Joined: Tue Oct 22, 2013 9:36 pm
- Location: Chicago, IL
- Contact:
Re: Add column, then efficiently fill it
With an UPDATE, you're doing a DELETE and an INSERT. When you say that you're doing a JOIN in the filesystem, have you considered using pre-JOIN projections? This will obviously impact your load performance but you save on query cost as it does not have to optimzie for any type of JOIN.
Checkout vertica.tips for more Vertica resources.
Re: Add column, then efficiently fill it
I'm not completely privy to the background, since it happened before I joined up, but my understanding is that at the time there was something wrong with pre-join projections. I'm not sure if it was a bug, or whether it didn't mesh well with our use case. I do know that someone on my team talked to Vertica and it was suggested not to use them.
-
- GURU
- Posts: 527
- Joined: Tue Oct 22, 2013 9:36 pm
- Location: Chicago, IL
- Contact:
Re: Add column, then efficiently fill it
Hmm, take a look at this blog post discussing pre-join projections. There may be something in there that justifies not using pre-joins. The other thing that comes to mind is trying:
Code: Select all
INSERT /*+direct*/ INTO tableA (col1, col2, col3)
SELECT a.col1, a.col2, b.col1
FROM tableA a
JOIN tableB b ...
Checkout vertica.tips for more Vertica resources.
Re: Add column, then efficiently fill it
Thanks---that blog post was really informative!