Cannot Update / Slow Update

ncreekmur · Post by **ncreekmur** » Wed Jan 09, 2013 7:03 pm

During our daily ETL process we have a table we:

1. Load New records into, if they do not exist.

2. Update existing records, field by field, and only if the existing value is shorter in length that the new one presented.

This update was taking around 10-15 minutes in total for the new record load, and updating of 14 columns based on the logic above.

We recently have had to change the primary key of this table from an Integer to a Varchar(36) so it can accept guids. This is a business requirement and not an option to change. Below this column will be referred to as "ID".

After the change the insert process was slow, but we've added a direct hint to it and resolved that portion for the most part.

Now the update statements with column by column evaluations have gotten... untenable. I had to kill the process yesterday after running for over an hour. I need each column update to run in 1-2 minutes for the solution to be serviceable.

Here's an example of the Update code we're trying to tune:

UPDATE schema.table
SET col1=t1.col1
FROM schema.table2 t1
WHERE hash(t1.ID) = hash(schema.table.ID)
AND CHARACTER_LENGTH(isnull(t1.col1,'')) > CHARACTER_LENGTH(isnull(schema.table.col1,''));

We have tried:
Adding a direct hint
Hashing the varchar36 join (seen above)
Purging the tables to ensure delete vectors are not an issue.
Building a different projection to segment the data across 3 nodes, by the ID.
Rebuilding the target table completely and re-doing the projection segment.
Re-writing the conditional update portion to perform the null evaluation to empty string before performing the length evaluation. Previously this was done separately in an OR clause in the where.

The target table is 326 million records. The table we're loading from is 1.726 million records.

Running out of ideas here and would appreciate any assistance.

Thanks,
Nick

ncreekmur · Post by **ncreekmur** » Wed Jan 09, 2013 7:07 pm

Adding the explain (dont have time to mask the tables, dont care):

------------------------------
QUERY PLAN DESCRIPTION:
------------------------------

explain UPDATE BI_DAT_Production.WebUser
SET User_First_Name=t1.User_First_Name
FROM BI_DAT_Production.tempBigDataWebUser t1
WHERE hash(t1.User_ID) = hash(BI_DAT_Production.WebUser.User_ID )
AND CHARACTER_LENGTH(isnull(t1.User_First_Name,'')) > CHARACTER_LENGTH(isnull(BI_DAT_Production.WebUser.User_First_Name,''))

Access Path:
+-DML UPDATE
| Target Projection: BI_DAT_Production.WebUser01_b1
| Target Projection: BI_DAT_Production.WebUser01_b0
| Target Projection: BI_DAT_Production.WebUser_copy_b1
| Target Projection: BI_DAT_Production.WebUser_copy_b0
| Target Prep:
| Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 7M, Rows: 164M] (PATH ID: 1) Inner (BROADCAST)
| | Join Cond: (hash(VAL(2)) = hash(WebUser.User_ID))
| | Join Filter: ("character_length"(coalesce(VAL(2), '')) > "character_length"(coalesce(WebUser.User_First_Name, '')))
| | Materialize at Output: WebUser.User_Last_Name, WebUser.User_Phone, WebUser.User_Email, WebUser.User_Address1, WebUser.User_Address2, WebUser.User_City, WebUser.User_State, WebUser.User_Zip, WebUser.User_Country, WebUser.User_Monetary_Value, WebUser.User_Metric_Value, WebUser.User_Browser, WebUser.User_IP, WebUser.Fileid, WebUser.Lineage_ID, WebUser.All_Users, WebUser.User_Pub_ID, WebUser.epoch
| | Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for <No Alias> [Cost: 698K, Rows: 328M] (PATH ID: 2)
| | | Projection: BI_DAT_Production.WebUser_copy_b0
| | | Materialize: WebUser.User_ID, WebUser.User_First_Name
| | | Execute on: All Nodes
| | | Runtime Filter: (SIP1(HashJoin): hash(WebUser.User_ID))
| | +-- Inner -> STORAGE ACCESS for <No Alias> [Cost: 4K, Rows: 2M] (PATH ID: 3)
| | | Projection: BI_DAT_Production.tempBigDataWebUser_b0
| | | Materialize: VAL(0), VAL(0)
| | | Execute on: All Nodes

ncreekmur · Post by **ncreekmur** » Wed Jan 09, 2013 8:33 pm

A user on the My.Vertica forums told us we're doing it wrong.

As such here's our current plan we're going to try:

In order to keep my brain moving in the lines of problem solving we’re discussing this approach (and I’d love your thoughts):

1. Inner join the temp table against the existing.

2. Evaluate all fields we would previously be updating via a case statement.

3. Write these records into a new temp table.

4. Remove these records from the existing table.

5. Insert the newly case records back into the existing table.

6. Insert our new data afterwards.

No updates exactly, just evaluations in select statements as we move into temps, then a deletion and re-insert.

Thoughts?

ncreekmur · Post by **ncreekmur** » Wed Jan 09, 2013 8:49 pm

skWa PM'd me and I thought I'd update the thread (to help googling if nothing else):

I was directed here:

http://my.vertica.com/docs/6.0.1/HTML/i ... #16427.htm

Based on this it seems that Vertica is building a new projection on each node (for 326mm rows thats not trivial) before attempting the update.

As such that means I need a better segmentation on a new projection… which I’m working on.

Accepting suggestions on the new projection… the table is basically defined in the join filter on the first pass of the explain plan.

id10t · Post by **id10t** » Wed Jan 09, 2013 9:41 pm

Hi!

1. i agree with "user from my.Vertica" - manual DELETE/INSERT will works faster than UPDATE
2. data of inner join of 2M rows is re-distributed (not outer), because it's data of temp table. So try to change temp table to "permanent" with order only 1 col, not segmented. play with it
3. for cols, where you compares length define constrain DEFAULT as empty string, so you will avoid ISNULL validation.
4. check your syntax: why its semi join? I think it should be a inner join!

ncreekmur · Post by **ncreekmur** » Wed Jan 09, 2013 9:46 pm

sKwa,

1. Understood.
2. The table is a permanent table located in the same schema as the table we're targetting for update.
3. We'll definitely consider adding that. But this wasn't really an issue before we changed our ID field.
4. This is a good question. The join should be inner. The syntax above defines the join in the where clause, but that shouldn't be creating an outer join.

Sharon (Cutter I believe) recommended on my.vertica that we need a complete redesign. Do you have an opinion on that? If so I need to get it started because it's going to take a few hours.

Thanks,
Nick

scutter · Post by **scutter** » Thu Jan 10, 2013 12:11 am

Hi Nick,

Picking up this thread here since this forum is likely to be more active on this topic than my.vertica.

The inner is broadcast to the other nodes when the segmentation of the two projections isn't optimal - when the segmentation is not on the join key. If both of your tables are segmented on the user id, then you can avoid the broadcast.

For your alternative solution, for the step "Inner join the temp table against the existing" you may still want to optimize for the join. If the hash join that you are currently getting is slow and is the bottleneck, try a merge join - where the first column in the sort order is the user id, and with segmentation on the user id. The only way to confirm where the time is going is by using profiling data, which takes a bit of expertise to interpret. You'll have to decide whether the performance improvement from the merge join is worth the extra disk storage and load time for the 2nd projection when you're loading new data.

UPDATE is a delete and insert, much like the workaround you described. The efficiency of the UPDATE when a subquery is involved will also depend on the projection design -- appropriate columns in the ORDER BY and proper use of RLE. Database Designer will optimize for UPDATE/DELETE automatically and you don't need to worry about getting it optimized manually. When people create projections by hand, they often end up with delete / replay-delete performance issues because their projections aren't optimal. While Database Designer has its shortcomings, IMHO it's worth working into your workflow so that you take care of issues like replay-delete up front and don't have to deal with them later. Also, the secondary columns that you're not so focussed will have encodings chosen for them, reducing the projection footprint. That said, it wouldn't surprise me deleting using a single IN list of ids, executed in a single pass, would be faster than what UPDATE with a subquery does - I don't have personal experience with this, so I'll defer to the other participants here.

Do make sure that you don't have extra projections lying around that you've been testing with, since DELETE/UPDATE need to be applied to all projections.

It's an interesting question on why the performance degraded so much once you changed the column type/width. I presume you don't have the original table around any longer for comparing explain plans and profiling data?

--Sharon

Cannot Update / Slow Update

Cannot Update / Slow Update

Re: Cannot Update / Slow Update

Re: Cannot Update / Slow Update

Re: Cannot Update / Slow Update

Re: Cannot Update / Slow Update

Re: Cannot Update / Slow Update

Re: Cannot Update / Slow Update