De-Duplication in Vertica

Moderator: NorbertKrupa

User avatar
abeeshdreams
Newbie
Newbie
Posts: 7
Joined: Tue Jul 10, 2012 6:11 am

Re: De-Duplication in Vertica

Post by abeeshdreams » Fri Nov 16, 2012 11:19 am

@scutter
That was a good one!

But the epoch is set for every commit performed. So if you are loading a table using a ETL tool with commit interval
as 10k, all the duplicates in that 10k will have the same epoch time. So epoch wont be of much help in such a scenario.

Now consider a scenario of 10k records being loaded at one go having around 5k exact duplicates (in all the fields).
The solution which I could think of is
1) select all the records with duplicates along with a row_number aggregate and insert into a temporary table with similar structure.
2) Now delete from Main Table all the duplicate records
3) Insert into the Main table, all the records from temporary table having row_number=1
4) drop the temporary table

This might improve the performance as deleting via 'IN' operator could take more time than just deleting all the duplicates and inserting the
unique records from temporary table. I havent tested this for performance yet. Any thoughts on this?

Abeesh
Abeeshdreams

scutter
Master
Master
Posts: 302
Joined: Tue Aug 07, 2012 2:15 am

Re: De-Duplication in Vertica

Post by scutter » Sat Nov 17, 2012 1:42 pm

Ideally you would de-dup during the load process on the way into the database, so that those 10k batches of records don't have any duplicates.
Sharon Cutter
Vertica Consultant, Zazz Technologies LLC

NorbertKrupa
GURU
GURU
Posts: 527
Joined: Tue Oct 22, 2013 9:36 pm
Location: Chicago, IL
Contact:

Re: De-Duplication in Vertica

Post by NorbertKrupa » Thu Apr 24, 2014 2:27 am

scutter wrote:you can use the undocumented 'epoch' column.
You are just a treasure trove of undocumented goodies.
Checkout vertica.tips for more Vertica resources.

Post Reply

Return to “Vertica SQL”