De-Duplication in Vertica

Moderator: NorbertKrupa

Rick
Newbie
Newbie
Posts: 23
Joined: Thu Jul 26, 2012 2:46 pm

De-Duplication in Vertica

Post by Rick » Thu Aug 02, 2012 3:23 pm

Hey, so i currently work with a very large database that takes in around 20 million records a day at 15 minute intervals (which vertica is ideal for).

One issue that i've come across, is that sometimes (for reasons out of vertica's control), duplicate data comes in. The solution for this that i've come up with works, but it's rather crude and i'm sure there are better ideas out there. It involves locating the duplicate records, isolating one, copying it to a file, removing all the records, and then putting the record back in the table at the end of it all. The process currently looks like this:

Code: Select all

\a
\t
\o duplicate_row.tbl 

select * from 
(select z_timestamp, datetime, moid, counter1, counter2, counter3 from (select rank() over(partition by datetime, moid order by counter1 desc) 
as duprank, * from schema1.duplicate_table) 
as ordered 
where ordered.duprank = 2)
as test2;

\o


delete from schema1.duplicate_table where (z_timestamp,moid,datetime) in (select z_timestamp, moid, datetime from 
(select row_number() over(partition by datetime, moid order by datetime) 
as duprank,moid, datetime, z_timestamp from schema1.duplicate_table) 
as ordered
where ordered.duprank > 1);


\set t_pwd `pwd`
\set input_file '''':t_pwd'/duplicate_row.tbl'''
COPY schema1.duplicate_table FROM :input_file DELIMITER '|' NULL '' DIRECT;
I was wondering if anybody can think of better solutions? It would be extremely useful if there was a delete-all-but-one type of function :(

For me, performance is the most important factor due to working with such a large database.

Cheers

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: De-Duplication in Vertica

Post by JimKnicely » Thu Aug 02, 2012 3:54 pm

What is the PK on the table having the duplicates?

Does this example help where the PK on the test table is col1?

Code: Select all

dbadmin=> select * from test;
 col1 | col2
------+-------
    1 | Jim
    2 | Jim
    3 | Jane
    4 | Helen
    5 | Jim
(5 rows)

dbadmin=> delete from test
dbadmin->  where (col1, col2) NOT IN (select max(col1), col2
dbadmin(>                               from test
dbadmin(>                              group
dbadmin(>                                 by col2);
 OUTPUT
--------
      2
(1 row)

dbadmin=> select * from test;
 col1 | col2
------+-------
    3 | Jane
    4 | Helen
    5 | Jim
(3 rows)

dbadmin=>

Maybe you can use your counter1 column in place of the col1 column in the example shown above?

That is, something like this?

Code: Select all

delete from schema1.duplicate_table
 where (datetime, moid, counter) not in (select datetime, moid, max(counter)
                                           from schema1.duplicate_table
                                          group
                                             by datetime, moid);
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

Rick
Newbie
Newbie
Posts: 23
Joined: Thu Jul 26, 2012 2:46 pm

Re: De-Duplication in Vertica

Post by Rick » Thu Aug 02, 2012 3:58 pm

Ah, I posted old code. The first part where it uses a rank() function is incorrect, that should also use row_number() (as with the second function), as with exact duplicates it is likely that the counter1 field will be exactly the same, so ordering by counter1 in a rank function will not be possible. The data that comes in may look something like this:

z_timestamp | datetime | MoID |C1 |C2 | C3
2012-01-01 12:00:00|2012-01-01 11:00:00|RNC-A|300|500|123
2012-01-01 14:00:00|2012-01-01 11:00:00|RNC-B|500|700|956
2012-01-01 12:00:00|2012-01-01 11:00:00|RNC-C|100|300|789
2012-01-01 13:00:00|2012-01-01 12:00:00|RNC-A|400|600|234
2012-01-01 13:00:00|2012-01-01 12:00:00|RNC-B|500|700|345
2012-01-01 13:00:00|2012-01-01 12:00:00|RNC-C|600|800|456
2012-01-01 14:00:00|2012-01-01 11:00:00|RNC-B|500|700|956
2012-01-01 14:00:00|2012-01-01 13:00:00|RNC-A|500|700|334
2012-01-01 14:00:00|2012-01-01 13:00:00|RNC-B|600|800|445
2012-01-01 14:00:00|2012-01-01 13:00:00|RNC-C|700|900|556
2012-01-01 14:00:00|2012-01-01 11:00:00|RNC-B|500|700|956

Primary keys are not used, due to the way data is loaded and the way the schema is set up. This is why the de-duplication is much more difficult than 'traditional' methods, and why I currently use the row_number() aggregate function. But there has to be a better way than writing one of the duplicates to a file every time!

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: De-Duplication in Vertica

Post by JimKnicely » Thu Aug 02, 2012 4:12 pm

How about this crazy idea? :)

Code: Select all

dbadmin=> select * from test;
 col | col2
-----+-------
   1 | Jim
   1 | Jim
   1 | Jane
   1 | Helen
   1 | Jim
(5 rows)

dbadmin=> create table test_bk as select distinct * from test;
CREATE TABLE
dbadmin=> drop table test;
DROP TABLE
dbadmin=> alter table test_bk rename to test;
ALTER TABLE
dbadmin=> select * from test;
 col | col2
-----+-------
   1 | Helen
   1 | Jane
   1 | Jim
(3 rows)
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

User avatar
becky
Intermediate
Intermediate
Posts: 118
Joined: Sat Apr 28, 2012 11:37 am

Re: De-Duplication in Vertica

Post by becky » Thu Aug 02, 2012 4:20 pm

Too bad there isn't the rowid pseudo-column in Vertica like in Oracle! It's easy to delete duplicates using that column.
THANKS - BECKSTER

Rick
Newbie
Newbie
Posts: 23
Joined: Thu Jul 26, 2012 2:46 pm

Re: De-Duplication in Vertica

Post by Rick » Thu Aug 02, 2012 4:23 pm

That would work, however I didn't explain my situation fully (My apoligies). Copying just duplicate records to a file is far quicker than writing 20+ million rows to a table, then dropping a table with 20 million+ rows (and often 500+ columns)

Also, as the tables are constantly being loaded throughout the day it's not practical to drop the main tables. (As I say the tables are loaded every 15 minutes, often more than that, with around 1.2million records an hour on the biggest interfaces)

Finding the duplicates is not the issue so much (this is fairly simple), the issue lies in deleting them all but one.

@becky, Yeah, that's exactly what I need right now. Pity Oracle is far too slow for the ammounts of data being loaded :P

Also, must say great work on the forum, this is the first of it's kind i've seen for Vertica, and no doubt as it gets larger it will become an invaluable source.

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: De-Duplication in Vertica

Post by JimKnicely » Thu Aug 02, 2012 5:32 pm

This method should be faster than writing to a file:

Example:

Code: Select all

dbadmin=> select * from test;
 col1 | col2
------+-------
    1 | Jane
    1 | Jim
    1 | Jim
    1 | Jim
    1 | Helen
(5 rows)

dbadmin=> alter table test add column keep_col int;
ALTER TABLE
dbadmin=> insert into test (select distinct col1, col2, 1 from test);
 OUTPUT
--------
      3
(1 row)

dbadmin=> delete from test where keep_col is null;
 OUTPUT
--------
      5
(1 row)

dbadmin=> alter table test drop column keep_col;
ALTER TABLE
dbadmin=> select * from test;
 col1 | col2
------+-------
    1 | Jane
    1 | Jim
    1 | Helen
(3 rows)
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

Post Reply

Return to “Vertica SQL”