De-Duplication in Vertica

Post by **JimKnicely** » Thu Aug 02, 2012 6:04 pm

Rick,

User 2 User Support! That's the goal!

How about this option to get rid of dups:

I created a table with 15,000,000 records having a lot of duplicates. I added a temporary column to the table to store an indicator that I want to keep the row. I then inserted all the non-duplicates into the same table giving the temporary column a value of 1 for these rows. Finally, I deleted all records where the keep column is null and then dropped the temporary column. My table was left with all unique rows.

Example:

Code: Select all

dbadmin=> create table test (col1 int, col2 int);
CREATE TABLE

Time: First fetch (0 rows): 16.203 ms. All rows formatted: 16.217 ms

dbadmin=> insert into test (select randomint(5), randomint(5) from tables cross join columns cross join tables t2 limit 15000000);
  OUTPUT
----------
 15000000
(1 row)

Time: First fetch (1 row): 8180.880 ms. All rows formatted: 8180.911 ms

dbadmin=> select col1, col2, count(*)
dbadmin->   from test
dbadmin->  group by col1, col2
dbadmin->  order by col1, col2;
 col1 | col2 | count
------+------+--------
    0 |    0 | 600398
    0 |    1 | 599744
    0 |    2 | 601643
    0 |    3 | 600217
    0 |    4 | 600686
    1 |    0 | 599671
    1 |    1 | 599769
    1 |    2 | 599408
    1 |    3 | 598887
    1 |    4 | 599205
    2 |    0 | 600508
    2 |    1 | 599828
    2 |    2 | 599882
    2 |    3 | 599538
    2 |    4 | 598461
    3 |    0 | 599772
    3 |    1 | 600063
    3 |    2 | 599769
    3 |    3 | 600085
    3 |    4 | 599001
    4 |    0 | 601377
    4 |    1 | 600934
    4 |    2 | 601486
    4 |    3 | 599584
    4 |    4 | 600084
(25 rows)

Time: First fetch (25 rows): 1028.894 ms. All rows formatted: 1028.966 ms

dbadmin=> alter table test add column keep_row int;
ALTER TABLE
Time: First fetch (0 rows): 3198.575 ms. All rows formatted: 3198.590 ms
dbadmin=> insert into test (select distinct col1, col2, 1 from test);
 OUTPUT
--------
     25
(1 row)

Time: First fetch (1 row): 315.839 ms. All rows formatted: 315.870 ms

dbadmin=> delete from test where keep_row is null;
  OUTPUT
----------
 15000000
(1 row)

Time: First fetch (1 row): 1214.642 ms. All rows formatted: 1214.727 ms

dbadmin=> alter table test drop column keep_row;
ALTER TABLE
Time: First fetch (0 rows): 2664.760 ms. All rows formatted: 2664.774 ms

dbadmin=> select * from test;
 col1 | col2
------+------
    0 |    4
    1 |    0
    1 |    1
    1 |    2
    2 |    1
    3 |    0
    3 |    1
    4 |    0
    4 |    2
    4 |    4
    1 |    4
    2 |    2
    3 |    3
    3 |    4
    4 |    1
    0 |    0
    0 |    1
    0 |    2
    0 |    3
    1 |    3
    2 |    0
    2 |    3
    2 |    4
    3 |    2
    4 |    3
(25 rows)

Time: First fetch (25 rows): 27.760 ms. All rows formatted: 27.851 ms

That all happens pretty quickly!

jpcavanaugh · Post by **jpcavanaugh** » Thu Aug 02, 2012 6:44 pm

Do the duplicates happen within a specified period of time? For instance, a single day? The reason I ask is you can pull all the values from a partition into a temp table, drop the partition and move the data back in from the temp table.

Rick · Post by **Rick** » Fri Aug 03, 2012 9:21 am

@ Jim, hmm interesting, could work, i'll try it out and let you know, cheers. The only issue I have is that adding a column to the table, even for a short ammount of time may effect the loading of data to the table (as I said before the tables are being constantly loaded). Also, with the tables sometimes having 1,000+ columns this may be a factor too. I'm not sure this will be quicker than deleting only the duplicates to a file, as millions of records will be copied into the table (remember were talking 1000+ column tables in some cases). But I will definately test your idea as it has the potential to work, and is exactly the sort of thing i'm looking for (just new ideas!), cheers.

@ junior, They usually come in every 15 minutes, but are often loaded in 5 minute intervals (as a large ammount of data is coming in)

jpcavanaugh · Post by **jpcavanaugh** » Mon Aug 06, 2012 3:42 pm

When you are testing try this pattern to see how it works for you.

-Partition your data by day
-Load data into base table as it comes in
-On some interval, copy all the data from a partition into a temp table (only pulling non-duplicate rows
-Drop partition
-Copy from temp table into base table

scutter · Post by **scutter** » Tue Aug 07, 2012 2:17 am

Can you de-duplicate as a separate step during the load process for duplicate records within the same batch. For removing records from different loads, and thus different epochs, you can use the undocumented 'epoch' column.

select *,epoch from duplicate_table;

Julie · Post by **Julie** » Wed Aug 08, 2012 1:36 pm

scutter,

Thanks for the tip!

you can use the undocumented 'epoch' column.

How does this column get populated? If I run the test below, the epoch column is NULL:

Code: Select all

dbadmin=> create table t (c int);
CREATE TABLE
dbadmin=> insert into t values (1);
 OUTPUT
--------
      1
(1 row)

dbadmin=> insert into t values (2);
 OUTPUT
--------
      1
(1 row)

dbadmin=> select c, epoch from t;
 c | epoch
---+-------
 1 |
 2 |
(2 rows)

scutter · Post by **scutter** » Thu Aug 16, 2012 9:50 pm

The epoch column is the epoch when the data was committed. It will be null for uncommitted data.

scutter=> create table scutter.t1(c integer);
CREATE TABLE
scutter=> insert into scutter.t1 values(1);
OUTPUT
--------
1
(1 row)

scutter=> select c,epoch from scutter.t1;
c | epoch
---+-------
1 |
(1 row)

scutter=> commit;
COMMIT
scutter=> insert into scutter.t1 values(2);
OUTPUT
--------
1
(1 row)

scutter=> select c,epoch from scutter.t1;
c | epoch
---+----------
1 | 26731459
2 |
(2 rows)

scutter=> commit;
COMMIT
scutter=> select c,epoch from scutter.t1;
c | epoch
---+----------
1 | 26731459
2 | 26731484
(2 rows)

(other commits by other sessions occurred between my two COMMITs)

De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica