One issue that i've come across, is that sometimes (for reasons out of vertica's control), duplicate data comes in. The solution for this that i've come up with works, but it's rather crude and i'm sure there are better ideas out there. It involves locating the duplicate records, isolating one, copying it to a file, removing all the records, and then putting the record back in the table at the end of it all. The process currently looks like this:
Code: Select all
\a
\t
\o duplicate_row.tbl
select * from
(select z_timestamp, datetime, moid, counter1, counter2, counter3 from (select rank() over(partition by datetime, moid order by counter1 desc)
as duprank, * from schema1.duplicate_table)
as ordered
where ordered.duprank = 2)
as test2;
\o
delete from schema1.duplicate_table where (z_timestamp,moid,datetime) in (select z_timestamp, moid, datetime from
(select row_number() over(partition by datetime, moid order by datetime)
as duprank,moid, datetime, z_timestamp from schema1.duplicate_table)
as ordered
where ordered.duprank > 1);
\set t_pwd `pwd`
\set input_file '''':t_pwd'/duplicate_row.tbl'''
COPY schema1.duplicate_table FROM :input_file DELIMITER '|' NULL '' DIRECT;
For me, performance is the most important factor due to working with such a large database.
Cheers