How to delete duplicate rows

vicky_tsc · Post by **vicky_tsc** » Mon Sep 23, 2013 6:32 am

How to delete the duplicate rows from he following table,

id | name | sal | location
----+------------+-------+----------
1 | Tom | 50000 | NJ
1 | Tom | 50000 | NJ
2 | Jerry | 40000 | NJ
2 | jerry | 40000 | NJ
3 | Tim | 40000 | NJ
4 | Jim | 20000 |NJ

I tried with following Query,
DELETE FROM syn.sales WHERE id NOT IN (SELECT MIN(id) _
FROM syn.sales GROUP BY name)

nnani · Post by **nnani** » Mon Sep 23, 2013 8:08 am

Hello vicky_tsc.

You can do something like this
First create a temporary table with all the data which is latest and single instance of each record, so no duplicates exist

Code: Select all

create temporary table  Loc_temp_sales on commit preserve rows as
select 
id
, name 
, sal 
, location
from 
(select 
id
, name 
, sal 
, location
ROW_NUMBER() OVER(PARTITION BY name, location, sal  ORDER BY  id ASC) as rnk
from syn.sales) sub
where sub.rnk=1

Then Truncate the existing table syn.sales

Code: Select all

TRUNCATE TABLE syn.sales;

Then Load the table syn.sales with data from Temporary table created earlier

Code: Select all

Insert into syn.sales
select * form Loc_temp_sales ;

This way you can load your target table with all unique records and delete all duplicates.

There are other alternative ways of removing duplicates from a table.

Hope this helps..

vicky_tsc · Post by **vicky_tsc** » Mon Sep 23, 2013 9:24 am

Thank you nnaani. Is there any way without creating temp table?. I want by using queries as we use in oracle and sql.

id10t · Post by **id10t** » Mon Sep 23, 2013 12:35 pm

Hi!

>> Is there any way without creating temp table?
I don't think so.
Oracle is row engine database so it easy to support concepts like ROWID, while Vertica is columnar database and "relation/row recognition" it's a big challenge for columnar RDBMS and main impact of "relation recognition" has on load/insert and update/delete. Vertica did a trade off: fast loads over deletes (as should be in DWH) - no rejects on constraint/s violation while data are loaded. As Vertica best practice is: prepare data before loads.

There are only one option: duplicated rows inserted not at same epoch/AHM/time.

Workaround with epochs:

Code: Select all

daniel=> select * from vicky_tsc ;
 id | name  |  sal  | location 
----+-------+-------+----------
  1 | Tom   | 50000 | NJ
  2 | Jerry | 40000 | NJ
  3 | Tim   | 40000 | NJ
  4 | Jim   | 20000 | NJ
(4 rows)

Code: Select all

daniel=> copy vicky_tsc from stdin direct;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Tom|50000|NJ
>> 2|Jerry|40000|NJ
>> \.
daniel=> select epoch, * from vicky_tsc ;
 epoch | id | name  |  sal  | location 
-------+----+-------+-------+----------
  1358 |  1 | Tom   | 50000 | NJ
  1358 |  2 | Jerry | 40000 | NJ
  1358 |  3 | Tim   | 40000 | NJ
  1358 |  4 | Jim   | 20000 | NJ
  1359 |  1 | Tom   | 50000 | NJ
  1359 |  2 | Jerry | 40000 | NJ
(6 rows)

Code: Select all

daniel=> delete from vicky_tsc where epoch = 1359;
 OUTPUT 
--------
      2
(1 row)
daniel=> select * from vicky_tsc ;
 id | name  |  sal  | location 
----+-------+-------+----------
  1 | Tom   | 50000 | NJ
  2 | Jerry | 40000 | NJ
  3 | Tim   | 40000 | NJ
  4 | Jim   | 20000 | NJ
(4 rows)

Of cause you can implement "epoch" column explicitly: add to table column when data is inserted and delete dups by aka Slowly Changing Dimension Type II.
http://en.wikipedia.org/wiki/Slowly_cha ... on#Type_II

vicky_tsc · Post by **vicky_tsc** » Tue Sep 24, 2013 6:44 am

This is small table and the epoch can be used, but if the table is huge and has many duplicates in the table.then how to handle it.

pborne · Post by **pborne** » Fri Jan 03, 2014 6:02 pm

When the table is huge, divide and conquer the problem, as usual.

If you can't do it in one pass, do it in two passes. If you can't do it in two passes, do it in four passes etc.

Once you have your n result set, do a union two by two, wash rinse repeat.

How to delete duplicate rows

How to delete duplicate rows

Re: How to delete duplicate rows

Re: How to delete duplicate rows

Re: How to delete duplicate rows

Re: How to delete duplicate rows

Re: How to delete duplicate rows