How to delete the duplicate rows from he following table,
id | name | sal | location
----+------------+-------+----------
1 | Tom | 50000 | NJ
1 | Tom | 50000 | NJ
2 | Jerry | 40000 | NJ
2 | jerry | 40000 | NJ
3 | Tim | 40000 | NJ
4 | Jim | 20000 |NJ
I tried with following Query,
DELETE FROM syn.sales WHERE id NOT IN (SELECT MIN(id) _
FROM syn.sales GROUP BY name)
How to delete duplicate rows
Moderator: NorbertKrupa
Re: How to delete duplicate rows
Hello vicky_tsc.
You can do something like this
First create a temporary table with all the data which is latest and single instance of each record, so no duplicates exist
Then Truncate the existing table syn.sales
Then Load the table syn.sales with data from Temporary table created earlier
This way you can load your target table with all unique records and delete all duplicates.
There are other alternative ways of removing duplicates from a table.
Hope this helps..
You can do something like this
First create a temporary table with all the data which is latest and single instance of each record, so no duplicates exist
Code: Select all
create temporary table Loc_temp_sales on commit preserve rows as
select
id
, name
, sal
, location
from
(select
id
, name
, sal
, location
ROW_NUMBER() OVER(PARTITION BY name, location, sal ORDER BY id ASC) as rnk
from syn.sales) sub
where sub.rnk=1
Code: Select all
TRUNCATE TABLE syn.sales;
Code: Select all
Insert into syn.sales
select * form Loc_temp_sales ;
There are other alternative ways of removing duplicates from a table.
Hope this helps..
Re: How to delete duplicate rows
Thank you nnaani. Is there any way without creating temp table?. I want by using queries as we use in oracle and sql.
Re: How to delete duplicate rows
Hi!
>> Is there any way without creating temp table?
I don't think so.
Oracle is row engine database so it easy to support concepts like ROWID, while Vertica is columnar database and "relation/row recognition" it's a big challenge for columnar RDBMS and main impact of "relation recognition" has on load/insert and update/delete. Vertica did a trade off: fast loads over deletes (as should be in DWH) - no rejects on constraint/s violation while data are loaded. As Vertica best practice is: prepare data before loads.
There are only one option: duplicated rows inserted not at same epoch/AHM/time.
Workaround with epochs:
Of cause you can implement "epoch" column explicitly: add to table column when data is inserted and delete dups by aka Slowly Changing Dimension Type II.
http://en.wikipedia.org/wiki/Slowly_cha ... on#Type_II
>> Is there any way without creating temp table?
I don't think so.
Oracle is row engine database so it easy to support concepts like ROWID, while Vertica is columnar database and "relation/row recognition" it's a big challenge for columnar RDBMS and main impact of "relation recognition" has on load/insert and update/delete. Vertica did a trade off: fast loads over deletes (as should be in DWH) - no rejects on constraint/s violation while data are loaded. As Vertica best practice is: prepare data before loads.
There are only one option: duplicated rows inserted not at same epoch/AHM/time.
Workaround with epochs:
Code: Select all
daniel=> select * from vicky_tsc ;
id | name | sal | location
----+-------+-------+----------
1 | Tom | 50000 | NJ
2 | Jerry | 40000 | NJ
3 | Tim | 40000 | NJ
4 | Jim | 20000 | NJ
(4 rows)
Code: Select all
daniel=> copy vicky_tsc from stdin direct;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Tom|50000|NJ
>> 2|Jerry|40000|NJ
>> \.
daniel=> select epoch, * from vicky_tsc ;
epoch | id | name | sal | location
-------+----+-------+-------+----------
1358 | 1 | Tom | 50000 | NJ
1358 | 2 | Jerry | 40000 | NJ
1358 | 3 | Tim | 40000 | NJ
1358 | 4 | Jim | 20000 | NJ
1359 | 1 | Tom | 50000 | NJ
1359 | 2 | Jerry | 40000 | NJ
(6 rows)
Code: Select all
daniel=> delete from vicky_tsc where epoch = 1359;
OUTPUT
--------
2
(1 row)
daniel=> select * from vicky_tsc ;
id | name | sal | location
----+-------+-------+----------
1 | Tom | 50000 | NJ
2 | Jerry | 40000 | NJ
3 | Tim | 40000 | NJ
4 | Jim | 20000 | NJ
(4 rows)
http://en.wikipedia.org/wiki/Slowly_cha ... on#Type_II
Re: How to delete duplicate rows
This is small table and the epoch can be used, but if the table is huge and has many duplicates in the table.then how to handle it.
Re: How to delete duplicate rows
When the table is huge, divide and conquer the problem, as usual.
If you can't do it in one pass, do it in two passes. If you can't do it in two passes, do it in four passes etc.
Once you have your n result set, do a union two by two, wash rinse repeat.
If you can't do it in one pass, do it in two passes. If you can't do it in two passes, do it in four passes etc.
Once you have your n result set, do a union two by two, wash rinse repeat.