Page 1 of 1

Performance tuning a COUNT DISTINCT

Posted: Wed Sep 17, 2014 1:28 am
by becky
Hi all,

Does anyone have any tips or best ways to handle a COUNT DISTINCT on a very large data set (> 1 million rows in a fact table). I'm seeing my queries take > 30 seconds with the COUNT DISTINCT and about 7 seconds with a simple COUNT.

Thanks!

Re: Performance tuning a COUNT DISTINCT

Posted: Wed Sep 17, 2014 1:20 pm
by NorbertKrupa
If you're on 7.0+, you may want to look at Optimizing COUNT (DISTINCT) by Calculating Approximate Counts.

Re: Performance tuning a COUNT DISTINCT

Posted: Wed Sep 17, 2014 1:42 pm
by BoMBaY
There is the good article about count distinct on the link below.

Avoiding the OLAP Cliff for Count Distinct Queries in Vertica
http://www.vertica.com/2014/03/05/avoid ... n-vertica/

Re: Performance tuning a COUNT DISTINCT

Posted: Wed Sep 17, 2014 1:53 pm
by NorbertKrupa
That article eludes to using an approximate count distinct.

Re: Performance tuning a COUNT DISTINCT

Posted: Wed Sep 17, 2014 2:13 pm
by becky
Thanks, guys.

I tried the approximate count distinct function and the query never returns! Well, I let it run for about two minutes and just canceled it. The COUNT DISTINCT is faster in my database. Is that weird?

:o :shock: :o

Re: Performance tuning a COUNT DISTINCT

Posted: Wed Sep 17, 2014 2:47 pm
by NorbertKrupa
What does your query look like?

Re: Performance tuning a COUNT DISTINCT

Posted: Tue Sep 30, 2014 3:47 pm
by becky
Anyone know what these parameters do?

Code: Select all

dbadmin=> select parameter_name, description from configuration_parameters where parameter_name ilike '%distinct%';
         parameter_name         |                                                       description
--------------------------------+-------------------------------------------------------------------------------------------------------------------------
 DBDCountDistinctSampleRowCount | Number of rows to be sampled for count distinct analysis during DBD design. Default: 0 Use DBDCountDistinctSampleRowPct
 DBDCountDistinctSampleRowPct   | Percentage of rows to be sampled for count distinct analysis during DBD design. Default: 100
(2 rows)