Running Count Distinct

Moderator: NorbertKrupa

Post Reply
PerS
Newbie
Newbie
Posts: 4
Joined: Fri Sep 16, 2016 8:15 am

Running Count Distinct

Post by PerS » Wed Oct 12, 2016 1:13 pm

How many customers where active last 365 days?

We use this as a "churn" KPI.
Say 3,456,466 customers made at least one purchase between yesterday - 365 and yesterday.
Now we run the following query every day just to see for one day at a time or populate a spreadsheet:

SELECT
COUNT(DISTINCT CUSTOMER_KEY) AS "Customers"
FROM F_DAY_SALES
WHERE DATE_KEY BETWEEN CURRENT_DATE - 366 AND CURRENT_DATE - 1

COUNT DISTINCT don't take the UNBOUNDED PRECEDING | FOLLOWING (?)

Is there a way to get something like

Date, Customers
2016-01-01 3,456,466
2016-01-02 3,456,578
2016-01-03 3,457,897
.
.
.
Where the numbers are COUNT(DISTINCT CUSTOMER_KEY) OVER (PARTITION BY DATE_KEY ORDER BY DATE_KEY ROWS BETWEEN 365 PRECEDING AND 0 FOLLOWING)?

Best regards!
P

scutter
Master
Master
Posts: 302
Joined: Tue Aug 07, 2012 2:15 am

Re: Running Count Distinct

Post by scutter » Wed Oct 12, 2016 1:56 pm

I just submitted this exact request a few weeks ago through Product Management. The more customers who request it, the more likely it is to be implemented. Open a ticket asking for it.

—Sharon
Sharon Cutter
Vertica Consultant, Zazz Technologies LLC

PerS
Newbie
Newbie
Posts: 4
Joined: Fri Sep 16, 2016 8:15 am

Re: Running Count Distinct

Post by PerS » Wed Oct 12, 2016 2:01 pm

:D

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: Running Count Distinct

Post by JimKnicely » Wed Oct 12, 2016 2:33 pm

Hi,

Are you sure you want to use an analytic function? Looking at your expected output, it seems you just want a distinct customer count for every date. Maybe just using a simple GROUP BY is sufficient?

Example:

Code: Select all

dbadmin=> select * from F_DAY_SALES order by 1;
 date_key | customer_key
----------+--------------
 20160101 |            1
 20160101 |            1
 20160101 |            1
 20160101 |            2
 20160101 |            3
 20160101 |            4
 20160102 |            1
 20160102 |            5
 20160103 |            1
 20160103 |            2
 20160103 |            2
 20160103 |            3
 20160103 |            3
(13 rows)

dbadmin=> select to_date(date_key::varchar, 'YYYYMMDD') date, count(distinct customer_key) customers from F_DAY_SALES where to_date(date_key::varchar, 'YYYYMMDD') between add_months((now()-1)::date, -12) and (now()-1)::date group by date_key order by 1;
    date    | customers
------------+-----------
 2016-01-01 |         4
 2016-01-02 |         2
 2016-01-03 |         3
(3 rows)
Note that I used the ADD_MONTHS function to look back one year (instead of subtracting 365) to account for leap years. You may not want this, but just a thought.
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: Running Count Distinct

Post by JimKnicely » Wed Oct 12, 2016 7:39 pm

Unless you are looking to get a "running" distinct count of customer keys. Maybe you can try a query in the example below. It is using a 10 day window to keep the result set small.

Example:

Code: Select all

dbadmin=> select * from F_DAY_SALES order by 1;
 date_key | customer_key | date_key_date
----------+--------------+---------------
 20161004 |            1 | 2016-10-04
 20161004 |            1 | 2016-10-04
 20161004 |            2 | 2016-10-04
 20161004 |            3 | 2016-10-04
 20161005 |            3 | 2016-10-05
 20161006 |            3 | 2016-10-06
 20161006 |            4 | 2016-10-06
 20161008 |            1 | 2016-10-08
 20161008 |           10 | 2016-10-08
 20161009 |           10 | 2016-10-09
 20161011 |            1 | 2016-10-11
 20161011 |            2 | 2016-10-11
 20161011 |           15 | 2016-10-11
 20161012 |          100 | 2016-10-12
(14 rows)

dbadmin=> select the_date date,
dbadmin->        customer_key
dbadmin->   from (select b.ts the_date,
dbadmin(>                count(distinct a.customer_key) customer_key
dbadmin(>           from (SELECT ts
dbadmin(>                   FROM (SELECT '1900-01-01'::timestamp AS tm
dbadmin(>                          UNION ALL
dbadmin(>                         SELECT '2100-01-01'::timestamp AS tm) AS t
dbadmin(>                 TIMESERIES ts AS '1 day' OVER (ORDER BY tm)) b
dbadmin(>           left join F_DAY_SALES a
dbadmin(>             on a.date_key_date <= b.ts
dbadmin(>          group by b.ts) foo
dbadmin->  where foo.the_date between sysdate::date-10 and sysdate::date
dbadmin->  order by foo.the_date;
        date         | customer_key
---------------------+--------------
 2016-10-03 00:00:00 |            0
 2016-10-04 00:00:00 |            3
 2016-10-05 00:00:00 |            3
 2016-10-06 00:00:00 |            4
 2016-10-07 00:00:00 |            4
 2016-10-08 00:00:00 |            5
 2016-10-09 00:00:00 |            5
 2016-10-10 00:00:00 |            5
 2016-10-11 00:00:00 |            6
 2016-10-12 00:00:00 |            7
(10 rows)
I think this works and is what you are looking for? You can modify the the predicate "where foo.the_date between sysdate::date-10 and sysdate::date" to be "where foo.the_date between sysdate::date-365 and sysdate::date" or to what ever you want.

This query might not perform all that well on large data sets. I'm with Sharon in that it will be nice when the analytic functions support the distinct key word!
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

Post Reply

Return to “Vertica SQL”