Running Count Distinct

PerS · Post by **PerS** » Wed Oct 12, 2016 1:13 pm

How many customers where active last 365 days?

We use this as a "churn" KPI.
Say 3,456,466 customers made at least one purchase between yesterday - 365 and yesterday.
Now we run the following query every day just to see for one day at a time or populate a spreadsheet:

SELECT
COUNT(DISTINCT CUSTOMER_KEY) AS "Customers"
FROM F_DAY_SALES
WHERE DATE_KEY BETWEEN CURRENT_DATE - 366 AND CURRENT_DATE - 1

COUNT DISTINCT don't take the UNBOUNDED PRECEDING | FOLLOWING (?)

Is there a way to get something like

Date, Customers
2016-01-01 3,456,466
2016-01-02 3,456,578
2016-01-03 3,457,897
.
.
.
Where the numbers are COUNT(DISTINCT CUSTOMER_KEY) OVER (PARTITION BY DATE_KEY ORDER BY DATE_KEY ROWS BETWEEN 365 PRECEDING AND 0 FOLLOWING)?

Best regards!
P

scutter · Post by **scutter** » Wed Oct 12, 2016 1:56 pm

I just submitted this exact request a few weeks ago through Product Management. The more customers who request it, the more likely it is to be implemented. Open a ticket asking for it.

—Sharon

PerS · Post by **PerS** » Wed Oct 12, 2016 2:01 pm

Post by **JimKnicely** » Wed Oct 12, 2016 2:33 pm

Hi,

Are you sure you want to use an analytic function? Looking at your expected output, it seems you just want a distinct customer count for every date. Maybe just using a simple GROUP BY is sufficient?

Example:

Code: Select all

dbadmin=> select * from F_DAY_SALES order by 1;
 date_key | customer_key
----------+--------------
 20160101 |            1
 20160101 |            1
 20160101 |            1
 20160101 |            2
 20160101 |            3
 20160101 |            4
 20160102 |            1
 20160102 |            5
 20160103 |            1
 20160103 |            2
 20160103 |            2
 20160103 |            3
 20160103 |            3
(13 rows)

dbadmin=> select to_date(date_key::varchar, 'YYYYMMDD') date, count(distinct customer_key) customers from F_DAY_SALES where to_date(date_key::varchar, 'YYYYMMDD') between add_months((now()-1)::date, -12) and (now()-1)::date group by date_key order by 1;
    date    | customers
------------+-----------
 2016-01-01 |         4
 2016-01-02 |         2
 2016-01-03 |         3
(3 rows)

Note that I used the ADD_MONTHS function to look back one year (instead of subtracting 365) to account for leap years. You may not want this, but just a thought.

Post by **JimKnicely** » Wed Oct 12, 2016 7:39 pm

Unless you are looking to get a "running" distinct count of customer keys. Maybe you can try a query in the example below. It is using a 10 day window to keep the result set small.

Example:

Code: Select all

dbadmin=> select * from F_DAY_SALES order by 1;
 date_key | customer_key | date_key_date
----------+--------------+---------------
 20161004 |            1 | 2016-10-04
 20161004 |            1 | 2016-10-04
 20161004 |            2 | 2016-10-04
 20161004 |            3 | 2016-10-04
 20161005 |            3 | 2016-10-05
 20161006 |            3 | 2016-10-06
 20161006 |            4 | 2016-10-06
 20161008 |            1 | 2016-10-08
 20161008 |           10 | 2016-10-08
 20161009 |           10 | 2016-10-09
 20161011 |            1 | 2016-10-11
 20161011 |            2 | 2016-10-11
 20161011 |           15 | 2016-10-11
 20161012 |          100 | 2016-10-12
(14 rows)

dbadmin=> select the_date date,
dbadmin->        customer_key
dbadmin->   from (select b.ts the_date,
dbadmin(>                count(distinct a.customer_key) customer_key
dbadmin(>           from (SELECT ts
dbadmin(>                   FROM (SELECT '1900-01-01'::timestamp AS tm
dbadmin(>                          UNION ALL
dbadmin(>                         SELECT '2100-01-01'::timestamp AS tm) AS t
dbadmin(>                 TIMESERIES ts AS '1 day' OVER (ORDER BY tm)) b
dbadmin(>           left join F_DAY_SALES a
dbadmin(>             on a.date_key_date <= b.ts
dbadmin(>          group by b.ts) foo
dbadmin->  where foo.the_date between sysdate::date-10 and sysdate::date
dbadmin->  order by foo.the_date;
        date         | customer_key
---------------------+--------------
 2016-10-03 00:00:00 |            0
 2016-10-04 00:00:00 |            3
 2016-10-05 00:00:00 |            3
 2016-10-06 00:00:00 |            4
 2016-10-07 00:00:00 |            4
 2016-10-08 00:00:00 |            5
 2016-10-09 00:00:00 |            5
 2016-10-10 00:00:00 |            5
 2016-10-11 00:00:00 |            6
 2016-10-12 00:00:00 |            7
(10 rows)

I think this works and is what you are looking for? You can modify the the predicate "where foo.the_date between sysdate::date-10 and sysdate::date" to be "where foo.the_date between sysdate::date-365 and sysdate::date" or to what ever you want.

This query might not perform all that well on large data sets. I'm with Sharon in that it will be nice when the analytic functions support the distinct key word!

Running Count Distinct

Running Count Distinct

Re: Running Count Distinct

Re: Running Count Distinct

Re: Running Count Distinct

Re: Running Count Distinct