Vertica analytics

Moderator: NorbertKrupa

Post Reply
malargopal
Newbie
Newbie
Posts: 23
Joined: Mon Feb 04, 2013 10:54 am

Vertica analytics

Post by malargopal » Thu Jul 18, 2013 8:29 am

Hi

Can any one pls tell me the crisp about

1. Gapfilling
2.Interpolation
3.Sessionization
4.Why vertica is called as in built analytics.


Thanks in advance
Malar

User avatar
BoMBaY
Beginner
Beginner
Posts: 26
Joined: Tue Jul 16, 2013 5:45 am

Re: Vertica analytics

Post by BoMBaY » Thu Jul 18, 2013 10:19 am

1 & 2 : Gap Filling and Interpolation (GFI)

To fills in missing data points and adds new (missing) data points within a range of known data points to the output.

Example
The following table, which contains two input rows five seconds apart, at 3:00:00 and 3:00:05.

Code: Select all

=> SELECT * FROM TickStore;

         ts          | symbol | bid

---------------------+--------+------

 2009-01-01 03:00:00 | XYZ    |   10

 2009-01-01 03:00:05 | XYZ    | 10.5

(2 rows)
Given those two inputs, how would you determine a bid price that fell between the two points, such as at 3:00:03 PM?

There are two Interpolation. (Constant Interpolation and Linear Interpolation)

Constant Interpolation

Image

Linear Interpolation

Image

The following is a side-by-side comparison of constant and linear interpolation schemes.

ImageImage

For more information, please follow the link https://my.vertica.com/docs/6.1.x/HTML/ ... #13585.htm

3 : Sessionization

Sessionization, a special case of event-based windows, is a feature often used to analyze click streams, such as identifying web browsing sessions from recorded web clicks.
In HP Vertica, given an input clickstream table, where each row records a Web page click made by a particular user (or IP address), the sessionization computation attempts to identify Web browsing sessions from the recorded clicks by grouping the clicks from each user based on the time-intervals between the clicks. If two clicks from the same user are made too far apart in time, as defined by a time-out threshold, the clicks are treated as though they are from two different browsing sessions.

Example

The input table WebClicks contains the following rows:

Code: Select all

=> SELECT * FROM WebClicks;

 userId |      timestamp

--------+---------------------

      1 | 2009-12-08 15:00:00

      1 | 2009-12-08 15:00:25

      1 | 2009-12-08 15:00:45

      1 | 2009-12-08 15:01:45

      2 | 2009-12-08 15:02:45

      2 | 2009-12-08 15:02:55

      2 | 2009-12-08 15:03:55

(7 rows)
In the following query, sessionization performs computation on the SELECT list columns, showing the difference between the current and previous timestamp value using LAG(). It evaluates to true and increments the window ID when the difference is greater than 30 seconds.

Code: Select all

=> SELECT userId, timestamp, 

   CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) > '30 seconds') 

     OVER(PARTITION BY userId ORDER BY timestamp) AS session FROM WebClicks;

 userId |      timestamp      | session

--------+---------------------+---------

      1 | 2009-12-08 15:00:00 |       0

      1 | 2009-12-08 15:00:25 |       0

      1 | 2009-12-08 15:00:45 |       0

      1 | 2009-12-08 15:01:45 |       1

      2 | 2009-12-08 15:02:45 |       0

      2 | 2009-12-08 15:02:55 |       0

      2 | 2009-12-08 15:03:55 |       1

(7 rows)
In the output, the session column contains the window ID from the CONDITIONAL_TRUE_EVENT function. The window ID evaluates to true on row 4 (timestamp 15:01:45), and the ID that follows row 4 is zero because it is the start of a new partition (for user ID 2), and that row does not evaluate to true until the last line in the output.

You might want to give users different time-out thresholds. For example, one user might have a slower network connection or be multi-tasking, while another user might have a faster connection and be focused on a single Web site, doing a single task.

To compute an adaptive time-out threshold based on the last 2 clicks, use CONDITIONAL_TRUE_EVENT with LAG to return the average time between the last 2 clicks with a grace period of 3 seconds:

Code: Select all

SELECT userId, timestamp, 

CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) >

(LAG(timestamp, 1) - LAG(timestamp, 3)) / 2 + '3 seconds') 

OVER(PARTITION BY userId ORDER BY timestamp) AS session

FROM WebClicks;

 userId |      timestamp      | session

--------+---------------------+---------

      2 | 2009-12-08 15:02:45 |       0

      2 | 2009-12-08 15:02:55 |       0

      2 | 2009-12-08 15:03:55 |       0

      1 | 2009-12-08 15:00:00 |       0

      1 | 2009-12-08 15:00:25 |       0

      1 | 2009-12-08 15:00:45 |       0

      1 | 2009-12-08 15:01:45 |       1

(7 rows)
4 : Why vertica is called as in built analytics.

I'm not sure that I understand this question clearly. I will try to answer.

Because HP Vertica analytics support SQL Analytic functions based on the ANSI 99 standard and also have HP Vertica SQL extensions.

Standard:

- FIRST_VALUE(arguments): Allows the selection of the first value of a table or partition without having to use a self-join
- MEDIAN(arguments): Returns the middle value from a set of values
- NTILE(value): Equally divides the data set into a {value} number of subsets (buckets)
- RANK(): Assigns a rank to each row returned from the query with respect to the other ordered rows
- STDDEV(arguments): Computes the statistical sample standard deviation of the current row with respect to a group of rows
- AVG(arguments): Computes an average of an expression in a group of rows

HP Vertica SQL extensions:

Event Based Windows

- CONDITIONAL_CHANGE_EVENT(expr): assigns an event window number to each row, starting from 0, and increments by 1 when the result of evaluating the argument expression on the current row differs from that on the previous row. This function is similar to the analytic function ROW_NUMBER, which assigns a unique number, sequentially, starting from 1, to each row within a partition.
- CONDITIONAL_TRUE_EVENT(boolean_expr): assigns an event window number to each row, starting from 0, and increments the number by 1 when the result of the boolean argument expression evaluates true.

Timeseries

- TIME_SLICE (expression, slice_length, [time_unit], [start_or_end]): Aggregates data by different fixed-time intervals and returns a rounded-up input TIMESTAMP value to a value that corresponds with the start or end of the time slice interval.
- Gap Filling and Interpolation (GFI)

Event Series Joins
- INTERPOLATE: Used to join two event series together using some ordered attribute, event series joins let you compare values from two series directly, rather than having to normalize the series to the same measurement interval.

Pattern Matching
- MATCH Clause: A SQL extension that lets you screen large amounts of historical data in search of event patterns.

Hope this helps.
Itipong Chewinpipat (Bay)
DBA Specialist (Vertica/Oracle)

Image
ImageImage

malargopal
Newbie
Newbie
Posts: 23
Joined: Mon Feb 04, 2013 10:54 am

Re: Vertica analytics

Post by malargopal » Fri Jul 19, 2013 7:36 am

Thanks a lot for the good explanation :)

User avatar
nnani
Master
Master
Posts: 302
Joined: Fri Apr 13, 2012 6:28 am
Contact:

Re: Vertica analytics

Post by nnani » Fri Jul 19, 2013 9:28 am

Real Good explaination :)

Thanks Itipong.
nnani........
Long way to go

You can check out my blogs at vertica-howto

User avatar
becky
Intermediate
Intermediate
Posts: 118
Joined: Sat Apr 28, 2012 11:37 am

Re: Vertica analytics

Post by becky » Fri Jul 19, 2013 6:52 pm

Wow! That was a great post, BoMBaY! Thanks for sharing!
THANKS - BECKSTER

Post Reply

Return to “Vertica Analytics”