blob: cbad354f8bde91cd85d9aa5edc1f88fe4fe8cdb9 [file] [log] [blame] [view]
---
{
"title": "WINDOW_FUNCTION_WINDOW_FUNNEL",
"language": "en"
}
---
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->
## WINDOW FUNCTION WINDOW_FUNNEL
### description
Searches for event chains in a sliding time window and calculates the maximum number of events that occurred from the chain.
- window is the length of time window in seconds.
- mode can be one of the followings:
- "default": Defualt mode.
- "deduplication": If the same event holds for the sequence of events, then such repeating event interrupts further processing. E.g. the array parameter is [event1='A', event2='B', event3='C', event4='D'], and the original event chain is "A-B-C-B-D". Since event B repeats, the filtered event chain can only be "A-B-C" and the max event level is 3.
- "fixed": Don't allow interventions of other events. E.g. the array parameter is [event1='A', event2='B', event3='C', event4='D'], and the original event chain is A->B->D->C, it stops finding A->B->C at the D and the max event level is 2.
- "increase": Apply conditions only to events with strictly increasing timestamps.
- timestamp_column specifies column of DATETIME type, sliding time window works on it.
- evnetN is boolean expression like eventID = 1004.
The function works according to the algorithm:
- The function searches for data that triggers the first condition in the chain and sets the event counter to 1. This is the moment when the sliding window starts.
- If events from the chain occur sequentially within the window, the counter is incremented. If the sequence of events is disrupted, the counter is not incremented.
- If the data has multiple event chains at varying points of completion, the function will only output the size of the longest chain.
```sql
window_funnel(window, mode, timestamp_column, event1, event2, ... , eventN)
```
### example
#### example1: default mode
Using the ```default``` mode, find out the maximum number of consecutive events corresponding to different ```user_id```, with a time window of ```5``` minutes:
```sql
CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");
INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'order', '2022-05-14 10:04:00', 'HONOR', 3),
(100123, 'payment', '2022-05-14 10:10:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);
SELECT
user_id,
window_funnel(
300,
"default",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;
+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 3 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+
```
For ```uesr_id=100123```, because the time when the ```payment``` event occurred exceeds the time window, the matched event chain is ```login-visit-order```.
#### example2: deduplication mode
Use the ```deduplication``` mode to find out the maximum number of consecutive events corresponding to different user_ids, with a time window of 1 hour:
```sql
CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");
INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'login', '2022-05-14 10:03:00', 'HONOR', 3),
(100123, 'order', '2022-05-14 10:04:00', "HONOR", 4),
(100123, 'payment', '2022-05-14 10:10:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);
SELECT
user_id,
window_funnel(
3600,
"deduplication",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;
+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 2 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+
```
For ```uesr_id=100123```, after matching the ```visit``` event, the ```login``` event appears repeatedly, so the matched event chain is ```login-visit```.
#### example3: fixed mode
Use the ```fixed``` mode to find out the maximum number of consecutive events corresponding to different ```user_id```, with a time window of ```1``` hour:
```sql
CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");
INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'order', '2022-05-14 10:03:00', "HONOR", 4),
(100123, 'login2', '2022-05-14 10:04:00', 'HONOR', 3),
(100123, 'payment', '2022-05-14 10:10:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);
SELECT
user_id,
window_funnel(
3600,
"fixed",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;
+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 3 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+
```
For ```uesr_id=100123```, after matching the ```order``` event, the event chain is interrupted by the ```login2``` event, so the matched event chain is ```login-visit-order```.
#### example4: increase mode
Use the ```increase``` mode to find out the maximum number of consecutive events corresponding to different ```user_id```, with a time window of ```1``` hour:
```sql
CREATE TABLE events(
user_id BIGINT,
event_name VARCHAR(64),
event_timestamp datetime,
phone_brand varchar(64),
tab_num int
) distributed by hash(user_id) buckets 3 properties("replication_num" = "1");
INSERT INTO
events
VALUES
(100123, 'login', '2022-05-14 10:01:00', 'HONOR', 1),
(100123, 'visit', '2022-05-14 10:02:00', 'HONOR', 2),
(100123, 'order', '2022-05-14 10:04:00', "HONOR", 4),
(100123, 'payment', '2022-05-14 10:04:00', 'HONOR', 4),
(100125, 'login', '2022-05-15 11:00:00', 'XIAOMI', 1),
(100125, 'visit', '2022-05-15 11:01:00', 'XIAOMI', 2),
(100125, 'order', '2022-05-15 11:02:00', 'XIAOMI', 6),
(100126, 'login', '2022-05-15 12:00:00', 'IPHONE', 1),
(100126, 'visit', '2022-05-15 12:01:00', 'HONOR', 2),
(100127, 'login', '2022-05-15 11:30:00', 'VIVO', 1),
(100127, 'visit', '2022-05-15 11:31:00', 'VIVO', 5);
SELECT
user_id,
window_funnel(
3600,
"increase",
event_timestamp,
event_name = 'login',
event_name = 'visit',
event_name = 'order',
event_name = 'payment'
) AS level
FROM
events
GROUP BY
user_id
order BY
user_id;
+---------+-------+
| user_id | level |
+---------+-------+
| 100123 | 3 |
| 100125 | 3 |
| 100126 | 2 |
| 100127 | 2 |
+---------+-------+
```
For ```uesr_id=100123```, the timestamp of the ```payment``` event and the timestamp of the ```order``` event occur in the same second and are not incremented, so the matched event chain is ```login-visit-order```.
### keywords
WINDOW,FUNCTION,WINDOW_FUNNEL