| --- |
| title: Sessions - Guide - Apache DataFu Pig |
| version: 1.4.0 |
| section_name: Apache DataFu Pig - Guide |
| license: > |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| ## Sessions |
| |
| A 'session' is a useful concept when analyzing user activity on a website. We essentially |
| define a session as sustained user activity. By assigning events to sessions we can perform |
| analysis on user sessions and draw useful conclusions. |
| |
| For example, suppose that we have a stream of page views by user. Each page view can be |
| represented by a member ID, a timestamp, and a URL: |
| |
| ```pig |
| pv = LOAD 'pageviews.csv' USING PigStorage(',') |
| AS (memberId:int, time:long, url:chararray); |
| ``` |
| |
| One statistic that may be useful to know is how long users tend to stay active on the website. |
| When they visit do they tend to stick around for a long time and view many pages? Or is it |
| typically a very brief session? Apache DataFu provides UDFs that help in this sort of analysis. |
| |
| ### Sessionization |
| |
| The [Sessionize](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/sessions/Sessionize.html) |
| UDF can be used to assign unique session IDs to events within a stream. Events are passed to the |
| UDF in time-sorted order. If two consecutive events are separated by a sufficient amount of time, then |
| they are assigned to a different session. |
| |
| Let's walk through an example. Suppose we are interested in computing some statistics on session length. |
| What we need to do is sessionize the data and then compute the time difference between the first and last |
| event of each session. With the session lengths computed we can then pass the values to statistics methods. |
| |
| First we need to choose a threshold for the `Sessionize` UDF. We'll consider "10 minutes" a sufficient |
| amount of time: |
| |
| ```pig |
| DEFINE Sessionize datafu.pig.sessions.Sessionize('10m'); |
| ``` |
| |
| We'll also define functions to compute various statistics. In this example we'll compute the median, |
| 90th and 95th percentiles, and variance of the session lengths. |
| |
| ```pig |
| DEFINE Median datafu.pig.stats.StreamingMedian(); |
| DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.9','0.95'); |
| DEFINE VAR datafu.pig.VAR(); |
| ``` |
| |
| Next we'll sessionize the data. We group by member and sort the events by time. `Sessionize` appends |
| the session ID to each tuple. Events for a member that are within 10 minutes of each other will be |
| assigned to the same session. |
| |
| ```pig |
| pv = FOREACH pv GENERATE time, memberId; |
| pv_sessionized = FOREACH (GROUP pv BY memberId) { |
| ordered = ORDER pv BY time; |
| GENERATE FLATTEN(Sessionize(ordered)) |
| AS (time,memberId,sessionId); |
| } |
| ``` |
| |
| Now that the data is sessionized, we can compute the session lengths: |
| |
| ```pig |
| session_times = |
| FOREACH (GROUP pv_sessionized BY (sessionId,memberId)) { |
| GENERATE group.sessionId as sessionId, |
| group.memberId as memberId, |
| (MAX(pv_sessionized.time) - MIN(pv_sessionized.time)) |
| / 1000.0 / 60.0 as session_length; |
| } |
| ``` |
| |
| Finally let's compute our statistics: |
| |
| ```pig |
| session_stats = FOREACH (GROUP session_times ALL) { |
| GENERATE |
| AVG(session_times.session_length) as avg_session, |
| SQRT(VAR(session_times.session_length)) as std_dev_session, |
| Median(session_times.session_length) as median_session, |
| Quantile(session_times.session_length) as quantile_session; |
| } |
| |
| DUMP session_stats; |
| ``` |
| |
| With the session statistics computed, we can now perform some interesting queries. For example, |
| let's get the list of users who had sessions in the upper 95th percentile. These are the users |
| who were most engaged in our website. |
| |
| ```pig |
| long_sessions = FILTER session_times BY |
| session_length > session_stats.quantiles_session.quantile_0_95; |
| |
| very_engaged_users = DISTINCT (FOREACH long_sessions GENERATE memberId); |
| |
| DUMP very_engaged_users; |
| ``` |
| |
| ### Counting Sessions |
| |
| [SessionCount](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/sessions/SessionCount.html) |
| can be used to count sessions. It works very similarly to |
| [Sessionize](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/datafu/pig/sessions/Sessionize.html). |
| |
| One useful application of `SessionCount` is in counting page views. This is a useful statistic |
| to track for any website. However, user's sometimes hit refresh. Or they may inadvertently take |
| another action within a short period of time that causes another page view to be recorded. These |
| additional page views are significant and we may want to filter them out when computing the count. |
| `SessionCount` can help with this. |
| |
| First we'll define the UDF and specify a 10 minute threshold: |
| |
| ```pig |
| define SessionCount datafu.pig.sessions.SessionCount('10m'); |
| ``` |
| |
| We then perform the same procedure as before, sorting the events by time and passing them into the UDF. |
| This time we get a count as output instead of a bag of sessionized events. |
| |
| ```pig |
| pv_sessionized = FOREACH (GROUP pv BY (memberId,url)) { |
| ordered = ORDER pv BY time; |
| GENERATE group.memberId as memberId, |
| group.url as url, |
| FLATTEN(SessionCount(ordered.time)) as count; |
| } |
| ``` |
| |
| We now have the page view counts grouped by member and URL. Now we can perform one more group to get the |
| total page views across all members and URLs. |
| |
| ```pig |
| pv_sum = FOREACH (GROUP pv_sessionized ALL) |
| GENERATE SUM(pv_sessionized.count) as total_pvs; |
| DUMP pv_sum; |
| ``` |