| ======================== |
| Example Usage of Distill |
| ======================== |
| |
| In this example, we run through a simulated user experiment using UserALE data generated within an instantiation of |
| Superset. This data reflects four simulated user sessions in which the user performs three tasks within the Video Game |
| Sales example dashboard: |
| |
| #. Filter the Video Games Sales by Wii, Racing, and Nintendo. |
| #. Find Mario Kart in the list of games. |
| #. Determine the difference in global sales between the 3DS game Nintendogs + cats and Wii Sports. |
| |
| A screenshot of this Superset dashboard can be seen below: |
| |
| .. image:: ./images/Superset_Dashboard.png |
| :width: 700 |
| |
| The data of these four sessions is captured in a JSON file entitled `task_example.json`. In the following example, we |
| will: |
| |
| * Show how to use Distill's Segmentation package to create useful ``Segments`` of data. |
| * Visualize ``Segment`` objects using timeline/gantt and digraph visualizations. |
| * Compare each of these user sessions through the investigation of edit distances. |
| |
| **Note: The data utilized in this example was not data collected in any user study. Rather this data is simulated |
| through developer interactions with the Superset dashboard.** |
| |
| Imports |
| ------- |
| The first step in this example is to import all of the packages that we need. We do this with the code below: |
| |
| .. code:: python |
| |
| import datetime |
| import distill |
| import json |
| import networkx as nx |
| import os |
| import pandas as pd |
| import plotly.express as px |
| import re |
| |
| Processing and Segmentation |
| --------------------------- |
| Now that we have imported all of the necessary packages, we can begin to process and segment the data. This can be done |
| by creating ``Segment``/``Segments`` objects. These objects will help us to visualize the data and understand processes |
| that the users took to perform each of the three tasks. |
| |
| Processing the JSON File |
| ************************ |
| The ``setup`` function is used to convert a JSON file into the required format for segmentation. It also allows us to |
| assert the date format that we want to use for our analysis (i.e., integer or ``datetime``). Below we define this |
| function: |
| |
| .. code:: python |
| |
| def setup(file, date_type): |
| with open(file) as json_file: |
| raw_data = json.load(json_file) |
| |
| data = {} |
| for log in raw_data: |
| data[distill.getUUID(log)] = log |
| |
| # Convert clientTime to specified type |
| for uid in data: |
| log = data[uid] |
| client_time = log['clientTime'] |
| if date_type == "integer": |
| log['clientTime'] = distill.epoch_to_datetime(client_time) |
| elif date_type == "datetime": |
| log['clientTime'] = pd.to_datetime(client_time, unit='ms', origin='unix') |
| |
| # Sort |
| sorted_data = sorted(data.items(), key=lambda kv: kv[1]['clientTime']) |
| sorted_dict = dict(sorted_data) |
| |
| return (sorted_data, sorted_dict) |
| |
| Using this function, we can process the UserALE data and create ``Segment`` objects that represent each of the four user |
| sessions. This is shown below through the utilization of the ``generate_collapsing_window_segments`` function. |
| |
| .. code:: python |
| |
| data_many_session = setup("./data/task_example.json", "datetime") |
| sorted_dict = data_many_session[1] |
| |
| # Create segments based on sessionID |
| segments = distill.Segments() |
| session_ids = sorted(distill.find_meta_values('sessionID', sorted_dict), key=lambda sessionID: sessionID) |
| for session_id in session_ids: |
| segments.append_segments(distill.generate_collapsing_window_segments(sorted_dict, 'sessionID', [session_id], session_id)) |
| |
| # Improve readability of Segment names |
| for index in range(len(segments)): |
| segments[index].segment_name = "Session" + str(index) |
| |
| Below we list out each of the created ``Segment`` objects along with their number of logs and start times. |
| |
| .. code:: console |
| |
| Session0 Length: 427 Start Time: 2022-05-16 21:25:57.935000 |
| Session1 Length: 236 Start Time: 2022-05-16 21:27:38.283000 |
| Session2 Length: 332 Start Time: 2022-05-16 21:28:59.774000 |
| Session3 Length: 219 Start Time: 2022-05-16 21:30:25.633000 |
| |
| Further Segmentation of Sessions |
| ******************************** |
| Now that there are ``Segment`` objects that represent each session, let's write the ``Segment`` objects. This will allow |
| us to further segment these session segments to analyze the activity of the user during each of these sessions. This |
| can be done with the following code: |
| |
| .. code:: python |
| |
| segment_names = [segment.segment_name for segment in segments] |
| start_end_vals = [segment.start_end_val for segment in segments] |
| segment_map = distill.write_segment(sorted_dict, segment_names, start_end_vals) |
| |
| We can now generate ``Segments`` objects within each of those session segments that represent user interactions on two |
| different elements of the Superset dashboard. |
| |
| The first element involves user interactions with the filter window that filters the list of video games (shown in the |
| screenshot below). The element in the path that represents these interactions is "div.filter-container css-ffe7is." |
| |
| .. image:: ./images/Video_Game_Filter.png |
| :width: 500 |
| |
| The second element involves interactions with the actual list of video games (shown in the screenshot below) represented |
| by the "div#chart-id-110.superset-chart-table" path element. |
| |
| .. image:: ./images/Games_List.png |
| :width: 500 |
| |
| By creating ``Segment`` objects that show user interaction on these two windows, we can get an understanding of how the |
| user is using the Superset dashboard to complete the three tasks. We create these ``Segment`` objects with the |
| following code: |
| |
| .. code:: python |
| |
| session_0_segments = distill.generate_collapsing_window_segments(segment_map['Session0'], 'path', ['div.filter-container css-ffe7is'], "Game_Filter") |
| session_1_segments = distill.generate_collapsing_window_segments(segment_map['Session1'], 'path', ['div.filter-container css-ffe7is'], "Game_Filter") |
| session_2_segments = distill.generate_collapsing_window_segments(segment_map['Session2'], 'path', ['div.filter-container css-ffe7is'], "Game_Filter") |
| session_3_segments = distill.generate_collapsing_window_segments(segment_map['Session3'], 'path', ['div.filter-container css-ffe7is'], "Game_Filter") |
| |
| session_0_segments.append_segments(distill.generate_collapsing_window_segments(segment_map['Session0'], 'path', ['div#chart-id-110.superset-chart-table'], "Games")) |
| session_1_segments.append_segments(distill.generate_collapsing_window_segments(segment_map['Session1'], 'path', ['div#chart-id-110.superset-chart-table'], "Games")) |
| session_2_segments.append_segments(distill.generate_collapsing_window_segments(segment_map['Session2'], 'path', ['div#chart-id-110.superset-chart-table'], "Games")) |
| session_3_segments.append_segments(distill.generate_collapsing_window_segments(segment_map['Session3'], 'path', ['div#chart-id-110.superset-chart-table'], "Games")) |
| |
| Now, we append each of those newly generated ``Segments`` objects to the overarching segments variable. This will create |
| one large ``Segments`` object that contains all ``Segment`` objects from all sessions. |
| |
| .. code:: python |
| |
| segments.append_segments(session_0_segments) |
| segments.append_segments(session_1_segments) |
| segments.append_segments(session_2_segments) |
| segments.append_segments(session_3_segments) |
| |
| Visualization of ``Segment`` Objects |
| ------------------------------------ |
| To understand these ``Segment`` objects better, we can visualize them. First, we will visualize them using Plotly's |
| timeline function, then we will analyze them by creating DiGraphs. |
| |
| Visualization with Plotly's Timeline |
| ************************************ |
| The following code can be used to define a function that will display a Plotly timeline of each of the ``Segment`` |
| objects: |
| |
| .. code:: python |
| |
| def display_segments(segments): |
| segment_list = [] |
| for segment in segments: |
| if not isinstance(segment.start_end_val[0], datetime.datetime) or not isinstance(segment.start_end_val[1], datetime.datetime): |
| new_segment = distill.Segment() |
| new_segment.segment_name = segment.segment_name |
| new_segment.num_logs = segment.num_logs |
| new_segment.uids = segment.uids |
| new_segment.generate_field_name = segment.generate_field_name |
| new_segment.generate_matched_values = segment.generate_matched_values |
| new_segment.segment_type = segment.segment_type |
| new_segment.start_end_val = (pd.to_datetime(segment.start_end_val[0], unit='ms', origin='unix'), pd.to_datetime(segment.start_end_val[1], unit='ms', origin='unix')) |
| segment_list.append(new_segment) |
| else: |
| segment_list.append(segment) |
| new_segments = distill.Segments(segments=segment_list) |
| distill.export_segments("./test.csv",new_segments) |
| df = pd.read_csv("./test.csv") |
| fig = px.timeline(df, x_start="Start Time", x_end="End Time", y="Segment Name", color="Number of Logs") |
| fig.update_yaxes(autorange="reversed") |
| os.remove("./test.csv") |
| fig.show() |
| |
| Using this code, we can visualize the ``Segment`` objects we created. |
| |
| .. code:: python |
| |
| display_segments(segments) |
| |
| This will produce the following timeline graph: |
| |
| .. image:: ./images/Timeline_Graph.png |
| :width: 700 |
| |
| This graph shows the number of logs in each ``Segment`` while also showing the length of time each ``Segment`` |
| represents. We can also begin to understand some of the interactions that each user had with the dashboard by |
| analyzing the ``Segment`` objects that exist within each overarching session ``Segment``. |
| |
| Visualizing User Workflows with DiGraphs |
| **************************************** |
| Another way we can visualize user workflows is through the creation and analysis of DiGraphs. The function below |
| (``draw_digraph``) draws a DiGraph based on the passed in ``Segments`` object. These graphs are colored in such a way |
| that interactions with the video game filter are colored in green while the interactions with the list of video games |
| are colored in blue. |
| |
| .. code:: python |
| |
| def draw_digraph(segments): |
| nodes = sorted(segments.get_segment_list(), key=lambda segment: segment.start_end_val[0]) |
| edges = distill.pairwiseSeq(segments.get_segment_list()) |
| |
| # Set coloring of graph based on element in Superset dashboard |
| color_map = [] |
| for segment in segments: |
| if re.match("Game_Filter\S*", segment.segment_name): |
| color_map.append('green') |
| else: |
| color_map.append('blue') |
| |
| graph = distill.createDiGraph(nodes, edges) |
| nx.draw(graph, node_color=color_map) |
| return graph |
| |
| We can now use this function to create DiGraphs of each of the user sessions. |
| |
| **Graph 0 - Session 0** |
| |
| .. code:: python |
| |
| G0 = draw_digraph(session_0_segments) |
| |
| .. image:: ./images/Graph_0.png |
| :width: 400 |
| |
| **Graph 1 - Session 1** |
| |
| .. code:: python |
| |
| G1 = draw_digraph(session_1_segments) |
| |
| .. image:: ./images/Graph_1.png |
| :width: 400 |
| |
| **Graph 2 - Session 2** |
| |
| .. code:: python |
| |
| G2 = draw_digraph(session_2_segments) |
| |
| .. image:: ./images/Graph_2.png |
| :width: 400 |
| |
| **Graph 3 - Session 3** |
| |
| .. code:: python |
| |
| G3 = draw_digraph(session_3_segments) |
| |
| .. image:: ./images/Graph_3.png |
| :width: 400 |
| |
| By analyzing these graphs, we can understand the general interactions that users had with the two elements of the |
| Superset dashboard. For instance, in each of these graphs, the user starts by filtering the dashboard. Based on the |
| tasks that the user is meant to perform, this makes a lot of sense since the most logical way to filter the dashboard is |
| through the filtering window. However, these graphs begin to differ in the amount of interactions that the user has with |
| the actual list of video games. While users always follow the workflow: filter --> game list --> filter --> game list, |
| there are occasions when the user interacts more with the game list than others. |
| |
| Measuring Similarity with Edit Distance |
| --------------------------------------- |
| One way to understand the differences between the previously generated DiGraphs is to look at their edit distance. Edit |
| distance is a metric that measures how many distortions are necessary to turn one graph into another, thus measuring |
| similarity. For instance, taking the edit distance between a graph and itself yields an edit distance of 0, since the |
| graphs are exactly the same. We can show this using NetworkX's ``graph_edit_distance`` to calculate the edit distance. |
| |
| **Input** |
| |
| .. code:: python |
| |
| nx.graph_edit_distance(G0, G0) # 0.0 |
| |
| Let's now calculate the edit distances between each graph to calculate an average. Note, however, that when we try to |
| calculate some edit distances, we run into a bit of an issue. Since edit distance is a computationally complex problem, |
| it can take a long time and require large amounts of computational resources to find an exact answer. To simplify this |
| problem, we can use NetworkX's ``optimize_graph_edit_distance`` function which will create an approximation of the graph |
| edit distance. For the following calculations, we use the function required depending on the length of time |
| ``graph_edit_distance`` takes in each circumstance. |
| |
| **Input: G0, G1** |
| |
| .. code:: python |
| |
| next(nx.optimize_graph_edit_distance(G0, G1)) # 32.0 |
| |
| **Input: G0, G2** |
| |
| .. code:: python |
| |
| next(nx.optimize_graph_edit_distance(G0, G2)) # 34.0 |
| |
| **Input: G0, G3** |
| |
| .. code:: python |
| |
| next(nx.optimize_graph_edit_distance(G0, G3)) # 38.0 |
| |
| **Input: G1, G2** |
| |
| .. code:: python |
| |
| nx.graph_edit_distance(G1, G2) # 2.0 |
| |
| **Input: G1, G3** |
| |
| .. code:: python |
| |
| next(nx.optimize_graph_edit_distance(G1, G3)) # 18.0 |
| |
| **Input: G2, G3** |
| |
| .. code:: python |
| |
| nx.graph_edit_distance(G2, G3) # 4.0 |
| |
| Using these outputs we can now calculate the average edit distance with the following calculation: |
| |
| .. code:: python |
| |
| (32.0 + 34.0 + 38.0 + 2.0 + 18.0 + 4.0)/6 # 21.33 |
| |
| This shows that the average edit distance between each of these session DiGraphs is 21.33. |
| |
| |
| |
| |