Process mining in Python

Requirements

Python 3.x, opyenxes, pygraphviz (or graphviz).

For this class you can use any Python environment available having the abovementioned libraries.
It is also possible to use: https://colab.research.google.com.

The codes in this lab instruction are based on the codes from the book
A Primer on Process Mining. Practical Skills with Python and Graphviz.
The codes are not optimized and they are supposed to show a step by step process mining solution.

Implementing a simple heuristic miner

Using XUniversalParser in the following excerpt of code, import a repairexample.xes file into your Python script:

from opyenxes.data_in.XUniversalParser import XUniversalParser
 
path = 'repairExample.xes'
 
with open(path) as log_file:
    # parse the log
    log = XUniversalParser().parse(log_file)[0]

Take a look at the log variable. Using log.get_features() or log.get_attributes(), you can check some information about the log. As the parsed log consists of lists of events, you can also select a single event and check its attributes:

event = log[0][0]
event.get_attributes()

For ease of further work, we will create a workflow_log consisting of names of events:

workflow_log = []
for trace in log: 
    workflow_trace = []
    for event in trace[0::2]:
        # get the event name from the event in the log
        event_name = event.get_attributes()['Activity'].get_value()
        workflow_trace.append(event_name)
    workflow_log.append(workflow_trace)

To create a simple heuristic net of task (simplified process model like in Disco tool), we will create a structure in which for each event, we gather a set of all events that precede this event:

w_net = dict()
for w_trace in workflow_log:
    for i in range(0, len(w_trace)-1):
        ev_i, ev_j = w_trace[i], w_trace[i+1]
        if ev_i not in w_net.keys():
            w_net[ev_i] = set()
        w_net[ev_i].add(ev_j)

Take a closer look at the w_net dictionary:

{'Analyze Defect': {'Inform User', 'Repair (Complex)', 'Repair (Simple)'},
 'Archive Repair': {'End'},
 'Inform User': {'Archive Repair', 'End', ...}, 
 ...}

It represents the connections between events:

	Analyze Defect	Archive Repair	Inform User	…	End
Analyze Defect			→
Archive Repair					→
Inform User	→				→
…
End

Visualizing results using Pygraphviz

Using Pygraphviz, we can render an image depicting the process:

import pygraphviz as pgv
G = pgv.AGraph(strict=False, directed=True)
G.graph_attr['rankdir'] = 'LR'
G.node_attr['shape'] = 'Mrecord'
for event in w_net:
    G.add_node(event, style="rounded,filled", fillcolor="#ffffcc")
    for preceding in w_net[event]:
        G.add_edge(event, preceding)
 
G.draw('simple_heuristic_net.png', prog='dot')

If you don't have pygraphviz, you can use graphviz (check instruction at the bottom of the page).

Diagram enhancing

In Disco, we could see the frequencies of tasks. Let's count such frequency:

ev_counter = dict()
for w_trace in workflow_log:
    for ev in w_trace:
        ev_counter[ev] = ev_counter.get(ev, 0) + 1

Then, in our model, we can just change the label to include the result of calculation:

text = event + ' (' + str(ev_counter[event]) + ")"
G.add_node(event, label=text, style="rounded,filled", fillcolor="#ffffcc") # code for Pygraphviz

We can also change the transparency of the discovered tasks based on their frequencies (code for Pygraphviz, so for graphviz, it should be adjusted):

color_min = min(ev_counter.values())
color_max = max(ev_counter.values())
 
G = pgv.AGraph(strict=False, directed=True)
G.graph_attr['rankdir'] = 'LR'
G.node_attr['shape'] = 'Mrecord'
for event in w_net:
    value = ev_counter[event]
    color = int(float(color_max-value)/float(color_max-color_min)*100.00)
    my_color = "#ff9933"+str(hex(color))[2:]
    G.add_node(event, style="rounded,filled", fillcolor=my_color)
    for preceding in w_net[event]:
        G.add_edge(event, preceding)
 
G.draw('simple_heuristic_net_with_colors.png', prog='dot')

We can also try to discover start and end events and correct the model:

from functools import reduce
ev_source = set(w_net.keys())
ev_target = reduce(lambda x,y: x|y, w_net.values())
ev_start_set = ev_source - ev_target
print("start set: {}".format(ev_start_set))
ev_end_set = ev_target - ev_source
print("end set: {}".format(ev_end_set))
 
for ev_end in ev_end_set:
    end = G.get_node(ev_end)
    end.attr['shape']='circle'
    end.attr['label']=''
 
G.add_node("start", shape="circle", label="")
for ev_start in ev_start_set:
    G.add_edge("start", ev_start)
 
G.draw('simple_heuristic_net_with_events.png', prog='dot')

graphviz instead of pygraphviz

It is possible to use graphviz instead of pygraphviz, but it has different syntax, e.g.:

import graphviz
G = graphviz.Digraph()
for event in w_net:
    G.node(event, style="rounded,filled", fillcolor="#ffffcc")
    for preceding in w_net[event]:
        G.edge(event, preceding)
 
G.graph_attr['rankdir'] = 'LR'
G.node_attr['shape'] = 'Mrecord'
G.edge_attr.update(penwidth='2')
G.node("End", shape="circle", label="")
G.render('simple_graphviz_graph')
display(G)

Excercise

Extend process discovery with additional features:

Try to discover the frequency of each transition (flow) and render the number of occurrences both as a label and the thickness of the line.
Add some filtering option to show or hide tasks or flows according to the chosen threshold.
Optimize code by avoiding creating additional lists, e.g. using itertools, more_itertools or other Python tools.
Only for interested students: Try to implement and discover relations according to the Alpha algorithm.

There is no report required after this lab. However, it is possible to submit an additional report for 5 points (for a very good score) presenting the implementation of at least two of the above exercises.