====== Process mining in Python ======

===== Requirements =====

Python 3.x, opyenxes, pygraphviz (or graphviz).

For this class you can use any Python environment available having the abovementioned libraries. \\ 
It is also possible to use: https://colab.research.google.com.

The codes in this lab instruction are based on the codes from the book \\
[[https://www.springer.com/gp/book/9783319564272|A Primer on Process Mining. Practical Skills with Python and Graphviz]]. \\ The codes are not optimized and they are supposed to show a step by step process mining solution.
===== Implementing a simple heuristic miner =====

Using [[https://opyenxes.readthedocs.io/en/latest/_modules/opyenxes/data_in/XUniversalParser.html|XUniversalParser]] in the following excerpt of code, import a {{ :pl:dydaktyka:dss:lab:repairexample.txt |repairexample.xes}} file into your Python script:

<code python>
from opyenxes.data_in.XUniversalParser import XUniversalParser

path = 'repairExample.xes'

with open(path) as log_file:
    # parse the log
    log = XUniversalParser().parse(log_file)[0]
</code>

Take a look at the ''log'' variable.
Using ''log.get_features()'' or ''log.get_attributes()'', you can check some information about the log.
As the parsed log consists of lists of events, you can also select a single event and check its attributes:

<code python>
event = log[0][0]
event.get_attributes()
</code>

For ease of further work, we will create a ''workflow_log'' consisting of names of events:

<code python>
workflow_log = []
for trace in log: 
    workflow_trace = []
    for event in trace[0::2]:
        # get the event name from the event in the log
        event_name = event.get_attributes()['Activity'].get_value()
        workflow_trace.append(event_name)
    workflow_log.append(workflow_trace)
</code>

To create a simple heuristic net of task (simplified process model like in Disco tool), we will create a structure in which for each event, we gather a set of all events that precede this event:

<code python>
w_net = dict()
for w_trace in workflow_log:
    for i in range(0, len(w_trace)-1):
        ev_i, ev_j = w_trace[i], w_trace[i+1]
        if ev_i not in w_net.keys():
            w_net[ev_i] = set()
        w_net[ev_i].add(ev_j)
</code>

Take a closer look at the ''w_net'' dictionary:

<code>
{'Analyze Defect': {'Inform User', 'Repair (Complex)', 'Repair (Simple)'},
 'Archive Repair': {'End'},
 'Inform User': {'Archive Repair', 'End', ...}, 
 ...} 
</code>

It represents the connections between events:

| | Analyze Defect | Archive Repair | Inform User | ... | End |
| Analyze Defect | | |  ->  | | |
| Archive Repair | | | | |  ->  | 
| Inform User |  ->  | | | |  ->  |  
| ... | 
| End |

===== Visualizing results using Pygraphviz =====

Using [[https://pygraphviz.github.io/|Pygraphviz]], we can render an image depicting the process:

<code python>
import pygraphviz as pgv
G = pgv.AGraph(strict=False, directed=True)
G.graph_attr['rankdir'] = 'LR'
G.node_attr['shape'] = 'Mrecord'
for event in w_net:
    G.add_node(event, style="rounded,filled", fillcolor="#ffffcc")
    for preceding in w_net[event]:
        G.add_edge(event, preceding)

G.draw('simple_heuristic_net.png', prog='dot')
</code>

{{:pl:dydaktyka:dss:lab:simple_heuristic_net.png?550|}}

If you don't have pygraphviz, you can use graphviz ([[#graphviz_instead_of_pygraphviz|check instruction at the bottom of the page]]).
===== Diagram enhancing =====

In Disco, we could see the frequencies of tasks. Let's count such frequency:

<code python>
ev_counter = dict()
for w_trace in workflow_log:
    for ev in w_trace:
        ev_counter[ev] = ev_counter.get(ev, 0) + 1
</code>

Then, in our model, we can just change the label to include the result of calculation:

<code python>
text = event + ' (' + str(ev_counter[event]) + ")"
G.add_node(event, label=text, style="rounded,filled", fillcolor="#ffffcc") # code for Pygraphviz
</code>

We can also change the transparency of the discovered tasks based on their frequencies (code for Pygraphviz, so for graphviz, it should be adjusted):

<code python>
color_min = min(ev_counter.values())
color_max = max(ev_counter.values())

G = pgv.AGraph(strict=False, directed=True)
G.graph_attr['rankdir'] = 'LR'
G.node_attr['shape'] = 'Mrecord'
for event in w_net:
    value = ev_counter[event]
    color = int(float(color_max-value)/float(color_max-color_min)*100.00)
    my_color = "#ff9933"+str(hex(color))[2:]
    G.add_node(event, style="rounded,filled", fillcolor=my_color)
    for preceding in w_net[event]:
        G.add_edge(event, preceding)

G.draw('simple_heuristic_net_with_colors.png', prog='dot')
</code>

We can also try to discover start and end events and correct the model:

<code python>
from functools import reduce
ev_source = set(w_net.keys())
ev_target = reduce(lambda x,y: x|y, w_net.values())
ev_start_set = ev_source - ev_target
print("start set: {}".format(ev_start_set))
ev_end_set = ev_target - ev_source
print("end set: {}".format(ev_end_set))

for ev_end in ev_end_set:
    end = G.get_node(ev_end)
    end.attr['shape']='circle'
    end.attr['label']=''

G.add_node("start", shape="circle", label="")
for ev_start in ev_start_set:
    G.add_edge("start", ev_start)

G.draw('simple_heuristic_net_with_events.png', prog='dot')
</code>

{{:pl:dydaktyka:dss:lab:simple_heuristic_net_colors.png?570|}}

===== graphviz instead of pygraphviz =====

It is possible to use graphviz instead of pygraphviz, but it has different syntax, e.g.:

<code python>
import graphviz
G = graphviz.Digraph()
for event in w_net:
    G.node(event, style="rounded,filled", fillcolor="#ffffcc")
    for preceding in w_net[event]:
        G.edge(event, preceding)

G.graph_attr['rankdir'] = 'LR'
G.node_attr['shape'] = 'Mrecord'
G.edge_attr.update(penwidth='2')
G.node("End", shape="circle", label="")
G.render('simple_graphviz_graph')
display(G)
</code>

{{:pl:dydaktyka:dss:lab:graphviz-example.png?570|}}
===== Excercise =====

Extend process discovery with additional features:
  - Try to discover the frequency of each transition (flow) and render the number of occurrences both as a label and the thickness of the line.
  - Add some filtering option to show or hide tasks or flows according to the chosen threshold. 
  - Optimize code by avoiding creating additional lists, e.g. using ''itertools'', ''more_itertools'' or other Python tools. 
  - 8-o Only for interested students: Try to implement and discover relations according to the Alpha algorithm. 

<fc #ff0000>There is no report required after this lab.</fc> However, it is possible to submit an additional report for 5 points (for a very good score) presenting the implementation of at least two of the above exercises.