UUM Electronic Theses and Dissertation
UUM ETD | Universiti Utara Malaysian Electronic Theses and Dissertation
FAQs | Feedback | Search Tips | Sitemap

Framework for mining XML format business process log data

Ang, Jin Sheng (2024) Framework for mining XML format business process log data. Doctoral thesis, Universiti Utara Malaysia.

[thumbnail of permission to deposit-allow embargo 12 months-s904045.pdf] Text
permission to deposit-allow embargo 12 months-s904045.pdf
Restricted to Repository staff only

Download (942kB) | Request a copy
[thumbnail of s904045_01.pdf] Text
s904045_01.pdf
Restricted to Repository staff only until 10 January 2025.

Download (12MB) | Request a copy
[thumbnail of s904045_02.pdf] Text
s904045_02.pdf
Restricted to Repository staff only

Download (8MB) | Request a copy

Abstract

With the advent of the Internet, there is a dramatic increase in the volume of semi-structured and unstructured data. Therefore, a lot of frequent subtree mining (FSM) algorithms and methods were developed to get information from semi-structured data specifically data with hierarchical nature. However, many existing FSM algorithms and methods often neglect or fail to preserve structural information, which hinders extracting meaningful insights from such data. Besides, statistical analysis and data mining techniques are difficult to be applied in eXtensible Markup Language (XML) format documents. This study introduces an alternative approach for mining XML format documents which can be modelled into tree-structured format. The Flatten Sequential Structure Model (FSSM) was developed to transform tree-structured data into structured, preserving its structural integrity, thus facilitating comprehensive statistical analysis and data mining. FSSM was divided into two phases. The first phase converted tree structure data into flat structure with the structural information. The second phase converted the first phase data into structured format. After that, statistical analysis or classification were conducted. The effectiveness of the methods and framework was assessed by applying them to both simulation datasets and real-life event logs, namely the Business Process Intelligence Challenge (BPIC). After applying FSSM phases to simulation and real-life event log data, descriptive statistics, t-tests, and chi-square tests were successfully executed. Association rules revealed that they outnumbered those from existing FSM methods. The Random Forest model outperformed others with a classification accuracy of 0.75 for simulation data, while the decision tree achieved the highest accuracy (0.7474) in the BPIC 2017 dataset. In the BPIC 2018 dataset, all three models performed well, exceeding 0.99 in classification accuracy. The results indicate that by transforming complex hierarchical data into a format suitable for statistical analysis, the analysis process is simplified and made more accessible to researchers in various fields.

Item Type: Thesis (Doctoral)
Supervisor : Mohd Jamil, Jastini and Mohd Shaharanee, Izwan Nizal
Item ID: 11012
Uncontrolled Keywords: Process Mining, XML, Data Mining, Statistical Analysis
Subjects: T Technology > T Technology (General) > T58.5-58.64 Information technology
Q Science > QA Mathematics > QA299.6-433 Analysis
Divisions: Awang Had Salleh Graduate School of Arts & Sciences
Date Deposited: 29 Feb 2024 00:24
Last Modified: 29 Feb 2024 00:24
Department: Awang Had Salleh Graduate School of Art & Sciences
Name: Mohd Jamil, Jastini and Mohd Shaharanee, Izwan Nizal
URI: https://etd.uum.edu.my/id/eprint/11012

Actions (login required)

View Item
View Item