UUM ETD | Universiti Utara Malaysian Electronic Theses and Dissertation
FAQs | Feedback | Search Tips | Sitemap

An automatic diacritization algorithm for undiacritized Arabic text

Zayyan, Ayman Ahmad Muhammad (2017) An automatic diacritization algorithm for undiacritized Arabic text. Masters thesis, Universiti Utara Malaysia.

[img] Text
s815357_01.pdf
Restricted to Registered users only

Download (2MB)
[img]
Preview
Text
s815357_02.pdf

Download (590kB) | Preview

Abstract

Modern Standard Arabic (MSA) is used today in most written and some spoken media. It is, however, not the native dialect of any country. Recently, the rate of the written dialectal Arabic text increased dramatically. Most of these texts have been written in the Egyptian dialectal, as it is considered the most widely used dialect and understandable throughout the Middle East. Like other Semitic languages, in written Arabic, short vowels are not written, but are represented by diacritic marks. Nonetheless, these marks are not used in most of the modern Arabic texts (for example books and newspapers). The absence of diacritic marks creates a huge ambiguity, as the un-diacritized word may correspond to more than one correct diacritization (vowelization) form. Hence, the aim of this research is to reduce the ambiguity of the absences of diacritic marks using hybrid algorithm with significantly higher accuracy than the state-of-the-art systems for MSA. Moreover, this research is to implement and evaluate the accuracy of the algorithm for dialectal Arabic text. The design of the proposed algorithm based on two main techniques as follows: statistical n-gram along with maximum likelihood estimation and morphological analyzer. Merging the word, morpheme, and letter levels with their sub-models together into one platform in order to improve the automatic diacritization accuracy is the proposition of this research. Moreover, by utilizing the feature of the case ending diacritization, which is ignoring the diacritic mark on the last letter of the word, shows a significant error improvement. The reason for this remarkable improvement is that the Arabic language prohibits adding diacritic marks over some letters. The hybrid algorithm demonstrated a good performance of 97.9% when applied to MSA corpora (Tashkeela), 97.1% when applied on LDC’s Arabic Treebank-Part 3 v1.0 and 91.8% when applied to Egyptian dialectal corpus (CallHome). The main contribution of this research is the hybrid algorithm for automatic diacritization of undiacritized MSA text and dialectal Arabic text. The proposed algorithm applied and evaluated on Egyptian colloquial dialect, the most widely dialect understood and used throughout the Arab world, which is considered as first time based on the literature review.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Automatic diacritization, Diacritic marks, morphological analyzer, maximum likelihood estimation, statistical n-gram.
Subjects: T Technology > T Technology (General) > T58.5-58.64 Information technology
Divisions: Awang Had Salleh Graduate School of Arts & Sciences
Depositing User: Mr. Badrulsaman Hamid
Date Deposited: 19 Sep 2018 04:07
Last Modified: 19 Sep 2018 04:07
URI: http://etd.uum.edu.my/id/eprint/6822

Actions (login required)

View Item View Item