UUM Electronic Theses and Dissertation
UUM ETD | Universiti Utara Malaysian Electronic Theses and Dissertation
FAQs | Feedback | Search Tips | Sitemap

Text representation using canonical data model

Hadi, Hiba Jasim (2016) Text representation using canonical data model. Masters thesis, Universiti Utara Malaysia.

[thumbnail of s815059_01.pdf]
Preview
Text
s815059_01.pdf

Download (3MB) | Preview
[thumbnail of s815059_02.pdf]
Preview
Text
s815059_02.pdf

Download (1MB) | Preview

Abstract

Developing digital technology and the World Wide Web has led to the increase of digital documents that are used for various purposes such as publishing, in turn, appears to be connected to raise the awareness for the requirement of effective techniques that
can help during the search and retrieval of text. Text representation plays a crucial role
in representing text in a meaningful way. The clarity of representation depends tightly on the selection of the text representation methods. Traditional methods of text
representation model documents such as term-frequency invers document frequency
(TF-IDF) ignores the relationship and meanings of words in documents. As a result the
sparsity and semantic problem that is predominant in textual document are not resolved. In this research, the problem of sparsity and semantic is reduced by proposing
Canonical Data Model (CDM) for text representation. CDM is constructed through an
accumulation of syntactic and semantic analysis. A number of 20 news group dataset
were used in this research to test CDM validity for text representation. The text documents goes through a number of pre-processing process and syntactic parsing in order to identify the sentence structure. Text documents goes through a number of preprocessing steps and syntactic parsing in order to identify the sentence structure and then TF-IDF method is used to represent the text through CDM. The findings proved that CDM was efficient to represent text, based on the model validation through
language experts‟ review and the percentage of the similarity measurement methods.

Item Type: Thesis (Masters)
Supervisor : Ahmad, Azizah and Kamaruddin, Siti Sakira
Item ID: 5633
Uncontrolled Keywords: Text Representation, TF-IDF, CDM
Subjects: T Technology > T Technology (General) > T58.5-58.64 Information technology
Divisions: Awang Had Salleh Graduate School of Arts & Sciences
Date Deposited: 16 May 2016 11:53
Last Modified: 05 Apr 2021 02:29
Department: Awang Had Salleh Graduate School of Arts and Sciences
Name: Ahmad, Azizah and Kamaruddin, Siti Sakira
URI: https://etd.uum.edu.my/id/eprint/5633

Actions (login required)

View Item
View Item