Arabic Language Spell Checker for Diacritics
Abstract
The thesis describes a system that aims to restore diacritics to an Arabic text, whether
the input text is partially diacritized or not diacritized at all. The system solely depends
on the statistics that are generated from a fully diacritized texts. A corpus of fully
diacritized text was setup as the training set to the system. The training set is used to
build a language model of diacritization. Basically the probability distribution over
sequences of words for diacritics is learned through the corpus. Eight experiments
were conducted on eight corpuses in order to track and analyze the results as the
corpus grow in size.
The results of the experiments showed that a bigram model is more efficient
than other models to restore diacritics. Even without using a technique to resolve the
issue of words and bigrams that were not seen in the training set, the system was still
able to restore diacritics as effective and more as the results seen in the previous
approaches.
At the end of each of experiment, the results were measured in terms of Word
Error Rate (WER); a word contains one or more wrongly diacritized letter, and
Diacritics Error Rate (DER) which is the total number of wrongly diacritized letters in
the test set. The best result achieved was when using a bigram model in the corpus that contained the greatest numbers of words (WER = 14.70% and DER = 5.92%).
However, if previously unseen bigrams were ignored, the results would be an average
of 6% for WER and 3.5% for DER.
Student(s)
Marwan Muhyiddine Itani
Supervisor(s)
Ahmed Sherif Zekri, Rashed Zantout