Thesis & Dissertations

Arabic Language Spell Checker for Diacritics


The thesis describes a system that aims to restore diacritics to an Arabic text, whether the input text is partially diacritized or not diacritized at all. The system solely depends on the statistics that are generated from a fully diacritized texts. A corpus of fully diacritized text was setup as the training set to the system. The training set is used to build a language model of diacritization. Basically the probability distribution over sequences of words for diacritics is learned through the corpus. Eight experiments were conducted on eight corpuses in order to track and analyze the results as the corpus grow in size. The results of the experiments showed that a bigram model is more efficient than other models to restore diacritics. Even without using a technique to resolve the issue of words and bigrams that were not seen in the training set, the system was still able to restore diacritics as effective and more as the results seen in the previous approaches. At the end of each of experiment, the results were measured in terms of Word Error Rate (WER); a word contains one or more wrongly diacritized letter, and Diacritics Error Rate (DER) which is the total number of wrongly diacritized letters in the test set. The best result achieved was when using a bigram model in the corpus that contained the greatest numbers of words (WER = 14.70% and DER = 5.92%). However, if previously unseen bigrams were ignored, the results would be an average of 6% for WER and 3.5% for DER.


Marwan Muhyiddine Itani


Ahmed Sherif Zekri, Rashed Zantout