Abstract—Sindhi is highly homographic language, the text is written without diacritics in real life applications, that creates lexical and morphological ambiguity. It is a most critical problem facing Sindhi computational processing and difficult to assign correct syntactic category in the text. Lot of work has been done for diacritic restorations by using statistical and linguistics approaches, still results are not on acceptable level. Tagging the non-diacritic words can be solved using semantic knowledge. This paper describes a rule-based semantic Part of Speech (POS) tagging system that relies on a Word Net to identify the analogical relations between words in the text. The proposed approach is focused on the use of Word Net structures for the task of tagging. POS tagging is a process of assigning correct syntactic categories to each word. Tag set and word disambiguation rules are fundamental parts of any POS tagger. In this research, the tagset for Sindhi POS, word disambiguation rules, tagging and tokenization algorithms are designed and developed. Two types of lexicons are used, one for simple words and other one for disambiguated words. The corpus is collected from a comprehensive Sindhi Dictionary; the corpus is based on the most recent available vocabulary used by local people. The experiments using combination of two lexicons that show promising results and the accuracy of our proposed approach is acceptable.
Index Terms—Word Net; Part of Speech; Morphology; Lexicon; Tagging Rules
Javed Ahmed Mahar is with the Department of Computer Science, Shah Abdul Latif University, Khairpur, Sindh, Pakistan (email: email@example.com ).
Ghulam Qadir Memon is with FEST, HIIT, Hamdard University, Karachi, Sindh, Pakistan (email: firstname.lastname@example.org).
Cite: Javed Ahmed Mahar and Ghulam Qadir Memon, "Sindhi Part of Speech Tagging System Using Wordnet," International Journal of Computer Theory and Engineering vol. 2, no. 4, pp. 538-545, 2010.