Automatic POS tagging of Arabic words using the YAMCHA machine-learning tool

  • Ahmed Abdelghany Mohammed Universiti Malaysia Kelantan
  • Ahmad Zaki Amiruddin
Keywords: YAMCHA - support vector machine - POS tagging - training corpus - testing corpus


Part of speech tagging of Arabic words is a morphological tagging of each word with the part of speech that is suitable for it. This process is a basic step in most natural language processing (NLP) applications such as automatic summarization, information retrieval, automatic translation and other applications. The aim of this research is to present an Arabic automatic POS tagger based on a statistical system that makes advantage of machine learning systems. The machine learning system used in this research is YAMCHA (Yet Another Multipurpose CHunk Annotator), which is an open source tool that performs many language processing tasks, such as automatic morphological word picking, entity names recognition, syntactic analysis of sentences, and other linguistic tasks. YamCha uses a machine learning algorithm called Support Vector Machines, which is used to classify data very accurately and efficiently because it uses part of the data for training and learning, and it also allows changing the extent and types of linguistic information based on machine learning (feature set and window -size). Therefore, the proposed methodology requires a good amount of texts analyzed at the level of parts of speech in order to train the system on them. The size of the corpus used in the research was 100.039 words, and it was divided by 70% for training and 30% for testing. The size of the training corpus was 64,608 words, and the size of the test blog was 35,431 words. The number of part of speech tags that the system trained on and distinguished is 48 tags. The system was trained on the training corpus several times with changing the extent of the linguistic information used in the training, then analyzing the test corpus and evaluating the results in order to reach the best results in the automatic recitation of Arabic words. The lowest error rate was 11.4%, and it was in the case of considering the previous word in the analysis without looking at its morphological title (F:-1..0:0..).

How to Cite
Abdelghany Mohammed, A., & Amiruddin, A. Z. (2023). Automatic POS tagging of Arabic words using the YAMCHA machine-learning tool. International Online Journal of Language, Communication, and Humanities, 6(1), 75-86.