Training a POS tagging model for Cameroon Pidgin

Introduction

The purpose of this project is to train a Hidden Markov Parts of Speech tagger for Cameroonian Pidgin. Pidgin is a simplified form of linguistic communication between groups of people that do not have a language in common. Pidgin is not considered as a full language by linguists. However, pidgin is commonly spoken in especially in various African countries including Cameroon, Ghana, Nigeria and Senegal. Pidgin varies across regions and countries. Cameroonian pidgin is also known as Kamtok.

Parts of Speech (POS) tagging refers to grammatically tagging a word in a sentence with respect to a particular part of speech. In natural language processing, POS tagging is very crucial for tasks such as Machine translation, Name entity recognition, sentiment analysis, and information retrieval.

A Hidden Markov Models (HMM) is a statistical model that can be used to capture hidden information from observable sequential symbols. Given a sequence of inputs, like words, the model will compute a sequence of outputs of the same length. An HMM model is a graph where nodes are probability distributions over labels and edges give the probability of transitioning from one node to the other. HMM is largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. HMM is useful for tasks such as Speech recognition, optical character recognition, and in this case text classification.

Libraries such as Spacy or NLTK already have trained machine learning models that can be used to obtain the parts of speech for high resource languages such as English. The aim of this task is to train a model that can assign POS tags to words from the Cameroon Pidgin corpus.

The dataset is set of tagged transpositions of conversations between two or more person conversations both by telephone and in-person. This dataset was gotten from Dr. Melanie Green, University of Sussex; Dr. Miriam Ayafor, University of Yaoundé I and Dr. Gabriel Ozon, University of Sheffield, 2016, A Spoken Corpus of Cameroon Pidgin English: pilot study, Literary and Linguistic Data Service, hdl.handle.net/20.500.14106/2563 (https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2563).

Dataset Pre-processing

I am using colab for this project. The dataset contains various transcripts by speakers from various towns including Bamenda, Douala, and Bertoua. Every word in each sentence has already been assigned a part of speech. For the first experiment, the only transcripts form Bamenda and Kumba are selected. The total number of sentences from these transcripts are 7903. There are 56 POS tags in these transcripts. Some of the POS tags are: NNP (personal noun), PUN (punctuation), FOR (foreign word),and VBO(verb).

This makes the training process easy. Words without tags were omitted from the dataset. Also, words with more than one tag were removed. The tokens and their POS tags were paired as tuples for each sentence.

Training and Result

The data was split into train and test.

The data was split into train and test with scikit-learn's train-test-split. The HiddenMarkovModelTagger from NLTK library was used to train the model on the train data and the test set was used for evaluation.

The results show that model is accurately able to predict most of the POS tags. More so, the accuracy of the model is 0.93.

Challenges

Some of these POS tags do not fall under the Penn Treebank POS tags. For instance, the POS tags such as TMO (modality marker), TPE (perfective aspect marker), FOC (focus marker), IDG (indigenous word) to name a few. If these tags are categorized as X (others), a lot of meaningful tags will be missed. Going through this dataset made me realize how vast Cameroon pidgin vocabulary is. As a result, the tags provided in this dataset are used as the treebank.

Knowledge/Conclusion

The total number of sentences is 7900

The total number of tokens was 23632

The accuracy of the model is 0.93.

Next steps

Use more universal treebank for POS tags.

Train a POS tagging model for all African pidgin.