Building a Language Assistant for Translating English to Kamtok Phrases
Despite the availability of models like Google Translate, which supports over 100 languages, many languages and dialects remain underrepresented due to a lack of data. Kamtok, also known as Cameroon Pidgin English, is no exception. Assembling data is no easy task.
The goal of this experiment was to create a platform where users can learn translations of words and understand their usage in sentences in Cameroonian Pidgin English (Kamtok)..
About the dataset
For this exercise, I utilized a small dataset of English-to-Pidgin words and phrases.. The long term goal is to expand this dataset so it can be used to build a translation model, chatbots, etc. Currently, the dataset is so small for a neural network, so I decided to experiment on this topic of semantic searches. I compiled these words and phrases by looking through the internet for phrases and words and their translation, and also providing some translations.
Converting Existing data to vectors
I employed a pretrained model to convert the dataset into vectors. For this task, I used sentence transformers all-MiniLM-L6-v2
model. I also converted the user input to vectors and finally perform a cosine similarity on the user’s input and the documents.
Result versus similar documents
I use the cosine similarity matrix to compare the user’s input to the data and return the one with the highest similarity score score. You can see how it works here https://learn-kamtok-nzf9gii8ykkjnlhvy6hzmc.streamlit.app/
Conclusion:
This was fun to create. My hope is to build a bigger dataset for more advanced NLP tasks.
What do you think about this tool? I would like to know your thoughts in the comments!