Yorùbá Word Vector Representation with fastText

Wuraola Oyewusi
5 min readOct 24, 2019

--

I’m a native speaker of Yorùbá language.One of the interesting things about the language is that words may have similar spelling but mean different things,these words spellings and intonation are differentiated using diacritics marks. (E.g ogún: twenty, ògùn: medicine, ògún: diety of iron)

I’ll like to know what interesting things I’ll find when yoruba words vector are learned using a word embedding method like fastText . Luckily fastText has a repository of pretrained vectors in different languages.

fastText is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. fastText uses a neural network for word embedding. October 2019,Wikipedia.

Data: The sample data for this article was downloaded from Niger-Volta-LTI github repository.

All the codes are in this notebook

!pip install fasttext!wget https://raw.githubusercontent.com/Niger-Volta-LTI/yoruba-text/master/Iroyin/yoglobalvoices.txt  #Download sample Yoruba text!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.yo.300.bin.gz        #Download Yoruba pretrained model!gunzip cc.yo.300.bin.gz     #Unzip pretrained model
%%capture
import nltkfrom pprint import pprintfrom gensim.models import FastTextimport fasttextnltk.download(‘punkt’)from nltk import word_tokenize

For this article, we’ll compare the performance of two models. “trained_model”(trained on the sample data) and “pretrained_model”( trained on Common Crawl and Wikipedia using fastText)

Explore sample yoruba text.

with open(‘yoglobalvoices.txt’) as f: #Read Yoruba Text
yoruba_text = “”.join(f.readlines())
yoruba_tokens = word_tokenize(yoruba_text)print(len(yoruba_tokens))print(yoruba_tokens[:20])27487
['Ìjọba', 'Tanzania', 'fi', 'Ajìjàgbara', 'Ọmọ', 'Orílẹ̀-èdèe', 'Uganda', 'sí', 'àtìmọ́lé', ',', 'wọ́n', 'sì', 'lé', 'e', 'kúrò', 'nílùú', 'Wairagala', 'Wakabi', ',', ..........]

Train and save sample text using fastText. I trained in dimension 300 like in the pretrained model, the default epoch is 5 but trained with 10 because the data is small(just about 27400 tokens).

import fasttextyorumodel = fasttext.train_unsupervised('yoglobalvoices.txt',epoch=10,dim=300)    #train model using fasttextyorumodel.save_model('yoglobal.bin')    #save trained model

Load the trained and pretrained models using gensim

trained_model = FastText.load_fasttext_format('yoglobal.bin')pretrained_model = FastText.load_fasttext_format('cc.yo.300.bin')

Compare and Explore the two model.
It is expected that the pretrained model will perform much better with higher confidence scores because it was trained on a very large data corpus compared to the data for the trained model, but part of the purpose of this article is to show how these models are trained and some of the things that can be done with them.

We’ll attempt to use the models to pick the odd word out.
Ìjọba : Government, Ìbílẹ: Local, Abùjá: Nigeria’s Capital, ọkọ: Husband
ìyàwó: Wife, Ọmọ: Child

pprint(trained_model.wv.doesnt_match(["Ìjọba", "Ìbílẹ", "Abùjá", "ọkọ"]))  'Ìbílẹ'pprint(pretrained_model.wv.doesnt_match(["Ìjọba", "Ìbílẹ", "Abùjá", "ọkọ"]))  'ọkọ'pprint(trained_model.wv.doesnt_match(["Ìjọba","ìyàwó","Ọmọ","ọkọ"]))  'ìyàwó'pprint(pretrained_model.wv.doesnt_match(["Ìjọba","ìyàwó","Ọmọ","ọkọ"])) 'Ìjọba'

For the two word sets, the pretrained_model made accurate predictions and the trained_model didn’t do well.

Attempted to do some word vector arithmetic.

For this example. I was trying to see if any of the models will predict ìyá : Mother when positive=[‘ọkùnrin’: man, ‘bàbá’:father], negative=[“obìrin”:Mother]

trained_model.wv.most_similar(positive=['ọkùnrin', 'bàbá'], negative=["obìrin"])[('lẹ́yìn', 0.11785450577735901),
('àárín', 0.10498246550559998),
('ìdá', 0.09333144873380661),
('tẹ̀', 0.08053965866565704),
('mọ̀', 0.08035025745630264),
('mọ́', 0.06920996308326721),
('pọ̀', 0.06716396659612656),
('sílẹ̀', 0.0633367970585823),
('ọwọ́', 0.06136838719248772),
('ọ̀pọ̀', 0.06086157634854317)]
pretrained_model.wv.most_similar(positive=['ọkùnrin', 'bàbá'], negative=["obìrin"])[('Omobìnrin', 0.6283954977989197),
('kínrin', 0.6255975961685181),
('fóbìnrin', 0.6219295263290405),
('arákùnrin', 0.6108750104904175),
('aránbìnrin', 0.6077451705932617),
('Ianrin', 0.6050188541412354),
('Aganrin', 0.5930180549621582),
('yanrin', 0.5929874181747437),
('-bìnrin', 0.5906635522842407),
('tobùnrin', 0.5903921127319336)]

The trained model didn’t do a fantastic Job at this but the pretrained model made a good attempt by predicting words like ‘Omobìnrin’:Young woman, ‘arákùnrin’: Young man, ‘aránbìnrin’: Young woman.
I later found out a spelling mistake, ‘obìrin’ was written instead of ‘obìnrin’.

Find Most Similar Words

pprint(trained_model.wv.most_similar('ọkọ'))[('ọdún-un', 0.9934332370758057),
('du', 0.9933178424835205),
('India', 0.9933065176010132),
('Jannat', 0.9933027625083923),
('ajìjàǹgbara', 0.9932816028594971),
('agbára', 0.9932748079299927),
('àárín-in', 0.993270754814148),
('Tanzania', 0.9932376146316528),
('Pakistan', 0.9932341575622559),
('Prem', 0.9932260513305664)]
pprint(pretrained_model.wv.most_similar('ọkọ'))[('Láìfòtápè', 0.7003475427627563),
('dojúkọ', 0.6962067484855652),
('wọnú', 0.6954782009124756),
('rìndínlọ', 0.6948665380477905),
('Ìforúkọsílẹ', 0.6943686008453369),
('ojú-omi', 0.6940218210220337),
('ṣakóso', 0.689141035079956),
('pò.lọpọ', 0.6798331141471863),
('FV', 0.6793243885040283),
('jáagbà', 0.674048662185669)]

For this example it’s funny how the trained model is so confident about all the wrong predictions. ‘ọkọ’ means husband but the characters in the word has a lot of variations in yoruba, I’ll understand why the pretrained model will assume it’s related to ojú-omi because canoe/boat is ‘ọkọ̀ ojú-omi’ .
There are more examples on this in the notebook and I hope you plug in your own values and have fun experimenting.

Compare Similarity

print(trained_model.wv.similarity('aṣo','ẹ̀wù'))--------------------------------------------------------------------KeyError                                  Traceback (most recent call last)<ipython-input-24-6177e0ab7bac> in <module>()
----> 1 print(trained_model.wv.similarity('aṣo','ẹ̀wù'))
/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
1989 return word_vec / max(1, ngrams_found)
1990 else: # No ngrams of the word are present in self.ngrams
-> 1991 raise KeyError('all ngrams for word %s absent from model' % word)
1992
1993 def init_sims(self, replace=False):
KeyError: 'all ngrams for word ẹ̀wù absent from model'print(pretrained_model.wv.similarity('aṣo','ẹ̀wù') 0.0286972

To compare similarity and flex my command of yoruba language I decided to test use of rare words.
‘aṣo’ is the contemporary word for cloth but it can also be called ‘ẹ̀wù’.
I understand why our trained model talks about ‘all ngrams for the word ẹ̀wù absent from model’. The data is small and even though fastText works based on ngram of subwords such that even if a word isn’t in a model, it can assign vector representation based on available subwords.

The pretrained model which is more robust was able to assign a similarity score but very low in comparing two words that are synonymous. This is understandable because ‘ẹ̀wù’ is not a common word in texts or eveyday use, so its difficult to correlate it’s relationship to ‘aṣo’.

print(trained_model.wv.similarity('bùbá','ṣòkòtò'))

0.0588443
print(pretrained_model.wv.similarity('bùbá','ṣòkòtò')) 0.499416

For the example above , ‘bùbá’: Top and ‘ṣòkòtò’: Trousers are words that appear together often. The trained model calculated a low similarity score because the data it trained on is small and not diverse, the pretrained model did a better job and calculated a reasonable similarity score.

There are many other examples in the notebook, you may need translate some of the words to english to understand and explore more.
I agree a word is known by the company it keeps and I know vector representation is maths but I’m relieved it works well for my language too.
I hope native speakers write more in yorùbá and add the proper intonations as much as they can.

--

--