Building Swahili Stop words Corpus for Computing

By Leo Mutuku
Data Science Lab
  Published 7th April 2014
Share this Article
Building Swahili Stop words Corpus for Computing

With Chris Orwa,

Data Scientist, iHub Research

iHub Research commenced phase two of Umati, the online hate speech monitoring project, in January 2014 with the aim of understanding how online conversations evolve over time in an election cycle and; detecting dangerous speech - speech with the potential to catalyse violence.

In order to improve efficiency in our past methodology, the first step was to automate data collection from various public social sites, including Facebook, Twitter, blogs and online forums. We have since been working to build a software, the Umati Logger, which will collect the requisite data as well as classify and filter noise using Machine Learning and Natural Language Processing algorithms. The first stage of the automation process, the Facebook collector, is complete and we are now successfully collecting comments from public Facebook pages and groups (More on this in a future blog post).

As part of building the auto-classifier, it becomes necessary to analyse the body of the text from the comments. This stage first involves removing stop words from the comments. Stop words in computing are words that don't provide context to a document therefore only increasing the computation resources required to process text files. In English these words include: and, or, not, this, that, here, there etc. What is most interesting here is that, a lot of the comments we have collected contain Swahili words, or Sheng’ (local pidgin), or a mixture of all. However, to the best of our knowledge, there isn't a corpus of Swahili stop words easily available, so we decided to create one.

Using 248,283 comments collected in the month of December 2013, we used the following procedure to extract Swahili stop-words:

Procedure

  • Load all comments from Facebook posts.
  • Convert all comments to lowercase.
  • Break sentences into word tokens.
  • Remove all English stop-words.
  • Create a frequency table of the words.
  • Sample top 30% of most occurring words
  • The sample will form a set of Swahili stop-words.
This is a particularly useful exercise, since we aim to deploy Umati in the forthcoming Nigerian elections. Nigeria’s context is quite similar to our own in Kenya; we expect a local version of pidgin language to be in use and just like in the case of Swahili and Sheng’, there are no stop words that we can feed to our software.  Moving forward, we hope to crowd source for help in this activity to ensure we build a comprehensive directory for several African languages, which are not typically recognised in computing software.

comments powered by Disqus

Account Login