With Chris Orwa,
Data Scientist, iHub Research
iHub Research commenced phase two of Umati, the online hate speech monitoring project, in January 2014 with the aim of understanding how online conversations evolve over time in an election cycle and; detecting dangerous speech - speech with the potential to catalyse violence.
In order to improve efficiency in our past methodology, the first step was to automate data collection from various public social sites, including Facebook, Twitter, blogs and online forums. We have since been working to build a software, the Umati Logger, which will collect the requisite data as well as classify and filter noise using Machine Learning and Natural Language Processing algorithms. The first stage of the automation process, the Facebook collector, is complete and we are now successfully collecting comments from public Facebook pages and groups (More on this in a future blog post).
As part of building the auto-classifier, it becomes necessary to analyse the body of the text from the comments. This stage first involves removing stop words from the comments. Stop words in computing are words that don't provide context to a document therefore only increasing the computation resources required to process text files. In English these words include: and, or, not, this, that, here, there etc. What is most interesting here is that, a lot of the comments we have collected contain Swahili words, or Sheng’ (local pidgin), or a mixture of all. However, to the best of our knowledge, there isn't a corpus of Swahili stop words easily available, so we decided to create one.
Using 248,283 comments collected in the month of December 2013, we used the following procedure to extract Swahili stop-words: