By Sidney Ochieng
The Umati Project is iHub Research’s effort to monitor dangerous speech online. Currently in phase 2, we’ve expanded the scope of the project, from merely monitoring instances of dangerous speech online, to analyzing how online public conversations take place over time, and how some of it veers towards dangerous speech.
Read on the project's evolving scope here.
However it’s not just the scope of the project that has evolved, but also the methods and technology that we use to monitor and classify instances of interesting speech. We currently define interesting speech (in the Kenyan context) as any speech that mentions groups based on tribe, religion, nationality, sexual orientation, gender, disability, influential people (including politicians), cities/regions, and socio-economic class. This broad definition has been adopted to capture as much relevant data in accordance with the Umati Project’s expanded scope.
During the first phase of the project we had 6 monitors for the major languages: Kiswahili, Luo, Kalenjin, Luhya, Kikuyu and English. They each had a list of pages on Facebook, keywords on Twitter and a list of websites and blogs that they kept an eye on. They would spend 8 hours a day searching for instances of dangerous speech across these platforms. Any instances found would be coded into a Google Form and categorised. Further, we had six other people come in during the weekends to substitute those who did the weekly monitoring. Another part of their jobs was to be on the lookout for new sources. For more information on the manual process, see the methodology section of this Umati Report.
Every 3 months, there were refresher trainings for the monitors, on the methodology and dangerous speech classification process. There were instances some of misclassifications and varying effectiveness (expected of human input), as well as fluctuations in productivity and task dullness.
Over the course of the last 8 months, we have been able to build tools that enable an automated data collection process off Twitter and Facebook. Currently we are able to track keywords on Twitter (currently numbering about 200) and public Facebook pages and groups (currently about 130) using custom-built software. Using other tools, like import.io, we are able to monitor the conversation on blogs and forums. We are also building and training tools that will be able to automatically classify and annotate the large amounts of data we’re collecting, which should perform faster and more consistently than a human being. We’re constantly working on improving the tools, having humans manually annotating data sets and classifying them to “teach” the system to use that data to classify other similar sets. For example, we have data around a single event, take the attacks in Mpeketoni: we go through a section of the collected data, labeling instances of interesting speech as true, everything else as false. Instances that are marked as interesting have the following characteristics: they mention groups based on tribe, religion, nationality, sexual orientation, gender, disability, influential people (including politicians), cities/regions, and socio-economic class. We then use that annotated dataset to train an algorithm, test it on the annotated data, note its accuracy and repeat. When we are relatively confident with the algorithm, we turn it on the rest of the dataset.
We hope that as we improve the algorithms and increase the number of human examples, the system will eventually be able to classify data with minimal human intervention. That said, the system will always require humans since they will have to initially train the system and also because humans will always do the final analysis; the algorithms only do the initial analysis and sort through the noise. There’s still a long way to go even on the initial front.
The benefits of using more technology for Umati are numerous. The tools are more efficient; software is faster than a human being, it never tires and is able to run 24/7, and it enables us to use less manpower and there are reduced costs in the long run.
The cons, however, include increased initial costs for quality data scientists. It takes a long time to build, test and deploy these automated tools. Also building these tools requires a level of expertise and experience that is difficult to find and acquire. Finally, there’s still the fact that you need humans to verify the output of these tools and also the final analysis will always be done by humans who are able to think creatively.
All in all, I think that software does make the world a better, more efficient place, but we would all do well to remember that there’s still no way for a computer to do anything useful without help of a human. As we seek to replace manual processes with more automated ones there will always be trade-offs.
Going forward we want to experiment with Sentiment Analysis and Natural Language Processing to improve analysis of speech. This will involve looking at sentence structure, context and other attributes. Also we’re started building tools able to monitor twitter to tell when an event has happened reducing the time before we start collecting data on it. We hope with these tools it’ll be easier for anyone to set up an instance of Umati anywhere (e.g. in Nigeria, to where the project is currently planning to scale). A series of reports will be released down the road!
Got any questions or comments? Get in touch via the comments section below.