We make sense of information – this is what A Data Pro exists for. What the slogan doesn’t say is that we have to make sense quickly, cheaply, on huge sets of data and by tuning to our clients’ priorities. In the data industry this means keeping a dynamic balance between automation and manual processing.
With the rise of big data multiple smaller projects started coming to A Data Pro and timeframes became more demanding. We felt the need of a smarter machine helper and decided to build an Artificial Intelligence (AI) that does not only read text, but understands the meaning behind the text, so it can automatically index and categorize it. The indexing is done for a variety of purposes – to spot the entities and brands mentioned, to assess and label the topic, to understand the sentiment behind the writing. If an AI does this successfully our analysts would spend less time ordering data and more time analyzing it and making conclusions. So here is our AI story – why we are doing it and how it is going to work.
The costly supply of meaning
Most of the content A Data Pro works with is text in various languages (from Croatian to Japanese) and in different areas of competence: energy, manufacturing, automotive, healthcare, financial markets, regulatory compliance, etc. We do have our little machine helpers – crawlers, APIs, an indexing and storage platform and a content management and publishing system, where we store, index, categorize, transform, enrich and distribute everything we do – from press clippings to due diligence reports and media analyses. We also build vocabularies, interconnect meanings and put them into a hierarchy to create taxonomies* – if done right, they allow the computer to follow a clear path, that is a set of commands what to do with each word of the content we gather.
Let me illustrate that with an example. If a tweet mentions ‘golf’, the algorithm would index the tweet with the labels of Volkswagen, car, and vehicle because it was specifically instructed that this combination of letters is associated with the entity Volkswagen Golf, which is included in the entity of Volkswagen, which in turn may be included or independent from the entity of cars. The assigned labels allow analysts to work with filters and process batches of tweets instead of sifting through raw data.
However, creating specific instructions (or a more complex project-specific taxonomy) for every possible word has its challenges. First, it is a slow and expensive process that can make you rigid and less responsive as a provider. Second, it must have a narrow scope in order to work well – if you cover both automotive topics and sport news, the Golf from the example above might not be a car, but a game. If you get the meaning wrong, the indexing would also be wrong and so would be your analysis.
So what you need is a changing set of instructions, that “understand” and respond to the surrounding text, the topic, the context, etc. The more factors you include, the better you can cover all possible meanings. But the more sets of instructions you create, the more expensive the algorithm becomes – instead of indexing and analyzing the data your analysts are busy building and maintaining the commands to be followed by the computer – definitions and taxonomies. The complexity of it also takes more time, so quick analysis and real-time indexing get out of reach when you build a classification system for each project.
So one inevitably ends up asking if the computer could create or adapt its own instructions. Can an algorithm learn to distinguish meanings by observing the text that it has to index?
The ability of algorithms to learn and make data predictions and decisions is an old quest in artificial intelligence and one of the main topics of the discipline. There are many approaches that achieve machine learning to various extent (decision tree learning, association rule learning, artificial neural networks, etc) and each has its advantages and disadvantages. After discussing it with the CTO, we decided that the best mid-term strategy for A Data Pro is to delve into the domain of Artificial Neural Networks (ANNs).
Invented more than 50 years ago, ANNs operate by creating connections between many different processing elements that form a network structure similar to the neurons and axons in the human brain. Each processing unit (neuron) is a decision node that intakes multiple inputs and produces a single output, which is then sent to another processing unit. Going back to the Golf example, imagine that a single neuron is responsible for the decision if ‘golf’ is a car or if it is not. If it receives signals from the neurons for ‘car’, ‘automotive’, ‘vehicles’, ‘track’, etc, it will produce an output of 1 (‘car’); if it does not get the inputs or if there are signals from other neurons (for sport, grass, hole, stick, etc) it will produce an output of 0 (‘not car’).
The beauty of the model is that the connections between the “neurons” are subject to change – they are reinforced or diminished automatically in the process of analyzing the data. In our simplified example, the ‘golf’ neuron will be strongly connected to the neurons for other words, which the ANN has previously encountered close by or combined with ‘golf’ – words like ‘car’, ‘grass’, ‘stick’ and ‘vehicle’. The ‘golf’ neuron will not be connected or it will be weakly connected to neurons for terms that ‘golf’ is rarely encountered with – like ‘peanut butter’, ‘jam’ or ‘jelly’. This is true for each of the neurons – the ‘stick’ neuron has its group of connections, ‘vehicle’ has one as well and ‘jelly’ has one, too. So, when an article mentions ‘vehicles’, ‘road’ and ‘golf’, the whole area of automotive-associated neurons will light up like a christmas tree. If, instead, the article mentions ‘stick’, ‘grass’ and ‘golf’, the activated area would be the one related to the sport of golf. Depending on the activity of surrounding neurons, the computer would be able to process the correct meaning of ‘golf’ based on the context it is used in. Very neat.
Use your brain
To fully understand the power behind ANNs, one has to take into account that its algorithm indexes the entire content. In our example – the artificial brain does not stop at building an understanding of the entity ‘golf’. It repeats the process for each word in the text, effectively creating a ‘map’ of meaning that covers the document as a whole and each of its parts individually – it is all the activated neurons and all their connections, each with their respective intensity. The mapped meaning can be used to extract content, topic and sentiment from the data and to categorize it as desired.
Once trained the ANN does not rely on people for a definition but builds its own understandings and updates them on the go. In a way it is learning by doing. It is fast, can index complex, real time and heterogeneous data (big data) and does not require much to be set up and started. Given that its neurons do not have to be on the same computer, an ANN can be stationed in a network of processors and does not require huge investments in computational power or hardware.
Training an ANN
The cons of the ANN are that it needs a lot of information to be trained (the more the better) and that the training is somewhat a mystery. Nobody really knows if and how the algorithm understands each concept – it simply builds up connections to things it encounters. Consequently, the resulting meaning may or may not be what was initially intended.
For example, if you want to train an ANN to recognize tanks, you may train it using photographs – some with tanks on them and some without. If the algorithm guesses correctly, you confirm it and the neuron connections are reinforced. If it is wrong, you reject the answer and the connections are diminished. With enough photographs, the ANN will learn to recognize the look of a tank (or anything else it is trained on – test it yourself here). If you have included top, down, front, back and side pictures in the training, it will also learn to recognize a tank from various viewpoints. You have to be careful to train it for all possibilities.
However, you also have to be careful about all the additional things the ANN learns. Much like a toddler, the algorithm will learn from everything you present it with – no matter if you like it or not. It will also make its own associations. If all the pictures where you tell it that there is a tank shot in cloudy weather, the ANN will remember that and it will associate it with the tank. Show it a cloudy sky picture after the training and it will tell you that there is a tank on it. To “fix” its understanding you would have to train it on additional cloudy pictures without tanks and sunny pictures with tanks (the tank example is an experiment of the US Army back the 1980s – read the whole story here).
Build up on what you are good at
When deciding on what machine learning approach to choose, we wanted a technology that builds on Data Pro’s core strengths – the language and the industry skills of our analysts. The new AI had to be a quick learner and to work in any language. We wanted to catch up with state of the art agents like the ones created by Open Calais, Connotate, and our Bulgarian colleagues in Ontotext. We also wanted to go beyond, to create something that is all ours and that will give us a competitive edge. For A Data Pro, ANNs were the perfect match. First, they are an existing yet innovative solution based on available technology – algorithms and frameworks are available in open source (Neural Network toolbox and Neuroph). Second, to make them working, ANNs need to be trained with hundreds of thousands of structured content and this is exactly what A Data Pro produces. With some additional IT expertise and machine learning know-how we can enable the ANN to learn from the mountains of work that our indexing teams has already done. For example, to train an algorithm on textual content, we can use the 55,000 tagged articles that we process every month and also add the years of content in our archive – real data, on real events, indexed manually by expert media analysts.
In September 2014 the Business Development team started work on our CASPAR (Combined Automated Semantic Processing Array) project – an ANN-based platform for automated semantic processing of unstructured content. To get additional help for the required technological jump and the related financial investment, we decided to apply for European funding, within the framework of the Eurostars program. The application was a success and last week we received the European Commission’s confirmation that our AI project is now eligible for EUREKA! funding (I will share the hard learned truths of Eurostars application in another blog post). What is important for now is that we are building our little robot helper and its brain will gather the specific experience of a few hundreds of our top researchers.
*Taxonomy is the science of classification. A taxonomy or taxonomic scheme is a classification of things (entities, concepts) and the principles that are followed in order to classify them. The term is mostly associated with Biology, where all living organisms are classified by Domain, Kingdom, Phylum, Class, Order, Family, Genus, and Species.
PS: For a deeper, intelligent and fun reading on Artificial Intelligence do check Tim Urban’s post on Wait but why – The AI Revolution: The Road to Superintelligence).