A Data Pro has completed an R&D project for the development of an integral technology to create and publish data-based business news stories. What is next? Will the team of researchers, editors and software developers produce a story worth a journalism award, e.g. a Chernorizetz Hrabar prize (the Bulgarian equivalent of Pulitzer)? Will a computer ever write a better story than a human reporter? That time is not far away for Kristian Hammond, co-founder of Narrative Science, “In five years a computer program will win a Pulitzer Prize — and I’ll be damned if it’s not our technology.”
Our team’s ambitions are more modest. We want to save reporters tedious hours of accessing spreadsheets of figures, analysing them and manually reporting the results. Writing a story like that takes at least half an hour while for the same time a software program can deliver hundreds of stories.
Let’s make things simple. What will a machine need to produce a report, let’s say on Bulgaria’s GDP growth in May, as soon as the statistics office releases its figures? A program to harvest the figures from the source, software to pick the proper language from a pool of options (news templates and thesauri) and an algorithm to turn it into a story.
Easier said than done. For our researchers it meant analysing 800+ sources of information. Editors had to compose templates for corporate and economic news on 28 topics and develop comprehensive thesauri and linguistic rules for their use. A software engineer was needed to roll out data harvesting systems and story generation algorithms.
All that took 29 months! The project called “Experimental research of types of sources of structured data and design of thesaurus and algorithms for automatic news generation” was completed on May 7, 2015. It was carried out under operational programme Development of the Competitiveness of the Bulgarian Economy 2007-2013, financed by the EU through the European Regional Development Fund and the national budget under the Support for Research and Development of Bulgarian Enterprises procedure. The value of the project was some BGN 500,000, 100% of which went for the remuneration of the R&D team.
Let’s see how it happened.
Information services managers met regularly to scale the content (topics, industries, etc), based on our editorial policy, readership and customers (information agencies, aggregators, etc).
A team of 10 researchers then started browsing the web, looking for up-to-date, comprehensive and trustworthy sources of information such as statistical offices, state agencies, stock exchange operators, etc. Every month they checked out more than 25 sources to see whether they provided free and structured data, which must also be accurate, updated and comprehensive.
Meanwhile, three editors were drawing up story templates on corporate earnings, share price swings, inflation, jobless rate, etc together with a thesaurus of terms and expressions for each of the topics and the rules for their application, taking into account the subject and the source specifics. For example, when profits rise they can inch up, edge up, soar, jump, surge, depending on how big the rise is.
Then software developers received the sources that passed the tests and produced algorithms for data harvesting from each source and algorithms for combining the templates with the data.
What lies ahead? We will continue to examine user behaviour and will add new topics such as comparative news and analyses.
The next step will be to develop further the technology so that it will be able to automatically create not only news stories but also industrial reports. An industrial report is a snapshot of an industry including sales, trends, players, etc. Companies need it to map out their strategies.