Blog

 

 

Ingen.io: Taking text understanding to the next level

Ingen.io will be indispensable for media monitoring agencies,” says Martin Linkov, co-founder of the eponymous Slovakia-based startup.

A bold statement, backed by years of experience in the media monitoring industry, which made the team aware of the problems companies in the field face, and open to the possible solutions. “Ingen.io extracts information from context,” he continues, “analysing and making assumptions of persons, entities or products which are not mentioned since they are considered common knowledge, yet they are related to the information conveyed by the text and key to its better understanding. These insights help computer systems make connections between articles or identify not-so-obvious correlations between different parts of the same text.”

UniGraph-Atanas-Martin-Michal-Igor

Tell us more about the company and the team.

We’re a team of four: Atanas, Martin, Igor and Michal. We get along very well and share years of friendship and experience in Big Data projects. We’re privileged to have great advisers from the large networks of Wayra and ODINE who together with Neulogy invested in the growth of the business and are very supportive.

What products are you building?

We’re developing two complementary products, UniGraph and Ingen.io:

UniGraph is becoming the largest, structured open knowledge repository mapping all products, people, events, places… all entities from the world we live in and showing how they connect with each other. Users, such as monitoring agencies, information service providers and analysts, can extract the information they need with their existing or purpose built tools. They will be able, for instance, to identify relations between public figures and companies regardless of their place of registration and governing jurisdiction. Unigraph can be conveniently accessed by everyone over API or downloaded – for free.

Ingen.io – provides APIs for Natural Language Processing and Understanding deeply integrated with UniGraph. The augmentations range from basic tasks like Language Identification and in-document classification to more advanced ones like Named Entity Recognition. Combined, they lead to The Context API which reconstructs the broader contextual information of any information piece. The tight integration with the knowledge repository – UniGraph, makes Ingen.io independent, very precise and fast – up to six times faster than the industry standard set by established players like Alchemy.

In Wayra, Prague

How exactly Ingen.io works?

Ingen.io tries to mimic the reasoning process in the human brain by making and extending or discarding predictions about the meaning of the text word by word, sentence by sentence. Along the way the software identifies and surfaces “context hubs” – nodes that facilitate connections between the mentioned entities in the text, but are not included by the author.

Give us a use case?

Consider the following scenario: The Coca-Cola Company hires a new C level executive for Turkey, who makes an uncoordinated political statement. The company is caught off guard, because an analyst didn’t update in time all the CocaCola media queries.

Ingen.io can prevent all this by taking over the manual work of keeping complicated search strings up to date. With a simple interface integration, the analyst is presented with a screen on which to define just the company, the language and the market as per client requirements. Ingen.io takes care of the rest and working always with the latest data from UniGraph keeps track and returns everything related to the company, its executives and competitors whether it is mentioned directly or is understood from the broader context.

How did you end up building a Knowledge Graph, aren’t there several already?

At Ingen.io we try to control and own every aspect of the technology. Contrary to the popular industry approach, we don’t use third-party libraries or tools. Every process – from the language identification to the entity disambiguation is internally developed. With that spirit it was just a matter of time to realize that the deficiencies in the data and schema of all publicly available knowledge bases – DBPedia, Wikidata, Freebase can’t be overcome. They are a good starting point, but we need something much bigger in terms of data and more accurate in terms of metadata. The available solutions mostly ignore the source of the information and the time period the data refers to. For us this is a key aspect in providing a truly objective representation of the data from the world we live in. The end result is that we’re now building a Knowledge Graph from scratch – including our own schema and storage engine to better represent the data source and time dimension.

How will you map “all the data and connections” from the world we live in?

We started by aligning and cleaning the largest repositories – Freebase and Wikidata. Now, we’re turning our attention to the thousands of open datasets released by governments around the world. We’ve just created an elaborate data mapping and an import solution that allows anybody to define rules on how a dataset is to be represented in the UniGraph schema. We’re also working on harvesting, structuring and reconciliation tools to crawl the web and add data about the latest products, hires and scores almost at the same time as they’re reported by the media.

Can Ingen.io suggest adding entities in UniGraph?

Yes. We are creating an environment for the two products to continuously complement and improve each other. The more UniGraph grows, the more Ingen.io will understand from the text, the better Ingen.io understands the text, the more the database will be enriched.

Atanas, showcasing UniGraph at the Open Data Summit in London

What technologies do you use?

For Unigraph we are developing a custom key value engine on top of RocksDB. Before that we tried almost everything: – Mongo, MySQL, OrientDB, Aerospike but none could handle the massive volumes of data we’ve been inserting and querying. We code in Golang and fully utilize the language’s concurrency in order to provide the fastest service possible.

At first, we decided to step on Freebase, DBpedia, Wikidata and try to extract the best from  them. It turned out that they do not have connections and full, global coverage. Finally, we decided that we will develop our proprietary database. At first we wondered which scheme to choose – Schema.org, DBpedia, or something else, since the problem with most of those databases and schemes was that very few of them reflect the time dimension. For instance, according to DBpedia, Marissa Mayer works for both Yahoo! and Google. And it’s not clear where she currently works.

Now we are writing rules for automatic harvesting of structured data and we are collecting the available data. The scraping and understanding of unstructured data is the next level.

This is the purpose of Ingen.io. It will be based on Unigraph and both products will continuously complement and enrich each other.

How did you start?

It all started with “Paris”. Working at search and retrieval optimizations we felt that we’re losing the battle with false positives and not only. We wanted to improve document correlation, faceting – everything. The thing is that training data is scarce and predominantly in English. The solution then, had to be independent and valid across languages. It sounded impossible at the time. Atanas was resilient and started work on a prototype – after hours, during weekends. One day he called me and said:

We have it! We can identify Paris as the person and not the city in:

“Paris is the daughter of Richard Hilton.”

Nobody else can, it is too short and ambiguous.

Once we had something tangible, a proof that our crazy idea is even possible we applied for funding. We prepared well and in very heated competition with companies from all around Europe we won the support of Wayra – Telefonica’s  start-up accelerator programme. We moved to Prague and travelled for events, meetings and competitions around Europe to validate the need, get feedback, meet customers and investors.

First Prague, now Bratislava, where next?

We’re building a global business and the CEE region is a great place for a base. We are often travelling to meet clients in Vienna and Berlin and reach London in two hours. Slovakia in particular is a good home for startups. The government is supportive, the businesses are approachable and open to innovations, the entire country is in fact, a startup! After all, Slovakia and Bratislava are among the youngest countries and capitals in Europe. We were impressed by the ambition and the expertise of the people here and the tight integration between education and business. Just an example: Neulogy – our investors have their office on the ground floor of the Faculty of Informatics. The ecosystem is vibrant and there are many global companies that started and still have operations in Slovakia.

What’s next?

We’re scaling the Ingen.io infrastructure and preparing it for activation by several big multinationals. In parallel we’re building the UniGraph website and functionality to allow anybody to join the community, browse the data and the schema, suggest improvements and define rules for data upload and reconciliation. These milestones are the launchpads for Ingen.io’s sustainability and growth, which will be fuelled with a next round of financing by the end of the year.