CASPAR 02 – Research contribution to semantic technologies and machine learning

On Dec 16 2016, A Data Pro officially signed a funding contract for its latest R&D project. It is named CASPAR 02 (Combined Automated Semantic Processing Array). And no, we promise not to contribute to the robot apocalypse. Skynet is still not science, just fiction. What we hope to contribute to is the growing body of research that deals with semantic technologies and machine learning.

The project is funded through the Eurostars 2.0 initiative, an EU-run research and development grant program during cut-off 3. We ranked in the top 200 projects, selected among competition from the entire European Union and beyond, involving 870 participants, 70% of which were SMEs applying for over EUR 344 mln. An awarded Eurostars label means that CASPAR fulfills high innovation and business sustainability requirements, having passed rigorous examination by EU officials and domain experts. In cut-off 3, 94 projects passed the contracting phase. A Data Pro scored double having been awarded the only two successful projects from Bulgaria in this round.

We’ll be partnering with Cyprus-based GeoImaging. A Data Pro will cover all text-related processing tasks, and GeoImaging will be responsible for handling image data tasks.

Three teams will be working on the project – 3 IT specialists, 8 media analysts and administrative personnel. We also managed to attract а Bulgarian Academy of Sciences (BAS) associate professor with extensive experience in neural networks and data modeling – Dr. Tatyana Atanasova, who will be supporting the IT team to efficiently plan, develop, select and train neural networks and will be consulting the project overall – this includes keeping an eye out for mistakes and critical problems that might arise.

The project has been divided into two main phases, each of which will take a year. Each year has different activities scheduled. In the very beginning, the focus will be on research – how to organize the teams, how to ensure everyone has access to all the information they need, which technologies will be put under the microscope. The IT team will then design a software selection methodology, while the team of media analysts will begin a continuous process of content annotation that will be later used for machine-training purposes. Then comes a phase which we call “calculation time” – here several instances of neural networks are trained in several different domains (a domain is а set of pre-determined characteristics that are used in the content’s taxonomy, in our case the domain consists of language, topic and type of media).

In the second year of the project we will be comparing the behavior and performance of neural networks agains pre-existing semantic solutions (symbolic approach). The same datasets and domains will be tested on other automation software (e.g.ontologies, statistical methods etc.). There will be a few cycles of testing and re-training until we get the metrics up to high satisfactory values.

The result of our combined efforts will (hopefully) be a published in a scientific paper as a comparison of the performance of the different systems separately and in combination to perform specific content-related tasks – analysis, systematization, annotation, data-enrichment etc. We also aim to discover which systems are (neural networks or semantic-based) are faster and most cost-efficient and if their quality and speed can improve if combined.

In doing so we will create a platform, which allows us to create domain-relevant pre-trained “artificial brains” that we aim to monetize afterwords. We would like to understand how to create a system, which can process different bodies of information on demand and offer this product to the end user.