Components Supporting the Open Data Exploitation
twitter linkedin

UnifiedViews: An ETL Framework for Sustainable RDF Data Processing

The advent of Linked Data [1] accelerates the evolution of the Web into an exponentially growing information space where the unprecedented volume of RDF data offers consumers a level of information integration that has up to now not been possible. Suppose a Linked Data/RDF consumer, who has a data processing task to build a data mart integrating information from various RDF and non-RDF sources. Lots of tools for RDF data processing emerged in the last few years, thus, the consumer may use these tools to realize his task.

Unfortunately, the consumer cannot focus merely on the proper configuration of these tools, but he has to also, e.g., write a script executing these tools in the required order, forward logs produced by the tools to a single location, think about the location of the configurations for the tools. Further, the consumer does not have any support for debugging intermediate RDF data created by the tools as the task is executed. The consumer cannot reuse configurations created by other consumers; what is more, as the amount of his configuration increases, maintenance of configurations may easily become a nightmare.

To address the problem of sustainable RDF data processing a typical Linked Data/RDF consumer is facing, we propose UnifiedViews, an Extract-Transform-Load (ETL) framework, where the concept of data processing task is a central concept and another central concept is the native support for RDF data format and ontologies. A data processing task (or simply task) consists of one or more data processing units. A data processing unit (DPU) encapsulates certain business logic needed when processing data (e.g., one DPU may extract data from an RDF database or apply a SPARQL query [2,3]). Every DPU has its inputs, outputs, business logic and configuration.

UnifiedViews is a framework, thus, consumers may create custom DPUs; any tool used by RDF/Linked Data community can be easily wrapped as a DPU. UnifiedViews allows consumers to define and adjust data processing tasks, using graphical user interface (an excerpt is depicted in Figure 1).

odcs

UnifiedViews takes care of task scheduling. A consumer may configure UnifiedViews to get notifications about errors in the tasks’ executions; the consumer may also get daily summaries about the tasks being executed. UnifiedViews ensures that DPUs are executed in the proper order, so that all DPUs have proper required inputs when being launched. UnifiedViews provides consumers with the debugging capabilities – a consumer may browse and query (using SPARQL query language) the RDF inputs to and RDF outputs from any DPU.

UnifiedViews allows consumers to share DPUs, configurations of DPUs, and tasks as needed. The code of UnifiedViews is available at https://github.com/UnifiedViews/Core under a combination of GPLv3 and LGPLv3 licenses. The documentation for the framework is available at https://grips.semantic-web.at/display/UDDOC/UnifiedViews+User+Documentation, including also a guide for creating new DPUs.

UnifiedViews is used in COMSODE project as a core component of Open Data Node, where it ensures extraction, transformation, and publishing of (Linked) Open Data. As part of COMSODE project, we also prepare new DPUs needed to process and publish 150 datasets as (Linked) Open Data. We will describe the concept of DPUs and introduce examples of new DPUs in one of the next blog posts.

References:

  1. C. Bizer, T. Heath, and T. Berners-Lee. Linked Data – The Story So Far. International Journal on Semantic Web and Information Systems, 5(3):1 – 22, 2009.
  2. S. H. Garlik, A. Seaborne, and E. Prud’hommeaux. SPARQL 1.1 Query Language. W3C Recommendation, 2013. http://www.w3.org/TR/2013/REC-sparql11-query- 20130321/, Retrieved 20/03/2014.
  3. P. Gearon, A. Passant, and A. Polleres. SPARQL 1.1 Update. Technical report, W3C, 2013. Published online on March 21st, 2013 at http://www.w3.org/TR/2013/REC-sparql11-update-20130321/, Retrieved 20/03/2014.

—————————————————————————————————————————————–

Article written by T.Knap

knapp1 Tomas Knap received his Ph.D. from Faculty of Mathematics and Physics, Charles University, Czech Republic, for his research on trustworthy Linked Data integration and consumption. In 2013, he co-founded company Semantica.cz s.r.o, an SME entrepreneurship focused on consulting Linked Data and semantic web solutions for data integration and publishing.