Components Supporting the Open Data Exploitation
twitter linkedin

Open Data Node 1.0 released

Open Data Node logoOn April 29th 2015, Open Data Node 1.0 was released. So now I’m going to describe what this release actually does, compared to what it is supposed to do (as described almost a year ago in my initial blog post: Open Data Node – what it is, what it does, what is next).

 

Release 1.0 is our first stable release (accompanied also by launch of new project web: opendatanode.org). This means, that all basic stuff is in there:

  • all SW modules put together, Open Source, with simple installation (UnifiedViews, CKAN, PostgreSQL, Virtuoso, midPoint and more)
  • basic use-cases implemented
  • no disruptive changes expected during the rest of the project

Open Data Node (or ODN in short) is a software tool. And not just any software, ODN is Open Source software. ODN is intended to be used for publishing of Open Data in automated, repeatable and easy to use fashion. ODN is the main output from COMSODE project (see project description). And is accomanied also by accompanied by generic Methodology for Open Data publishing.

So, what the Open Data Node 1.0 can do for you? This blog post will give you quick overview. For more details, after reading this post, you can continue reading ODN web at http://opendatanode.org/, latest release notes and user stories.

As mentioned in my initial blog post, Open Data Node (ODN) is not a silver bullet, one solution to rule them all. One corner case is a situation, when simple shell script is suitable and sufficient to publish certain Open Data. If this is your case, ODN is most probably not for you. Another extreme is a case when brand new (or major upgrade of existing) information system is being executed and Open Data publication is factored in from the beginning. Again, if this is your case, ODN is most probably not for you. But for cases, when simple shell script is “just not enough” and “complete new information system” is not feasible either, ODN can help. How?

  • It provides powerful ETL capabilities, both for Linked Data and tabular/relational data, to allow publishers to convert, clean, enrich and link data before publishing as Open Data
  • To help data users to actually understand and use the data, it provides also data publication and presentation functions
  • And to help data publishers more with the whole Open Data publication process (as described for example in COMSODE Methodology – see here), it provides also cataloguing functionality
  • Data publishers will also benefit from integration capabilities with internal systems, modular design and Open Source nature of ODN

I will explain more on each in subsequent sections.

Simple installation

On Debian systems, after you prepare COMSODE repository, you can simply run:

aptitude install odn-simple

and you have ODN installed and running.

Just follow instructions in release notes.

ETL capabilities

Very shortly: ODN basically pumps data from the inside to the outside. To do that properly, in context of Open Data:

  • ODN supports both Linked Data (as new paradigm for data publication and usage) and tabular/relation data (currently prevailing technology)
  • It has ability to create repeatable publication jobs, jobs which can convert formats, clean and enrich the data, even link the data to other data
  • Publishers can schedule such jobs to automate publication of updates to keep datasets up-to-date without repeated manual labour

scheme of basic Open Data Node use-case

Very important aspect is caching of the data: Open Data intended for publication is stored inside ODN. Thanks to that, internal systems are insulated from possible overload or attacks via Open Data publishing. While in rare cases ODN can go down, internal systems are still operational and organization publishing the data can still function.

Note: ETL capabilities are supplied by powerful UnifiedViews tool, see also our other blog post: UnifiedViews 2.0 – Whats new.

Publication and presentation functions

ODN support both basic methods of publishing the data (as described in post Understanding Data Accessibility):

  • batch access: ODN can prepare file dumps in various formats (CSV, XML, RDF, etc.), possibly compress them (ZIP) and make them available (via HTTP).
  • access through API: ODN can publish tabular/relational data via REST API and Linked Data via SPARQL endpoint.

And thanks to incorporating CKAN and other tools, ODN provides to data users also functions to preview, analyse and visualize data. As or now, that works mainly for tabular data. Later on, we will enhance that also for Linked Data, with visualisation tool Payola.

Cataloguing functionality

Previously we wrote that “it is not a data catalogue”. But later, while preparing COMSODE Methodology and listening for feedback from COMSODE User Group (see for example Second User Board meeting: Recommendations for technology and methodology), we’ve decided to include the catalogue in the ODN.

Inclusion of cataloguing functionality is motivated by the need to make it easier for data publishers to follow COMSODE Methodology: While in phase “Development of open data publication plan (P01)”, publishers are (among other things) mapping their internal data sources (steps like “Analysis of data sources” and “Identification of datasets for opening up”). So they already need a place where to put information about those internal data sources. Data catalogue – internal one – is thus a very good function ODN can provide.

This functionality is provided by including customized CKAN catalogue in ODN in two roles:

  1. CKAN in role of “internal catalogue” is the main entry for data publishers into ODN. This catalogue if private, visible only to data publisher and its authorized personnel.
    From this catalogue, publishers manage many aspects of their Open Data publication. Once some dataset is properly prepared for publication, it can be marked as “public” and ODN will automatically ensure the visibility of such public data to the general public in …
  2. … CKAN serving role of “public catalogue”. This public catalogue is the main entry for general public (a.k.a. data users). In this catalogue, they will see only datasets explicitly marked as “public”, and will use this catalogue to search for the datasets, learning about them, looking at and obtaining the data from.

Compared to what we wrote originally, it is still true that ODN is supposed to complement data catalogues. Imagine Organization ABC having ODN instance and that instance providing also catalogue of all datasets published by this organization. It is nice and useful as of itself, but it is not suitable nor desirable to replace say state wide or EU wide data catalogue. So, such ODN instance will instead “fit” into the hierarchy of data catalogues and provide dataset metadata on behalf of Organization ABC to anyone – including the nation wide and EU wide data catalogues – in (also) automated fashion, saving Organization ABC valuable time.

Integration functions, modular design, Open Source implementation

For the basic use-cases, the main focus is on ability to integrate with various kinds of data sources: various formats (XLS, XML, CSV, etc.), technologies (SQL, JDBC, SPARLQ, etc.), via file system or remotely (HTTP, etc.) and so on are supported “out of the box”.

In more broader terms, thanks to Open Source implementation of ODN, taking into account also open standards, ODN can be enhanced (by almost anyone) with additional modules, or incorporated into bigger information systems, integrated with existing infrastructure as used by data publishers, etc. It can be even modified.

For example ODN’s Single-Sign-On (SSO): Thanks to midPoint, CAS and LDAP, it can be integrated to existing user management, authentication and authorization systems organizations may already be using.

Note: Some concrete formats and APIs are not there yet. Because for that, we need more feedback from those trying ODN in their real environment. For example, for SQL we currently support only PostgreSQL, MySQL, MS SQL and Oracle and support for other databases might be added, pending feedback from users. See section “Future” bellow.

Stable release

1.0 was a first stable release. This means that we are going to provide further upgrades in a way so as to not disrupt your operations, i.e. backward compatible or (if not feasible) with easy migration to new release.

Future

While ODN 1.0 does provide a lot of basic functionality (and it can also already help applications to be built – see Building an application on Open Data with Spinque), there is still some work to do to make ODN better. With this release we’re starting many pilots in various EU countries. Using feedback from those pilots, we will further refine ODN. Here, I also kindly ask also you to give ODN a try and provide additional feedback to us.

Among many smaller things (more file formats – like JSON, more publication protocols – like FTP or BitTorrent, etc.) we’re have one bigger feature still in development: wizards. While you can do quite powerful ETL stuff in ODN, this is not truly easy to use for everybody. Thus, for common cases we will implement “wizards” which will allow even novice users publish usable and high quality Open Data with minimum of skills. What are those “common cases”? Here’s a slide from one of our presentations:

ODN - most common use-cases

Using data scheme from http://5stardata.info/ , the most common case is taking 2* from internal systems and transform it to at least 3*. In practice, it means ODN being able to harvest data from internal systems in formats like CSV (and it many variants), XLS(X), XML (possibly with XSD schema) and various kinds of SQL databases. And being able to publish that data in common Open Data formats (i.e. CSV, JSON or RDF, via API or file dumps).

Additionally, we will also add features for quality assessment of the data. Those features will help both data publishers and data users: publishers can use them to get hints about what to improve in published data, users will be able to better assess for example to what extent is the data actually usable for certain purposes.

But to get there, to get those additional features done, we need to first validate basic ODN functions “in the field”. We also need to verify that what we see as “common publication use case” is truly “common”, and narrow down a concrete (and not too long) list of specific transformations, making sure that we implement what truly needed and to not waste time implementing what is non needed.

Similarly, based on user needs, we would like to expand the list of supported platforms, adding support for other Linux distributions or other operating systems.

So again, I urge you to join our User Group and try ODN. Or at least get in touch with COMSODE and describe the scenario/problem you’re facing.


Peter Hanečák Peter Hanečák is a Senior Researcher and a team leader from the EEA Company.  At the same time he is the Open Data enthusiast.

Social tagging: >

Trackbacks/Pingbacks

  1. Components Supporting the Open Data Exploitation » Open Data Node 1.2 released

Leave a Reply