While there are many interesting technical topics – eg. producing data linked to other data, automating datasets transformations, data quality assessment, data enrichment – for sure the crucial point is to make content of published datasets easily accessible to users.
In current practice there are two distinct methods how to facilitate access to the data:
- batch access implemented by making available files with contents of dataset
- query based access to selected parts of data available through application interface (APIs)
Basic difference between these methods lies in the degree of interaction between data provider and data user. Batch access through files is a service very easy to setup and maintain at the side of data provider, which does not need nor enables any active interaction with data while selecting and accessing them. At the other side API based approach requires usually substantial data processing at the data provider side for each user request, but its strength lies in minimising volumes of data transmitted to the data user and providing most actual data. There are hybrid approaches of course, for example ODN software supports both of these data access methods.
Batch access uses classical services for making files available online. Nowadays most common means are by use of HTTP, FTP or more sophistically by BitTorrent protocol. But if there is specific reason, transfer can be permitted only after request, possibly by e-mail or done completely offline, for example by using data storage medium (for extremely large datasets this can indeed be the most practical method of transfer) – but while designing your access procedures keep in mind that one of the benefits of Open Data approach is minimising complexity and needed resources to communication between data producer and data user.
The most important part concerning Open Data access through files is to determine appropriate format of the file in which are data stored. Here the key measure is to enable and simplify automated processing of stored data. Try to look at it from the point of view of user: How difficult (measured by needed software/algorithm, its computational complexity and accuracy of the result) it is to identify data in file? To extract data from it? To search for specific data? Not suitable are the data file formats which are proprietary as they impose unnecessary cost to the data user or sometimes their processing is not readily available at all.
To put it summarily, most data file formats can be divided into categories:
- Formats not enabling automated data processing or making this processing at great cost or inaccurate – in this category are all picture and media file formats, and formats which store unstructured data, like text documents, PDF, HTML – these are not suitable for Open Data publication
- Proprietary data formats limiting their usage – for example DOC, XLS – not suitable for Open Data publication
- Open formats designed to hold structured data – mostly CSV, JSON, XML – these are the main mean how to access Open Data
- Advanced formats specifically designed for holding and processing large or semantically varied data – mainly used is RDF – these are standard for Linked Open Data and other high quality needs
Access to the data through API
Access to the data through API is more complex to setup. Firstly, data provider must have data internally available in structured form (database) and he must establish several infrastructure components:
- Storage for published data, both in terms of capacity and software handling – usually dedicated database instance. It is generally not a good idea to connect users directly to the internal production data stores.
- Application logic processing user queries. In some setups it is suitable to have only processor of some common query language – for access to the data stored in RDF form there is special query language defined – SPARQL, which is derivative of SQL user for access to the data stored in relational databases. Otherwise it is possible to have defined special set of query constructs suitable to the domain model of data, which are usually translated into classical SQL.
- Publicly available services where user specifies query and retrieve the results. At this brief description we only want to mention most common ones: Representational state transfer (and its implementation as RESTful API), SPARQL endpoints or custom designed Web Services.
- Maintain online connection to retrieve data form internal production data stores and enforce their security. It rarely can be accomplished by simple DB connection, so in this place data provider must put to use some ETL tools.
There are available various software packages integrating some or all mentioned components to simplify process of building and maintenance of whole data access infrastructure for the data providers. Again, one example of such integrated system is ODN developed in COMSODE project.
Data provider can publish data by using their own infrastructure, or put them in the cloud (for example use CDN or application services such as DaaS). It is a common situation to provide some services to the data producers at the central level in the country, particularly as part of national data catalogue or OpenData portal.
For the data user one of the critically important tasks is ability to effectively update data (of course there are exceptions, particularly datasets containing historical data). In this field, the user must address following questions:
- When will be more recent data available (as opposed to already received data)?
- How can be determined which data is new or updated and which was deleted?
- How to reconstruct content of the dataset at a particular time from the past?
Whatever method for data access you choose, for users it is important to have reliable access to the data. This can be achieved by keeping several basic rules:
- Invariability – Keep point of access (for example URL to data file) constant, and also the method of access, data structure and identification of individual objects within the dataset.
- Capacity – Data must be accessible to the user, in terms of both selected time and transmission capacity.
- Rules – User must understand what can he do with the data (see the text about license) and know the accuracy of the data in terms of their error rate and liability (for example if the data are legally binding).
Ľubor Illek received a degree in Informatics at Comenius University, Faculty of Mathematics, Physics and Informatics. Since 1998 he is active in area of information security and standardization activities. Since 2003 he is a member of Slovak informatics society. In 2009 he was one of the founding members of the Society for Open Information technologies . SOIT is a non-profit civic association of people who advocate the use of open information technologies in diverse areas of our society, with main focus on Slovakia. It brings together experts from the field of information technology to promote the idea of openness in the public policy making, access to data, and use of OSS.Social tagging: APIs > batch access > data providers > data users > Open data accessibility