This blog post is the third and final part in a series of posts on creating advanced search applications with Spinque and the Open Data Node. We go to the beginning of the chain and describe the transformation and publication of the datasets itself. Using the Open Data Node we transformed, integrated and published five datasets.
In the previous post we described the search application Linked Open Images. The application provides integrated access to several Open Data collections that contain content about World War 2: historical (1) Dutch news reels of the Netherlands Institute for Sound and Vision and (2) Photographs and (3) Books from the NIOD Institute for War, Holocaust and Genocide Studies. With the application a user can search for newsreels, watch the video and explore photographs and books related to the topic of the newsreel. The backend of the application required several search functions, such as keyword search and recommendation. These functions were all created by modeling search strategies in Spinque, and required no programming whatsoever!
In the first post of this series we described how the collections, used in the application, were integrated. We introduced Spinque LINK, a service to interactively align controlled vocabularies. Using LINK we integrated the two that are used to describe the collections of videos, photographs and books.
In this final post of the series we describe the transformation and publication of the datasets itself. While most of the datasets that we used are already available as open data they were not yet integrated. The news reels are described with terms from the audiovisual thesaurus GTAA, and the photographs and books are described with terms from the NIOD term list. In the datasets that are currently published as open data, however, the reference to terms of the controlled vocabularies are not included. Therefore we can not benefit from the links between the controlled vocabularies and also lose valuable information, such as name variants.
Using the Comsode’s Open Data Node we transformed and published the datasets and the controlled vocabularies in an integrated fashion as RDF. The result is available in the Open Data Node hosted by Spinque. It contains five datasets, the three collections and the two controlled vocabularies. For each dataset we created a pipeline using UnifiedViews, the dataset transformation software included in ODN.
The collections and the GTAA thesaurus are already published through the OAI metadata harvesting protocol. Therefore the starting point of the pipelines for these datasets are the OAI endpoints. To crawl the content of these OAI endpoints Spinque developed an OAI crawler plugin for UnifiedViews.
The figure above shows a screenshot of UnifiedViews with the pipeline to crawl the data from the collection of newsreels from the OAI endpoint and convert it to RDF. The pipeline starts with the OAI crawler, the red block on top. The OAI crawler outputs XML files where the collection objects are described with metadata using the dublin core scheme. With a simple XSLT transformation the XML is converted to RDF. Finally, the RDF is zipped and saved.
A second pipeline takes the saved zip file and transforms it to the final dataset. We choose to split up the pipelines as the OAI crawling can be time-consuming, and we did not want to repeat this in the debugging process. The second pipeline is shown in the screenshot above. As input it takes the crawled collection data, and the GTAA audiovisual thesaurus. Using a SPARQL update query we integrate the collection objects with the thesaurus. This is done by replacing the literal terms, e.g. Amsterdam, with the identifier of the concept in the thesaurus, e.g. http://data.beeldengeluid.nl/gtaa/31586. We replace the literal values of the dc:subject field with concepts from the subject facet of the thesaurus, and the literal values of the dc:coverage field with concept from the geographical location facet of the thesaurus. The updated data is then made available as a downloadable RDF data dump and through a SPARQL endpoint.
The collections of photographs and books are produced with similar pipelines. In this case the literal values of the dc:subject field are taken from the NIOD term list.
The term list from the NIOD itself required a different pipeline. The term list was not yet available as Open Data. We created a pipeline that produces a SKOS version of the term list out of a CSV file. Using a tabular block we could map the columns in the CSV directly to SKOS properties. The CSV file contained columns for skos:prefLabel, skos:related and skos:narrower.
The final step of the integration is the linking of the two controlled vocabularies, GTAA and NIOD term list. This is done with CultuurLINK and is described in the first blog post of this series. The links are included as an RDF file in the CKAN resource for the NIOD term list.
All the RDF data together is the input for the Linked Open Images application. The RDF files are indexed with Spinque and the search functionality is modeled with strategies, as described the previous blog post.