Monthly Archives: April 2013

Datasets in CERIF

Data or Datasets are increasingly recognised as an essentiell asset of the Research ecosystem. More and more, funders require data to be linked with research output such as publications, and the very concept of a ‘Data Publication‘ is extensively discussed and variously approached. To support these data-intensive movements the CERIF task group discussed data-related model extensions in meetings in Athens and approved the then proposed extensions in a more recent task group meeting in Braga.

Datasets in CERIF

Datasets in CERIF

The Data extension proposal started by investigating CKAN, DCAT and eGMS and was guided by a first draft proposal of the Jisc-funded C4D project. UK’s Research Data Management activities have been acknowledged as “the most advanced in Europe” and previous CERIF model extensions (e.g. Federated Identifiers, Measurements, Indicators) were strongly influenced by multiple projects funded under Jisc’s RIM Programmes.

The central CERIF entity behind the concept of data or dataset* is named cfResultProduct (its syntax is cfResProd); it maintains multiple relationships with e.g. Dataset, Equipment, Facility, Service, Funding, Measurement, Indicator, Medium, Patent, Publication, Project, Person, Organisation, and Federated Identifier. The brief introduction to CERIF shows the range of the CERIF model, its entities and their relationships, where the current proposal adds a link from dataset to Geographic Bounding Box cfGeoBBox, proposes the addition of an alternative title, and date as a valuable attribute of a measurement (informed through UK’s REF Reporting activities), and it gives a recommendation how to deal with sensitive information, lineage/provenance and comments. The proposed CERIF model extensions will be incorporated in an upcoming CERIF XML 1.6 release, for which a CERIF XML Schema will soon be available for testing at the euroCRIS website, and where the official release is planned before the summer.

The CERIF for Dataset (C4D) project extended its demonstrator to support the export of metadata in CERIF, which is a work-in-progress that will be continued liaising with euroCRIS. OpenAIRE extended its datamodel to CERIF allowing for linkage with datasets and funding.

The following small – but not fully populated – XML example of a CERIF Dataset is therefore valid according to CERIF 1.5 with extensions implemented in XML comments. The vocabulary for the example is left empty – as it would very much rely on a particular context and needs further thought and input from real-life examples.

<?xml version="1.0" encoding="UTF-8"?>
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-04-03" sourceDatabase="Brigitte Jörg">
<!-- An example dataset record embedding federated identifiers and some example links -->
<!-- The vocabulary via cfClassId with real world systems should be reused if it has been defined somewhere else -->
<!-- The currently defined CERIF 1.5 vocabulary is publicly available in CERIF XML: http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xml -->

<cfResProd>
  <cfResProdId>internal-dataset-record-id-01</cfResProdId>
  <cfName cfLangCode="EN" cfTrans="o">The name or title of the dataset.</cfName>
  <!-- <cfAltName cfLangCode="EN" cfTrans="o">The alternative name or title of the dataset<cfAltName> -->
  <cfResProd_Class>
     <cfClassId>collection</cfClassId>
     <cfClassSchemeId>example-scheme-resource-types</cfClassSchemeId>
  </cfResProd_Class>
  <!-- link to a part of the dataset -->
  <cfResProd_ResProd>
     <cfResProdId2>dataset-record-id-02</cfResProdId2>
     <cfClassId>part</cfClassId>
     <cfClassSchemeId>inter-product-relations</cfClassSchemeId>
  </cfResProd_ResProd>
  <!-- link to repository as a service where the dataset function is described -->
  <cfResProd_Srv>
     <cfSrvId>dataset-id-01</cfSrvId>
     <cfClassId>function</cfClassId>
     <cfClassSchemeId>dataset-location-scheme</cfClassSchemeId>
  </cfResProd_Srv>
  <!-- link to funding information, such as programme or call -->
  <cfResProd_Fund>
     <cfFundId>funding-id-01</cfFundId>
     <cfClassId>funding</cfClassId>
     <cfClassSchemeId>cerif-output-funding-roles</cfClassSchemeId>
     <cfStartDate>2000-01-01T00:00:00</cfStartDate>
     <cfEndDate>2012-12-31T00:00:00</cfEndDate>
  </cfResProd_Fund>
  <!-- link to provenance information, such as related measurement -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id-00</cfMeasId>
     <cfClassId>provenance</cfClassId>
     <cfClassSchemeId>dataset-provenance-scheme</cfClassSchemeId>
  </cfResProd_Meas>
  <!-- link to publication built on dataset -->
  <cfResPubl_ResProd>
     <cfResPublId>publication-id-01</cfResPublId>
     <cfClassId>built-on</cfClassId>
     <cfClassSchemeId>cerif-inter-output-relations-scheme</cfClassSchemeId>
  </cfResPubl_ResProd>
  <!-- link to person in the role of a contributor -->
  <cfPers_ResProd>
     <cfPersId>person-id-02</cfPersId>
     <cfClassId>contributor</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to person in the role of a group author -->
  <cfPers_ResProd>
     <cfPersId>person-id-03</cfPersId>
     <cfClassId>group-authors</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to organisation in the role of a funder -->
  <cfOrgUnit_ResProd>
     <cfOrgUnitId>orgunit-id-01</cfOrgUnitId>
     <cfClassId>funder</cfClassId>
     <cfClassSchemeId>cerif-organisation-output-roles</cfClassSchemeId>
  </cfOrgUnit_ResProd>
  <!-- link to a comment made about the dataset -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id</cfMeasId>
     <cfClassId>comment</cfClassId>
     <cfClassSchemeId>dataset-commenting-scheme</cfClassSchemeId>
  </cfResProd_Meas>
</cfResProd>

<!-- The vocabulary defining the dataset record happens via CERIF Semantic Layer -->
<!-- However, for this example, we do not provide any formal terminology and leave it empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
<!-- Another possible terminology or scheme - currently empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
</CERIF>

Related Links:

* The dataset discussion in the CERIF task group started from a common understanding of data set as defined in Wikipedia: “A data set (or dataset) is a collection of data, usually presented in tabular form.