Category Archives: Research Data Management

Outputs from CERIF for Dataset (C4D) Workshops

The C4D project aimed at developing a framework for incorporating metadata into CERIF for improved discovery and usage of research datasets in the future. The project recently held two workshops in Glasgow and London. The outputs from the workshops including slides and a collection of attendees’ feedback and a summary of the discussion is now available at the C4D blog.

The key issues raised at the workshops were:

  1. Clarifying the specification for research data management
  2. Getting Senior Management committment
  3. Setting up suitable systems and processes
  4. Getting staff to engage with Research Data Management
  5. Quantifying costs and getting budget
  6. Integration of services across organisational boundaries

Many useful links to discover further related material is available from the C4D blog.

CERIF 1.6 Formal Models released for Testing

As indicated in a previous post, the next CERIF release – namely CERIF 1.6 – happens before the summer. It follows decisions taken in a series of CERIF task group meetings (Athens (10/2012), Braga (02/2013), Rome (03/2013)). According to agreement by the CERIF task group, this CERIF 1.6 release is meant for extensive testing, to get feedback and input with respect to the next major release – CERIF 2.0. The formal CERIF 1.6 models are now available:

Major updates in the current CERIF 1.6 release centre around the CERIF entity cfResultProduct (cfResProd). CERIF is a formal model (supplying a formal syntax and declared semantics) to allow for different meanings of entities and their relationships in contexts. Therefore, all entities, including the cfResultProduct (cfResProd) entity, in addition to their naming (syntax) are enhanced with semantic (contextual) information to become more meaningful. Such enhancements can be implemented through the so-called CERIF Semantic Layer and can be seen as contextual vocabularies to set the boundaries.

The formal CERIF entity cfResProd represented by its short name, is in fact a container to aggregate all potential types. The history or legacy and thus the usage of the CERIF cfResultProduct entity informed about its meaning over time – namely “data” or “dataset”, and the discussions within the CERIF task group leading to the current updates started from such an understanding (see also “Datasets in CERIF”). It must be noted that in a CERIF understanding ‘product’ is not to be confused with a ‘commercial product’ but rather to be seen as a result ‘product’ of research activities. Formally, a type such as “dataset” is stored within the so-called CERIF Semantic Layer, where it maintains its own identifier, namespace, examples, descriptions, source of origin, etc. (see also CERIF in Brief).

The major updates in the current CERIF release to support recording and thus an understanding of datasets have been informed by the Jisc-funded C4D (CERIF for Datasets) project, following investigations* of CKAN, DCAT and eGMS. The short list below indicates the updates, where more details will be published on the euroCRIS website over time:

  • addition of Alternative Title for ‘dataset’
    cfResProd.cfAltName
  • new link entity from ‘dataset’ to geographic bounding box
    cfResProd.cfResProd_GeoBBox
  • new link entity from geographic bounding box to measurement
    cfGeoBBox.GeoBBox_cfMeas
  • new DateTime attribute with measurements
    cfMeas.cfDateTime=”2012-01-01T00:00:00″
  • lineage/provenance is considered a time-stamped role ‘measurement’ of a ‘dataset’
    cfResProd.cfResProd_Meas.cfClassId=”provenance-uuid”
  • a comment in general is understood as a ‘measurement’
    cfMeas.cfValJudgeText=”This is a comment that allows for … etc.”
  • new attribute cfOrder in links to Persons and Organisations from Results (e.g.)
    cfResPubl.cfResPubl_Pers.cfOrder=”1″ where
    cfResPubl.cfResPubl_Pers.cfClassId=”author-uuid”
  • no changes with Localisation entities
    cfLanguage, cfCountry, cfCurrency
  • deprecation of a few attributes with future releases
  • informing about handling of dates
  • incorporating and getting inspired by existing vocabularies and governance structures, such as CASRAI, VIVO, ISOCAT, V4OA, SKOS, RDF, etc.)

Some more CERIF XML examples will be posted here with this blog, shortly. For a testing and proper validation of CERIF 1.6 XML files, the following header should be added:

<CERIF
xmlns="urn:xmlns:org:eurocris:cerif-1.6-2" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.6-2 http://www.eurocris.org/Uploads/Web%20pages/CERIF1.6/CERIF_1.6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" date="2013-07-24" sourceDatabase="LabelForYourData">

* See for mapping approaches of CKAN, DCAT and eGMS to CERIF in the paper’s appendix “A multi-level metadata approach for a Public Sector Information data infrastructure“ by N. Houssos, B. Joerg, B. Matthews, in CRIS 2012 Proceedings.

Datasets in CERIF

Data or Datasets are increasingly recognised as an essentiell asset of the Research ecosystem. More and more, funders require data to be linked with research output such as publications, and the very concept of a ‘Data Publication‘ is extensively discussed and variously approached. To support these data-intensive movements the CERIF task group discussed data-related model extensions in meetings in Athens and approved the then proposed extensions in a more recent task group meeting in Braga.

Datasets in CERIF

Datasets in CERIF

The Data extension proposal started by investigating CKAN, DCAT and eGMS and was guided by a first draft proposal of the Jisc-funded C4D project. UK’s Research Data Management activities have been acknowledged as “the most advanced in Europe” and previous CERIF model extensions (e.g. Federated Identifiers, Measurements, Indicators) were strongly influenced by multiple projects funded under Jisc’s RIM Programmes.

The central CERIF entity behind the concept of data or dataset* is named cfResultProduct (its syntax is cfResProd); it maintains multiple relationships with e.g. Dataset, Equipment, Facility, Service, Funding, Measurement, Indicator, Medium, Patent, Publication, Project, Person, Organisation, and Federated Identifier. The brief introduction to CERIF shows the range of the CERIF model, its entities and their relationships, where the current proposal adds a link from dataset to Geographic Bounding Box cfGeoBBox, proposes the addition of an alternative title, and date as a valuable attribute of a measurement (informed through UK’s REF Reporting activities), and it gives a recommendation how to deal with sensitive information, lineage/provenance and comments. The proposed CERIF model extensions will be incorporated in an upcoming CERIF XML 1.6 release, for which a CERIF XML Schema will soon be available for testing at the euroCRIS website, and where the official release is planned before the summer.

The CERIF for Dataset (C4D) project extended its demonstrator to support the export of metadata in CERIF, which is a work-in-progress that will be continued liaising with euroCRIS. OpenAIRE extended its datamodel to CERIF allowing for linkage with datasets and funding.

The following small – but not fully populated – XML example of a CERIF Dataset is therefore valid according to CERIF 1.5 with extensions implemented in XML comments. The vocabulary for the example is left empty – as it would very much rely on a particular context and needs further thought and input from real-life examples.

<?xml version="1.0" encoding="UTF-8"?>
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-04-03" sourceDatabase="Brigitte Jörg">
<!-- An example dataset record embedding federated identifiers and some example links -->
<!-- The vocabulary via cfClassId with real world systems should be reused if it has been defined somewhere else -->
<!-- The currently defined CERIF 1.5 vocabulary is publicly available in CERIF XML: http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xml -->

<cfResProd>
  <cfResProdId>internal-dataset-record-id-01</cfResProdId>
  <cfName cfLangCode="EN" cfTrans="o">The name or title of the dataset.</cfName>
  <!-- <cfAltName cfLangCode="EN" cfTrans="o">The alternative name or title of the dataset<cfAltName> -->
  <cfResProd_Class>
     <cfClassId>collection</cfClassId>
     <cfClassSchemeId>example-scheme-resource-types</cfClassSchemeId>
  </cfResProd_Class>
  <!-- link to a part of the dataset -->
  <cfResProd_ResProd>
     <cfResProdId2>dataset-record-id-02</cfResProdId2>
     <cfClassId>part</cfClassId>
     <cfClassSchemeId>inter-product-relations</cfClassSchemeId>
  </cfResProd_ResProd>
  <!-- link to repository as a service where the dataset function is described -->
  <cfResProd_Srv>
     <cfSrvId>dataset-id-01</cfSrvId>
     <cfClassId>function</cfClassId>
     <cfClassSchemeId>dataset-location-scheme</cfClassSchemeId>
  </cfResProd_Srv>
  <!-- link to funding information, such as programme or call -->
  <cfResProd_Fund>
     <cfFundId>funding-id-01</cfFundId>
     <cfClassId>funding</cfClassId>
     <cfClassSchemeId>cerif-output-funding-roles</cfClassSchemeId>
     <cfStartDate>2000-01-01T00:00:00</cfStartDate>
     <cfEndDate>2012-12-31T00:00:00</cfEndDate>
  </cfResProd_Fund>
  <!-- link to provenance information, such as related measurement -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id-00</cfMeasId>
     <cfClassId>provenance</cfClassId>
     <cfClassSchemeId>dataset-provenance-scheme</cfClassSchemeId>
  </cfResProd_Meas>
  <!-- link to publication built on dataset -->
  <cfResPubl_ResProd>
     <cfResPublId>publication-id-01</cfResPublId>
     <cfClassId>built-on</cfClassId>
     <cfClassSchemeId>cerif-inter-output-relations-scheme</cfClassSchemeId>
  </cfResPubl_ResProd>
  <!-- link to person in the role of a contributor -->
  <cfPers_ResProd>
     <cfPersId>person-id-02</cfPersId>
     <cfClassId>contributor</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to person in the role of a group author -->
  <cfPers_ResProd>
     <cfPersId>person-id-03</cfPersId>
     <cfClassId>group-authors</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to organisation in the role of a funder -->
  <cfOrgUnit_ResProd>
     <cfOrgUnitId>orgunit-id-01</cfOrgUnitId>
     <cfClassId>funder</cfClassId>
     <cfClassSchemeId>cerif-organisation-output-roles</cfClassSchemeId>
  </cfOrgUnit_ResProd>
  <!-- link to a comment made about the dataset -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id</cfMeasId>
     <cfClassId>comment</cfClassId>
     <cfClassSchemeId>dataset-commenting-scheme</cfClassSchemeId>
  </cfResProd_Meas>
</cfResProd>

<!-- The vocabulary defining the dataset record happens via CERIF Semantic Layer -->
<!-- However, for this example, we do not provide any formal terminology and leave it empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
<!-- Another possible terminology or scheme - currently empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
</CERIF>

Related Links:

* The dataset discussion in the CERIF task group started from a common understanding of data set as defined in Wikipedia: “A data set (or dataset) is a collection of data, usually presented in tabular form.

Data Context – Towards Pragmatic Boundaries

Research Data Alliance Logo

On March 18, the 1st Plenary of the global Research Data Alliance (RDA) in Gothenburg was launched by Neelie Kroes – Vice-President of the European Commission. The three days event brought together more than 200 ‘data advocates’ from around the world representing researchers, institutions, governments and industry, to facilitate discussions towards next possible steps with collaboration and joint work via working groups or interest groups – in the spirit of openness, sharing and re-use at the intersection of culture and science.

The procedure as to how RDA working groups and interest groups are being setup is introduced, but still considered a work in progress. At the launch event’s Agora session group proposals were presented and discussions continued within the so-called Birds of a Feather sessions. Besides two formal working groups on “Persistent Identifiers” and “Data Type Registries” mature case statements were introduced for “Data Foundations and Terminology“, “Practical Policy“, “Legal Interoperability” and “Communities and Engagement“, while other presentations where still refining their objectives “Metadata“, “Contextual Metadata“, “Repository Audit and Certification“, “Preservation e-Infrastructure“, “Marine Data Management“, “UPC Code“, “Data Citation” and new ideas emerged “Community Capability Model“, “Big Data Analysis Query“, “Worldwide PID“, “Data Publication“, “Architectural Data Interoperability” and “Industry and Health Informatics“.

While the different proposals indicate the wide range, implied challenges and overlap, this post is to inform about the outcome of discussions around “Contextual Metadata”, which started from four initial, rather un-specified use-cases (see also forum discussion), of which (1-3) where classified managerial and (4) was meant to speak for the researcher:

  1. Output Reporting to Funders
  2. Exchange of Information on Research Activity between Universtities
  3. Management of the Research Portfolio of a University by a Research Manager
  4. Discovery and Re-use of Datasets for other Purposes

The one hour BOF discussions approached these use-cases with the aim of understanding a context by identifying its significant underlying entities and their relationships towards delivering formal, standardised descriptions, i.e. implementations, taking into account material that is available and the expertise of people willing to engage. That is, putting forward a formal Case Statement to establish a RDA Working Group approached via use-cases.

Discussions revealed there is a lot of interest in the proposed group, which in the spirit of RDA will be renamed to “Data Context”. Furthermore, it was recommended that the proposed use-cases should be as specific as possible to ensure feasibility and delivery.

Pragmatic Boundaries

Pragmatic Boundaries

In the end, a new set of much more specific use-cases has been agreed. These will be posted in the RDA forum for further public consultation and refinement. They have been classified according to anticipated perspectives and in their order follow the discussed priorities:

  • Researcher: Find data and supplementary information (e.g. services, reports, tools, news, photos, …) to support a case study around an event (e.g. hurricane Katerina) from different catalogues.
  • Managerial: Indicate to funders what are the overlaps and gaps in currently funded research. Want to know from Data-Centers and Scientists if there are overlaps in Programmes – look a lot wider -> sub implications – understands amongst others semantics of geography and temporal contextual aspects.
  • Provenance: Allow to take Segments from Streamed Data and Workflows. (e.g. Social scientist reporting on social aspects of a climatic event) (e.g. an agency will publish storm reports/maps … and increasingly in public spaces … posting them on facebook, tweet where people wish to know from where is the data and who produced the image, who ran the processing job.)
  • Interoperability: Exchange of contextual metadata between different systems.

Close collaboration is foreseen and has started with other working groups, especially the proposed RDA “Metadata” group, where interaction facilitators have been nominated to maintain the bridge. In addition, there was an agreement to exchange group members between the ICSU/World Data System’s group “Knowledge Network” and to explore potential collaboration opportunities with CODATA working groups. Available standardisation and harmonisation approaches such as those initiated by DCC, PROV, PREMIS, MARC, CKAN, DCAT, CERIF, CASRAI, VIVO, OAIS, APA, W3C, ISO, OMG, etc. will certainly guide development and implementation processes. A report is being prepared, slides will be uploaded and discussions will be continued in the RDA forum.

The RDA initiative has been brought into existence by an initial three research funding organisations:

  • The Australian Commonwealth Government through the Australian National Data Service supported by the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative
  • The European Commission through the iCordi project funded under the 7th Framework Program
  • The United States of America through the RDA/US activity funded by the National Science Foundation

Research Data Alliance on Twitter:  http://twitter.com/resdatall