Category Archives: Harmonisation

RIOXX Application Profile in CERIF – a first draft

RIOXX have developed an Application Profile, which underpins the Guidelines for Open Access Repositories. It has been developed by UKOLN, Chygrove Ltd and Jisc, working closely with RCUK to provide a mechanism for institutional repositories in use in the UK Higher Education sector to comply with the RCUK Policy on Open Access. The ”first release of RIOXX focuses on applying consistency to the metadata fields used to record research funder and project/grant identifiers”.

I have been working with Paul Walk of RIOXX, to consider how the RIOXX application profile might be expressed in CERIF. The processed steps while transforming RIOXX into CERIF were as follows:

  1. Awareness of the use-case or purpose behind the RIOXX application profile
  2. Identification of relevant CERIF entities and their relationships underlying the profile
  3. Identification and assignment of RIOXX vocabulary terms
  4. Identification and assignment of RIOXX constraints / rules inline with the CERIF model and constructs
  5. Forward to the CERIF task group and the OpenAIRE community for validation and feedback

The transformation process started from an awareness of the use-cases or purpose behind the RIOXX application profile and guidelines. These have been designed “primarily with publications in mind” re-using the well-known “resource” entity from Dublin Core. Consequently, the underlying CERIF publication entity cfResultPublication in short cfResPubl is equally considered as a central entity. Further CERIF entities to underly the RIOXX profile have been identified as indicated in the image.

rioxx-in-cerif

RIOXX Application Profile in CERIF – employed vocabulary and corresponding proposed classification schemes

The image reflects the RIOXX ‘concepts’ in CERIF; revealing a CERIF ‘ontology’. In CERIF, objects are effectively built through identifiers; a recently published CERIF Reference Document shows CERIF ‘object’ features as “Minor Classes”, and their identifier mechanism is explained with CERIF in Brief. The selection of CERIF entities is based on the mapping of the RIOXX Application Profile 1.0 elements to CERIF 1.5 elements.

RIOXX to CERIF Mapping

RIOXX CERIF
“resource” cfResPubl
dc:title cfResPubl.cfResPublTitle.cfTitle
rioxxterms:creator cfResPubl.cfResPubl_Pers.PersId
+cfResPubl.cfResPubl_Pers.cfClassId=”creator”
cfResPubl.cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.cfResPubl_OrgUnit.cfClassId=”creator”
cfResPubl.cfResPubl_Srv.cfSrvId
+cfResPubl.cfResPubl_Srv.cfClassId=”creator”
-> cfFedId.cfFedId
dc:identifier cfResPubl.cfResPublId
-> cfFedId.cfFedId
dc:source cfResPubl.cfISSN
dc:language cfResPubl.cfResPubl_Lang.cfClassId=”language”
rioxxterms:projectId cfResPubl.cfProj_ResPubl.cfProjId
+cfResPubl.cfProj_ResPubl.cfClassId=”projectIdentifier”
-> cfFedId.cfFedId
rioxxterms:funder cfResPubl.ResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.ResPubl_OrgUnit.cfClassId=”funder”
-> cfFedId.cfFedId
dcterms:issued cfResPubl.cfPublDate
dc:format cfResPubl.ResPubl_Class.cfClassId=”e.g.jpeg”
+cfResPubl.ResPubl_Class.ClassSchemeId=”dc”
dc:publisher cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.ResPubl_OrgUnit.cfClassId=”publisher”
-> cfFedId.cfFedId
dc:description cfResPubl.cfResPublAbstr.cfAbstr
dc:subject cfResPubl.cfResPubl_Class.cfClassId=”e.g.physics”
dc:rights cfResPubl.cfResPubl_Class.cfClassId=”e.g.cc-by”
dc:coverage Requires further elaboration as to whether and how e.g. the spatial, temporal, jurisdictional information is covered, because time, space or jurisdiction are constructs that are modeled different in CERIF.
dc:audience cfResPubl.cfResPubl_Pers.cfPersId
+cfResPubl.cfResPubl_Pers.cfClassId=”audience”
-> cfFedId.cfFedId
dc:type cfResPubl.cfResPubl_Class.cfClassId=”e.g.journal-article”
Requires further elaboration. In Dublin Core it is a free text field, whereas in CERIF types are classes with their own identifiers; in an optimum space to anticipate a controlled vocabulary.
rioxxterms:contributor cfResPubl.cfResPubl_Pers.cfPersId
+cfResPubl.cfResPubl_Pers.cfClassId=”contributor”
cfResPubl.cfResPubl_Srv.cfSrvId
+cfResPubl.cfResPubl_Srv.cfClassId=”contributor”
cfResPubl.cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.cfResPubl_OrgUnit.cfClassId=”contributor”
-> cfFedId.cfFedId
dc:relation cfResPubl.cfResPubl_ResPubl.cfResPublId
+cfResPubl.cfResPubl_ResPubl.cfClassId=”relation”
-> cfFedId.cfFedId
dcterms:references cfResPubl.cfResPubl_ResProd.cfResProdId
+cfResPubl.cfResPubl_ResProd.cfClassId=”dataset”
-> cfFedId.cfFedId

The mapping demonstrates the inherent conceptual differences between RIOXX and CERIF. E.g., the dcterms:creator element could be mapped to either a CERIF person, organisation or service identifier (cfPersId, cfOrgUnitId, cfSrvId), and in addition to a relationship between the “resource” i.e. CERIF publication and either a person, organisation, or service.

In CERIF e.g. “creator” is considered a role in a relationship and not an attribute of e.g. a publication or “resource” itself. Therefore, “Creator” is maintained as a classification term with its own identifier cfClassId=”creator”. The same holds for “contributor”, “publisher”, “funder”. Furthermore, in Dublin Core, a “resource” conceptually implies to underly all Dublin Core descriptions but “resource” is not an explicit element itself, and a direct mapping is therefore not possible. In general, a “resource” can either be e.g. a CERIF publication cfResultPublication in short cfResPubl, or e.g. data cfResultProduct in short cfResProd.

Note: There is awareness about the underlying ambiguities at repositories’ sides, and the RIOXX guidelines reflect these in their current version, by taking into account and therefore dealing with the legacy of current implementations.

A simple RIOXX to CERIF mapping has been presented in the table above. In addition to investigated exceptions with direct mappings, the RIOXX vocabulary has been identified. To formally describe this vocabulary in CERIF, requires its structure to follow the CERIF Semantic Layer sub model (see figure within CERIF in Brief), where namespaces as e.g. applied in RIOXX, such as, rioxxterms; dcterms; dc; reflect CERIF Classification Schemes to which identified terms are assigned, as indicated in the  image above. The mapping in addition revealed, that due to conceptual differences in between the two formats – namely RIOXX and CERIF, rules will have to be developed. E.g. Language is an entity in CERIF as well is a Title, and therefore no vocabulary is needed. Rules may be required, such as with the RIOXX Cardinality, and a formal mapping requires model construct types to reflect entities of the two formats:

  • RIOXXTerms: Creator; Funder; Contributor; Project Identifier
  • DCTerms: Issued; Audience; Reference
  • DC: IdentifierLanguageSourceTitleDescriptionFormat; Publisher; RightsSubjectCoverageRelation;
  • RIOXXCardinality: OneOrMore; ExactlyOne; ZeroOrMore; ZeroOrOne
  • ModelConstructTypes: Entity; Attribute; Relationship; Term; Scheme; Element

In CERIF, rules would currently be encoded as a vocabulary. The proposition is therefore, to extend the CERIF vocabulary following the RIOXX Profile’s requirements anticipating the formal CERIF syntax and declared semantics (Semantic Layer). These rules could look as follows; their formal application enabled by the CERIF link entity cfClass_Class through the two inherent identifiers, cfClassId1 and cfClassId2 upon which a rule’s state (e.g. active; inactive) could be further indicated (not yet considered below).

  • Describing Cardinality “one or more” within the proposed “RioxxCardinalityScheme”:
    cfClass.cfClassId=”OneOrMore”
    cfClass.cfClassSchemeId=”rioxxCardinality”
  • Applying the “one or more” cardinality description to the “Creator” relationship:
    cfClass_Class.cfClassId1=”rioxxTerms:creator”
    cfClass_Class.cfClassId2=”rioxxCardinality:oneOrMore”

Summarising the investigations and mappings, further thought and feedback is required. A formal extension proposal document will be prepared for continued discussion within the CERIF TG and the wider community, where also the vocabularies’ and the terms’ identifiers need discussion. The current proposal adds a federated identifier cfFedId.cfFedId as placeholder reference for persistent identifiers (e.g. ORCID with person; FundRef with Funders, etc.).

The entire formal representation of the above presented RIOXX Application Profile Version 1.0 will be made available in CERIF XML for download and further investigation, and to supply unambiguous description of the current draft and proposal, not least also for comparison with ongoing related activities such as OpenAIRE, where a CRIS Interoperation Profile in CERIF XML is being developed. It will be presented at the euroCRIS Membership Meeting in Bonn – May 13th, 2013.

The RIOXX team posts updates on the RIOXX blog.

Further Links:

Datasets in CERIF

Data or Datasets are increasingly recognised as an essentiell asset of the Research ecosystem. More and more, funders require data to be linked with research output such as publications, and the very concept of a ‘Data Publication‘ is extensively discussed and variously approached. To support these data-intensive movements the CERIF task group discussed data-related model extensions in meetings in Athens and approved the then proposed extensions in a more recent task group meeting in Braga.

Datasets in CERIF

Datasets in CERIF

The Data extension proposal started by investigating CKAN, DCAT and eGMS and was guided by a first draft proposal of the Jisc-funded C4D project. UK’s Research Data Management activities have been acknowledged as “the most advanced in Europe” and previous CERIF model extensions (e.g. Federated Identifiers, Measurements, Indicators) were strongly influenced by multiple projects funded under Jisc’s RIM Programmes.

The central CERIF entity behind the concept of data or dataset* is named cfResultProduct (its syntax is cfResProd); it maintains multiple relationships with e.g. Dataset, Equipment, Facility, Service, Funding, Measurement, Indicator, Medium, Patent, Publication, Project, Person, Organisation, and Federated Identifier. The brief introduction to CERIF shows the range of the CERIF model, its entities and their relationships, where the current proposal adds a link from dataset to Geographic Bounding Box cfGeoBBox, proposes the addition of an alternative title, and date as a valuable attribute of a measurement (informed through UK’s REF Reporting activities), and it gives a recommendation how to deal with sensitive information, lineage/provenance and comments. The proposed CERIF model extensions will be incorporated in an upcoming CERIF XML 1.6 release, for which a CERIF XML Schema will soon be available for testing at the euroCRIS website, and where the official release is planned before the summer.

The CERIF for Dataset (C4D) project extended its demonstrator to support the export of metadata in CERIF, which is a work-in-progress that will be continued liaising with euroCRIS. OpenAIRE extended its datamodel to CERIF allowing for linkage with datasets and funding.

The following small – but not fully populated – XML example of a CERIF Dataset is therefore valid according to CERIF 1.5 with extensions implemented in XML comments. The vocabulary for the example is left empty – as it would very much rely on a particular context and needs further thought and input from real-life examples.

<?xml version="1.0" encoding="UTF-8"?>
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-04-03" sourceDatabase="Brigitte Jörg">
<!-- An example dataset record embedding federated identifiers and some example links -->
<!-- The vocabulary via cfClassId with real world systems should be reused if it has been defined somewhere else -->
<!-- The currently defined CERIF 1.5 vocabulary is publicly available in CERIF XML: http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xml -->

<cfResProd>
  <cfResProdId>internal-dataset-record-id-01</cfResProdId>
  <cfName cfLangCode="EN" cfTrans="o">The name or title of the dataset.</cfName>
  <!-- <cfAltName cfLangCode="EN" cfTrans="o">The alternative name or title of the dataset<cfAltName> -->
  <cfResProd_Class>
     <cfClassId>collection</cfClassId>
     <cfClassSchemeId>example-scheme-resource-types</cfClassSchemeId>
  </cfResProd_Class>
  <!-- link to a part of the dataset -->
  <cfResProd_ResProd>
     <cfResProdId2>dataset-record-id-02</cfResProdId2>
     <cfClassId>part</cfClassId>
     <cfClassSchemeId>inter-product-relations</cfClassSchemeId>
  </cfResProd_ResProd>
  <!-- link to repository as a service where the dataset function is described -->
  <cfResProd_Srv>
     <cfSrvId>dataset-id-01</cfSrvId>
     <cfClassId>function</cfClassId>
     <cfClassSchemeId>dataset-location-scheme</cfClassSchemeId>
  </cfResProd_Srv>
  <!-- link to funding information, such as programme or call -->
  <cfResProd_Fund>
     <cfFundId>funding-id-01</cfFundId>
     <cfClassId>funding</cfClassId>
     <cfClassSchemeId>cerif-output-funding-roles</cfClassSchemeId>
     <cfStartDate>2000-01-01T00:00:00</cfStartDate>
     <cfEndDate>2012-12-31T00:00:00</cfEndDate>
  </cfResProd_Fund>
  <!-- link to provenance information, such as related measurement -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id-00</cfMeasId>
     <cfClassId>provenance</cfClassId>
     <cfClassSchemeId>dataset-provenance-scheme</cfClassSchemeId>
  </cfResProd_Meas>
  <!-- link to publication built on dataset -->
  <cfResPubl_ResProd>
     <cfResPublId>publication-id-01</cfResPublId>
     <cfClassId>built-on</cfClassId>
     <cfClassSchemeId>cerif-inter-output-relations-scheme</cfClassSchemeId>
  </cfResPubl_ResProd>
  <!-- link to person in the role of a contributor -->
  <cfPers_ResProd>
     <cfPersId>person-id-02</cfPersId>
     <cfClassId>contributor</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to person in the role of a group author -->
  <cfPers_ResProd>
     <cfPersId>person-id-03</cfPersId>
     <cfClassId>group-authors</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to organisation in the role of a funder -->
  <cfOrgUnit_ResProd>
     <cfOrgUnitId>orgunit-id-01</cfOrgUnitId>
     <cfClassId>funder</cfClassId>
     <cfClassSchemeId>cerif-organisation-output-roles</cfClassSchemeId>
  </cfOrgUnit_ResProd>
  <!-- link to a comment made about the dataset -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id</cfMeasId>
     <cfClassId>comment</cfClassId>
     <cfClassSchemeId>dataset-commenting-scheme</cfClassSchemeId>
  </cfResProd_Meas>
</cfResProd>

<!-- The vocabulary defining the dataset record happens via CERIF Semantic Layer -->
<!-- However, for this example, we do not provide any formal terminology and leave it empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
<!-- Another possible terminology or scheme - currently empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
</CERIF>

Related Links:

* The dataset discussion in the CERIF task group started from a common understanding of data set as defined in Wikipedia: “A data set (or dataset) is a collection of data, usually presented in tabular form.

Data Context – Towards Pragmatic Boundaries

Research Data Alliance Logo

On March 18, the 1st Plenary of the global Research Data Alliance (RDA) in Gothenburg was launched by Neelie Kroes – Vice-President of the European Commission. The three days event brought together more than 200 ‘data advocates’ from around the world representing researchers, institutions, governments and industry, to facilitate discussions towards next possible steps with collaboration and joint work via working groups or interest groups – in the spirit of openness, sharing and re-use at the intersection of culture and science.

The procedure as to how RDA working groups and interest groups are being setup is introduced, but still considered a work in progress. At the launch event’s Agora session group proposals were presented and discussions continued within the so-called Birds of a Feather sessions. Besides two formal working groups on “Persistent Identifiers” and “Data Type Registries” mature case statements were introduced for “Data Foundations and Terminology“, “Practical Policy“, “Legal Interoperability” and “Communities and Engagement“, while other presentations where still refining their objectives “Metadata“, “Contextual Metadata“, “Repository Audit and Certification“, “Preservation e-Infrastructure“, “Marine Data Management“, “UPC Code“, “Data Citation” and new ideas emerged “Community Capability Model“, “Big Data Analysis Query“, “Worldwide PID“, “Data Publication“, “Architectural Data Interoperability” and “Industry and Health Informatics“.

While the different proposals indicate the wide range, implied challenges and overlap, this post is to inform about the outcome of discussions around “Contextual Metadata”, which started from four initial, rather un-specified use-cases (see also forum discussion), of which (1-3) where classified managerial and (4) was meant to speak for the researcher:

  1. Output Reporting to Funders
  2. Exchange of Information on Research Activity between Universtities
  3. Management of the Research Portfolio of a University by a Research Manager
  4. Discovery and Re-use of Datasets for other Purposes

The one hour BOF discussions approached these use-cases with the aim of understanding a context by identifying its significant underlying entities and their relationships towards delivering formal, standardised descriptions, i.e. implementations, taking into account material that is available and the expertise of people willing to engage. That is, putting forward a formal Case Statement to establish a RDA Working Group approached via use-cases.

Discussions revealed there is a lot of interest in the proposed group, which in the spirit of RDA will be renamed to “Data Context”. Furthermore, it was recommended that the proposed use-cases should be as specific as possible to ensure feasibility and delivery.

Pragmatic Boundaries

Pragmatic Boundaries

In the end, a new set of much more specific use-cases has been agreed. These will be posted in the RDA forum for further public consultation and refinement. They have been classified according to anticipated perspectives and in their order follow the discussed priorities:

  • Researcher: Find data and supplementary information (e.g. services, reports, tools, news, photos, …) to support a case study around an event (e.g. hurricane Katerina) from different catalogues.
  • Managerial: Indicate to funders what are the overlaps and gaps in currently funded research. Want to know from Data-Centers and Scientists if there are overlaps in Programmes – look a lot wider -> sub implications – understands amongst others semantics of geography and temporal contextual aspects.
  • Provenance: Allow to take Segments from Streamed Data and Workflows. (e.g. Social scientist reporting on social aspects of a climatic event) (e.g. an agency will publish storm reports/maps … and increasingly in public spaces … posting them on facebook, tweet where people wish to know from where is the data and who produced the image, who ran the processing job.)
  • Interoperability: Exchange of contextual metadata between different systems.

Close collaboration is foreseen and has started with other working groups, especially the proposed RDA “Metadata” group, where interaction facilitators have been nominated to maintain the bridge. In addition, there was an agreement to exchange group members between the ICSU/World Data System’s group “Knowledge Network” and to explore potential collaboration opportunities with CODATA working groups. Available standardisation and harmonisation approaches such as those initiated by DCC, PROV, PREMIS, MARC, CKAN, DCAT, CERIF, CASRAI, VIVO, OAIS, APA, W3C, ISO, OMG, etc. will certainly guide development and implementation processes. A report is being prepared, slides will be uploaded and discussions will be continued in the RDA forum.

The RDA initiative has been brought into existence by an initial three research funding organisations:

  • The Australian Commonwealth Government through the Australian National Data Service supported by the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative
  • The European Commission through the iCordi project funded under the 7th Framework Program
  • The United States of America through the RDA/US activity funded by the National Science Foundation

Research Data Alliance on Twitter:  http://twitter.com/resdatall

CERIF UK Coordination Meeting (preliminary summary)

On February 28th 2013 a CERIF UK Coordination Meeting was held at Prospero House in London, to identify the priorities for a feasible and sustainable CERIF coordination and implementation roadmap from ongoing activities in the UK Research Information Management (RIM) space. The current wider CERIF UK landscape is depicted as follows.

 

Wider CERIF UK Landscape

 

Before the meeting, an open spreadsheet has been prepared listing the ongoing activities, the organisations engaged and the outputs available (reflected in the above image). The spreadsheet was meant as a start for add-ons and is still open for extension and not yet to be considered final: http://bit.ly/13PWJqq

Many UK HEIs now use CERIF-based systems and work continues to ensure that funders’ systems to collect information about research outputs can accept information from universities in this standard format. Emerging national infrastructures such as RCUK’s Gateway to Research are based on CERIF and a CERIF-XML interface for the REF submission system is being prepared for addition during the first half of 2013. These developments are increasingly complemented by international activities within CASRAI and VIVO.

However, consistent implementation and standards development requires coordination. The meeting provided an overview of ongoing and past activities, the organisations engaged and the outputs and assets resulting from them to highlight the need for sustainability and to determine which of them require further development, maintenance and dissemination. Coordinated action is necessary to enable their consistent re-use and implementation. The meeting was an opportunity to identify what the current priorities and issues are and the commitments that can be made for the next steps forward.

In the end, it was clear that coordination is work and requires human resources and organisations to support the efforts, i.e. the work also needs to secure continued funding beyond June 2013.

A report and related material will be made available shortly.