REF Reporting Profile in CERIF XML (1.6) and Examples

With previous posts we introduced the mapping work to transform the REF XML Reporting Profile to CERIF XML (and vice-versa):

After quite a journey and some months later we now publish the current CERIF XML files to share them with the community for further discussion even if they are not as polished as initially planned. It is important to note, that the files did not undergo a final testing nor evaluation to this point. However, they are syntactically valid CERIF 1.6 XML and have been prepared thoroughly. To prevent from further delay and to not risk that the files will not be published and thus un-usable at all, we provide them for continued improvements and for further elaboration as such – this is important especially with respect to semantics. 

We consider the files a very valuable contribution for the guidance of future CERIF activities. They do demonstrate the complexity imposed by a multitude of applicable vocabularies and show the need for contextual clarity when defining boundaries, aggregation and governance levels.

It has to be mentioned here, that the “REF Reporting Profile in CERIF” was not a profile built according to REF Guidelines but a profile aimed at transforming a REF2014 XML file (following the REF Guidelines) into a CERIF XML file with an awareness of the substantial underlying structural differences at both ends – including that the data will finally have to be validated by the REF XML mechanism according to the guidelines (that is, e.g. the length of a string or the cardinality of values). It is for this reason also that a decision was taken, to use the REF XML element names as identifiers for the CERIF vocabulary terms whenever possible, to simplify the automated transformation script maximally and to ensure the recognition of the corresponding elements or hence terms (see below xml examples). This is also in support of a human understanding when examining the files. People familiar with CERIF will know that there is quite a number of required identifiers (often non human readable) within CERIF entities to enable the interlinkage or aggregation of objects; which may indeed be a challenge for the human reader (please have a look at the Excel Sheet comment column).

To provide for better access to the files – again for the human reader – the bulk reporting profile has been split into separate files:

Within the reporting files, the applied vocabulary terms (cfClassId) and their corresponding namespaces (cfClassId) are indicated by identifier references where the controlled vocabulary (cfClassId/cfClassSchemeId) itself is maintained in the vocabulary file.

For a quick reference we also provide an Excel Sheet of the profile. Its xml2xml tab covers all the involved entities and fields and indicates the explained structure. Its vocabularies tab collects all controlled terms (and their identifiers) except from those which are expected to be provided by the submitting institution themselves (hence a ‘institution’ prefix in the cfClassificationSchemeId column of the Excel). Examples of relevant institutional vocabulary terms are available with the vocabulary file and should be retrievable via the cfClassSchemeId field and the prefix ‘institution’ instead of ‘ref’.

If submitted in pieces and not in one bulk file, each object has to a) identify the reporting institution by provision of the UK Provider Reference Number (UKPRN) b) indicate multiple submissions and c) refer to the REF’s Units of Assessment.

The following snippets from REF XML and REF in CERIF XML provide insight into inherent structural differences. The complexity increases (not shown in the snippets) with CERIF relationships and furthermore with multiple vocabularies and definitions for possible aggregations and objects at a given time:

A REF2014 XML Snippet


<ref2014Data xmlns="http://www.ref.ac.uk/schemas/ref2014data">
  <institution>10006840</institution>
  <submissions>
    <submission>
      <unitOfAssessment>9</unitOfAssessment>
      <multipleSubmission>A</multipleSubmission>
    </submission>
  </submissions>
</ref2014>

The corresponding CERIF XML Snippet

<!-- REF 2014 XML in CERIF -->
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.6-2" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.6-2 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.6/CERIF_1.6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" date="2013-09-22" sourceDatabase="REF Common Fields">
<!-- -->
<cfOrgUnit>
<!-- for the identification of a submission -->
     <cfOrgUnitId>10006840</cfOrgUnitId>  <!-- UKPRN number *mandatory* -->
     <cfOrgUnit_Class>
          <cfClassId>institution</cfClassId>
          <cfClassSchemeId>ref-organisation-types</cfClassSchemeId>
     </cfOrgUnit_Class>
     <cfOrgUnit_Class>
          <cfClassId>A</cfClassId>
          <cfClassSchemeId>ref-multiple-submissions</cfClassSchemeId>
     </cfOrgUnit_Class>
     <cfOrgUnit_OrgUnit>
          <cfOrgUnitId2>9</cfOrgUnitId2>
          <cfClassId>unitOfAssessment</cfClassId>
          <cfClassSchemeId>ref-organisation-categories</cfClassSchemeId>
     </cfOrgUnit_OrgUnit>
</cfOrgUnit>
<!-- -->
</CERIF>

Many thanks again to Gareth Edwards (HEFCE) who was very supportive in explaining the meaning behind fields and structures which initially were not entirely clear.

The files are now available for further testing. An XSLT Transformation Script is available upon request to generate REFXML from CERIF XML (it needs a bug-fixing). We shall see upon feedback and responses how to proceed with it.

Outputs from CERIF for Dataset (C4D) Workshops

The C4D project aimed at developing a framework for incorporating metadata into CERIF for improved discovery and usage of research datasets in the future. The project recently held two workshops in Glasgow and London. The outputs from the workshops including slides and a collection of attendees’ feedback and a summary of the discussion is now available at the C4D blog.

The key issues raised at the workshops were:

  1. Clarifying the specification for research data management
  2. Getting Senior Management committment
  3. Setting up suitable systems and processes
  4. Getting staff to engage with Research Data Management
  5. Quantifying costs and getting budget
  6. Integration of services across organisational boundaries

Many useful links to discover further related material is available from the C4D blog.

CERIF 1.6 Formal Models released for Testing

As indicated in a previous post, the next CERIF release – namely CERIF 1.6 – happens before the summer. It follows decisions taken in a series of CERIF task group meetings (Athens (10/2012), Braga (02/2013), Rome (03/2013)). According to agreement by the CERIF task group, this CERIF 1.6 release is meant for extensive testing, to get feedback and input with respect to the next major release – CERIF 2.0. The formal CERIF 1.6 models are now available:

Major updates in the current CERIF 1.6 release centre around the CERIF entity cfResultProduct (cfResProd). CERIF is a formal model (supplying a formal syntax and declared semantics) to allow for different meanings of entities and their relationships in contexts. Therefore, all entities, including the cfResultProduct (cfResProd) entity, in addition to their naming (syntax) are enhanced with semantic (contextual) information to become more meaningful. Such enhancements can be implemented through the so-called CERIF Semantic Layer and can be seen as contextual vocabularies to set the boundaries.

The formal CERIF entity cfResProd represented by its short name, is in fact a container to aggregate all potential types. The history or legacy and thus the usage of the CERIF cfResultProduct entity informed about its meaning over time – namely “data” or “dataset”, and the discussions within the CERIF task group leading to the current updates started from such an understanding (see also “Datasets in CERIF”). It must be noted that in a CERIF understanding ‘product’ is not to be confused with a ‘commercial product’ but rather to be seen as a result ‘product’ of research activities. Formally, a type such as “dataset” is stored within the so-called CERIF Semantic Layer, where it maintains its own identifier, namespace, examples, descriptions, source of origin, etc. (see also CERIF in Brief).

The major updates in the current CERIF release to support recording and thus an understanding of datasets have been informed by the Jisc-funded C4D (CERIF for Datasets) project, following investigations* of CKAN, DCAT and eGMS. The short list below indicates the updates, where more details will be published on the euroCRIS website over time:

  • addition of Alternative Title for ‘dataset’
    cfResProd.cfAltName
  • new link entity from ‘dataset’ to geographic bounding box
    cfResProd.cfResProd_GeoBBox
  • new link entity from geographic bounding box to measurement
    cfGeoBBox.GeoBBox_cfMeas
  • new DateTime attribute with measurements
    cfMeas.cfDateTime=”2012-01-01T00:00:00″
  • lineage/provenance is considered a time-stamped role ‘measurement’ of a ‘dataset’
    cfResProd.cfResProd_Meas.cfClassId=”provenance-uuid”
  • a comment in general is understood as a ‘measurement’
    cfMeas.cfValJudgeText=”This is a comment that allows for … etc.”
  • new attribute cfOrder in links to Persons and Organisations from Results (e.g.)
    cfResPubl.cfResPubl_Pers.cfOrder=”1″ where
    cfResPubl.cfResPubl_Pers.cfClassId=”author-uuid”
  • no changes with Localisation entities
    cfLanguage, cfCountry, cfCurrency
  • deprecation of a few attributes with future releases
  • informing about handling of dates
  • incorporating and getting inspired by existing vocabularies and governance structures, such as CASRAI, VIVO, ISOCAT, V4OA, SKOS, RDF, etc.)

Some more CERIF XML examples will be posted here with this blog, shortly. For a testing and proper validation of CERIF 1.6 XML files, the following header should be added:

<CERIF
xmlns="urn:xmlns:org:eurocris:cerif-1.6-2" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.6-2 http://www.eurocris.org/Uploads/Web%20pages/CERIF1.6/CERIF_1.6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" date="2013-07-24" sourceDatabase="LabelForYourData">

* See for mapping approaches of CKAN, DCAT and eGMS to CERIF in the paper’s appendix “A multi-level metadata approach for a Public Sector Information data infrastructure“ by N. Houssos, B. Joerg, B. Matthews, in CRIS 2012 Proceedings.

RIOXX Application Profile in CERIF – a first draft

RIOXX have developed an Application Profile, which underpins the Guidelines for Open Access Repositories. It has been developed by UKOLN, Chygrove Ltd and Jisc, working closely with RCUK to provide a mechanism for institutional repositories in use in the UK Higher Education sector to comply with the RCUK Policy on Open Access. The ”first release of RIOXX focuses on applying consistency to the metadata fields used to record research funder and project/grant identifiers”.

I have been working with Paul Walk of RIOXX, to consider how the RIOXX application profile might be expressed in CERIF. The processed steps while transforming RIOXX into CERIF were as follows:

  1. Awareness of the use-case or purpose behind the RIOXX application profile
  2. Identification of relevant CERIF entities and their relationships underlying the profile
  3. Identification and assignment of RIOXX vocabulary terms
  4. Identification and assignment of RIOXX constraints / rules inline with the CERIF model and constructs
  5. Forward to the CERIF task group and the OpenAIRE community for validation and feedback

The transformation process started from an awareness of the use-cases or purpose behind the RIOXX application profile and guidelines. These have been designed “primarily with publications in mind” re-using the well-known “resource” entity from Dublin Core. Consequently, the underlying CERIF publication entity cfResultPublication in short cfResPubl is equally considered as a central entity. Further CERIF entities to underly the RIOXX profile have been identified as indicated in the image.

rioxx-in-cerif

RIOXX Application Profile in CERIF – employed vocabulary and corresponding proposed classification schemes

The image reflects the RIOXX ‘concepts’ in CERIF; revealing a CERIF ‘ontology’. In CERIF, objects are effectively built through identifiers; a recently published CERIF Reference Document shows CERIF ‘object’ features as “Minor Classes”, and their identifier mechanism is explained with CERIF in Brief. The selection of CERIF entities is based on the mapping of the RIOXX Application Profile 1.0 elements to CERIF 1.5 elements.

RIOXX to CERIF Mapping

RIOXX CERIF
“resource” cfResPubl
dc:title cfResPubl.cfResPublTitle.cfTitle
rioxxterms:creator cfResPubl.cfResPubl_Pers.PersId
+cfResPubl.cfResPubl_Pers.cfClassId=”creator”
cfResPubl.cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.cfResPubl_OrgUnit.cfClassId=”creator”
cfResPubl.cfResPubl_Srv.cfSrvId
+cfResPubl.cfResPubl_Srv.cfClassId=”creator”
-> cfFedId.cfFedId
dc:identifier cfResPubl.cfResPublId
-> cfFedId.cfFedId
dc:source cfResPubl.cfISSN
dc:language cfResPubl.cfResPubl_Lang.cfClassId=”language”
rioxxterms:projectId cfResPubl.cfProj_ResPubl.cfProjId
+cfResPubl.cfProj_ResPubl.cfClassId=”projectIdentifier”
-> cfFedId.cfFedId
rioxxterms:funder cfResPubl.ResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.ResPubl_OrgUnit.cfClassId=”funder”
-> cfFedId.cfFedId
dcterms:issued cfResPubl.cfPublDate
dc:format cfResPubl.ResPubl_Class.cfClassId=”e.g.jpeg”
+cfResPubl.ResPubl_Class.ClassSchemeId=”dc”
dc:publisher cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.ResPubl_OrgUnit.cfClassId=”publisher”
-> cfFedId.cfFedId
dc:description cfResPubl.cfResPublAbstr.cfAbstr
dc:subject cfResPubl.cfResPubl_Class.cfClassId=”e.g.physics”
dc:rights cfResPubl.cfResPubl_Class.cfClassId=”e.g.cc-by”
dc:coverage Requires further elaboration as to whether and how e.g. the spatial, temporal, jurisdictional information is covered, because time, space or jurisdiction are constructs that are modeled different in CERIF.
dc:audience cfResPubl.cfResPubl_Pers.cfPersId
+cfResPubl.cfResPubl_Pers.cfClassId=”audience”
-> cfFedId.cfFedId
dc:type cfResPubl.cfResPubl_Class.cfClassId=”e.g.journal-article”
Requires further elaboration. In Dublin Core it is a free text field, whereas in CERIF types are classes with their own identifiers; in an optimum space to anticipate a controlled vocabulary.
rioxxterms:contributor cfResPubl.cfResPubl_Pers.cfPersId
+cfResPubl.cfResPubl_Pers.cfClassId=”contributor”
cfResPubl.cfResPubl_Srv.cfSrvId
+cfResPubl.cfResPubl_Srv.cfClassId=”contributor”
cfResPubl.cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.cfResPubl_OrgUnit.cfClassId=”contributor”
-> cfFedId.cfFedId
dc:relation cfResPubl.cfResPubl_ResPubl.cfResPublId
+cfResPubl.cfResPubl_ResPubl.cfClassId=”relation”
-> cfFedId.cfFedId
dcterms:references cfResPubl.cfResPubl_ResProd.cfResProdId
+cfResPubl.cfResPubl_ResProd.cfClassId=”dataset”
-> cfFedId.cfFedId

The mapping demonstrates the inherent conceptual differences between RIOXX and CERIF. E.g., the dcterms:creator element could be mapped to either a CERIF person, organisation or service identifier (cfPersId, cfOrgUnitId, cfSrvId), and in addition to a relationship between the “resource” i.e. CERIF publication and either a person, organisation, or service.

In CERIF e.g. “creator” is considered a role in a relationship and not an attribute of e.g. a publication or “resource” itself. Therefore, “Creator” is maintained as a classification term with its own identifier cfClassId=”creator”. The same holds for “contributor”, “publisher”, “funder”. Furthermore, in Dublin Core, a “resource” conceptually implies to underly all Dublin Core descriptions but “resource” is not an explicit element itself, and a direct mapping is therefore not possible. In general, a “resource” can either be e.g. a CERIF publication cfResultPublication in short cfResPubl, or e.g. data cfResultProduct in short cfResProd.

Note: There is awareness about the underlying ambiguities at repositories’ sides, and the RIOXX guidelines reflect these in their current version, by taking into account and therefore dealing with the legacy of current implementations.

A simple RIOXX to CERIF mapping has been presented in the table above. In addition to investigated exceptions with direct mappings, the RIOXX vocabulary has been identified. To formally describe this vocabulary in CERIF, requires its structure to follow the CERIF Semantic Layer sub model (see figure within CERIF in Brief), where namespaces as e.g. applied in RIOXX, such as, rioxxterms; dcterms; dc; reflect CERIF Classification Schemes to which identified terms are assigned, as indicated in the  image above. The mapping in addition revealed, that due to conceptual differences in between the two formats – namely RIOXX and CERIF, rules will have to be developed. E.g. Language is an entity in CERIF as well is a Title, and therefore no vocabulary is needed. Rules may be required, such as with the RIOXX Cardinality, and a formal mapping requires model construct types to reflect entities of the two formats:

  • RIOXXTerms: Creator; Funder; Contributor; Project Identifier
  • DCTerms: Issued; Audience; Reference
  • DC: IdentifierLanguageSourceTitleDescriptionFormat; Publisher; RightsSubjectCoverageRelation;
  • RIOXXCardinality: OneOrMore; ExactlyOne; ZeroOrMore; ZeroOrOne
  • ModelConstructTypes: Entity; Attribute; Relationship; Term; Scheme; Element

In CERIF, rules would currently be encoded as a vocabulary. The proposition is therefore, to extend the CERIF vocabulary following the RIOXX Profile’s requirements anticipating the formal CERIF syntax and declared semantics (Semantic Layer). These rules could look as follows; their formal application enabled by the CERIF link entity cfClass_Class through the two inherent identifiers, cfClassId1 and cfClassId2 upon which a rule’s state (e.g. active; inactive) could be further indicated (not yet considered below).

  • Describing Cardinality “one or more” within the proposed “RioxxCardinalityScheme”:
    cfClass.cfClassId=”OneOrMore”
    cfClass.cfClassSchemeId=”rioxxCardinality”
  • Applying the “one or more” cardinality description to the “Creator” relationship:
    cfClass_Class.cfClassId1=”rioxxTerms:creator”
    cfClass_Class.cfClassId2=”rioxxCardinality:oneOrMore”

Summarising the investigations and mappings, further thought and feedback is required. A formal extension proposal document will be prepared for continued discussion within the CERIF TG and the wider community, where also the vocabularies’ and the terms’ identifiers need discussion. The current proposal adds a federated identifier cfFedId.cfFedId as placeholder reference for persistent identifiers (e.g. ORCID with person; FundRef with Funders, etc.).

The entire formal representation of the above presented RIOXX Application Profile Version 1.0 will be made available in CERIF XML for download and further investigation, and to supply unambiguous description of the current draft and proposal, not least also for comparison with ongoing related activities such as OpenAIRE, where a CRIS Interoperation Profile in CERIF XML is being developed. It will be presented at the euroCRIS Membership Meeting in Bonn – May 13th, 2013.

The RIOXX team posts updates on the RIOXX blog.

Further Links:

Datasets in CERIF

Data or Datasets are increasingly recognised as an essentiell asset of the Research ecosystem. More and more, funders require data to be linked with research output such as publications, and the very concept of a ‘Data Publication‘ is extensively discussed and variously approached. To support these data-intensive movements the CERIF task group discussed data-related model extensions in meetings in Athens and approved the then proposed extensions in a more recent task group meeting in Braga.

Datasets in CERIF

Datasets in CERIF

The Data extension proposal started by investigating CKAN, DCAT and eGMS and was guided by a first draft proposal of the Jisc-funded C4D project. UK’s Research Data Management activities have been acknowledged as “the most advanced in Europe” and previous CERIF model extensions (e.g. Federated Identifiers, Measurements, Indicators) were strongly influenced by multiple projects funded under Jisc’s RIM Programmes.

The central CERIF entity behind the concept of data or dataset* is named cfResultProduct (its syntax is cfResProd); it maintains multiple relationships with e.g. Dataset, Equipment, Facility, Service, Funding, Measurement, Indicator, Medium, Patent, Publication, Project, Person, Organisation, and Federated Identifier. The brief introduction to CERIF shows the range of the CERIF model, its entities and their relationships, where the current proposal adds a link from dataset to Geographic Bounding Box cfGeoBBox, proposes the addition of an alternative title, and date as a valuable attribute of a measurement (informed through UK’s REF Reporting activities), and it gives a recommendation how to deal with sensitive information, lineage/provenance and comments. The proposed CERIF model extensions will be incorporated in an upcoming CERIF XML 1.6 release, for which a CERIF XML Schema will soon be available for testing at the euroCRIS website, and where the official release is planned before the summer.

The CERIF for Dataset (C4D) project extended its demonstrator to support the export of metadata in CERIF, which is a work-in-progress that will be continued liaising with euroCRIS. OpenAIRE extended its datamodel to CERIF allowing for linkage with datasets and funding.

The following small – but not fully populated – XML example of a CERIF Dataset is therefore valid according to CERIF 1.5 with extensions implemented in XML comments. The vocabulary for the example is left empty – as it would very much rely on a particular context and needs further thought and input from real-life examples.

<?xml version="1.0" encoding="UTF-8"?>
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-04-03" sourceDatabase="Brigitte Jörg">
<!-- An example dataset record embedding federated identifiers and some example links -->
<!-- The vocabulary via cfClassId with real world systems should be reused if it has been defined somewhere else -->
<!-- The currently defined CERIF 1.5 vocabulary is publicly available in CERIF XML: http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xml -->

<cfResProd>
  <cfResProdId>internal-dataset-record-id-01</cfResProdId>
  <cfName cfLangCode="EN" cfTrans="o">The name or title of the dataset.</cfName>
  <!-- <cfAltName cfLangCode="EN" cfTrans="o">The alternative name or title of the dataset<cfAltName> -->
  <cfResProd_Class>
     <cfClassId>collection</cfClassId>
     <cfClassSchemeId>example-scheme-resource-types</cfClassSchemeId>
  </cfResProd_Class>
  <!-- link to a part of the dataset -->
  <cfResProd_ResProd>
     <cfResProdId2>dataset-record-id-02</cfResProdId2>
     <cfClassId>part</cfClassId>
     <cfClassSchemeId>inter-product-relations</cfClassSchemeId>
  </cfResProd_ResProd>
  <!-- link to repository as a service where the dataset function is described -->
  <cfResProd_Srv>
     <cfSrvId>dataset-id-01</cfSrvId>
     <cfClassId>function</cfClassId>
     <cfClassSchemeId>dataset-location-scheme</cfClassSchemeId>
  </cfResProd_Srv>
  <!-- link to funding information, such as programme or call -->
  <cfResProd_Fund>
     <cfFundId>funding-id-01</cfFundId>
     <cfClassId>funding</cfClassId>
     <cfClassSchemeId>cerif-output-funding-roles</cfClassSchemeId>
     <cfStartDate>2000-01-01T00:00:00</cfStartDate>
     <cfEndDate>2012-12-31T00:00:00</cfEndDate>
  </cfResProd_Fund>
  <!-- link to provenance information, such as related measurement -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id-00</cfMeasId>
     <cfClassId>provenance</cfClassId>
     <cfClassSchemeId>dataset-provenance-scheme</cfClassSchemeId>
  </cfResProd_Meas>
  <!-- link to publication built on dataset -->
  <cfResPubl_ResProd>
     <cfResPublId>publication-id-01</cfResPublId>
     <cfClassId>built-on</cfClassId>
     <cfClassSchemeId>cerif-inter-output-relations-scheme</cfClassSchemeId>
  </cfResPubl_ResProd>
  <!-- link to person in the role of a contributor -->
  <cfPers_ResProd>
     <cfPersId>person-id-02</cfPersId>
     <cfClassId>contributor</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to person in the role of a group author -->
  <cfPers_ResProd>
     <cfPersId>person-id-03</cfPersId>
     <cfClassId>group-authors</cfClassId>
     <cfClassSchemeId>cerif-person-output-contributions-scheme</cfClassSchemeId>
  </cfPers_ResProd>
  <!-- link to organisation in the role of a funder -->
  <cfOrgUnit_ResProd>
     <cfOrgUnitId>orgunit-id-01</cfOrgUnitId>
     <cfClassId>funder</cfClassId>
     <cfClassSchemeId>cerif-organisation-output-roles</cfClassSchemeId>
  </cfOrgUnit_ResProd>
  <!-- link to a comment made about the dataset -->
  <cfResProd_Meas>
     <cfMeasId>measurement-id</cfMeasId>
     <cfClassId>comment</cfClassId>
     <cfClassSchemeId>dataset-commenting-scheme</cfClassSchemeId>
  </cfResProd_Meas>
</cfResProd>

<!-- The vocabulary defining the dataset record happens via CERIF Semantic Layer -->
<!-- However, for this example, we do not provide any formal terminology and leave it empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
<!-- Another possible terminology or scheme - currently empty -->
<cfClassScheme>
  <cfClassSchemeId>scheme-id</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Scheme Name</cfName>
  <cfClass>
    <cfClassId>term-id</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Term Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">Term Description</cfDescr>
  </cfClass>
</cfClassScheme>
</CERIF>

Related Links:

* The dataset discussion in the CERIF task group started from a common understanding of data set as defined in Wikipedia: “A data set (or dataset) is a collection of data, usually presented in tabular form.

Data Context – Towards Pragmatic Boundaries

Research Data Alliance Logo

On March 18, the 1st Plenary of the global Research Data Alliance (RDA) in Gothenburg was launched by Neelie Kroes – Vice-President of the European Commission. The three days event brought together more than 200 ‘data advocates’ from around the world representing researchers, institutions, governments and industry, to facilitate discussions towards next possible steps with collaboration and joint work via working groups or interest groups – in the spirit of openness, sharing and re-use at the intersection of culture and science.

The procedure as to how RDA working groups and interest groups are being setup is introduced, but still considered a work in progress. At the launch event’s Agora session group proposals were presented and discussions continued within the so-called Birds of a Feather sessions. Besides two formal working groups on “Persistent Identifiers” and “Data Type Registries” mature case statements were introduced for “Data Foundations and Terminology“, “Practical Policy“, “Legal Interoperability” and “Communities and Engagement“, while other presentations where still refining their objectives “Metadata“, “Contextual Metadata“, “Repository Audit and Certification“, “Preservation e-Infrastructure“, “Marine Data Management“, “UPC Code“, “Data Citation” and new ideas emerged “Community Capability Model“, “Big Data Analysis Query“, “Worldwide PID“, “Data Publication“, “Architectural Data Interoperability” and “Industry and Health Informatics“.

While the different proposals indicate the wide range, implied challenges and overlap, this post is to inform about the outcome of discussions around “Contextual Metadata”, which started from four initial, rather un-specified use-cases (see also forum discussion), of which (1-3) where classified managerial and (4) was meant to speak for the researcher:

  1. Output Reporting to Funders
  2. Exchange of Information on Research Activity between Universtities
  3. Management of the Research Portfolio of a University by a Research Manager
  4. Discovery and Re-use of Datasets for other Purposes

The one hour BOF discussions approached these use-cases with the aim of understanding a context by identifying its significant underlying entities and their relationships towards delivering formal, standardised descriptions, i.e. implementations, taking into account material that is available and the expertise of people willing to engage. That is, putting forward a formal Case Statement to establish a RDA Working Group approached via use-cases.

Discussions revealed there is a lot of interest in the proposed group, which in the spirit of RDA will be renamed to “Data Context”. Furthermore, it was recommended that the proposed use-cases should be as specific as possible to ensure feasibility and delivery.

Pragmatic Boundaries

Pragmatic Boundaries

In the end, a new set of much more specific use-cases has been agreed. These will be posted in the RDA forum for further public consultation and refinement. They have been classified according to anticipated perspectives and in their order follow the discussed priorities:

  • Researcher: Find data and supplementary information (e.g. services, reports, tools, news, photos, …) to support a case study around an event (e.g. hurricane Katerina) from different catalogues.
  • Managerial: Indicate to funders what are the overlaps and gaps in currently funded research. Want to know from Data-Centers and Scientists if there are overlaps in Programmes – look a lot wider -> sub implications – understands amongst others semantics of geography and temporal contextual aspects.
  • Provenance: Allow to take Segments from Streamed Data and Workflows. (e.g. Social scientist reporting on social aspects of a climatic event) (e.g. an agency will publish storm reports/maps … and increasingly in public spaces … posting them on facebook, tweet where people wish to know from where is the data and who produced the image, who ran the processing job.)
  • Interoperability: Exchange of contextual metadata between different systems.

Close collaboration is foreseen and has started with other working groups, especially the proposed RDA “Metadata” group, where interaction facilitators have been nominated to maintain the bridge. In addition, there was an agreement to exchange group members between the ICSU/World Data System’s group “Knowledge Network” and to explore potential collaboration opportunities with CODATA working groups. Available standardisation and harmonisation approaches such as those initiated by DCC, PROV, PREMIS, MARC, CKAN, DCAT, CERIF, CASRAI, VIVO, OAIS, APA, W3C, ISO, OMG, etc. will certainly guide development and implementation processes. A report is being prepared, slides will be uploaded and discussions will be continued in the RDA forum.

The RDA initiative has been brought into existence by an initial three research funding organisations:

  • The Australian Commonwealth Government through the Australian National Data Service supported by the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative
  • The European Commission through the iCordi project funded under the 7th Framework Program
  • The United States of America through the RDA/US activity funded by the National Science Foundation

Research Data Alliance on Twitter:  http://twitter.com/resdatall

CERIF 1.5 Reference Doc – XML API – GtR Hack Day

The Gateway to Research (GtR) project organised its first hack days to test their APIs. The two-days event was hosted by Aston University in Birmingham and convened about 20 invited people with backgrounds in system development. A documentation has been published beforehand explaining in detail the GtR Application Programming Interface V1.0, providing at the time of testing two APIs

  • GtR API
  • CERIF API

both producing outputs in XML and JSON formats. More APIs will be added and existing APIs will change over time. Future updates will be informed at the GtR web portal. The hack event was considered as a test-drive where further events are being considered later this year. The GtR CERIF XML API received very positive feedback during the two days event.

This post is to inform the wider community about the availability of a CERIF 1.5 Reference Document as a result of a few hours of collaborative work at the hack event by Chris Gutteridge and Brigitte Jörg, automatically transforming and merging existing CERIF model files into a more readable version of CERIF 1.5 descriptions with the aim to serve in particular the community of developers by saving their time with getting in touch with the CERIF model, structure and thus, mission.

The file transformation script developed by Chris Gutteridge is provided without restrictions and thus encourages re-use and adjustments for upcoming CERIF release updates.

Further event results will be reported at the GtR website.

The goal of GtR is to give the public better access to information on research funded by the Research Councils, such as – who, what and where the Councils fund, and the output and outcomes, linking to available open access repositories and/or data catalogues. More information and discussion is available from the below links:

CERIF UK Coordination Meeting (preliminary summary)

On February 28th 2013 a CERIF UK Coordination Meeting was held at Prospero House in London, to identify the priorities for a feasible and sustainable CERIF coordination and implementation roadmap from ongoing activities in the UK Research Information Management (RIM) space. The current wider CERIF UK landscape is depicted as follows.

 

Wider CERIF UK Landscape

 

Before the meeting, an open spreadsheet has been prepared listing the ongoing activities, the organisations engaged and the outputs available (reflected in the above image). The spreadsheet was meant as a start for add-ons and is still open for extension and not yet to be considered final: http://bit.ly/13PWJqq

Many UK HEIs now use CERIF-based systems and work continues to ensure that funders’ systems to collect information about research outputs can accept information from universities in this standard format. Emerging national infrastructures such as RCUK’s Gateway to Research are based on CERIF and a CERIF-XML interface for the REF submission system is being prepared for addition during the first half of 2013. These developments are increasingly complemented by international activities within CASRAI and VIVO.

However, consistent implementation and standards development requires coordination. The meeting provided an overview of ongoing and past activities, the organisations engaged and the outputs and assets resulting from them to highlight the need for sustainability and to determine which of them require further development, maintenance and dissemination. Coordinated action is necessary to enable their consistent re-use and implementation. The meeting was an opportunity to identify what the current priorities and issues are and the commitments that can be made for the next steps forward.

In the end, it was clear that coordination is work and requires human resources and organisations to support the efforts, i.e. the work also needs to secure continued funding beyond June 2013.

A report and related material will be made available shortly.

A CERIF-XML Person Record + Vocab

The subsequent XML describes a valid person record in CERIF-XML embedding two federated identifier types – namely ORCID and an example HESA Staff Identifier. Furthermore, it gives two relationships to organisations, namely UKOLN via the role “Employee” and euroCRIS via the role “Board Member”. Whereas the role “Employee” is a term already defined in the CERIF Vocabulary, i.e. maintaining a UUID, the role “Board Member” is not a defined term in the CERIF Vocabulary and thus uses proprietary ID encoding.

<?xml version="1.0" encoding="UTF-8"?>
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-02-23" sourceDatabase="Brigitte Jörg">
<!-- A person record embedding federated identifiers and links to organisation records -->
<cfPers>
  <cfPersId>internal-pers-id-brigitte-joerg</cfPersId>
  <cfGender>f</cfGender>
  <cfPersName_Pers>
    <cfPersNameId>internal-persname-id-brigitte-joerg</cfPersNameId>
    <cfClassId>64f0eb00-462d-4737-8033-7efac82decf3</cfClassId> <!-- Passport Name -->
    <cfClassSchemeId>7375609d-cfa6-45ce-a803-75de69abe21f</cfClassSchemeId> <!-- Person Names -->
    <cfFamilyNames>Jörg</cfFamilyNames>
    <cfFirstNames>Brigitte</cfFirstNames>
  </cfPersName_Pers>
  <cfFedId>
    <cfFedIdId>internal-fed-id1-brigitte-joerg</cfFedIdId>
    <cfFedId>http://orcid.org/0000-0001-7941-8108</cfFedId> <!-- Brigitte Jörg's ORCID -->
    <cfFedId_Class>
        <cfClassId>716bcc9a-c9dd-4b8b-b4ab-6c140e578ec3</cfClassId> <!-- the "ORCID" term's uuid in the CERIF Vocabulary -->
        <cfClassSchemeId>bccb3266-689d-4740-a039-c96594b4d916</cfClassSchemeId> <!-- Identifier Types Scheme -->
    </cfFedId_Class>
  </cfFedId>
  <cfFedId>
    <cfFedIdId>internal-fed-id2-brigitte-joerg</cfFedIdId>
    <cfFedId>012345678910111213</cfFedId> <!-- Brigitte Jörg's fictitious HESA staff identifier -->
    <cfFedId_Class>
        <cfClassId>716bcc9a-c9dd-4b8b-b4ab-6c140e578ec3</cfClassId> <!-- the HESA "STAFFID" term's uuid -->
        <cfClassSchemeId>bccb3266-689d-4740-a039-c96594b4d916</cfClassSchemeId>
    </cfFedId_Class>
  </cfFedId>
  <cfPers_OrgUnit>
    <cfOrgUnitId>internal-orgunit-id-ukoln</cfOrgUnitId>
    <cfClassId>c302c2f0-1cd7-11e1-8bc2-0800200c9a66</cfClassId> <!-- Employee -->
    <cfClassSchemeId>994069a0-1cd6-11e1-8bc2-0800200c9a66</cfClassSchemeId>
    <cfStartDate>2012-06-01T00:00:00</cfStartDate>
  </cfPers_OrgUnit>
  <cfPers_OrgUnit>
    <cfOrgUnitId>internal-orgunit-id-euroCRIS</cfOrgUnitId>
    <cfClassId>board-member-term-id</cfClassId> <!-- not yet a released CERIF term -->
    <cfClassSchemeId>possibly-person-organisation-roles-scheme-id</cfClassSchemeId>
    <cfStartDate>2005-01-01T00:00:00</cfStartDate>
  </cfPers_OrgUnit>
</cfPers>

<!-- The vocabulary defining the person record via CERIF Semantic Layer -->
<cfClassScheme>
  <cfClassSchemeId>7375609d-cfa6-45ce-a803-75de69abe21f</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Person Names</cfName>
  <cfClass>
    <cfClassId>64f0eb00-462d-4737-8033-7efac82decf3</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Passport Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">The name of the person as printed in the passport.</cfDescr>
  </cfClass>
</cfClassScheme>
<cfClassScheme>
  <cfClassSchemeId>bccb3266-689d-4740-a039-c96594b4d916</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Identifier Types</cfName>
  <cfClass>
    <cfClassId>c0071785-549a-4379-a2af-d9a978ea3a1e</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">STAFFID</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">The Staff identifier is a unique code allocated to a staff member when they are first entered onto the staff record and, where a member of staff is contracted to work in jobs classified in SOC groups 1,2 or 3, it stays with them for the whole of their career within HE.</cfDescr>
    <cfDescrSrc cfLangCode="en" cfTrans="o">http://www.hesa.ac.uk/component/option,com_collns/task,show_manuals/Itemid,233/r,08025/f,003/</cfDescrSrc>
  </cfClass>
  <cfClass>
    <cfClassId>716bcc9a-c9dd-4b8b-b4ab-6c140e578ec3</cfClassId> 
    <cfTerm cfLangCode="en" cfTrans="o">ORCID</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized.</cfDescr>
    <cfDescrSrc cfLangCode="en" cfTrans="o">http://about.orcid.org</cfDescrSrc>
  </cfClass>
</cfClassScheme>
<cfClassScheme>
  <cfClassSchemeId>994069a0-1cd6-11e1-8bc2-0800200c9a66</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Person Organisation Roles</cfName>
  <cfClass>
    <cfClassId>c302c2f0-1cd7-11e1-8bc2-0800200c9a66</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Employee</cfTerm>
    <cfDescr  cfLangCode="en" cfTrans="o">A worker who is hired to perform a job.</cfDescr>
    <cfDescrSrc cfLangCode="en" cfTrans="o">http://wordnetweb.princeton.edu/perl/webwn?s=Employee</cfDescrSrc>
  </cfClass>
</cfClassScheme>
</CERIF>

It is not necessarily required for exchanging information, that the vocabulary as defined in CERIF Semantic Layer format is embedded in the .xml file, if it can be expected that the CERIF vocabulary (UUIDs) are known by the receiving or sending agent.

The file is valid according to the released CERIF 1.5 XML Scheme. This however does not consider checking of valid cfClassIds, i.e. the validity of the employed vocabulary. If that is a requirement, an application specific CERIF 1.5 XML Scheme can be generated via the CERIF-TG-Toolbox deployed at Sourceforge to ensure that only a pre-defined controlled vocabulary is valid.