CERIF – the Common European Research Information Format – is a conceptual model describing the Research domain. Formally, it is maintained as an Entity Relationship Model (ERM) from which SQL scripts for multiple databases, e.g. oracle, mySQL, MSSQL Server, etc. can be generated and from which CERIF XML has been inspired. This page is structured as follows:
- Introduction and Brief History
- The Conceptual CERIF Model
- The CERIF Semantic Layer
- The CERIF XML Exchange Format
- CERIF and the Semantic Web / Linked Open Data Web
- Identifiers in CERIF
- References and Notes
CERIF is a EU Recommendation to Member States and its history points back to as early as the late 1970s. However only in 1987 when the European Union organised a workshop and convened a working group on Research Databases, the first CERIF version was explicitly developed and formally released. The CERIF model thus grew organically through history and has since then been influenced by technological developments.
The first CERIF 1991 Manual defined research information systems with coverage of projects only; these were simple projects recording systems. However, with the first CERIF release in 1991 a need for more contextual information was already identified and recognised within CERIF 2000. The published CERIF 2000 toolkit defines a full CRIS data model, an exchange model, and a metadata data model. Furthermore, it recommends the use of the ORTELIUS Thesaurus (3) for subject headings and NUTS area codes for geographic encodings. The CERIF 2000 model was a significant step forward in describing the research domain. It was a true model not only a record description, it allowed for the inclusion of one single classification system or controlled vocabulary (it recommended the ORTELIUS Thesaurus), it allowed for multilingual expressions, it maintained roles, types and statuses, and it was heavily user-driven.
In 2002, the European Commission handed over CERIF and its further developments into the responsibility of euroCRIS – a non-profit organisation registered in the Netherlands. Since then, the CERIF model underwent major upgrades with respect to normalisation in 2006 and by introducing the Semantic Layer in 2008. Since CERIF 1.4 was released in early 2012, it maintains an embedded XML exchange format which became very popular.
CERIF describes entities in the Research domain, such as person, organisation, project, publication, patent, data, facility, equipment, service, funding, measurement, indicators, identifiers and their relationships. The image below presents an overall view over the Research domain and its related entities, where the colours indicate possible contexts, such as results (orange), outcomes (red), actors (green), infrastructure (purple) etc.
The names of CERIF entities with formal descriptions are shortened to ensure a consistency across databases, where some databases only allow for a maximum of 38 characters per entity name. A formal or so-called physical description of each CERIF entity through its ERM definition and finally through the SQL scripts follows a certain pattern. Each entity maintains a system-internal identifier, e.g. cfPers.cfPersId, cfProj.cfProjId, cfOrgUnit.cfOrgUnitId, etc. and some basic entity-specific attributes, such as e.g. acronym with project, facility or service cfProj.cfAcro, cfFacil.cfAcro, cfSrv.cfAcro, or birthdate and gender with person cfPers.cfBirthdate, cfPers.cfGender, or amount with funding cfFund.cfAmount, etc., and relationships.
In CERIF also relationships are entities. They maintain their own attributes such as timestamps and allow for a high flexibility with labelings and namespace allocations. Each CERIF relationship entity – a so-called link entity such as cfProj_Pers, cfPers_OrgUnit - is named after its two linked entities. The order of the two linked entities is entirely neutral and does not in any case follow a reading direction (see image below) - this is an important note.
The formal label or more meaningful name of a CERIF link-entity is in fact given through a term that is defined in the Semantic Layer, such as cfClass.cfClassTerm.cfTerm=”Manager”. Such a term is then referenced from within link entities by its identifier, e.g. cfClass.cfClassId=”79a2e340-1cfc-11e1-8bc2-0800200c9a66″ and supported by a namespace e.g. cfClass.cfClassSchemeId=”94fefd50-1d00-11e1-8bc2-0800200c9a66″ with its own identifier. This mechanism or pattern is similarly applied throughout the CERIF model and a better understanding will result from the remainder and from examples. The SQL script resulting from the above ERM to describe for example a person and its project relationship is as follows. It first creates a CERIF person cfPers table, then a CERIF project cfProj table before it relates the two by creating a link table cfProj_Pers.
-- Table cfPers CREATE TABLE "cfPers"( "cfPersId" Char(128 ) NOT NULL, "cfBirthdate" Date, "cfGender" Char(1 ), "cfURI" Char(128 ) ) ; -- Table cfProj CREATE TABLE "cfProj"( "cfProjId" Char(128 ) NOT NULL, "cfStartDate" Date, "cfEndDate" Date, "cfAcro" Char(16 ), "cfURI" Char(128 ) ) -- Table cfProj_Pers CREATE TABLE "cfProj_Pers"( "cfProjId" Char(128 ) NOT NULL, "cfPersId" Char(128 ) NOT NULL, "cfClassId" Char(128 ) NOT NULL, "cfClassSchemeId" Char(128 ) NOT NULL, "cfStartDate" Timestamp(6) NOT NULL, "cfEndDate" Timestamp(6) NOT NULL, "cfFraction" Float ) ;
Each CERIF table (entity) and link (entity) is defined similarly and descriptions have been generated for multiple database definition languages (e.g. ORACLE, DB2, mySQL, etc.) for euroCRIS members. A public HTML version of the CERIF 1.5 model for navigation is available from the euroCRIS website.
The Semantic Layer in CERIF is in fact a conceptual construct to describe a sub model of or within CERIF that allows for the efficient and meaningful management of controlled vocabularies. It is thus a declared semantics that follows the formal syntax of the CERIF model. In the community it is a known mantra that Current Research Information Systems (CRISs) are built on a “formal syntax and declared semantics”, i.e. CERIF.
In CERIF, controlled vocabulary terms are applicable only from within link entities of which there are basically two kinds:
- Binary Link entities such as: cfProj_Pers, cfProj_OrgUnit, cfProj_Fund
- Unary Link entities such as: cfProj_Class, cfOrgUnit_Class, cfFund_Class
Each link entity follows the same pattern. It hosts the two identifiers from the two linked entities, a timestamp with cfStartDate/cfEndDate and a reference to the cfClassId and cfClassSchemeId - in fact in a triple-like manner. The vocabulary term behind the cfClassId and its namespace the cfClassSchemeId are maintained in the Semantic Layer (see image below). It allows for a very detailed specification of each single term by indicating its source, by allowing for a description and definition plus examples, and, by even allowing for a formal linkage between terms through the cfClass_Class entity, i.e. formal CERIF term mappings.
A new vocabulary term is created by the adding of a new cfClass record with a term identifier cfClass.cfClassId within a namespace or scheme cfClass.cfClassSchemeId at a certain date, and if available a cfClass.cfURI. This cfClass.cfClassId and cfClass.cfClassSchemeId compound is then the so-called primary key for the vocabulary term cfClassTerm.cfTerm identification, its description cfClassDescr.cfDescr, related examples cfClassEx.cfEx, and its definition cfClassDef.cfDef.
In relational databases, information objects are aggregated through identifiers upon queries or from within application requirements. In CERIF that means an information object such as e.g. a person cfPers will be composed following a request and depends on system support or data supply from the employed system entities. An entire person object in CERIF would thus be aggregated from information maintained in the following person-related entities: cfPers, cfPers_Class, cfPersName, cfPersName_Pers, cfPersResInt, cfPersKeyw, cfPers_Country, cfPers_Lang, cfPers_CV, cfPers_EAddr, cfPers_PAddr, cfPers_Event, cfPers_ExpSkills, cfPers_Prize, cfPers_Qual, cfPers_Equip, cfPers_Facil, cfPers_Srv, cfPers_Fund, cfPers_ResPubl, cfPers_ResPat, cfPers_ResProd, cfProj_Pers, cfPers_OrgUnit, cfPers_Meas, cfPers_Indic, cfPers_Medium.
The structure of the conceptual (relational) CERIF model inspired the CERIF XML exchange format. However, where in a relational model the information object components are aggregated through internal identifiers, in XML the information object components are embedded (hierachical). The CERIF XML Exchange format replicates the names (syntax) from the relational entities. A person object in CERIF XML would thus be described as follows:
<?xml version="1.0" encoding="UTF-8"?> <CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-01-07" sourceDatabase="Short Person Example Record"> <cfPers> <cfPersId>pers-id0</cfPersId> <cfBirthdate>1971-04-17</cfBirthdate> <cfGender>f</cfGender> <cfPersName_Pers/> <cfPers_EAddr/> <cfPers_PAddr/> <cfPers_Country/> <cfPers_Lang/> <cfPers_CV/> <cfPers_Event/> <cfPers_ExpSkills/> <cfPers_Prize/> <cfPers_Qual/> <cfPers_Equip/> <cfPers_Facil/> <cfPers_Srv/> <cfPers_Fund/> <cfPers_ResPubl/> <cfPers_ResPat/> <cfPers_ResProd/> <cfProj_Pers/> <cfPers_OrgUnit/> <cfPers_Meas/> <cfPers_Indic/> <cfPers_Medium/> </cfPers> </CERIF>
A more comprehensive and with more data filled CERIF XML person record could look as follows.
<?xml version="1.0" encoding="UTF-8"?> <CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-01-07" sourceDatabase="More comprehensive Person Example Record"> <cfPers> <cfPersId>pers-id0</cfPersId> <cfGender>f</cfGender> <cfPersName_Pers> <cfPersNameId>persname-id1</cfPersNameId> <cfClassId>55f90543-d631-42eb-8d47-d8d9266cbb26</cfClassId> <!-- presented name --> <cfClassSchemeId>7375609d-cfa6-45ce-a803-75de69abe21f</cfClassSchemeId> <cfFirstnames>Jörg</cfFirstnames> <cfFamilyNames>Brigitte</cfFamilyNames> </cfPersName_Pers> <cfPersName_Pers> <cfPersNameId>persname-id2</cfPersNameId> <cfClassId>64f0eb00-462d-4737-8033-7efac82decf3</cfClassId> <!-- passport name --> <cfClassSchemeId>7375609d-cfa6-45ce-a803-75de69abe21f</cfClassSchemeId> <cfFirstNames>Joerg</cfFirstNames> <cfFamilyNames>Brigitte</cfFamilyNames> </cfPersName_Pers> <cfKeyw cfLangCode="EN" cfTrans="o">CERIF; CRIS; Information Systems Research Information Management.</cfKeyw> <cfPers_EAddr> <cfEAddrId>firstname.lastname@example.org</cfEAddrId> <cfClassId>35d43364-2160-4b6c-a487-5019458321e8</cfClassId> <!-- professional email --> <cfClassSchemeId>05cc5ff9-bc58-4743-ab59-46e5013e0039<cfClassSchemeId> </cfPers_EAddr> <cfPers_OrgUnit> <cfOrgUnitId>012345678</cfOrgUnitId> <cfClassId>ebd55ab0-1cfc-11e1-8bc2-0800200c9a66</cfClassId> <!-- researcher --> <cfClassSchemeId>e9616dbd-0d38-4b7d-a6cd-3c4df1e95462</cfClassSchemeId> <cfStartDate>2012-06-01T00:00:00</cfStartDate> </cfPers_OrgUnit> <cfPers_Pers> <cfPersId2>pers-id02</cfPersId2> <cfClassId>3ccd035b-bc79-477e-aa6c-0bd3606f85c8</cfClassId> <!-- supervisor --> <cfClassSchemeId>6b2b7d24-3491-11e1-b86c-0800200c9a66<cfClassSchemeId> </cfPers_Pers> </cfPers> </CERIF>
The euroCRIS website under the section of CERIF releases provides more CERIF XML examples. Through embeddings of link entities the identifiers from embedded entities can be omitted because they are inherited from the embedding entity and the objects are thus transparently closed (8).
Formal and valid CERIF XML can be ensured by a reference to a version (e.g. CERIF 1.5) of the XML Scheme within its header. For a valid CERIF XML generation it is recommended to consult the euroCRIS website with specification documents or be in touch with the CERIF task group.
There is no formal CERIF ontology in OWL or RDF available. First steps towards expressing CERIF in VIVO have been taken and will be continued. We cite the following statements from the Linked Open Data (LOD) task group report at the Business Session during the euroCRIS Membership Meeting in Madrid from November 2013 (7).
- The Semantics are in the relational CERIF Model.
- The LOD is not adding by itself any semantics, only exposing in a different format. So the LOD “ontologies” are no other thing than a RDF expression of CERIF.
- Improvement in semantics should go either directly to the CERIF relational model or to some “extended” ontology.
CERIF is maintained as a relational model and therefore every single entity has its own system-internal identifier (primary key) by which its records are recognised and finally aggregated towards an information object upon queries or rules. This is common and works very well in so-called closed world systems which relational databases are. Increasingly, systems need to interoperate and a re-use or integration of information across system is becoming a common scenario where identifiers play a crucial role. The strict rules that apply in relational CERIF database systems are a little bit looser with CERIF XML. This is justified due to the differences in the underlying technologies and their application. Where for system-internal aggregation the system-internal identifiers are crucial, for an exchange format they may not be essential anymore, and may thus even be defined as not mandatory in the future. Furthermore, with hierarchical embeddings in CERIF XML the identifiers are propagated to link entities and need not be repeated (see xml examples above).
System-internal identifiers for common Research entity records are usually created in different departments (e.g. person identifiers with Human Resources, project identifiers with Project Management Systems, etc.). With its latest release the CERIF model opened up its boundaries to external systems by introducing a so-called federated identifier entity cfFedId, that in fact allows for a unary-type linkage to any system external resource (identifier).
For controlled vocabularies and in particular towards re-use and integration of existing vocabularies, the re-use of identifiers is highly recommended to ensure a meaningful integration or application of the terms. CERIF is known for its support of multiple vocabularies and the Semantic Layer to maintain them.
The CERIF task group recommends to re-use existing vocabularies (e.g. CASRAI) and started to define its own vocabularies where there was no re-usable vocabulary available. For the reuse of defined vocabularies CERIF recommends the assignment of uuids. That is, with each defined CERIF vocabulary term, there comes a uuid (9).
“Anyone can create a UUID and use it to identify something with reasonable confidence that the same identifier will never be unintentionally created by anyone to identify something else. [...] A UUID is a 16-byte (128-bit) number. In its canonical form represented by 31 hexadecimal digits, displayed in five groups separated by hyphens for a total of 36 characters in the form 8-4-4-4-12, e.g.: 550e8400-e29b-41d4-a716-446655440000″ (Wikipedia).
<?xml version="1.0" encoding="UTF-8"?> <CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" release="1.5" date="2013-01-07" sourceDatabase="CERIF1.5RMAS_Semantics.xls"> <cfClassScheme> <cfClassSchemeId>b4de9a8f-3a4d-4233-9a9f-3b624e4ad74f</cfClassSchemeId> <cfName cfLangCode="en" cfTrans="o">Person Event Involvements</cfName> <cfDescr cfLangCode="en" cfTrans="o">This scheme contains CERIF vocabulary terms applicable in the cfPers_Event link entity, bing thus a vocabulary of the engagement that a person has in organising, delivering, participating or reporting an event.</cfDescr> <cfClass> <!-- usage with CERIF: cfPers_Event; cfProj_Pers --> <!-- class schemes: Person Event Involvements; Person Project Engagements --> <cfClassId>ddc3dd10-1cfd-11e1-8bc2-0800200c9a66</cfClassId> <cfTerm cfLangCode="en" cfTrans="o">Participant</cfTerm> <cfTermSrc cfLangCode="en" cfTrans="o">CIA project</cfTermSrc> <cfDef cfLangCode="en" cfTrans="o">A participant is someone who takes part in an activity</cfDef> <cfDefSrc cfLangCode="en" cfTrans="o">http://wordnetweb.princeton.edu/perl/webwn?s=participant</cfDefSrc> </cfClass> <cfClass> <!-- usage with CERIF: cfPers_Event --> <!-- class schemes: Person Event Involvements --> <cfClassId>2b3ba8f1-5620-42c9-8549-7d34ed37f968</cfClassId> <cfTerm cfLangCode="en" cfTrans="o">Interviewee</cfTerm> <cfTermSrc cfLangCode="en" cfTrans="o">CIA project</cfTermSrc> <cfDef cfLangCode="en" cfTrans="o">Person that is interviewed at an event.</cfDef> <cfDefSrc cfLangCode="en" cfTrans="o">RMAS project</cfDefSrc> </cfClass> <cfClass> <!-- usage with CERIF: cfPers_Event --> <!-- class schemes: Person Event Involvements --> <cfClassId>d1ee35f1-c4c6-4651-a760-06a3828a61c1</cfClassId> <cfTerm cfLangCode="en" cfTrans="o">Speaker</cfTerm> <cfTermSrc cfLangCode="en" cfTrans="o">CIA project</cfTermSrc> <cfDef cfLangCode="en" cfTrans="o">Person that speaks at an event.</cfDef> <cfDefSrc cfLangCode="en" cfTrans="o">RMAS project</cfDefSrc> </cfClass> </CERIF>
The CERIF vocabulary is published in an Excel file, as a SQL insert script and as an CERIF XML at the euroCRIS website. From within the above xml extract the definition of the terms via cfClassId uuids and the aligned scheme via cfClassSchemeId uuids should be much more clear.
- CERIF 1.5 Release at euroCRIS.
- Nice drawing and CERIF XML example by Jason Marshall as to how vocabularies work within CERIF (another view).
- The Ortelius Thesaurus was a recommendation with old CERIF releases and dates back to 1991. What is known so far is, that it was very much biased towards Teaching, but not so much towards Research. It has not been further and officially maintained since then.
- Public HTML version of the CERIF 1.5 model for navigation at the euroCRIS website.
- CERIF XML 1.5 Schema for CERIF XML file validation.
- Valid CERIF XML examples published at the euroCRIS website.
- Business Session Report LOD TG in Madrid (November 2013).
- Streamlining the CERIF XML Data Exchange Format towards 2.0.
- Entities and Identities in Research Information Systems.
- Towards a Sharable Research Vocabulary (SRV) – A Model-Driven Approach.