DDI Alliance, 2015-04-07
II. Current DDI Alliance Work Products
III. The Future: DDI-Lifecycle MD (Model Driven)
This document provides an overview of what the work products of the DDI Alliance are, what purpose they serve, and how they are maintained by the Alliance. It also addresses the coming developments in the Alliance’s work, notably the model-driven DDI-Lifecycle, currently under development.
The overarching aim of the DDI Alliance is to “create an international standard for describing data from the social, behavioral, and economic sciences”. Since its inception in 1995, the development of the DDI Alliance work products has moved in step with the evolving needs of the user communities seeking to document the data and the processes around its collection.
The work products have also evolved alongside developments in technology and in the computing infrastructure that support this. XML1, in which the current work products are expressed, has itself changed and developed. The standards to which the work products are aligned such as Dublin Core2, ISO/IEC 111793, and most recently GSIM4 have also evolved and the technologies available to implement the work products have matured.
Such broad aims necessitate a level of complexity, but with an underlying logic to what has been produced and how the products relate. This document will clarify the purposes of each of the work products, and attempt to explain why there is sometimes overlap, but rarely duplication.
This overview will help users to determine which of the work products to implement, and to know what to expect in light of future developments.
The document is aimed at the wide range of audiences that use and envisage using the work products of the DDI Alliance. These include Data Archives, National Statistical Agencies, individual and networks of researchers, data managers, and Intergovernmental Organizations.
There are several DDI Alliance work products, each of which exists for a different purpose. Each is discussed separately. It should be noted that, as a standards organization, the DDI Alliance is not in the business of requiring users to implement a particular work product or a particular version of a work product. It offers each work product for an intended purpose, in a fashion that is useful to some members of the DDI community of users. These work products are simply tools to be used – there is no necessary value in upgrading to a newer version of any given DDI work product or specification, unless doing so provides benefit to the implementer (different from software where upgrading to the latest version is often recommended).
Because of this, all existing DDI work products will be maintained over time, to support the users who have implemented that work product. Current work products include:
- DDI-Codebook – DDI-Codebook is an XML structure for describing codebooks (or data dictionaries), for a single study.
- DDI-Lifecycle – DDI-Lifecycle (also expressed in XML) expands on the coverage of a single study along the data lifecycle and can describe several waves of data collection, and even ad hoc collections of datasets grouped for the purposes of comparison. It is very useful when dealing with serial data collection as is often seen in data production within statistical offices and long-standing research projects.
- RDF Vocabularies – The RDF Vocabularies for the Semantic Web currently cover three areas: one for describing statistical classifications, one for describing datasets for the purposes of discovery on the Web, and one for describing the structures of individual datasets. These vocabularies are based on the DDI-Codebook and DDI-Lifecycle XML structures, but are identical to neither. They are designed to be used in combination, and as such are non-duplicative.
- Controlled Vocabularies - The DDI Controlled Vocabularies are recommended sets of terms and definitions. They are used to describe various types of activities or artefacts which exist within the DDI-Codebook, DDI-Lifecycle, and DDI-RDF Discovery structures.
The DDI-Codebook specification is the original work product of the Alliance, and has gone through several version changes. The latest version is DDI-Codebook version 2.5.1. DDI-Codebook was not always referred to as such – the “codebook” part of the title was added when DDI-Lifecycle was developed, to distinguish between the two. DDI-Codebook is an XML structure for describing codebooks (or data dictionaries), for a single study (a study in DDI parlance is a single wave of data collection). DDI-Codebook is designed to be used after-the-fact: it assumes that a data file exists as the result of a collection, and is then described using the XML structure provided by the specification.
DDI-Codebook is heavily used, being the structure behind such tools as Nesstar5, which is used by data archives within CESSDA6 and among Canadian academic institutions. The largest single user-base is probably the International Household Survey Network (IHSN7), which provides tools for documenting studies conducted by statistical agencies in the developing world, some of which are based on the Nesstar tools.
DDI-Codebook is – and will be – maintained within the limited scope of its specified coverage. Any bugs will be corrected in future releases, and minor changes may be made to allow DDI-Codebook to align with other of the work products. As an example, when DDI-Codebook 2.5 was released, the system for identification was updated to allow for DDI-Lifecycle identifiers to be included so that metadata which overlapped between these two work products could be expressed using either the Codebook or Lifecycle specification. In addition, content was added to support additional coverage of GSIM descriptive objects, which were also being added to DDI-Lifecycle.
DDI-Lifecycle is the result of a more demanding set of requirements emerging from the use of DDI-Codebook. It is again a complex metadata structure expressed using XML, but it is designed for different purposes. Lifecycle is capable of describing not only the results of data collection, but describing the metadata throughout the data collection process, from the initial conceptualization through to the archiving of the resulting data. It can describe several versions of the data and metadata as they change across the data lifecycle, hence the name “DDI-Lifecycle”.
DDI-Lifecycle can describe several waves of data collection, and even ad hoc collections of datasets grouped for the purposes of comparison. It is very useful when dealing with serial data collection as is often seen in data production within statistical offices and long-standing research projects.
DDI-Lifecycle was first released as DDI version 3.0, and is now in version 3.2. There will be further releases of DDI-Lifecycle, for the purposes of maintenance (bug fixes and alignment), and there will be a one-time functionality expansion to include work which pre-dated the forthcoming DDI 4, covering survey methodology. This will be included in the DDI 3.3 beta release (currently under development).
It should be noted that, because of the purpose of DDI-Lifecycle (which includes after-the-fact data description for single studies, as a necessary part of its broader scope), any metadata which can be expressed using DDI-Codebook can also be described using DDI-Lifecycle. DDI-Lifecycle is – again, because of the purpose it was designed to serve – a more complex structure than DDI-Codebook, and requires a more complex computing infrastructure.
DDI-Lifecycle can also be used to document specific sets of metadata outside of the description of a single study or set of studies. For example, areas of commonly shared metadata such as concepts, statistical classification, or geographic structures can be described and referenced by any number of studies. Processing activities can be described and used to support a metadata-driven approach to the collection, processing and publication of data.
C. RDF Vocabularies
RDF is the Resource Description Framework, a set of Web-based technologies for describing different information in a way that is machine-actionable. As such, it provides an alternative (for the Semantic Web and Linked Data) to the traditional XML structures specified by the DDI Alliance – DDI-Lifecycle and DDI-Codebook. To use RDF, a “vocabulary” (or ontology) is developed for a specific domain, with different types of information objects and their relationships and properties being defined. The DDI Alliance has published three such vocabularies: one for describing statistical classifications (XKOS8), one for describing datasets for the purposes of discovery on the Web (DDI-RDF Discovery9), and one for describing the structures of individual datasets (PHDD10).
These vocabularies are based on the DDI-Codebook and DDI-Lifecycle XML structures, but are identical to neither. They are designed to be used in combination, and as such are non-duplicative. They also use other RDF vocabularies developed elsewhere where these exist and are appropriate – this is considered a best practice in the RDF community, to promote interoperability.
RDF is intended as an outward-facing set of standards for publishing things on the Web. Its implementation expands outside of the community which has used DDI XML specifications. Traditionally, DDI XML has been used behind closed doors to support preservation, exchange, and data management (although dissemination is one aspect of data management, certainly). Thus, although systems which today are based on DDI XML could be partly implemented using the RDF vocabularies and technologies, the Alliance RDF specifications are intended to provide options at the level of the technology platform, and not to be duplicative of the XML specifications.
In future, for the model-driven DDI-Lifecycle, there will be equivalent RDF and XML representations for all DDI metadata content. Despite this, the RDF Vocabulary specifications will be used as the basis for some parts of DDI-Lifecycle (MD), and will be maintained over time as needed to address bug fixes.
D. Controlled Vocabularies
The Controlled Vocabularies produced by the DDI Alliance are sets of terms and definitions used to describe various types of activities or artefacts which exist within the DDI-Codebook, DDI-Lifecycle, and DDI-RDF Discovery structures. They can be used with either XML specification (or indeed, with any system that needs such terminologies). They are developed and maintained independent of any other DDI work products, and indeed, are more a supporting work product than one intended to be used in its own right. The Controlled Vocabularies are expressed in Genericode (XML) and in SKOS (under development).
The DDI Controlled Vocabularies are recommended sets of terms and definitions – organizations are free to use modified or entirely different vocabularies in their place if that better meets their organizations’ needs.
In the same way that experiences with DDI-Codebook lead to a broader set of requirements fulfilled by DDI-Lifecycle, implementations of DDI-Lifecycle and the DDI RDF vocabularies have provided new requirements, which will be met by DDI-Lifecycle MD, now under development. DDI-Lifecycle MD will be very different from other DDI work products, because it will be based on an information model of the metadata content. This information model will be available for implementation in standard RDF vocabularies and standard XML structures, which will be equivalent. This form of model-based standard is a best practice in the world of standardization, and several other standards in other domains have been using this approach for many years.
Because the DDI Alliance anticipates that there may be many different implementation technologies using DDI as a basis, having an explicit information model, expressed using the Unified Modeling Language (UML)11, will be of benefit. To provide an example, the Statistical Data and Metadata Exchange (SDMX)12 standard is model-based, and has standard implementations in XML, JSON, and EDIFACT (a flat-file syntax). Additionally, the model was used as the basis for an RDF vocabulary published by the W3C (the home of the base RDF standards themselves).
Thus, moving DDI to a model-based orientation is a way of protecting the standard against technological change, and a way of guaranteeing alignment across different technology implementations. The development and management of the model itself has been done in a way that enables the programmatic generation of derived formats such as the specification representations in XML Schema and OWL/RDF, the documentation, and program libraries. This represents the model-driven DDI-Lifecycle.
As a major shift in the way in which a DDI product is developed, the DDI-Lifecycle MD effort has unfolded over several years in a series of sprints and online meetings. As noted, the effort involves developing not only the information model, but also the infrastructure for building that model, the transformation into a set of representations (initially XML schema and RDF/OWL), and the associated documentation. DDI-Lifecycle MD will be rolled out in a series of “releases”. The initial releases will be both tests of the production infrastructure and an opportunity for the DDI community to comment on the direction of the information model. In addition to revisions driven by community feedback each of the initial releases will also add new content.
DDI-Lifecycle MD will be another type of DDI work product: like DDI-Codebook and DDI-Lifecycle, it will be a standard structure for metadata related to the lifecycle of data in all stages, and thus will encompass much of what is available in those XML structures today. It will also include the functionality of the DDI RDF vocabularies, since these are based on the structures of DDI-Codebook and DDI-Lifecycle. Like DDI-Codebook and DDI-Lifecycle, it will be compatible with the DDI Controlled Vocabulary work products.
However, there will be an expansion of coverage in terms of the data lifecycle (as compared to DDI-Lifecycle 3.*). For example, DDI-Lifecycle MD supports other types of data collection besides surveys, such as the collection of health information (including Electronic Health Record data), machine data (logs), and other forms of event and administrative data. Support for other data sources has in turn necessitated a review of DDI data processing objects and the development of a free standing process model that is not embedded in survey data collection. This model enables us to represent new types of data lifecycles including those that facilitate the building and maintenance of registries. Registries do not capture data in waves and rounds. Instead they lend themselves to new logical and physical products often supported by NoSQL and other types of big data stores that are able to disaggregate and aggregate data in a wide range of formats as registries evolve over time.
The expansion of coverage doesn’t mean a proliferation of new study objects. Instead a model-based information model often facilitates study object reuse, specialization, and a more systematic approach to study object development and organization.
Likewise, while this may sound like a proliferation of different DDI work products, in fact it is not: once the information model is developed, the XML and RDF specifications will be automatically generated from the model. Thus, although there will be an increased number of work products offered by the Alliance to the user community, these will actually require less effort to develop, and will have a guaranteed alignment in terms of their metadata structure.
This supports the use of subsets of the whole model for a variety of purposes. These subsets are called functional views. They are realized in related packages in the model and in the XML and RDF specifications. This approach enables the use of DDI for different audiences, hides complexity of the whole model, and supports interoperability.
Upon review of other documentation, you will find that these functional views are aligned with specific use cases. In the end the goal is that model-based DDI-Lifecycle can present DDI to users in chunks, so users can learn and own just the parts of DDI that they need. Of course, because of their use case(s), some users will necessarily have to become experts. However, the hope is that most users will only need to learn specific views.
5System for the dissemination of statistical information and related documentation, http://www.nesstar.com/
6Consortium of European Social Science Data Archives, http://www.cessda.net/