About the Specification

Purpose and Goals

The Data Documentation Initiative (DDI) is an effort to establish an international XML-based standard for the content, presentation, transport, and preservation of documentation for datasets in the social and behavioral sciences. Documentation, sometimes called metadata (data about data), constitutes the information that enables the effective, efficient, and accurate use of those datasets.

Centrality of Documentation to Science

The ability for scientists to exchange data is widely recognized as key to scientific progress. That ability critically turns upon the existence of documentation that enables a new analyst to gain a full understanding of the data without any consultation between the data collector and the analyst. There is widespread agreement among data archivists in the social sciences that extant forms of documentation are often inadequate for this purpose. Indeed, data archives have historically faced larger problems with the documentation of their data than with the data themselves.

Moreover, extant forms of documentation are technically obsolete, relying upon unstructured and incompatible text formats to convey information to the human mind but not to computer software. If the documentation of the data of the social and behavioral sciences is to become part of the Semantic Web -- which will enable computers and people to work "in cooperation" -- the documentation must use one of the preferred languages of the Semantic Web, and the information contained must be given well-defined and structured meaning in that language (Berners-Lee et al, 2001).

Improving Documentation

In response to the need for better documentation, the Data Documentation Initiative (DDI) is an endeavor to provide a straightforward means for social and behavioral scientists to record clearly and then to communicate to others all the salient characteristics of the empirical data for which they are responsible. The DDI metadata specification originated in the Inter-university Consortium for Political and Social Research and is now the project of an Alliance of about 25 institutions in North America and Europe. Together, the member institutions comprise many of the largest data producers and data archives in the world. Virtually every kind of body of data is found in one or more of the archives.

The DDI specification is a major transformation of the once-familiar electronic "codebook," retaining all of the capabilities of that kind of document but greatly increasing the scope and rigor of the information contained in it. Indeed, the DDI metadata can be displayed as conventional paper or screen codebook-like documents, but unlike the old codebooks, the information displayed can be fully understood by computer software as well as by humans. The DDI transforms the concept of codebooks by encoding codebook information into databases that share a known structure and a specification language across many bodies of data.

Larger Goals

The DDI aims to be the foundation for collection, distribution, use, and archiving of many future data collection projects in the social and behavioral sciences, across institutions, countries, and disciplines. It also aims to be the basis for retrofitting documentation of older studies for improved ease of use and stronger guarantee of archival preservation. The latter aim requires that this new specification be independent of any particular software or computing platform; this has been achieved by conceiving the specification as a generalized data model rather than as computer code. The data model is extensible and modular, supporting the specification of even the most complex data systems in a way that is simultaneously flexible and rigorous.

In further pursuit of the goal of wide adoption, the project is seeking cooperation from both data producers and statistical software manufacturers through their implementation of this new specification. The project's leaders hope that DDI metadata can soon be produced by standard computer assisted interviewing software and be accessed directly by many statistical software packages for purposes of data definition. When these two aims are achieved, the DDI specification can readily become the basis for the entire research process, from generation of a data collection instrument to production of research articles.

Structure of the DDI Specification

The DDI is currently expressed as an XML Document Type Definition, or DTD. The DTD defines all of the elements and attributes of social science technical documentation and the relationships among the elements and attributes. The DTD has also been converted to an XML Schema.

Version 1.0 of the DTD was published March 24, 2000. Since that time, several enhancements have been made; the most recent stable version of the specification is Version 3.0.

Benefits of the DDI Approach

To summarize, the DDI facilitates:

  • Interoperability. Codebooks marked up using the DDI specification can be exchanged and transported seamlessly, and applications can be written to work with these homogeneous documents.

  • Richer content. The DDI was designed to encourage the use of a comprehensive set of elements to describe social science datasets as completely and as thoroughly as possible, thereby providing the potential data analyst with broader knowledge about a given collection.

  • Single document - multiple purposes. A DDI codebook contains all of the information necessary to produce several different types of output, including, for example, a traditional social science codebook, a bibliographic record, or SAS/SPSS/Stata data definition statements. Thus, the document may be repurposed for different needs and applications. Changes made to the core document will be passed along to any output generated.

  • On-line subsetting and analysis. Because the DDI markup extends down to the variable level and provides a standard uniform structure and content for variables, DDI documents are easily imported into on-line analysis systems, rendering datasets more readily usable for a wider audience.

  • Precision in searching. Since each of the elements in a DDI-compliant codebook is tagged in a specific way, field-specific searches across documents and studies are enabled. For example, a library of DDI codebooks could be searched to identify datasets covering protest demonstrations during the 1960s in specific states or countries.