Skip to Main Content

Data Management and Sharing

Metadata: The Story of Your Dataset

Documenting your dataset through metadata ensures your research is useable and understandable weeks, months, and years in the future

This is important not only for when you revisit your older data, but also for others who will discover and potentially reuse a dataset which you have made available online.

MIT lists important things to document about your data, some of which are-

  • Title-  name the dataset and the project associated with it
  • Creator- researcher, PI, collaborators, addresses, etc.
  • Dates-  including "key dates associated with the data, including project start and end date, data modification data release date, and time period covered by the data"
  • Data description- including key words important to the dataset
  • Methodology- how the experiments were run, instrument information and settings, software used, reagents, and any other relevant information such as what might be captured in a lab notebook or a methods section of a paper
  • Rights-  any information on intellectual property rights with respect to the dataset
  • Identifier- a number or code that uniquely identifies a dataset (see below for more information)

Standardizing Metadata

Metadata standards specify which pieces of information to include in the metadata.

  • Dublin Core- is a basic, widely used metadata standard that can be adapted to a variety of disciplines
  • There are many more metadata standards that are discipline specific. Some examples are:
    • DDI (Data Documentation Initiative) for the social sciences
    • ABCD (Access to Biological Collection Data) for biological specimen data
    • SDAC (Standard for Documentation of Astronomical Catalogues) for astronomical data

Visit the Metadata Standards Catalog to search for a metadata standard by discipline or by scheme name,

Additionally, metadata must be encoded, or formatted, in a way that makes it machine readable and searchable. A robust repository will format your metadata for you. Some common formats are:

  • XML- eXtensible Markup Language
  • JSON- JavaScript Object Notation

One of the most important metadata elements for a dataset is a globally unique- persistent identifier (PID) which allows research datasets to be discovered and cited directly. DataCite is a global not-for-profit membership organization which ensures "that research outputs and resources are openly available and connected", specifically by assigning a digital object identifier (DOI) to research datasets.  DOI's are assigned through DataCite or through membership institutions with repositories. For example, Dryad is a member institution and can register DOI's for your datasets.

README Files, Codebooks, Data Dictionaries, and more

README files are documentation files that describe a file, folder, or dataset so that others can understand and interpret what it contains.  

  • Use README files to document your file naming system and file organization for a project.
  • Data repositories may require a README file to be submitted with a dataset.
    • README files typically contain clear information and guidance on:
      • attribution
      • permissions 
      • persistent identifiers (DOI, etc.)
      • data processing, collection and analysis, etc.
    • In the absence of a codebook or data dictionary (see below), README files can also document research methodology, define variables, provide context for files, etc.

Data dictionaries and codebooks define the elements of the dataset so that you and others can understand and use the dataset in the future.  These terms are used interchangeably for the most part, though codebooks are more often associated with survey data.

Data dictionaries and codebooks include information such as-

  • the name of the variables
  • the meaning of the variables
  • the units of measurement of each variable
  • the allowed values for each variable

Resources