Library:How to Cite Data

From UBC Wiki

Introduction to Data

Ball by Geralt from Pixabay

Data or Statistics?

Before plunging in make sure you know whether you are referencing data or statistics. Unfortunately, the terms are sometimes used as though they are interchangeable. This is not the case: "statistics are the interpretation and summary of data" while "data are raw information from which statistics are derived" ("Data or Statistics?" Finding and Using Data for your Research).

Basics

The most important thing to remember is that you want your citation to include enough information so that a reader could find the same dataset again in the future, even if the link you provide no longer works. It's necessary to include a mixture of general and specific information to help them be certain that they've found the same dataset that you were referring to.

Citing data has not always been standard practice, especially if it is data you have collected yourself, but as data becomes more and more widely shared proper attribution is increasingly important. Citing datasets helps them become part of the scholarly record and gives proper credit to the creator of the dataset.

Elements

Puzzle Parts by Hans from Pixabay

Most of the time this information wouldn't be included in the dataset itself, but would be located in the item record of the data repository.

Many data repositories provide information about how to cite their products - look closely to see if you can find anything. This is your best bet for relevant information, as the structure of repositories and how they display different elements varies widely.

The most important elements to include are:

Author/Creator - This could either be the personal name of the researcher, or the institution that collected the data.

Title - Include the full title as it appears in the record for the dataset, including table or catalogue numbers if they are provided. If there is more than one title and you want to cite a part within a whole (such as a series within a table, etc), you can include both titles in the same way that you would include other parts within a whole, such as an article within a journal, or a chapter within a book.

Publication date - Most datasets should include some kind of publication date, even if it is hard to find.

Identifier and/or Link - Most published datasets should have some sort of a unique identifier, most commonly a DOI or a URI. This is the most reliable way to identify a particular resource. Many dataset providers will include a permanent URL in addition to or instead of a unique identifier. Link the DOI to the data source if you are working digitally, or include the URL in print.

Other elements that may be good to include:

Edition or Version - This may help to identify your dataset if it is one that undergoes continuous changes.

Resource Type - Include if the style you are using normally includes a resource type

Publisher - This could be the repository where it's located, or whoever has verified the data.

Statistics Canada has a reference building tool that can help you identify which elements to include for a wide range of data and statistics products.

Styles

Clock by Geralt at Pixabay

Once you've tracked down all the right elements, you'll need to put them together by using the appropriate style guidelines, consistent with the rest of your citation list.

However

Many of the major style guides do not yet include specific instructions about how to cite datasets. In order to create a citation for a dataset, you'll need the same basic pieces of information as you would for any other citation. As described in the UBC library's general How to Cite guide, these are found by asking Who/What/When/Where about the item you are citing.

At this point, APA is the only major style that has given specific examples of how to create a data citation.

For other styles, you will need to arrange the elements in the same way as other resources with similar elements. Think about the similarities between the elements you have and the ones in more common resources (ie: a repository may be like a publisher, a series within a table may be like a chapter within a book). This can help you to build out your citation even if you don't have a specific example to model.

Links to style guide resources can be found on the library's How to Cite guide.

Repositories

Dataset repositories, also known as research data repositories, provide researchers with a stable place to store and provide others with access to their research data:

Depending on the research discipline, data can often be deposited in one or more data centers (or repositories) that will provide access to the data. These repositories may have specific requirements :
  • subject/research domain
  • data re-use and access
  • metadata.

UO Libraries. "Data Repositories". Research Data Management

Binary by Geralt from Pixabay

Some dataset repositories also have their own guidelines and suggestions for how to construct a data citation, which elements to include, and where to find those on the site. Look carefully around the repository's website to see if you can find any information about citing their data.


If you are interested in looking at some research data repositories here is a very short list. Databib maintains a very extensive list of research data repositories if you would like to explore further.

  • Abacus Abacus is the Research Data Collection of the British Columbia Research Libraries' Data Services, a collaboration involving the Data Libraries at Simon Fraser University (SFU), the University of British Columbia (UBC), the University of Northern British Columbia (UNBC) and the University of Victoria (UVic).
  • figshare A cloud-based storage system which "allows researchers to publish all of their data in a citable, searchable and sharable manner." NOTE: "all figures, media, poster, papers and multiple file uploads (filesets) are published under a CC-BY license... All datasets are published under CC0"
  • UK Data Service Provides "single point of access to a wide range of secondary data including large-scale government surveys, international macrodata, business microdata, qualitative studies and census data from 1971 to 2011." Mostly UK data, but also includes some data from IGOs like the IMF, OECD and the World Bank.
  • ICSPR "An international consortium of more than 700 academic institutions and research organizations....ICPSR maintains a data archive of more than 500,000 files of research in the social sciences. It hosts 16 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields."

Troubleshooting

Question Mark by Geralt from Pixabay

I can't find all the elements I need:

The different elements needed for the citation may be hard to find, depending on the source of your dataset. The information usually provided about datasets is not as standardized as it is for books and articles, which can make things confusing.

  • Several data and statistics repositories actually collect datasets from several different agencies and providers. If you're unable to find enough information about the dataset in the repository, tracking down the dataset where it was originally published may turn up additional information.
  • Many data providers will offer their own guidelines for citing their datasets. This can help to decode some of the language used to describe particular elements. Sometimes these guidelines are general for the whole site, and sometimes they will be linked directly from the dataset record itself.
  • Do your best to include as many elements as possible and keep your data citation consistent with the rest of your list. If some key element is unavailable, try to make sure there is still enough information so that someone else can find it.

The data I'm using comes from multiple sources:

Sometimes datasets can be drawing information from multiple sources at once, making them confusing to cite. This is particularly common when creating charts and tables, whether you are making them yourself or using online tools built in by the data providers.

You must cite ALL the sources of your data.

  • If you are combining data from several series from the same provider, cite all the series. (eg: combined series in CANSIM)
  • If you are combining data from several different providers, cite all the sources. (eg: a table you've made comparing trade data from Industry Canada to employment data from CANSIM)
  • If you're including a table or graph in your paper which combines data from multiple sources, include a note describing which data elements came from where, with in-text citations. Give each source an entry in your reference list. This guide from SFU shows citation examples in APA style of tables and figures drawing from multiple data sources.
  • If you're not including a table or graph, provide this same information in the text of your paper.
  • If your data is drawing from so many sources that citing each source in the traditional manner is unreasonable, see the section on "Microattribution" in the Digital Curation Centre guide.
CANSIM2series.png

More

Network by Geralt from Pixabay

Much of the information on this page has drawn on the following very thorough guide to citing data, which includes an extensive bibliography should you wish to do further reading on the subject. Ball, A. & Duke, M. (2012). ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides -

UBC Library's general How to Cite guide provides links to different style guides and more in-depth information about identifying the elements of a citation.

Statistics Canada has a reference building tool that can help you identify which elements to include for a wide range of data and statistics products.

The DOI Citation Formatter tool will generate a citation based on a DOI in the citation style of your choice. Make sure to double-check that all of your elements have ended up in the right place!

This guide from SFU shows citation examples in APA style of tables and figures drawing from multiple data sources.