LibGuides: Research Data and Reproducibility: Organizing Data

What is data organization?

Data organization refers to structuring project directories to aid the storage and finding of files, naming files to enable logical grouping and/or chronological sorting within directories, and structuring the contents of files to facilitate analysis. This page outlines best practices for organizing your research data.

Good data organization considers:

All of the data and materials associated with a project and the relationships between them
How to consistently and meaningfully name files
What others need to know to understand your organization methods

Directory Structures

You'll want to plan and use a file and folder organizational structure that is informative and keeps all the research information and materials associated with a project together.

Here are some tips for organizing directories:

Organize data hierarchically, and identify ways to divide your data into categories or attributes such as:
- Project
- Time
- Location
- File Type
Files can be arranged chronologically, by classification or code, or alphabetically within folders. The most appropriate arrangement may depend on the types of files.
Folder and sub-folder names should reflect the content of the folder, not the names of researchers or staff.
Include basic information such as the project title, dates, and some kind of unique identifier, such as a grant number.
Document your file directory structure and describe the types of records that should be maintained in those folders in a read me file.

Here are some suggestions for naming directories:

Put textual information such as documentation in a folder named docs
Put raw data and metadata into a folder named data and make it read-only to prevent making changes to it
Files created during cleanup and analysis can go in a folder named results
Put any source for scripts or code in a folder named src or code

Directory Structure examples
Organized by file type	Organized by analysis
Dataset.A Code Step.1 Step.2 Data Processed Raw Results Figure.1 Figure.2 Models readme.txt	Dataset.B Figure.1 Code Data Results Figure.2 Code Data Results Table.1 Code Data Results readme.txt

File Names

Establishing a file naming convention that produces groupings of related files can help everyone in your lab easily identify the data they're looking for by name. There is no one-size-fits-all naming convention -- the conventions you use should be based on your and your team's needs.

Here are some things to consider when choosing a naming scheme:

File names should be unique. To avoid confusion or data loss if they get moved around, avoid having files with the same name, even if they're in different folders.
File names should embody their content by indicating its major parameters as components of the file name, such as:
- The date the data in the file was collected
- The project the data was collected for
- The experiment the data was collected in
- The sample the data was collected from
- The instrument the data was collected by
- The location the data was collected at
- The state or nature of the data: raw, transformed, final, documentation, meeting notes, etc
- An identifier indicating the person responsible for creating or transforming the data in the file, such as their last name
- A version number to indicate the work history of the file
Dates in file names should use YYYYMMDD or YYYY-MM-DD format so that they sort chronologically in file folders.
File name components should be evident and non-cryptic:
- Use meaningful abbreviations
- Use CamelCase to make phrases more readable
Numbers in file names should use an extensible numbering scheme. Because computers sort numbers like letters, this means padding numbers with as many zeros as needed for the total number of files in order to sort appropriately in file folders. For example, instead of numbering files from 1-100, use 001-100 instead.
The components of your file names should be arranged from general to specific to group logically in file folders.
Characters in file names should be limited to numbers, letters, dashes, and underscores:
- Avoid special characters: they may not be supported on all computer systems and can lead to data loss when moving files between systems.
- Avoid spaces: file names with spaces need to be surrounded by quotes when referenced at the command line, and spaces in URLs are converted to the encoded space character by web browsers
- Consider using dashes instead of underscores for better matching in web searches and regular expressions and better readability in URLs
- See Of Spaces, Underscores and Dashes by Jeff Atwood for more thoughts on the pitfalls of spaces in file names and the pros and cons of dashes vs underscores. Read the comments, too!
Consider keeping file names to fewer than 50 characters for readability and system compatibility.
Whatever conventions you choose, be consistent and document your decisions!

The following example filenames follow the guidance listed above:

20160104-ProjectA-Ex1-Test1-v01.xlsx
20160104-ProjectA-MeetingNotes-SmithE-v02.docx
ExperimentnamentName-InstrumentName-CaptureTime-ImageID.tif

Note how the components of each file name are meaningfully named, they're arranged from general to specific, they're separated with dashes, the dates use YYYYMMDD format, and the version numbers are zero-padded.

File Naming Convention Worksheet
This worksheet from Caltech Library walks researchers through the process of creating a file naming convention for a group of files. This process includes: choosing metadata, encoding and ordering the metadata, adding version information, and properly formatting the file names.

File Contents

In addition to thinking about how to organize files on disk and how to name them, you should also give consideration to organizing the contents of your files. These links offer some best practices.

Data Organization in Spreadsheets
A tutorial on best practices for organizing data in spreadsheets in order to make them more usable for analysis, from statistician Karl Broman.
Tidy Data
This article in the Journal of Statistical Software by data scientist Hadley Wickham outlines a standard method for displaying multivariate data in the form of a data matrix where each variable is a column, each observation is a row, and each type of observational unit is a table. It makes it easy to tidy messy datasets and to develop tools for data analysis.