LibGuides: Data Management and Sharing: Data Organization

Folder Organization

At the beginning of any new project, consider the types of data that you will generate and think about how you or someone else might look for a file a year from now.

Establish a folder hierarchy that makes sense for the project. MIT's Organize your files gives one example:

[Project] / [Experiment] / [Instrument or Type of file]

Create a README file to describe your folder organization strategy

Below are several pitfalls to avoid, from FASEB's Structuring File Folders Effectively

Too many folder levels. Aim for 3-4 levels across all folders
Too many files in a folder. Good rule of thumb? No more than 10 files per folder
Ambiguous or overlapping folder names
Copies of the same file in different folders. This is problematic for version control and failure to update all copies. Create a shortcut if you need a placeholder for a file in another folder
Having "catch-all" folders. Avoid folders such as "current documents" or "my stuff". Additionally, name folders according to the research, not by researcher names

Dryad's Good Data Practices gives an example of two different ways to think about file organization:

1) Organized by File Type

DatasetA.tar.gz
|- Data/
|  |- Processed/
|  |- Raw/
|- Results/
|  |- Figure1.tif
|  |- Figure2.tif
|  |- Models/

2) Organized by Analysis

DatasetB.tar.gz
|- Figure1/
|  |- Data/
|  |- Results
|  |  |- Figure1.tif
|- Figure2/
|  |- Data/
|  |- Results/
|  |  |- Figure2.tif

File Naming

Establishing file naming conventions at the outset of a research project ensures data file organization and facilitates file retrieval and sharing. It is easy to underestimate the vast quantities of data files a project will generate, even on daily basis.

FASEB's File Naming Best Practices highlights important reasons to establish a file naming schema, as it will help you:

avoid computational mistakes when you analyze the data
browse your data and see what is in a file folder at a glance
remember what is in each file when you return to old data

Below are guidelines for file naming best practices:

Make file names descriptive and keep to 30 characters or less
Format dates according to the ISO 8601: YYYYMMDD. Using this format will also help with the file sorting function.
>Use underscores, and avoid using spaces as well as special characters such as periods or ampersands, which can confuse computer programs
Include a system for labeling file versions
Create a README file to describe your file naming scheme

File Formats

As technology continues to evolve, software and hardware that exist today can become obsolete. Data files saved in proprietary formats associated with that technology will be unusable. Storing your data in robust open file formats allows data to be accessible and usable to you and others in the future.

While you may need to collect data from an instrument in a default, proprietary file type, it is important to export data intended for storage and sharing to a more preservable format. Below are preferable formats that are non-proprietary, common, and accessible:

File type	Preferred format
Text	Plain text, ASCII (.txt) Portable Document Format (.pdf) Extensible Markup Language (.xml)
Tabular	Comma separated values (.csv)
Image	Tagged Image File Format (.tif or .tiff) JPEG 2000 (.jp2) Portable Network Graphics (.png)
Document	Portable Document Format (.pdf, .pdf/a, pdf/ua)
Video	MPEG-4 (.mp4) Material Exchange Format (.mxf)
Web data/ Data exchange	Javascript Object Notation (.json) Extensible markup language (.xml)
Geospatial Data	ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx, .sbn)

Why CSV, TXT, and JSON?

Formats like CSV, TXT, and JSON are widely used due to their simplicity, versatility, and ease of use across different platforms and programming environments.

Each format serves specific needs and has unique advantages, depending on the nature of the data, storage requirements, and the types of analyses or collaborations anticipated.

CSV: a favorite for structured data, especially when working with tables. Easy to read for both people and computers, fits well with most data processing tools, and lightweight, making it ideal for number-rich, straightforward data or survey results in rows and columns.
TXT: files offer great flexibility and are supported widely. Great for storing unstructured or semi-structured data and jotting down notes or logs in a highly readable format. That said, if you have a large dataset TXT files can become unwieldy as there is no inherent structure.
JSON: ideal when you are working with more complicated data which includes nested elements. JSON is easy to read for humans and machines, making it great for showing detailed relationships in your data, like metadata or different levels of classification. Web apps and data science projects commonly use JSON where a rich data structure is needed. While JSON files can be larger and a bit more complex than CSV files, they are adaptable and can handle various data types, making them super versatile for different research setups.

Data Management and Sharing

Contact