Data Management – King Group

Notebooks

While paper notebooks have stood the test of time, they have shortcomings including flammability, illegibility, difficulty in making backups, and unsuitability for the storage of electronic files. Electronic laboratory notebooks address these problems, but bring with them a host of other problems, including inconvenience, impermanence (no long-term support), and rigid formatting rules. My (BTK’s) current view is that we should use paper notebooks in conjunction with a digital repository for scanned backups and electronic files.

As of 30 March 2021, the viable (= FOSS and self-hosted) contenders for ELNs include indigo, elabftw, chemotion, openbis, jupyterlab, and kadi4mat. In general, though, these products are either too limited, too computationally focused, or too new to merit adoption over paper notebooks. Jupyterlab does look to be useful for data analysis and related programming.

Electronic Data

Our file server can be accessed at here and is a nextcloud instance hosted at hetzler. If you need an account or are having problems, contact BTK.

Please migrate your data from staudinger to our new server, which can be found here.

File Names for Data

The systematic naming of files will help us find things down the road. In a nutshell, the format should be:

~/20A/016/filename.ext

In full, here are the naming guidelines:

Use no more than 32 characters.
Use only numbers, letters, and underscores.
Do not use special characters, dashes, spaces, or multiple dots or stops.
Avoid common terms (‘data’, ‘sample’, ‘final’, or ‘revision’).
Use consistent case – all lower case, or all UPPER CASE, or Lower case.
Dates should be in a standard016 format – YYYYMMDD, which will allow them to sort chronologically

Create a folder with your name (e.g. data_btk) and subfolders for each notebook (e.g. 20A). Each experiment should then have a folder with a name that matches the first page of the experiment in your paper notebook (e.g., 16). The use of leading zeros will allow for proper sorting in directory listings, that is so that page 1 and 101 are not listed side-by-side on the computer. See below for a discussion of file names.

The Importance of Metadata and the README.txt file

As described above, each completed experiment should be compiled and documented in a single directory on our nextcloud instance. In addition, for each experiment, you must have a file named README.txt that contains the meta-data describing the contents of the directory. This README.txt file in each directory that gives the title of the experiment and provides a description of each file present, e.g. the origin of the materials in the image/nmr/etc, the instrument/software used to generate the file, etc. Chemdraw files are also useful. This file is the key to making sense of the various files and formats down the road.

The first line should also include the title of the experiment as per the example below. Here’s an sample README.txt from a computational experiment from Ben’s recent work:

Files for Ben King's notebook 20A p. 10:  Hexamer + (CoAPSO)6 Geometry Optimization (uff/avogadro)
1. fantrip-triazine-CoAPSO_6.mol   the mol file of the geometry
2. fantrip-triazine-CoAPSO_6.cml   the mol file of the geometry
3. 10.pdf                          scan of the notebook page

The file should be in plain text but can include markdown if you like. There are many useful discussions of meta data on the web, see e.g. this example from Harvard.

Common formats

NMR data should be stored in two, or maybe even three, formats:

json-dx (this is a open-format NMR interchange format preferred by IUPAC and various NMR data repositories)
mestrec
pdf analyzed spectrum with structure, expanded regions inset, integration, etc.

Mass spectral data should be stored in a similar way:

open mzML (this is an open-format MS data interchange format preferred by various MS data repositories)
the original data files from the instrument
pdf of the full and analyzed spectrum

Images should be stored using a non-lossy format. TIFF is preferred for images, but PNG can be used when TIFF is not available. JPEG must not be used. Make sure to store the original, unadulterated image.

Backups

Data on staudinger is stored as three copies:

Original data is stored on staudinger using a ZFS-file system that is redundant across 2 x 3 TB drives. The file system is configured to be versioning, making it possible to recover data that has been overwritten.
Staudinger is backed up nightly via a crontab job to an external USB hard drive
Staudinger is also backed up nightly using an idrive script to a 5 TB storage block on the idrive cloud service. BTK has access to the account.

Final Repository

At the completion of a manuscript, thesis, or dissertation, the document and all of the supporting original data will be placed into a permanent digital repository. We will most likely use Zenodo, which is hosted by CERN.