I’ve been thinking recently about how best to share data from the lab, as part of our adoption of open science practices. In particular, we generate quite a lot of EEG data. Unlike the MRI community, who have a universal file format (the Nifti), there is no widely-agreed standard format for EEG data. Almost all systems use a proprietary file format, for example our main EEG systems are ANT Neuroscan systems, which use a file format called EEProbe. For long term data accessibility, this is not ideal – companies go bust, and file formats get forgotten about. Also, reading these file formats into languages like R is problematic. Unless someone has written code to read the file type, they are next to useless (see this blog post by Matt Craddock for further discussion).
So, I have decided that the raw data files generated by the EEG system should be converted to another file format so that they can be shared more easily. Previously much of our analyses were done in Matlab, and so the .mat file format was a possibility. However this is actually just another proprietary file format (owned by the Mathworks), so might not still be widely readable in a few decades time. The very simplest thing I could think of was just to use a comma separated value (csv) file – a text file format used to store data, in which each field is separated by a comma.
For this to work, the data layout needs to be logical and standardised, and able to be adapted to use with various different EEG systems and electrode montages. For the main data files, the first column should show the time of each sample (in ms), and the second column should contain trigger values. Each subsequent column will contain the data from one electrode (with the column header containing the electrode name). Here is an example of some data in this format:
Most of the triggers turn out to be zeros, because the trigger events happen only rarely. However I think it’s better to store the triggers along with the data, rather than in a separate file (which is what the EEProbe format does). The electrodes can appear in any order – these will get matched up with the montage later. The big advantage to this format is simplicity – it’s clear what is being stored in each column of the spreadsheet and what to do with it. However there is one big disadvantage – the raw csv files are at least 10x larger than the original files from the EEG system. This is because the EEProbe file format uses some form of compression to reduce the file size, whereas csv files are uncompressed.
We can also compress the csv files, using something like gzip. The compressed files are still quite a lot larger than the originals – approximately 4.2x larger – but I think that’s manageable because these days storage is cheap, and the aim here is to host the data files on a website like the OSF or Figshare, which don’t charge for hosting publicly available data anyway. Crucially, the gzip compression is transparent to software like R, which can load a csv.gz file using the same command as you would use to load an uncompressed csv file. This can be done using native R functions, and is as simple as:
data <- read.csv('file.csv.gz',header=TRUE)
Of course data can also be read into other programming languages, or even packages like Excel.
I have written a Matlab script to convert cnt files to gzipped csv files, which is linked to below. This function uses the EEProbe CNT reader plugin from EEGlab, and so requires EEGlab to be installed, and visible on the Matlab path. The idea is that you give it the path to a folder as an input, and it will convert all the cnt files it finds in there into csv format. I’m not particularly intending for others to use this script as is, but rather it’s a useful template to adapt to process data from different systems.
Besides the raw data, it is also often useful to have meta-data associated with the study. Whereas the raw data files will usually involve one file per recording session (block), the meta-data should be common to all participants and sessions in a study, so only needs to be generated once. I also used the csv format for this, and laid it out as follows:
The first two columns contain useful information about the study, including a description, the year it took place, and parameters of the EEG system and experimental conditions. The second two columns contain all legal trigger codes, along with a description of the conditions they indicate. Next I included a list of participant numbers for whom complete data sets exist. Then the next three columns give the labels and x and y positions of each electrode in the montage, and all remaining columns are to draw cartoon head, nose and ears. This is to permit the creation of scalp plots. I think this is more or less everything one would need to process the results of a typical EEG experiment. Again, example code to create the header file is linked below.
As a first step, I thought it would be a good idea to upload one of our largest data sets in this format. It is from a steady-state EEG experiment measuring contrast response functions, in which we tested N=100 participants. The results are reported in a recent paper (Vilidaite et al., 2018). We plan to do some secondary analyses on this data set, but I think it’s an unusual resource that others might find interesting as well. It can be accessed here: https://osf.io/y4n5k/