It has long been recognized that synergies between data sets are invaluable to researchers in the environmental sciences (
Carpenter et al. 2009;
Peters 2010;
Marx 2013;
Specht et al. 2015;
McKay and Emile-Geay 2016). Many have noted the opportunities for synthesis among independently collected data sets (
Carpenter et al. 2009;
Peters 2010;
Marx 2013;
Specht et al. 2015), and some have called for software tools to ease the challenges of sharing data sets (
Peters 2010). Spatiotemporal data, or data combining spatial variability with temporal variability, is an important subset of environmental data (
Peters 2010) and includes data from long-term environmental monitoring, paleolimnological studies, dendrochronological studies, ice cores, and ocean cores. These data are often collected at high temporal resolution at great effort, the interpretation of which is facilitated by comparison to other spatially or temporally proximal data sets. Although online repositories for environmental data have grown in number (e.g., DataONE, EcoTrends, NCEI Paleo), end-user software tools specifically dedicated to data exchange among researchers remain lacking.
All data storage is a balance between flexibility (what information can it store), efficiency (how much disk space is needed to store the data), and usability (how much manipulation is needed before something can be done with it;
Anderson et al. 2009). An ideal format for environmental data must be flexible enough to store essential metadata, uncertainty information, and the data set itself in a single distributable container; efficient enough to store increasingly large data sets; and usable enough to encourage widespread use in the environmental community. Linked tables (
Codd 1971,
1990) have long been used to store complex data in a single distributable container, but when they are used to distribute multiparameter spatiotemporal data, their use and format is often inconsistent. This paper describes an implementation of linked tables that provides a consistent, flexible, efficient, and usable data format to facilitate data exchange among a rapidly increasing number of multiparameter spatiotemporal data sets.
Data formats
At least initially, much multiparameter time-series data are stored in a flat, wide format, with a single column for each measured parameter and a row for each time or depth measured. This format resembles the “tidy data” format described by
Wickham (2014) and draws from the idea of location- and time-wide representations of spatiotemporal data described by
Pebesma (2012). If multiple locations were measured in a study, the location names can be stored in a column or data from separate locations can be stored in separate tables. Value uncertainty information can be stored in additional columns (often one next to each parameter column), which requires a strict format if the data need to be machine readable. Other metadata that may apply to specific values (e.g., notes, additional uncertainty information) or parameters (e.g., units, measurement method) are generally lost in this format. Missing cell values may either indicate that a value was not measured or that it was below detection limit, and there is generally no way to distinguish the two outside the context of spreadsheet software. This data format is popular because of its simplicity and its usability. Spreadsheet software is amenable to visualizing and manipulating data in this format, which has likely aided its popularity. An example of this format is provided in
Table 1.
An alternative to a flat, wide format is a flat, long format with a column that specifies which parameter was measured, a column that specifies the value of that parameter, and a row for each time–parameter combination (
Pebesma 2012;
Wickham 2014). As a result, the table may contain many more rows than the wide format and contains much repeated information (each time–parameter combination is included for every measured value, whereas in a wide format each time value is only included once). Uncertainty information and other metadata that apply to specific values can be included as an additional column to this table (e.g., notes, additional uncertainty information); however, metadata for each parameter–location is still difficult to incorporate. This format can be edited or manipulated in spreadsheet software, is able to natively retain far more of the original information than a wide format (e.g., value uncertainty), is able to accommodate varying observation times across multiple parameters, and allows rapid plot creation via plotting libraries that include a faceting mechanism (e.g., the “ggplot2” R package;
Wickham et al. 2016). Missing values are explicit and can have a defined meaning in a long format because if a location–parameter–time combination was not measured, the row will not exist in the table (
Wickham 2014). Conversion to the wide format described above is available in spreadsheet software (using pivot tables) as well as Python (via the “pandas” package;
McKinney 2016) and R (via the “tidyr” package;
Wickham and RStudio 2017). An example of the sample data in a flat, long format is provided in
Table 2.
Many spatiotemporal data are distributed in geographic information system (GIS) data formats, which generally consist of location information linked to a flat, wide data table (i.e., the attribute table). When the number of locations is large compared with the number of points in time, multiparameter spatial data are effectively stored in these formats. When the number of points in time becomes large, handling these data in GIS data formats with flat, wide attribute tables becomes unwieldy, and other structures are needed to avoid attribute tables with a large number of columns (i.e., a column for each parameter–point in time combination).
Pebesma (2012) described a number of R data structures, provided in the package “spacetime” (
Pebesma et al. 2016), which combine GIS structures with time-series structures, together forming a powerful interface for the visualization and analysis of single-parameter spatiotemporal data.
There are many examples of multiparameter spatiotemporal data distribution in flat, long format. Statistics Canada (
statcan.gc.ca/) distributes location information (census tracts) in a spatial data format, which is linked to actual census data that are distributed in a flat, long format. The US National Water Quality Monitoring Council (
waterqualitydata.us/) distributes historic water quality data from a number of water quality monitoring organizations in a flat, long format along with location data in a separate table. The Environment Canada NAtChem data set (
ec.gc.ca/natchem/) occasionally distributes data in a file that contains several linked tables that include location metadata, parameter metadata, and column metadata alongside data values, although data values are in a flat, wide format.
The recently proposed linked paleo data (LiPD) format is a nested format for multiparameter spatiotemporal data implemented in JavaScript object notation (JSON) that is optimized for paleoclimate data (
Emile-Geay and Eshleman 2013;
McKay and Emile-Geay 2016). This format is highly structured and suggests required metadata that should be distributed alongside paleoclimate data. The nested nature and JSON implementation of this format inhibits editing data directly in spreadsheet software; however, the underlying data are still stored in tables with one table for each measured parameter, a column for time, a column for the value of the parameter, and a row for each time measured. This structure is similar to the flat, long format described above, except data for each parameter are split into separate tables. The LiPD format also proposes a formal encoding for the interpretation of paleoclimate parameters (
Emile-Geay and Eshleman 2013;
McKay and Emile-Geay 2016), similar to other formal encodings for ecological metadata (
Fegraus et al. 2005).
An ideal data format combines the strengths of all of the above data formats: the ability to view and edit data in spreadsheet software, the ability to store uncertainty information, the ability to easily visualize the data without complicated manipulation, and the ability to encode metadata that are critical for interpreting and contextualizing environmental data. In particular, we would argue that making data formats readable in spreadsheet software is critical to its widespread adoption, as most researchers have been trained in this and are comfortable with its use.