FACETS is introducing a Data Science section as a platform to facilitate the dissemination of high-quality research focused on data. Given the explosion of data, and new data types, across virtually all areas of research endeavour in recent years, this is the right time for such a platform to emerge. Furthermore, considering the inherently interdisciplinary nature of data science, FACETS is the right venue. The assembled team of Subject Editors for the Data Science section gives excellent coverage of both theoretical and applied aspects of data science.
The advent of a
FACETS Data Science section is an opportune time to consider data science as a field as well as the various types of submissions we wish to see. A little historical context will be provided before moving to the very broad interpretation of data science that
FACETS will take. Although the activities associated with data science have been around for many years (see, e.g.,
Donoho 2017), it is useful to start by considering the origins of the term data science and its evolution to what we know today as data science.
The use of the term data science goes at least as far back as the 1996 meeting of the International Federation of Classification Societies (IFCS). As
Hayashi (1998, p. 40) explained:
The roundtable discussion “Perspectives in classification and the Future of IFCS” was held at the last Conference under the chairmanship of Professor H.-H. Bock. In this panel discussion, I used the phrase ‘Data Science’. There was a question, “What is ‘Data Science’?” I briefly answered it. This is the starting point of the present paper.
The “present paper” referred to above is the contribution of
Hayashi (1998, p. 40), wherein data science is described as follows:
Data science is not only a synthetic concept to unify statistics, data analysis and their related methods but also comprises its results.
Hayashi (1998, p. 40) went on to specifically contrast data analysis and mathematical statistics:
… mathematical statistics have been prone to be removed from reality. On the other hand, the method of data analysis has developed in the fields disregarded by mathematical statistics and has given useful results to solve complicated problems based on mathematico-statistical methods (which are not always based on statistical inference but rather are descriptive).
Although the foregoing excerpts from
Hayashi (1998) are important for historic context, they do not serve as an accurate contextualization of modern data science. However, before moving entirely beyond this important early work, it is useful to draw parallels—both historic and modern—with the sentiments captured therein.
Arguments for viewing data analyses through a more practical lens go back to well before the entry of the term data science into the mainstream parlance. One of the more famous examples centres around the disagreement between Fisher and Gossett (the latter being “Student” of
t test fame) over the notion of significance. Many fascinating accounts of this disagreement have been given, including interesting work by
Ziliak (2008). In brief, one might say that the practical and data-context-specific approach favoured by Gossett stood in contrast to the notion of a result being either statistically significant or not. The notion of statistical significance has, to date at least, stood the test of time and is eloquently summarized by
Ziliak (2008, p. 200):
Yet against “Student’s” wishes and periodic warnings, it was this same extraordinary Fisher, “Student’s” younger friend and colleague, who invented and campaigned for the 5 percent rule of statistical significance. Today, Fisher’s preferred interpretation of “Student’s” test is customary if not enforced in most sciences, journals, and even courts of law.
Interestingly, the notion of a 5% level of statistical significance is central to the current discourse around
p-values (see, e.g.,
Wasserstein and Lazar 2016;
Goodman 2019;
Kmetz 2019;
Startz 2019;
Valentine et al. 2019). It is notable, and perhaps inevitable, that the subject of the famous disagreement between Fisher and Gossett has gained increased attention at a time when data science—and related attitudes that emphasize practical and data-context-specific considerations—has burgeoned into a sufficiently important field to penetrate into the public discourse. It may yet be that, around a century later, Gossett wins the argument.
The relationship between data science and statistics was taken up by others shortly after the work of
Hayashi (1998). For instance,
Cleveland (2001, p. 25) wrote:
A very limited view of data science is that it is practiced by statisticians. The wide view is that data science is practiced by statisticians and subject matter analysts alike, blurring exactly who is and who is not a statistician.
Although the relationship between data science and statistics is important in understanding the origins of data science, it is important to note that the field of data science is now very broad and reaches far beyond these origins. In a recent monograph,
McNicholas and Tait (2019, p. 1–2) discussed the relationship between statistics and data science:
On the one extreme, some might view data science—and data analysis, in particular—as a retrogression of statistics; yet, on the other extreme, some may argue that data science is a manifestation of what statistics was always meant to be. In reality, it is probably an error to try to compare statistics and data science as if they were alternatives.
McNicholas and Tait (2019, p. 2) went on to take the view that
… statistics plays a crucial role in data analysis, or data analytics, which in turn is a crucial part of the data science mosaic.
Beyond statistics, computer science plays a very important role in data science today; e.g., machine learning has become very important for data analytics, and data preprocessing can be very computationally demanding. The same is true of optimization, which is a field practiced by people with a variety of backgrounds including management science and mathematics. Crucially, however, there is much more to data science than data analytics. For example, data security and privacy are issues of tremendous practical importance (e.g.,
Abouelmehdi et al. 2018). Data ethics, and the responsible use of data in general, are fundamental for data science (see
Floridi and Taddeo 2016, for example). Communication, including effective data visualization, is another key part of the data science mosaic (e.g.,
Perkel 2018). The foregoing is far from an exhaustive list of data science topics but it should serve to illustrate how broad the field has become. The breadth of data science is also apparent in the extremely diverse range of its applications including work in cyber security (e.g.,
Buczak and Guven 2016), health care (e.g.,
Spruit and Lytras 2018), finance (e.g.,
Giudici 2018), and public policy (e.g.,
Matheus et al. 2018).
The breadth of the field of data science will be reflected in the papers sought by the
FACETS Data Science section. However, before further discussion on the new section, it will be helpful to consider the relationship between big data and data science. Similar to data science, the term big data has no one universally accepted meaning. Some definitions are based on three, or more, words beginning with the letter V.
Puts et al. (2015) give an interesting discussion of the three-V definition as well as addressing the difference between big data and administrative data. Very roughly, the three-V definition defines big data in terms of size (volume), diversity (variety), and streaming (velocity). Whether one of the three Vs will suffice, according to this paradigm, for data to be big data is open for debate. However, this debate will not be enriched herein because to do so might distract from the key point, i.e., there is more to data science than big data. In fact, experience suggests that some of the most difficult data problems do not have any of the Vs. Examples include data that are too few, rather than too many, as well as datasets containing missing data. Both of these situations are difficult, and important, in data science.
However one may wish to define data science, the key must always be data. For a piece of work to be considered data science, we require only that data are at its heart. At the time of writing, the
FACETS Data Science Section has three subject areas: Data Science Theory and Methods, Data Science Applications, and Research Data Management. Because data must be at the heart of the work, Data Science Theory and Methods and Data Science Applications are very much related. Consequently, it might be difficult for authors to determine which of these subject areas is most suitable for a particular manuscript. As a guideline, if the novelty principally concerns methodology, then the Data Science Theory and Methods area should be chosen whereas, if the novelty lies in the application, then Data Science Applications is more suitable. The historically important description of data science given by
Hayashi (1998), and related discussion herein, might give one the impression that methodological statistics work is unwelcome; however, this is not the case. All theory and methodology will be given full consideration provided that data are at the core of the work. As one would expect, work concerning data storage, security, privacy, etc. should be submitted to the Research Data Management subject area. Papers on other topics in data science, such as data ethics and the responsible use of data, are welcome and should also be submitted to the Research Data Management area.
Acknowledgements
The author is grateful to the Subject Editors for the Data Science Section as well as the Editor-in-Chief and several colleagues for their thoughtful comments on this editorial.