Why China needs data sharing to address its air-quality challenge
In the fight against air pollution, sharing data within China and forming scientists’ consensus on data-driven policy recommendations are at least as important as data collection. The scale of data publicly available can have dramatic implications for the insights possible: a series of research studies in limited data contexts can lead to the wrong conclusions if the contexts do not encompass the broader picture or if they are not connected. Despite the clear benefits of data sharing, there are systemic hurdles in China: in the past, collection of air-pollution data was fraught with distortion. Further, promotion and tenure requirements disincentivize data sharing essential to informative results. Finally, overcoming the data-sharing and collaboration problem is not enough. As demonstrated by the California Air Resources Board and in air-quality management in Houston, scientists must then come together around that shared data to inform the policy-making process.
As data sharing and collaboration can take on multiple forms, to elucidate our argument, we first differentiate between forms of data sharing. Data sharing is the exchange of data between members of a group for an express purpose. Data sharing can take many shapes with varying numbers of researchers at the same or different institutions. A data-sharing program may or may not involve making data public. Making data public is the most extreme form of data sharing, in as much as it makes the data available to everyone. In between public data sharing and data shared amongst only two entities is a data-sharing club, where data are shared based on a set of common principles. Traditionally, an ambitious international collaboration addressing an issue with the complexity of air pollution would be the product of a small group with the ability to expand as the project develops. The types and scale of questions answerable with shared data can be significantly greater than any single group could collect and analyse on their own. Data sharing can conflict with traditional cultural standards of data collection and ownership in academia, without incentives otherwise. Houston, Texas’s history of addressing air pollution, discussed later in this paper, is a prime case study in data sharing, without making the data public.
Data pooling is the construction of a database to hold data that members, or in some extreme cases the general public, can access at will, rather than for a particular research or policy goal. It may be thought of as a more sophisticated extension of data sharing. Data pooling, when executed well, can have the advantage of ensuring validity and uniformity of data. Data pooling also has the potential to leverage shared resources to answer more complex, far-reaching questions, as well as allowing new questions to be addressed as scientific understanding develops, encouraging the emergence of new knowledge. Pooling data can be difficult to implement, as it requires buy-in from participants who may be accustomed to holding data independently to make the most individual use of it. It also requires significant maintenance and management to build, curate and continually validate such a database. Databases such as those created by the Convention on Long-range Transboundary Air Pollution (CLRTAP) or the Aerosols, Clouds, and Trace gases Research Infrastructure (ACTRIS) – both of which are discussed later in the paper – are examples of data pooling [1,2].