Re-using data and results from other peoples’ research is crucial to consistent and cumulative research, yet it is also often a frustrating experience. Many raw data and results are not available, and those that are available are often hidden in pdfs or books and need to be manually extracted. The access to research data in industrial ecology, in particular, is very unsatisfactory [1]. IE researchers using data from literature sources often spend considerable time finding, extracting, and formatting data to use in their assessments, time that could be invested into quality control and analysis of model results instead. We all know the situation, where we extract data from a literature source by hand, knowing that at least ten people must have done the same thing before us. Such processes are slow, prone to errors, frustrating, and a bad allocation of research funding.
The barriers to better data sharing are manifold [2], as are the opportunities for what could be done if only data were available more easily [1].
Another, related problem in industrial ecology is that till this day, data in industrial ecology are commonly seen as existing within the domain of particular methods or models, such as input-output, life cycle assessment, urban metabolism, or material flow analysis data. This artificial division of data into methods contradicts the common phenomena described by those data: the objects and events in the industrial system, or socioeconomic metabolism. A consequence of this scattered organization of related data across methods is that IE researchers and consultants are often not aware of the different data sources that exist, or don’t realize that two seemingly independent datasets actually point to the same reference data source. Often, data compilers use method-specific system definitions and boundaries (for example when measuring building area or metal concentration of ores), which leads to assumptions that are often not properly documented and cause errors when data are used by others.
The story
The data model and database topics have been with me ever since I started working in the field, back then with Daniel Müller at NTNU, where we built a stock-and-flow database for the research group. For several years after that, I refused work on the database topic, fearing that such work could easily become a wasted effort, especially when a business model for using the database is lacking. But when tasked with building up the industrial ecology research group in Freiburg, the topic came up again, as I needed to develop infrastructure for storing and sharing the group’s data. Together with Mahadi, our programming student, we sat down and created a stock-flow database on our web server in 2017, and fed it with the steel cycle dataset [3]. That was a really nice tool, but too limited in its scope. We needed something bigger.
While working on a new software platform for material cycle modelling [4], I learned quite a lot about how to structure vastly different data (material composition of products, per capita stocks, energy consumption of processes, product lifetimes, yield coefficients, shared of something in something, …), so that they can be stored in a common data template and parsed and used by the software conveniently. Half a year after the launch of the stock-flow database, it clicked and I managed to formulate a general data model for socioeconomic metabolism and develop a data template for it to be used in an upcoming data-intensive project (see the video lecture on the model [5]).
Soon, it became clear that implementing the data model in a bespoke platform to exchange industrial ecology datasets would not be that much work, so I started procrastinating on other things to move this pet project forward. I invited Niko Heeren to join the team and together, we developed a relational database built on the general data model, a set of data parsers, and a user interface to the database, all of which are open source and can be implemented by individual researchers, groups, institutions, or the entire community [6].
In the latter case, the community-wide implementation, one can speak of an industrial ecology data commons (IEDC). After half a year of work (and ten years of meditation on the problem) we are now unveiling an IEDC prototype containing a diverse set of datasets from own work and from the literature:
http://www.database.industrialecology.uni-freiburg.de/
On behalf of the project team, I would like to invite you to explore the database and have a look at the data model!
The prototype
The most salient feature of this database is that it contains close to 20 different data types, from process inventories to product material composition, Sankey diagrams, and historic population figures. All data that’s needed for socio-metabolic research. All in one common data format, all in one common database.
The smallest available datasets each contain only one data item; they are typically individual numbers describing facts in the text of reports of papers. The largest available dataset currently has 63565 data items, it is the in-use stock of steel in 4 product groups, 146 countries, and 109 years. There are much larger datasets out there, like the EXIOBASE 3.4 time series of MRIO tables. These data can be inventoried in our database as well but it does not make sense to actually insert them, as they already come in an established data format.
To insert data into the database, the researcher needs to identify the data type [7] and the type-specific aspects of the dataset she wants to insert. The data description and the data are then entered into an Excel template, where the data can be formatted as table with multi-index rows and columns or as list of tuples. A parser then uploads the data to the database if all checks are passed. The entire extraction, formatting, and uploading process typically takes between 2 hours for a new dataset down to 30 minutes per dataset, if several similar datasets are extracted and uploaded. In cases were reformatting the data into the Excel templates is too cumbersome, a custom parser can be written. The data are first inserted into a review database and, if successfully uploaded, moved to the main database.
Using a database such as the one described here has several advantages: First, of course, the easy access to existing data. Second, the opportunity to upload own data to have them stored for re-use (documentation purposes, funding requirements, higher impact of work). Third, the increase of information and the quality assurance that occur when data are structured and formatted into the data model.
The future
Now that the prototype is working, several questions arise: 1) Is this type of database that what we need to facilitate data sharing in our community? 2) How to modify and scale up the prototype, develop a business case, and make sure that data are shared legally? 3) Where to obtain the financial and personal resources to further develop our data infrastructure?
We need to find answers to these questions. Many scholars will need to contribute, and I hope that there will be a lot of resonance within the industrial ecology community. Your opinion and experience is needed! Please send your thoughts and comments to us: in4mation~at~indecol.uni-freiburg.de
References
[1] Hertwich EG, Heeren N, Kuczenski B, Majeau-Bettez G, Myers RJ, Pauliuk S, et al. Nullius in Verba. Advancing Data Transparency in Industrial Ecology. J Ind Ecol. 2018;22(1):6–17.
[2] Pfenninger S, DeCarolis J, Hirth L, Quoilin S, Staffell I. The importance of open data and software: Is energy research lagging behind? Energy Policy [Internet]. 2017;101(October 2016):211–5.
[3] Pauliuk S, Wang T, Müller DB. Steel all over the world: Estimating in-use stocks of iron for 200 countries. Resour Conserv Recycl. Elsevier B.V.; 2013 Feb;71:22–30.
[4] https://github.com/IndEcol/ODYM
[5] https://youtu.be/1aCynUvSVRY
[6] https://github.com/IndEcol/IE_data_commons
[7] http://www.database.industrialecology.uni-freiburg.de/resources/IEDC_DataTypes_Overview.pdf
Good Article