Open Data, Open Standards and Open Source: field notes from the SeNeReKo project

[A guest post by Frederik Elwert, post-doctoral researcher at the Centre for Religious Studies at Ruhr-University in Bochum, Germany. He is to be found on Twitter @felwert ]

The Center for Religious Studies (CERES) at Ruhr-University Bochum is a great place to study religion, with a variety of scholars from different backgrounds contributing to interdisciplinary research projects. But despite its innovative research, it still inherits much of the conservatism of its constituent disciplines, Religious Studies, Theology, Indology, Islamic and Jewish Studies, and others. When we started with SeNeReKo, a small digital humanities project, in 2012, we got an opportunity to learn many new lessons on open scholarship.

I am glad to have the opportunity to share some of my ideas on this topic here on the OA-OA blog. I will not talk about Open Access, probably the last step in the research cycle, although colleagues of mine at CERES recently started collecting experiences with that as well, launching a small Open Access online journal called Entangled Religions last year. Instead, I will focus on the prerequisites of open research, which play an even greater role in the digital humanities: Open Data, Open Standards, and Open Source.

Open Data

SeNeReKo is short for the rather lengthy title “Semantic and social network analysis as a tool to study religious contact”. That title highlights the methodological focus of the project. In fact, when funding this project (along with 23 others), it was the German ministry of education and research’s intention to use previous years’ digitisation efforts as a starting point for advances in the analysis of existing collections. The data is there, now what to do with it? It was important for us not having to digitise sources ourselves, something we would not have been able to do within the project’s time frame. But we still learned a lot about the difference between “existing data” and “reusable data”.

There are plenty of religious sources available online. A plethora of websites allows one to read and query databases of religious texts. But terms of use differ a lot, if they are made explicit at all. This is striking, given that most of these texts are some hundreds (or thousands) of years old and should be seen as a common good. But in many cases, specific translations, editions or collections are restricted in access. But in order to re-use data, it is important to have full access to the data. This is a legal as well as a technical problem.

On the legal level, one must have the right to access, store and process data. Without being an expert in this area and without elaborating on differences between countries, this is often covered by academic freedom. In the case of digital projects more crucial, however, is the right to re-distribute: In their presentation of research results, digital projects are not limited to quoting small snippets of the text. They can also display a complete re-sampling of the original data, providing not only results of interpretation, but also tools for interpretation. This requires permission to present and thus to re-distribute the data. (Or only to provide a very limited “distant” view on the data that does not allow one to re-assemble the original text, like in Google’s ngram viewer.)

But there is also a technical challenge: In order to re-use data in innovative ways, one has to have access that goes beyond of what the search interfaces of the collection’s website allows. If we want to re-model religious texts as networks of meaning, then we need to have all the information the source edition provides in a machine-readable form.

In our case, we were particularly interested in the Buddhist Pali Canon and in Ancient Egyptian sources as two exemplary cases. For both, digital editions exist that are regularly used by scholars in the respective fields. The Pali Canon is available in the Chattha Sangayana edition. This edition has previously been sold on CD, but since some years, it is freely available online. More interestingly, the website makes the machine-readable source files available in an (albeit outdated) TEI-XML format. However, the Vipassana Research Institute that publishes this edition does not explicitly state under which terms the data are available. After contacting them, we found them to be very liberal and open, but the case shows the problem of missing licenses that make it difficult to re-use data, even if it is published with an implicit motivation to allow various use cases.

A collection of Ancient Egyptian texts is available from the Thesaurus Linguae Aegyptiae project of the Berlin-Brandenburg Academy of Sciences and Humanities. Their website provides quite sophisticated search options, translations, and links each word in the texts to the project’s dictionary. But the texts are available only through the interface the site provides. There is no way to download the underlying source data in a machine-readable format for further analyses. We found the Academy to be quite responsive to our requests, but it required lengthy negotiations with legal counsels involved on both sides until a data sharing contract was signed. As part of the contract, the Academy provided us with an XML export of their internal database, giving us access to all the information hidden behind the web interface.

For open scholarship to thrive, we need open data. In my opinion, especially large collections and editorial projects cannot regard their data as a treasure that must be protected. These institutions are rather infrastructure providers, they should allow others to use their data in creative ways in order to generate new insights. Open licenses, as they increasingly emerge for Open Access publications, are also needed for open data.

Open Standards

Even after receiving the raw data from the Academy, we noticed that having the data is not enough: You also have to be able to read them. Reading Old Egyptian was not the issue, since we have competent Egyptologists in the project. But reading a project-specific database schema invented some 20 years ago with all linguistic information encoded using numeric codes proved to be an issue. It took us almost two years of data archaeology to decipher the format, which was only possible with the extensive help of researchers from the Academy who were familiar with the encoding schema.

When we planned how to deal with these project-specific formats, we decided to convert all the files to a standard format before actually doing our analysis. This allowed us to work on a common format for both our corpora, instead of adapting our methods to the specifics of each corpus’ format. And additionally, this allowed us to develop our software in a way that it can be applied to other texts and languages beyond our project with no or only little adaptation.

Still, finding a common format in practice is not always easy. Often, there are no official standards, but rather a collection of de-facto standards, sometimes competing with each other. And using a standard file format does not automatically mean that data can be exchanged between projects and tools. The Pali Canon already was available in a TEI XML format, albeit in an outdated version that still required some work. So using TEI as the basis for converting our corpora seemed to be the obvious choice, also given its spread in the digital humanities context. But TEI can be seen as a family of formats rather than as a definite standard, with many alternative ways of encoding the same thing. So we learned that it is not only important to have a standard format, but also to agree upon how the standard should be interpreted. And this can hardly be achieved on a general level, but rather one needs to have a limited community of people who share some goals and who agree on a way to build compatible editions. The EpiDoc initiative is a good example for such a community. For the case of more linguistically than epigraphically interested Egyptologists, an initial core group gathered in 2013 in Liège. People from the Thesaurus Linguae Aegyptiae, Berlin, the Ramsès project, Liège, the Rubensohn project, Berlin, and some others (including us) started to discuss different aspects of data interoperability. This discussion continues, so we hope to achieve a certain level of compatibility in the near future.

All this goes beyond the core work of our project. We could have used the data as they were, transform them according to our needs, and do our analyses. But the data transformation took such a great amount of time that we felt obliged to spare future researchers these hassles. In an ideal world of open scholarship, a project like ours could just access the data in a common format and re-use it without major modifications. Documentation is a large part of this, but also collaboration: Interoperability cannot be achieved by one party alone, it requires agreement between data providers and data consumers. In this light, open standards are not just a matter of technical specifications, but of building a community that actively engages in dialogue.

Open Source

In the context of the digital humanities, but also in other disciplines that apply quantitative methods, analyses often involve writing code. When developing new analytical methods, this requires to implement them in code. But even when applying existing methods to one’s material, this can be expressed in code: Instead of browsing the menus of a statistical application like SPSS, the same analyses can be performed using code-driven statistical environments like R—and even SPSS allows to run analyses programmatically as code rather than through its graphical user interface. Instead of analysing networks using the buttons that Gephi provides, one can also run network analyses from Python.

There are two aspects that I think are important when thinking about code in the context of open scholarship: sustainability and reproducibility. Regarding the first point, almost the same arguments apply as with open standards: Of course, it is a good thing to provide code that is produced in research projects under an open source license. But in order to really have a sustainable open source project, this also requires writing documentation and building a community. We are trying to achieve this at least to a certain extend with our software for text network analysis, but only time will tell if we succeed.

But I think it is still relevant that we post our code online. That way, we allow other researchers to reproduce the analyses we performed, and possibly find errors that we overlooked. Ideally, every article we publish would be accompanied by the code for the underlying analyses. Since we constantly refine our methods, we make it possible to jump back to the state of our code that drove the analysis for a specific paper (see this example for my paper at the DHLU2013 conference). Anybody who ever tried to reproduce statistical analyses following only a description of the process knows what difference it makes to actually be able to inspect the analysis step by step. But this also affects publication strategies: new publishing formats and platforms like GitHub and Notebook Viewer for analyses in Python or RPubs for analyses in R allow to publish reproducible articles that embed the code that drove analysis. But they are currently mainly used by a very technical audience and not integrated into general open access publishing models. Ideally, the publication of articles and the publication of code would be intertwined, allowing for open and reproducible scholarship.

Conclusion

Open Access is but one building block of open scholarship—albeit a very central one. In order to make scholarship fully open, other components of the research cycle should be open as well. This affects open data (like freely available editions of historical texts), open standards (like TEI for text encoding), and open source (for reproducible analyses). Much research in Religious Studies and Theology is textual scholarship, and it would benefit a lot from open access to its sources. Still, I feel that there is yet a lot to improve, both technically and culturally.

Our work in the SeNeReKo project brought us into contact with many of these questions that we had not dealt with before. It was an opportunity to learn, to assess what is already possible, and to contribute our share to improve the situation. There is still a lot we know that we can improve in our own scholarly practice, and we have not yet achieved the level of openness that we strive for. But our research would not have been possible without access to data, standardised ways to exchange data, and open source software. I believe there is much to gain for Religious Studies and Theology on the way to open scholarship.

Advertisements

Historian of twentieth century Britain; interested in digital history, open access publishing, web archives. Tweets @pj_webster

Posted in Uncategorized

Enter your email address to follow this blog and receive notifications of new posts by email.

Categories
Archives
Blog Stats
  • 43,007 hits
%d bloggers like this: