09 Feb A text and data mining wish list for GLAM

Europeana Research recently hosted a guest blog post by Beatrice Alex, Research Fellow at the School of Informatics at the University of Edinburgh and a member of the Edinburgh Language Technology Group . Beatrice describes her ‘Wish List to GLAM’ for providing access to data for text mining and we thought this might be of interest to the broader TDM community.

Beatrice has been part of several digital humanities projects using TDM, including Trading Consequences and Palimpsest, which used different textual sources from archives and libraries.

An illustration of the life of a researcher using TDM: Beatrice's delivery of the British Library Nineteenth Century Books Collection on hard disk for the Palimpsest project.

An illustration of the life of a researcher using TDM: Beatrice’s delivery of the British Library Nineteenth Century Books Collection on hard disk for the Palimpsest project.

“In the case of Trading Consequences, a project on mining information on nineteenth century commodity trading in the British Empire, we were able to text-mine, among other collections, summaries from the Kew Garden’s Directors’ Correspondence archive. In Palimpsest, a project on mining Edinburgh’s literary landscape, the largest datasets we obtained access to were a collection of books from HathiTrust and the Nineteenth Century Books Collection from the British Library Labs.”

While many technical barriers impact TDM, one highlighted issue in this discipline is finding out about resources in order to access them.

“Getting access to such datasets required us to know about them to begin with. In many cases, our project ideas are inspired and enabled by relevant data. Applications for research funding are always much stronger if we can provide evidence for being able to work with a given dataset. If galleries, libraries, archives and museums (GLAMs) are interested in sharing their available datasets widely for text mining and other research purposes, then they need to be proactive in publicising and explaining how to get hold of them.”

Beatrice goes on to describe her experiences in different projects with obtaining legal permission to mine, accessing data in various formats from hard disks of scanned book images to excel spreadsheets and attachments to well-structured APIs. Data formats are also challenging to navigate but with enough documentation a great deal is possible. Beatrice wrapped up with a wish list for GLAM institutions interested in sharing their data for TDM and we thought this was well worth sharing with a wider group of TDM practitioners, content providers and other stakeholders.

  1. publicise the data you would like to be used for text mining purposes
  2. give us information on what a collection or dataset contains (metadata, content, size, format)
  3. tell us how we can get hold of it
  4. find a mechanism to share the data easily
  5. if there are copyright issues, draw up a template agreement (which can be modified if necessary)
  6. provide the full text if you have it (not just the images), ideally in a consistent and well-formed format
  7. provide document-level metadata, if possible
  8. provide a URL for each source document, if possible, so that we can link back to you
  9. Merry data sharing.