A Checklist to Publish Collections as Data in GLAM Institutions

Large-scale digitization in Galleries, Libraries, Archives and Museums (GLAM) created the conditions for providing access to collections as data. It opened new opportunities to explore, use and reuse digital collections. Strong proponents of collections as data are the Innovation Labs which provided numerous examples of publishing datasets under open licenses in order to reuse digital content in novel and creative ways. Within the current transition to the emerging data spaces, clouds for cultural heritage and open science, the need to identify practices which support more GLAM institutions to offer datasets becomes a priority, especially within the smaller and medium-sized institutions. This paper answers the need to support GLAM institutions in facilitating the transition into publishing their digital content and to introduce collections as data services; this will also help their future efficient contribution to data spaces and cultural heritage clouds. It offers a checklist that can be used for both creating and evaluating digital collections suitable for computational use. The main contributions of this paper are i) a methodology for devising a checklist to create and assess digital collections for computational use; ii) a checklist to create and assess digital collections suitable for use with computational methods; iii) the assessment of the checklist against the practice of institutions innovating in the Collections as data field; and iv) the results obtained after the application and recommendations for the use of the checklist in GLAM institutions.


INTRODUCTION
During the past few decades Galleries, Libraries, Archives and Museums (GLAM) have provided access to their collections and materials in digital format.Organisations have been exploring the benefits of adopting the concept of Labs to publish under open licenses in order to reuse the digital collections in innovative and creative ways [40].Advances in technology have paved the way to publish digital collections suitable for computational use known as Collections as data [49].Furthermore, with the emergence of initiatives such as the common European Data Space for Cultural Heritage 1 and the European Cultural Heritage Cloud, 2 the need is even more urgent to incorporate Collections as data activities into the day-to-day operations of cultural heritage institutions in combination with building the necessary capacities to proactively contribute to such initiatives.
Many GLAM organisations provide their contents for computational use in several ways.For instance, the Data Foundry at the National Library of Scotland provides metadata and digitised collections using a CC0 license. 3The Library of Congress provides access to information about historic newspapers and selected digitised newspaper pages as JSON, Linked Data and bulk data. 4The Bibliothèque nationale du Luxembourg provides access to a newspapers dataset with rich metadata using international XML standards such as Metadata Encoding and Transmission Standard (METS) and Analyzed Layout and Text Object (ALTO). 5These initiatives can encourage other GLAM organisations to publish their collections suitable for computational use by following best practices and guidelines.However, as there is a wide diversity of approaches for publishing digital collections, organisations need proper assistance in selecting the best approach suited to their goals and, at the same time, considering researchers' and other reusers' needs.Several aspects might be considered in terms of how datasets are made available, including metadata formats (e.g., MARCXML, Dublin Core, JSON, etc.), data cleaning, licensing or documentation about the datasets.
This paper aims to define a checklist that can be used for both creating and evaluating digital collections suitable for computational use that are published by relevant institutions in the GLAM sectors.This approach provides an easy-to-apply method to encourage small and medium-size organisations to publish their digital collections as Collections as data.The main contributions of this paper are: i) a checklist to create and assess datasets suitable for use with computational methods; ii) the application of the checklist; and iii) the results obtained after applying it.
The paper is organised as follows: after a brief description of state of the art in Section 2, Section 3 describes the methodology used to build the checklist.The application of the methodology and results are shown in Section 4. The paper concludes with an overview of the methodology and future work.

RELATED WORK
The use of Artificial Intelligence and Machine Learning in the GLAM sectors has become a relevant topic aiming at applying new methods to the rich digital collections made available by the organisations [15, 47, 49, 57].In this sense, new initiatives on advancing the use of Artificial Intelligence have emerged such as Artificial Intelligence for Libraries, Archives and Museums (AI4LAM) 6 and NewsEye. 7Several aspects regarding data quality and transparency in terms of how the data is available for the public (e.g., license, format, access, etc.) have become crucial elements for researchers willing to reuse the contents [13].Many organisations such as the Bibliothèque nationale de France, the British Library and the Rijksmuseum focus on the application of new and advanced technologies to their digital materials [5, 21, 37].In addition, organisations have explored the benefits and challenges to use Application Programming Interfaces (APIs) in order to make available their digital collections as well as advanced vocabularies to describe the metadata [31, 41, 45, 54].Moreover, features such as data cleaning and enrichment, the use of expressive controlled vocabularies instead of traditional metadata formats, using advanced and widespread APIs and the use of common and known open licenses have become crucial to facilitate the reuse of the contents.These technological innovations are relevant to the effors in building a data space for cultural heritage and need to meet the needs of different types of users [22].
Despite all these efforts, there is still room for improvement regarding the publication of digital collections suitable for computational use [13].Adopting these new trends from scratch might be difficult for organisations due to several reasons, e.g., the absence of dedicated personnel, a limited budget or the lack of advanced technical skills.
In this context, a checklist provides a powerful tool as it presents a list of tasks, activities, and behaviors that need to be followed to achieve a systematic result.Checklists can be useful to help organisations to avoid common mistakes and to adopt best practices.In this way, the creation of checklists have emerged as an innovative method to provide best practices and guidelines.Several initiatives have tackled the definition and creation of checklists in the past in other domains, for instance for the improvement of the reliability of artificial intelligence systems in terms of the life cycle [27] and the evaluation of software process line approaches [1].Here, a checklist publication workflow was proposed including aspects such as source data management, reproducible data transformation, version control, data documentation and publication [50].Other initiatives include a checklist for developing a machine learning project based on cultural heritage data [34] and a checklist for a Data Management Plan made available by the Digital Curation Centre [18].
Regarding Collections as data, previous work has proposed a methodology to select datasets for computationally driven research applied to Spanish text corpora in order to encourage Spanish and Latin American institutions to publish machine-actionable collections [13].A compilation of actions that can be done to stimulate conversation, and to encourage and generate ideas and new possibilities concerning the publication of digital collections suitable for computational use was recently published [48].
The use of advanced technologies such as Artificial Intelligence in combination with rich data made available by GLAM organisations raised important ethical issues [7, 51].These include, for example, control over the data, including terms of service requirements, the subsistence of the organisation sharing the data, the anonymous release of data and the threat of potential re-identification, and awareness of potential uses of the data.While clearer guidelines and better coordination are needed [7, 48], libraries and universities are in the position of playing a crucial role in education concerning unknown and future ethical issues.
These efforts provide an extensive demonstration of how to make available digital collections suitable for computational use, giving particular attention to data quality, planning and experimentation.Nevertheless, to our best knowledge, none of the work to date provides an easy-to-follow and robust checklist to publish Collections as data in GLAM institutions.This checklist intends to encourage small-and medium-sized institutions to adopt the Collection as Data principle in their daily workflows following best practices and guidelines.

A CHECKLIST TO PUBLISH COLLECTIONS AS DATA IN GLAM INSTITUTIONS
Making available digital collections suitable for computational use is a complex process.Examples in the literature follow different approaches, making it difficult to adopt and standardise the process.In this sense, institutions may face challenges when addressing the adoption of Collections as data due to the lack of expertise, guidelines and best practices.This section introduces the methodology to create an easy to follow checklist to publish collections as data in the GLAM sector.
The checklist was constructed in four stages: i) relevant aspects for publishing digital collections suitable for computational use were identified based on existing implementations of the Collections as data principle and on a literature review; ii) potential issues and needs regarding how to make collections available as data were gathered from practitioners and researchers in GLAM and research institutions; iii) the checklist was built by synthesising the literature results and the issues and needs obtained in the previous steps; and iv) the checklist was tested and applied both as a tool for assessing a selection of datasets made available by GLAM institutions as proof of concept and as a supporting tool for creating collections as data.
The checklist proposed in this work is intended to encourage GLAM institutions to adopt Collections as data as a concept in their daily workflows.In addition, it could be distributed as additional and transparent information in the datasets for potential reusers and researchers.

Previous works based on data published by GLAM
The first step is based on a literature review encompassing existing work on publishing checklists in different domains and data management plans, institutional reports from GLAM organisations about digital collection publication for the public, and projects based on the reuse of the digital collections with innovative and creative approaches.In addition, recent research articles were searched in repositories (e.g., ACM Digital Library and dblp) about the impact and reuse of digital collections in GLAM institutions.Appendix A shows the list of studies included in the review.The items were classified into five categories as shown in Table 1.
Table 1.Literature review to create the checklist classified into categories.

Identifying issues and information needs when implementing the Collections as data principle
We conducted an observational study regarding the knowledge about and uptake of the Collections as data principle in GLAM institutions using an online survey during the period 10 to 30 October 2022.Participation was voluntary.We invited participants to provide the name of their institution and contact information while leaving open the option of anonymity.Consent was obtained from all respondents to include the survey results anonymously.
A first core set of questions aimed to understand the respondents' existing experience with Collections as data, including the issues encountered in the early implementation phases, and collect examples of datasets already published.A second core set of questions was included to identify to what extent the respondents felt sufficiently informed when starting to implement the Collections as data principle and to understand their information needs.Table 2 shows the questions used in the form sent to the participants.
The forty three unique responses came from GLAM and research institutions with a geographical spread across the USA (26) and Europe (14), complemented by one Asian and two fully anonymous contributions.Figure 1 shows that over half of the respondents indicated a low level of experience with preparing collections as data and nine were significantly experienced or experts.Similarly, the majority of respondents felt ill-informed when starting work on collections as data, with only two feeling very well-informed (Figure 2).
The core issues encountered when creating collections suitable for computational use were those of data preparation and dataset structure as well as matters of licensing and usage restrictions (Figure 3).Data preparation is hampered by data quality issues, particularly regarding OCR data, but also because of incoherent data and inconsistencies, e.g. in the resources' descriptive metadata.Decisions on ontologies, vocabulary reconciliation, identifiers, overall structuring and packaging are all identified as obstacles when creating the dataset structure.
Figure 4 reveals that institutions primarily name access to examples of implementation as well as both specific information on data preparation and general know-how about how to create collections as data as knowledge that would have simplified their uptake of the Collections as data principle.A register with collections as data projects and descriptions of the dataset creation processes would inspire and support institutions with no relevant experience in the initial implementation stages.Specific sought-after information on data preparation includes information on standards and best practices relating to file formats, metadata, data structure, and how to assess and (after selection) normalise the available data.General know-how should entail a user-friendly guide to tools, processes, decisions, and necessary policy choices and give insight in their implications on data modelling, data mapping and data reconciliation.On an organisational level, resources such as use cases showcasing the added value of collections as data could leverage strategic institutional support and encourage colleagues' and users' involvement.
The survey further discloses a primary need for information and guidelines regarding the preparation of data and the structuring of datasets.Documented examples of datasets could significantly support GLAM institutions.Similarly, detailed accounts of the creation process for specific existing collections as data could inspire and support institutions in the decisions and actions to be taken when developing their own data for computational use.There is also a call for access to user-friendly guidelines and general know-how.The checklist explicitly intends to provide an easy-to-use tool in this context.

The checklist
Based on the previous steps, a checklist to publish Collections as data was created as shown in Table 3.A preliminary version was presented and discussed during an international webinar organised by the International GLAM Labs Community [11]. 8An overview of each item is described below.

3.
3.1 Provide a clear license allowing reuse of the dataset without restrictions.The adoption of licenses that allow reuse will strengthen and expand the role of GLAM institutions in innovative scholarly communication.The use of permissive licenses is crucial to ease an understanding of the reuse possibilities and to facilitate the reuse of the digital collections [10, 48].Researchers expect a clear and reliable statement about the terms under which the dataset can be used.
During the past few years, organisations have started to publish and promote the use of metadata and digital objects in their collections (or part of them) under open licenses [13, 23, 36, 37, 43, 44].While there are initiatives to develop national open licenses, 9 Creative Commons licenses are a popular and widely-used tool.Some examples of licenses, statements and tools used by GLAM institutions are the following: • The Creative Commons Public Domain Mark indicates that data is in the public domain.For instance, the Moving Image Archive published by the Data Foundry at the National Library of Scotland is published under this tool. 10 The Creative Commons Public Domain Dedication (CC0) removes copyright restrictions on the use of the content.For instance, the British Library and the Library of Congress provide a selection of datasets published under this tool.11• CC BY data can be used when giving the appropriate credit to the source.For instance, the organisational data provided by the National Library of Scotland12 is published under a CC BY license.• National standards: other approaches are based on national licenses that describe how the data can be reused.For example, the Bibliothèque nationale de France made data available for the public on data.bnf.frunder the French Open license that enables the reuse and requires a attribution.• Rights Statement No known copyright indicates that is likely to be free from copyright restrictions but the public domain cannot be entirely confirmed.
In addition, publication platforms such as GitHub and Zenodo allow users to select an appropriate license when publishing the contents.license information can be provided as textual information, including a link to the appropriate license 13 or using metadata fields to describe copyright details such as the properties dc:rights and dcterms:license in the Dublin Core metadata schema.
Licensing the dataset must take into account the license of each of the resources contained in the dataset as these may vary.

Provide a suggestion of how to cite your dataset.
A suggestion for the citation promotes access and reusability of data as well as helps reusers to properly cite the dataset.Best practices recommend to include a preferred citation for the dataset [48].
A citation can be improved by using a permanent identifier to uniquely identify a resource such as a dataset [13].Digital Object Identifiers (DOI) are widely used by the community.For example, the datasets made available by the British Library and the National Library of Scotland provide a DOI as well as suggestions for citation.In fact, platforms such as Zenodo and DataCite provide a DOI for all published resources, including a citation in the most common citation formats such as BibTeX and APA.
Another practice is to describe the publication of a dataset in a research article that then can be used as a citation since journals provide citations in several formats.Several examples include the description of the transformation of a dataset into Linked Open Data (LOD) that have been made available as a research article [20, 31].

3.3.3
Include documentation about the dataset.Documentation is a key element to foster the reuse by the community [48].Documentation may include details about the original sources as well as the cleaning and transformation principles and actions performed, information about how to access and use the dataset, or a description of the quality in terms of the content provided [23].
The documentation can be provided in several ways such as a blog post, README files and tutorials.For example, Chronicling America provides information about the dataset by means of a dedicated website. 14Other examples are based on the use of README files, as is the case for the British Library. 153.4Use a public platform to publish the dataset.Public platforms to make available datasets enable reusers to download the contents in bulk [48].Some examples of free platforms are GitHub, Zenodo, Hugging Face and DataCite.However, some platforms may have limitations in terms of size for which paid services may be required.For example, the National Library of Scotland uses cloud storage services for their large datasets.16 These platforms provide additional features such as release management that can be useful to publish different versions of the same dataset [51].

Share examples of use as additional documentation.
Examples of use of the contents provided by a digital collection are useful to inspire researchers [40, 48].
In particular, a Lab environment within a GLAM organisation is the place where reusers are able to find examples and prototypes based on the digital collections that in many cases are made available under open licenses.For example, the KB Labs17 from the National Library of the Netherlands provides a list of tools and the LC Labs from the Library of Congress include the experimental tool Newspaper Navigator that allows users to browse the images extracted from the digitised newspapers database Chronicling America [35].
In other cases, reproducible Jupyter Notebooks are used to introduce researchers to how to access and reuse the datasets.A Jupyter Notebook combines textual descriptions and code in the form of cells that can be run step by step.Some examples are the GLAM Workbench 18 and the GLAM Jupyter Notebooks from Biblioteca Virtual Miguel de Cervantes [12].
Other approaches entail the publication of tutorials on platforms such as The Programming Historian [16] and Library Carpentry [4], and research articles in journals describing how the dataset was created and reused.

3.3.6
Give structure to the dataset.A coherent internal distribution of a dataset is essential for researchers wishing to explore and query that dataset.Depending on the size and the type of contents, the structure will differ.Digital materials include a wide variety of content types, including images, maps, metadata, text, music and video amongst others.
There are some rules that will allow for a better understanding of the content provided by the dataset.One way to enhance this understanding is, for example, using self-describing folder names (e.g., text or images).Another approach could be based on the file format of the files provided (e.g., txt and xml).Each file included in the dataset may be named with the local identifier in the GLAM organisation.When having different formats for each resource (e.g., XML and JSON), a new root folder can be created clustering each of the formats.
For example, the Bibliothèque nationale du Luxembourg made available historical newspapers as open data using a zip file. 19Each journal is included in a folder named with the title and the date.Each folder provides a set of folders according to different type of contents (images, pdf, text, thumbnails), the complete pdf and a xml file.Other approaches are based on metadata and provide a set of documents with different formats (e.g., Dublin Core and MARC) such as the Moving Image Archive.
More advanced initiatives such as BagIt File Packaging Format [33], describes a set of hierarchical file layout conventions for storage and transfer of arbitrary digital content.
When providing large-size images, which is often the result of a digitization process, it can be interesting to provide reduced-size thumbnails based on the original images to be able to visualise them easier and faster.One additional aspect to consider is the cleaning of the data before publication.For example, sometimes post-correction OCR data is included in the case of digitization datasets, or metadata collections may require cleaning to remove unnecessary metadata fields.

Provide machine-readable metadata.
There is a wide variety of forms and formats to make metadata about digital resources (e.g., a dataset) available.The use of interoperable machine-readable metadata enhances discoverability and use since the data is readily processed by a computer [59].Some examples of vocabularies to provide metadata are MARC, Dublin Core, Vocabulary of Interlinked Datasets (VoID) [58] and Data Catalog Vocabulary (DCAT) [60].Other initiatives are based on Resource Description Framework (RDF) and schema.org.For example, the machine readable metadata description using the vocabulary DCAT for the dataset National Bibliography of Scotland published by the Data Foundry is shown in Listing 1.
Listing 1. Machine-readable metadata description using the vocabulary DCAT for the dataset National Bibliography of Scotland published by the Data Foundry @ p r e f i x d c a t : < h t t p : / / www .w3 .o r g / n s / d c a t # > .@ p r e f i x d c t : < h t t p : / / p u r l .o r g / d c / t e r m s / > .

3.3.8
Include your dataset in collaborative edition platforms.Collaborative edition platforms have become increasingly relevant in the GLAM context to create links and enrich their collections [10, 26, 48].Crowdsourcing approaches enable the community to contribute to the content in a collaborative environment.
Wikidata, for example, enables the creation of resources known as entities, adding properties to describe the entities.The edition is performed using an easy and accessible web interface.For example, the section dedicated to computational access to digital collections at the International GLAM Labs Community website includes a selection of Jupyter Notebooks projects made available by relevant institutions that have been published in Wikidata [14]. 20Wikidata provides a public API to access the data, enabling users to retrieve the contents.Table 4 shows an overview of Wikidata properties that can be useful to describe datasets.
3.3.9Offer an API to access your repository.The use of an API to make available the dataset is a key element to foster reuse [48].APIs allows systems to communicate and to access and retrieve the entire dataset.In some cases, only a portion of the dataset may be retrieved for analysis using the API.
The use of an API to publish the digital contents may require additional features to be considered.For instance, when using IIIF, each resource should include a manifest.xmldescribing the contents of this resource.For LOD, the adoption of URL patterns for the resources (e.g., author/id or author/name) is required as well as an analysis of

Dataset API URL
how the data will be modelled (e.g, classes used and number of properties) according to the controlled vocabularies used to describe the metadata.Table 5 introduces an overview of digital collections made available by institutions using a wide variety of APIs.
3.3.10Develop a portal page.Using a portal page for the dataset enhances the visibility and facilitates additional information about reusing the data [48].This information may include references to links for the dataset, visualisations, awards received, contact information, etc.For example, the dataset Chronicling America includes a dedicated website to access the contents but also to understand how the API can be used and to provide information about the original sources and license.
In addition, platforms such as GitHub provide free services to publish websites that are stored as a code repository and enables the use of several themes. 213.11Add a terms of use.Best practices show the importance of adding terms of use describing the conditions of use for the data [48].The content can be provided as an additional section on a portal page or as a text document.
For example, the British Library EThOS dataset includes a terms of use section that details copyright, liability and access statements. 22Other examples describe additional aspects such as how to report content as inappropriate in situations where people's rights are violated. 23

APPLICATION OF THE CHECKLIST
The checklist is intended as a tool for institutions to start implementing the Collections as data principle by giving a list of actions that can be performed so as to make collection data ready for computational use and reuse.Whilst not all items on the checklist must be executed to present collections as data, they give a clear direction when deciding on which action to prioritise and which to defer to a later stage or to consider as unfitting to the specific collection.The checklist also serves as a tool for assessing existing collections as data, indicating the level of readiness for potential computational use.
This section provides cases for both of the above uses: it presents the results of the application of the checklist to assess a selection of datasets made available by relevant GLAM institutions as well as describes cases where the checklist provided handles to implement the Collection as data principle in an institutional context.

The checklist as a tool for assessing GLAM collections as data
A selection of datasets made available by a wide variety of GLAM institutions in terms of size has been assessed against the checklist (see Table 6).The institutions and datasets are listed below.
• The British Library has a number of data services available to support different use cases, for example the content showcased on data.bl.uk and hosted on the open access British Library Research Repository. 24ome of these datasets will also have a corresponding Collection Guide. 25 Data Foundry is the National Library of Scotland's open data platform, which includes digitised datasets, metadata, spatial data and organisational data.For this example, we have chosen 'Encyclopaedia Britannica', the most-used dataset.This covers the first 8 editions (100 years) of the Encyclopaedia. 26 The Library of Congress (LC) recently published data.labs.loc.gov, as an experimental sandbox for sharing data packages compiled as part of LC Labs' Mellon Foundation-funded Computing Cultural Heritage in the Cloud (CCHC) initiative. 27In this context, the Stereograph Card dataset consists of 39,526 stereograph card images from the 1850s through 1924, a subset of what was available online in the collection in the catalog in August 2022.• The Royal Danish Library made available an API for its digitised collection as a result of a newspaper digitization project running from 2014 to 2017.The construction of the API has been a way to experiment with the OpenAPI standard. 28 Art In Flanders (AIF) is a dataset supported by Meemoo that includes more than 20.000 images of objects from Flemish museums and cultural institutions, comprising paintings, sculptures, archaeological artefacts, design objects, and more.Digital reproductions and descriptive metadata are being made available through the artinflanders.beplatform.• Miguel de Cervantes Virtual Library (BVMC) made available its main catalogue as Linked Open Data (LOD) using Resource, Description and Access (RDA) as its main vocabulary [9].
Table 7 shows the results obtained after the assessment in terms of the items provided by the checklist introduced in Section 3. Table 6.Overview of the datasets and organisations used for the assessment.
Organisation URL Table 7. Overview of the results obtained when evaluating the checklist against a list of datasets made available by relevant GLAM institutions.
Organisation 1 2 3 4 5 6 7 8 9 10 11 License.All the datasets and platforms assessed provide a clear license.For example, the British Library has a formal access and reuse process to identify if works are out of copyright or in copyright, the National Library of Scotland's Data Foundry provides the license for each dataset 29 and the Library of Congress bases its reuse policies on its rights statement on the source collection.Table 6 introduces the licenses used in the dataset.
Suggested citation.In general, most of the datasets provide a persistent identifier such as a DOI.Other examples such as the National Library of Scotland's Data Foundry provide a suggested citation.The Library of Congress offers citations details for the source collections and the dataset creators and contributors.In other cases, a journal research article can be used to cite a dataset.
Documentation about the dataset.All the datasets provide dataset information and metadata as documentation in a wide diversity of manners (e.g., website, README) and granularity (e.g., collection and individual level).Other approaches are based on the use of machine-readable vocabularies based on RDF to describe the datasets such as the Vocabulary of Interlinked Datasets (VoID). 30Table 8 shows an overview of the approaches followed by the institutions selected in this work.
For example, the Library of Congress provides three levels of documentation for the datasets made available on data.labs.loc.gov by means on different files: i) a README file with a technical overview of how the data set 29 See, for example, https://data.nls.uk/data/digitised-collections/encyclopaedia-britannica/ 30http://vocab.deri.ie/voidwas created (e.g., details of the dataset source collection, computational readiness and possible uses, dataset field descriptions and rights statement); ii) a data cover sheet file with a more substantive overview of the data and the collection from which it is derived (e.g., version information, background of collection, original format, reading room details, contact and metadata types); and iii) a data processing plan describing the goal of the experiment and a description of intended use, and data documentation regarding different aspects such as composition, provenance, compilation methods, preprocessing steps and potential risks to people and communities, amongst others. 31n addition, the BL have explored innovative approaches such as Datasheets for Datasets by including a datasheet in the datasets, documenting its motivation, composition, collection process, recommended uses, etc. to facilitate better communication between dataset creators and dataset consumers. 32se of a public platform to publish the dataset.All the datasets are available by means of public platforms.However, there are differences accross the institutions regarding the use of institutional and third party platforms.For example, the BL uses both institutional and third party platforms, including British Library Research Repository, 33 Flickr, Wikimedia, Hugging Face, and secondary publishers, depending on the type/format of data.In the case of the other institutions, the datasets are available by means of an institutional website (e.g., Lab section and dedicated website) as is the case for Meemoo, BVMC and Data Foundry.
Other organisations have different approaches depending on the content provided.For example, the Library of Congress provides access to datasets that have been officially acquired in the Selected Datasets Collection. 34For experimental or temporary datasets, access is provided on LC for Robots 35 or on data.labs.loc.gov which hosts datasets using cloud service providers.
Share examples of use.Many of the datasets assessed include examples of use as additional documentation to show how to reuse the contents.However, the approaches differ from one institutions to another.For example, the National Library of Scotland's Data Foundry provides examples based on reproducible Jupyter Notebooks and the project includes collaboration initiatives based on the reuse of the datasets. 36The BL shares examples of dataset reuse on its Digital Scholarship blog. 37The Library of Congress includes a section "Computational Readiness and possible uses" in the README files. 38ive structure to the dataset.While datasets are structured according to different requirements and contents, in general, datasets are structured with reuse and data management in mind.For example, the National Library of Scotland's Data Foundry provides the datasets as zip files including folders per file format that can be easily identified by potential reusers.The Library of Congress has explored different ways to create and communicate coherence in datasets structure.Some examples are including dataset field descriptions in the README files, metadata and manifests for scripted and API access, and providing sample data and guidance on each data package page for ways to download the OCRed text, documentation and metadata.
Provide machine-readable metadata.All the datasets provide machine-readable metadata to describe the digital collections based on Dublin Core and more advanced approaches based on controlled vocabularies.The metadata is provided in form of additional files (e.g., XML) or through an API.
Include your dataset in collaborative edition platforms.While many institutions already provide information about their datasets in Wikidata, they are also interested to develop further Wikidata opportunities.Table 9 shows an overview of the Wikidata approaches in GLAM organisations.However, it is important to notice that in some cases the information included is not related to datasets but to other initiatives such as projects and notebooks.In addition, other approaches are based on Wikimedia approaches.For example, a subset of the BL dataset is currently on Wikimedia Commons, 39 which offers a useful introduction to the collection, including a Synoptic Index, as well as projects to georeference maps found in the texts.

Organisation
Wikidata link British Library https://www.wikidata.org/wiki/Wikidata:British_LibraryNational Library of Scotland https://www.wikidata.org/wiki/Q111411199Miguel de Cervantes Virtual Library https://www.wikidata.org/wiki/Q111396572Offer an API to access your repository.Most of the institutions have adopted APIs as a means to make available their collections based on different standards and tools.For example, the BVMC provides the datasets through a SPARQL public endpoint.Others provide a blend of this recommendation, using APIs to source collections and using manifests from data packages to gather those packages via a JSON/YAML API such as the Library of Congress.
However, some institutions have decided to provide the collections with simple, straightforward access through downloads to cater for those users whose technical skills are limited, such as students and artists.
Develop a portal page.All the institutions provide a data portal page including detailed information about the collections.Several examples included in this selection of organisations and datasets are the result of previous experimental data access points that have evolved to a new section such as Data.labs.loc.gov 40and data.cervantesvirtual.com.
Terms of use.Terms of use are provided by the organisations.In some cases, the information includes contact details, for example the BL. 41In other examples, the information is provided only in the country's official language (e.g., Spanish). 42Table 10 shows an overview of the terms of use provided by the institutions. 39https://commons.wikimedia.org/wiki/Commons:British_Library/Mechanical_Curator_collection 40https://data.labs.loc.gov/ 41https://bl.iro.bl.uk/terms 42 See, for example, https://data.cervantesvirtual.com/condiciones-de-uso/Providing a clear license allowing reuse of the dataset with as few restrictions as possible is one of the more complex items on the checklist.Managing these rights over time only adds to this complexity, as is shown in the case of meemoo's Art in Flanders (AIF) dataset.Rights management has gained much attention in the heritage sector over the last few years, in relation to developments in computing (e.g.Linked Open Data) and copyright legislation (e.g.EU directive on copyright).In line with this, the need arose to review the rights labelling policy for the AIF platform.The ambition was twofold: i) to make the dataset as freely available for access and reuse as possible; and ii) to adopt more appropriate standard rights labels for communicating the rights information.In this sense, some issues were identified: • over the years inadequacies had slipped into the rights information on the AIF platform.To give an example: a painting that had fallen into the public domain because its creator died more than 70 years ago, but with a copyright waiver (CC0) in the metadata and a photo credit © name-photographer on the picture.• new copyright is claimed on digital surrogates of public domain works.The initiative wanted to get in line with article 14 of the EU directive that protects the public domain from surrogate rights.For so-called "2-dimensional" works (such as paintings) there is a clear consensus that no new copyright can be claimed on purely technical reproductions.However, for photographic reproductions of 3-dimensional objects, the situation is less straightforward as these may meet the threshold of originality because of the photographer's personal choices in point of view, lighting etc.So copyright and copyright restrictions might still apply here.An additional complexity is that the terms of meemoo's older photography contracts cannot be transposed into a standard open license, so more restrictive CC licenses are being used.• owners may impose restrictions on the use of reproductions made of works in their collections, even if they are in the public domain.Lacking more appropriate tools, these use conditions have been (improperly) translated into a restrictive CC license on the AIF platform.
The group of photographic reproductions of 3-dimensional artworks were particularly problematic, as the project was confronted with double-layered rights statuses.Two solutions were successively considered: 1) using separate rights labels: one for the rights status of the artwork and one for the rights status of the photo, and 2) using a single rights label that communicates the rights status and usage conditions for the resource as a whole.In parallel, the possibility of recontacting the photographers in question, and asking them to waive their rights, was considered.
For images where the owner of the cultural objects imposes use restrictions on reproductions, it was decided to adopt a rightstatements.orglabel.These labels have been specifically devised for heritage institutions to communicate rights information in a standardised way when they do not own the copyright and therefore using a copyright license is legally not possible.
For the reproductions of two-dimensional works, it was proposed to use updated labels for the three main groups.Firstly, the majority are in the public domain and can be released with a public domain mark instead of a copyright waiver.These images are freely downloadable in high resolution and reusable for any purpose without , Vol. 1, No. 1, Article .Publication date: November 2023.any restrictions.Secondly, where collection owners have imposed use conditions, it was proposed to use the rights statement "contractual restrictions".However, since this is not a legally binding tool, the user still needs to agree to user terms before downloading the images.In parallel, these contracts are currently being reviewed with the goal of minimizing and standardizing the user restrictions far as possible.Thirdly, when works are under copyright, images get the "in copyright" mark.
For the reproductions of 3-dimensional works, a one-label approach was chosen for a number of reasons.Firstly, a single rights label is more user-friendly.It requires less pre-existing knowledge and leaves less room for (mis)interpretation by the user than the multiple-label approach.So in a sense, this is also the more secure and controlled approach.Secondly, not all labels are entirely compatible.For instance, a CC BY license which allows reuse of an image of an artwork, but which is also under full copyright and cannot be used without permission of the rights owner.In these cases, it was proposed to use the most restrictive label.In this way, pictures of artworks that are in copyright are tagged as such, even when the photographer agrees to a more open license.Additionally, pictures in the public domain get their label from the license agreed upon with the photographer.
In parallel, it was decided to include a maximum waiver of rights in (new) photography contracts.The process of clearing rights for older contracts is also in progress.
In summary, the updated 'open' licensing policy on the AIF platform proposes 4 rights labels for the main categories of images: • Public Domain mark for images of 2D artworks in the public domain.
• CC0 for images of 3D artworks in the public domain.
• No-copyright -contractual restrictions for images of artworks in the public domain restricted by the collection owner (2D and 3D).• In Copyright mark for images of artworks that are under copyright (2D and 3D).The majority of images on AIF belong to the first two groups.In addition, providing access to the AIF dataset through an API is on the roadmap.

The checklist as a tool for implementing the Collections as data principle at KU Leuven Libraries
As apparent from the survey results presented in Section 3.2, when working towards disclosing collections as data, GLAM institutions are looking for inspiration and general insight in how to approach the implementation of the Collections as data principle.The checklist can give guidance to the process, allowing to make informed decisions on what to focus on first.To give potential users of the checklist insight into this process, this section describes the case of KU Leuven Libraries'43 work on creating datasets for computational use.
Parameters for creating datasets depend on the context, the collection content, target users and intended use [49].Six datasets (see Table 11) were created as part of the preparation for the a hackathon aimed at researchers and postgraduate students from within KU Leuven [32].The hackathon organisation was a collaborative effort of KU Leuven Libraries and the university's Faculty of Arts 44 .Considering this context and as this was the library's first endeavor in creating collections as data, it was decided to (at least temporarily) offer access to the datasets (item 4, use a public platform to publish the dataset) through KU Leuven's internal data portal.Share examples of use (item 5) and include your dataset in collaborative edition platforms (item 8) were irrelevant to this context as participants were expected to work directly and independently with the source data.Regarding -providing a clear license allowing reuse of the dataset without restrictions (item 1), it was decided to only include resources in the public domain in order to allow for full reuse by participants outside of the hackathon.The Public Domain mark was provided on a resource-level in the descriptive metadata of the resources included in the dataset.Add a terms of use (item 11) was satisfied by including a statement in the hackathon Code of Conduct.To start, the metadata and data were identified and extracted from their respective repositories.The dataset was subsequently structured according to give structure to the dataset (item 6): each dataset contained a separate folder for each resource within which there were subfolders for each of the representations of these resources, e.g. a folder for page-level OCR data, for page-level jp2 images, and for PDF.On the level of the resource, a json manifest was included describing the resource.At the dataset level, the descriptive metadata was included as a xml metadata dump and as a partially cleaned csv and excel.A final csv was also provided, revealing the concordance between all the files in the dataset.
The full dataset was uploaded to the internal KU Leuven active data portal ManGO (item 4, use a public platform to publish the dataset), where hackathon participants could access the data, execute downloads or (providing the necessary infrastructure to work with the large datasets) connect the high performance computing infrastructure to the data through an API (item 9, offer an API to access the repository).In ManGO dataset metadata was added but for query purposes only.Documentation to each of the datasets included full information on the dataset structure, the descriptive metadata model, and some information on the level of the physical collection on the basis of which these datasets were created.
The library plans to further develop the datasets, improving there (re-)useability by including a Terms of use on a dataset level (item 11, add a terms of use), investigating possible locations to store and access the datasets for non-KU Leuven users (items 4, 9 and/or 10) as well as citation information which is so far lacking as the data is non-accessible to non-KU Leuven users (item 2, provide a suggestion of how to cite your dataset), and improving the structure and completeness of the datasets' documentation (item 3, include documentation about the dataset).It will also include the DOI of the hackathon team posters to inspire potential users (item 5, share examples of use as additional documentation), and investigate and develop a metadatamodel for dataset metadata (item 7, provide machine-readable metadata).The library has already pushed other data to a collaborative editing platform such as Wikidata and considers doing the same for these datasets. 45Yet, due to a lack of personnel, this will be postponed to a later date.
As a whole, the datasets directed the preparation phase by providing and concise overview of which actions lead to collections as data, allowing for timely reflection on priorities.It now also supports decisions regarding next steps to take.

Towards a Collections as data platform at KBR, the Royal Library of Belgium
Inspired by the Collections as data movement [48, 49], in 2020, KBR, the Royal Library of Belgium, embarked on a 48 month project 46 called DATA-KBR-BE 47 (2020-2024).The aim of the project is to optimise KBR's ICT infrastructure to stimulate sustainable data-level access to KBR's digitised and born-digital collections for digital humanities research.A key output of the project is to design and implement an Open Data Platform (data.kbr.be), for publishing KBR's collections as data datasets.
Work on conceptualising the future DATA-KBR-BE platform began at an early stage in the project, during an initial Brainstorming Workshop held in November 2020.It was clear from the outset that a researcher-centred and iterative approach was needed to gather requirements for the design and development of the DATA-KBR-BE platform.A first important step in this process was to review some of the existing library data platforms, such as the national libraries of Luxembourg , 48 the Netherlands, 49 Scotland 50 and The British Library. 51Questions such as: What data is offered?How?What format?What did people like, dislike about the platforms which were explored?The outcomes of this workshop were used as the basis for iteratively developing a checklist of needs.The emergence of the "Checklist to publish Collections as data in GLAM Institutions" introduced in Section 3.3 provided the project team with an ideal framework to help structure the development of the functional and technical requirements for the platform.
To prepare for the webinar in October 2022 [11], 52 an initial analysis of the checklist was undertaken to assess which checklist items were most relevant for the DATA-KBR-BE project.Initially, develop a portal page (item 10) and give structure to the dataset (item 6) were identified as the most relevant items for the project team, as our aim was to develop the DATA-KBR-BE platform and that we would need to understand how to structure the datasets that will be published there.However, it soon became relevant that many, if not all, the checklist items would support the development of the DATA-KBR-BE platform.For example, provide a suggestion of how to cite your dataset (item 2) and provide a license allowing reuse of the dataset (item 1) were quickly seen as essential.
To use the checklist more systematically, a collaborative spreadsheet was designed to capture each of the functional and technical requirements for the DATA-KBR-BE platform, as shown in Figure 5. Column B, is Fig. 5.A collaborative spreadsheet for capturing the technical and functional requirements for the DATA-KBR-BE platform used for categorising each of the requirements based on the checklist list, e.g.requirement 1: entry point for everything data-related at KBR, has been categorised in relation to checklist item develop a portal page (item 10).This approach enabled us to: a) group the requirements by category, b) to ensure that our requirements analysis was as exhaustive as possible by considering each of the checklist items, and c) to provide feedback to the International GLAM Labs Community to further improve the checklist.
When reviewing the checklist items, the difference between use a public platform to publish the dataset (item 4) and develop a portal page (item 10) was not initially clear without further explanation.We were also not sure about include your dataset in a collaborative edition platform (item 8) and also how it relates to items 4 and 10.Additionally, Provide machine-readable metadata (item 7) caused quite some discussion in the DATA-KBR-BE project team.For example, what about human-readable metadata?What does machine-readable mean in this context?Why is it prioritised?Is this descriptive metadata?Are any particular standards recommended?How does this relate to structural metadata?Is this covered under give structure to the dataset?(item 6).The training and documentation aspects of the checklist include documentation about the dataset (item 3) and share examples of use as additional documentation (item 5) were both seen as very relevant to the development of the DATA-KBR-BE platform.However, they would likely be added following the initial development of the platform itself.Finally, offer an API access to your repository (item 9) was out of the scope of the current DATA-KBR-BE project, and will be addressed in a follow-up project, the KBR Virtual Lab.
In conclusion, the "Collections as Data Checklist" was a valuable tool as it helped ensure that the DATA-KBR-BE project team had considered as many of the aspects of the checklist as possible when developing our Collections as data platform.

Discussion
While various institutions have made available their collections using APIs, there are still barriers hindering the adoption of the Collections as data principle, such as the lack of resources as well as the desire to make the collections easily available to a broad user group by means of simple access and downloads.The GLAM datasets which were selected for assessment in this article present some similarities but also some differences, e.g., the , Vol. 1, No. 1, Article .Publication date: November 2023.type of content, the formats and standards used for digital delivery, how they can be accessed, the licensing, and the documentation provided.
The checklist is informed by the issues and needs identified within the literature review and is complemented by the contributions of the practitioners who considered all the items included in the checklist relevant.In general, the practitioners observed a balance between simplicity and depth of practice.Some of them remarked that each of the items requires a different degree of maturity and prioritisation, which in some cases necessitates joint efforts by the community.
With regard to the application of the checklist as an assessment tool, and taking into account the wide variety of datasets provided by the GLAM institutions, the results obtained after the application of the checklist may differ amongst adopters of this approach.Initial results showed that the checklist is useful for identifying which aspects are relevant for a particular institution and, to some extent, easy to apply when making available datasets for computational use.In general, we observed that there is no order when applying the items in the checklist.Rather, as the case of KU Leuven Libraries demonstrates, priorities depend on the context, the content, the intended use and target users of the dataset.Furthermore, the checklist further can facilitate the development of infrastructures related to collections as data, as shown in the case of the DATA-KBR-BE platform.
In general, future work based on the items in the checklist is a common goal across the institutions wishing to make their collections available as data.Regarding the sharing of examples of practical implementations, the use of Jupyter Notebooks is increasing across organisations that are working in this area.For example, the British Library plans to improve reuse examples and their related tools on data.bl.uk in due course.Concerning the structure of the datasets, institutions are interested in improvements which can enhance the user experience.
While the institutional journeys into the delivery of collections as data differ and are not yet taking place in all institutions in the GLAM sector, an additional layer of complexity in the computational use of cultural data which needs to be accommodated is the development of the common European data space for cultural heritage, the European Cultural Heritage Cloud and other data spaces which will be using GLAM data, e.g. the European Open Science Cloud and large research infrastructures in the digital humanities.There is an ongoing effort to identify use cases for the data space for cultural heritage, and it would be helpful to coordinate this work with the future refinement of the checklist.

CONCLUSIONS
Over the past few years, there has been a growing interest in making available the digital collections published by GLAM organisations for computational use.
Based on previous work, we defined a methodology described in Section 3 to build a checklist for the publication of collections as data.Our evaluation showed several examples of application that can be useful to encourage other institutions to publish their digital collections for computational use.
Future work to be explored includes the improvement of the methodology by including additional features such as carbon footprint assessment, ethical issues and quality, as well as the inclusion of additional collections as data provided by organisations as use cases.

Fig. 1 .Fig. 2 .Fig. 3 .
Fig. 1.Survey results "What is the level of your experience with preparing Collections as data?"

Fig. 4 .
Fig. 4. Survey results "What information would you like to have/have liked to have had when starting to work towards Collections as data?What knowledge would have made it easier?"

Table 2 .
Survey employed to retrieve information regarding the publication of Collections as data in GLAM institutions.
h t t p s : / / d o i .o r g / 1 0 .3 4 8 1 2 / 7 cda − e p 2 1 > a d c a t : D i s t r i b u t i o n ; d c a t : downloadURL < h t t p s : / / n l s f o u n d r y .s 3 .amazonaws .com / m e t a d a t a / n l s − nbs − v2 .z i p > ; d c t : l i c e n s e < h t t p s : / / c r e a t i v e c o m m o n s .o r g / l i c e n s e s / by / 4 .0 / > ; d c a t : mediaType < h t t p s : / / www .i a n a .o r g / a s s i g n m e n t s / media − t y p e s / t e x t / xml > ; d c a t : c o m p r e s s F o r m a t < h t t p : / / www .i a n a .o r g / a s s i g n m e n t s / media − t y p e s / a p p l i c a t i o n / z i p > .

Table 4 .
Overview of Wikidata properties to describe a dataset as an entity.

Table 5 .
Overview of digital collections made available by relevant institutions using a wide variety of APIs.

Table 8 .
Overview of strategies to provide documentation about machine-actionable collections in GLAM institutions.

Table 9 .
Overview of Wikidata links related to GLAM organisations.

Table 10 .
Overview of terms of use provided by the organisations selected in this work.Providing a clear license: the case of meemoo and the Art in Flanders dataset

Table 11 .
Overview of datasets created by KU Leuven Libraries for the BiblioTech 2023 Hackathon(13-23 March 2023).At the time of writing, the datasets are only accessible for KU Leuven staff and students.