Methods
In consultation with project advisors and partners, the Sounding Spirit Digital Library team developed methods for digital workflow, bibliographic research and curation, and metadata collection. All work was guided by best relevant practices and project needs.
- To produce uniform and high quality digital files, we developed methods for aspects of digital workflow including digitization, file management, and optical character recognition based on the federal FADGI guidelines.
- To create a resource representative of the variety of southern sacred song, we collaborated with subject matter experts, including project team and advisory board members, to develop bibliographic research and curatorial processes that looked beyond traditional canons, key terms, and collections.
- To curate research data that supports user engagement, research, and interoperability, we sought out technical advisors with whom we developed metadata collection protocols that draw on best practices in the fields of library technical services.
Digital Workflow
The Sounding Spirit Digital Library includes volumes shared from seven partner institutions. Each contributing institution digitized (or outsourced digitization of) songbooks according to project standards developed by the digital workflow working group. Together, group members workshopped and confirmed technical and capture specifications, digitization scope, and quality control measures.
- A downloadable version of the project’s digitization standards and workflow will soon be accessible here.
- Users can download page images and inspect technical metadata such as colorspace, ppi, and bit depth using the command-line tool ImageMagick’s identify function.
Digitization
At the outset, partners captured and shared sample scans to assess institutional capacity for in-house digitization. Brown University, the University of Michigan, the University of Kentucky, and Southern Baptist Theological Seminary digitized volumes on site. Middle Tennessee State University and the University of Tulsa sent volumes to the project’s digitization vendor, Creekside Digital. Middle Tennessee State University digitized eight volumes on site, including works digitized for the pilot digital library and items identified to include late in the project. Emory University sent volumes housed at the Pitts Theology Library to Creekside for digitization, but digitized books from the Stuart A. Rose Manuscript, Archives, and Rare Book Library and other library units on site. Select Pitts Theology Library books were also digitized on site, as were Emory pilot digital library volumes.
Whether scanning in house or via third-party vendor, partners adhered to project digitization standards with a few notable exceptions. To comply with institutional policies, volumes digitized by Brown University use an alternative colorspace profile and crop to page edges. A single volume digitized by Middle Tennessee State University uses an alternative colorspace profile and white background. Twenty-two volumes included in the pilot digital library were digitized before the development of the expanded library’s standards.
Preservation
Sounding Spirit Digital Library partners are responsible for the long-term preservation of master files created during digitization. Prior to transferring files to the project for processing and publication, partners that digitized in house converted TIFFs to JPEGs using Adobe Bridge, Adobe Photoshop, Imagemagick, or Preview. JPEG conversions were produced at the highest quality setting, preserving the original resolution and color space of the TIFF. Creekside Digital followed all project file preservation standards by sharing TIFFs that project team members converted to JPEGs in the same manner described above.
File Transfer and Storage
To minimize the risk of file corruption, project scans were delivered either as package files using the BagIt specification and transferred via SFTP or as JPEGs via physical hard drive. The project’s digital infrastructure for partner files relies on a combination of Amazon Web Services (AWS) and Emory University-based servers and services. Back-up copies of partner scans were stored on AWS Glacier, with access copies located in a directory structure on a local server synced two-ways to an AWS S3 bucket. This structure allowed project members to manipulate files both locally and on the AWS S3, efficiently running processes in both environments.
OCR and PDFs
Our project team and advisors developed a custom script to run optical character recognition (OCR) on batches of books using AWS Textract and to then convert the OCR to ALTO XML. For volumes with files that exceeded Textract’s file size limit, we performed grayscale conversion on derivative files using an ImageMagick command as a preprocessing step. For project volumes published in a language that Textract doesn’t support, we ran local copies through Google Cloud Vision which offers experimental support for these languages. We then ran an additional custom script to create access PDF files for each volume that pair OCR with lower quality page images to ensure smaller file sizes for practicality.
Readux Ingest
To build the digital library in Readux, our team used its S3 batch ingest functionality to batch import books. Each batch included file images and associated OCR, as well as OCR files and PDFs. We separately used Readux’s metadata upload capability to ingest volume-level metadata and apply sitewide updates. Once all volumes are in Readux, team members will sort volumes into their associated collections and manually enter collection descriptions and volume summaries where available.
Quality Control
Multiple stages of quality control ensured strict adherence to project standards. Upon receipt of partner files, our team conducted spot checks of technical and capture specifications, requesting re-scans where needed. In conjunction with metadata collection, team members checked each book carefully for any undocumented missing, misordered, or duplicate pages. After applying OCR, our team ran a custom script to check for OCR failures on the volume or page level. During ingest, a script reported issues importing images, OCR, metadata, or supplementary files into our digital platform. Following ingest, our team reviewed each volume within the Readux user interface as a final check on the book’s presentation in the digital library.
Bibliographic Research and Curation
The Sounding Spirit Digital Library includes volumes representative of the many communities, traditions, and denominations present across the US South and its diasporas between 1850 and 1925. To build a corpus across institutional boundaries, the Sounding Spirit project team identified and prioritized criteria for including books. The resulting rubric supported bibliographic research and helped the project team select volumes. It also contributed to the development and implementation of project metadata collection and quality control.
- A downloadable version of the project’s prioritized criteria for bibliographic research will soon be accessible here.
- A downloadable version of the project’s “Checklist of Southern Sacred Music Imprints, 1850–1925” is available through Dataverse.
Selection and Condition Assessment
Sounding Spirit’s criteria for including books draw on the project’s foundational research questions and thematic scope. To enumerate and prioritize measurable criteria for the library and to generate specific search processes and extensive lists of search terms for each prioritized collection criterion, project music bibliographer Erin Fulton collaborated with advisors, external subject matter experts, and the core project team. Sounding Spirit used the resulting processes and terms to search the catalogs of our four initial partner institutions, modifying search terms as necessary to intersect with local cataloging practice. We supplemented this process with additional searches of select non-partner institutions and the Online Computer Library Center (OCLC) union catalog. This research helped us identify audiences, genres, periods, and geographies underrepresented in the holdings of our initial partner institutions. We transcribed bibliographic data from each relevant catalog record, along with project-specific information related to our criteria, creating a comprehensive checklist published in 2021 as the “Checklist of Southern Sacred Music Imprints, 1850–1925.”
From these findings, we recruited three additional partner institutions to contribute to this expanded digital library. We applied the same catalog search process to supplement the checklist with volumes from these additional partners’ holdings. To arrive at a list of candidates for digitization at each of the project’s partner archives, we excluded volumes held by non-partner archives and selected the earliest major editions among copies of titles with multiple attested editions and printings.
Staff at each partner institution carefully examined all digitization candidates by hand, assessing the books for condition and completeness. With input from partner archive staff, our Condition Assessment Working Group drafted a project Condition Assessment Guide and a spreadsheet template for pulling books and recording results. Archive staff pulled candidate volumes, recorded the presence of local duplicate copies, and checked to ensure the accuracy of our initial pull lists. In examining each copy, staff:
- Counted the total number of images to be captured for digitization,
- Documented any pages missing from the copy,
- Described any condition issues and recorded the general condition of the volume,
- Recorded the height and width of each copy, and
- Made a recommendation as to whether the book was a good candidate for digitization according to our standards.
These results enabled our team to choose among multiple copies of a work and to decide when to exclude digitization candidates that were fragile, damaged, or too incomplete. When the occasional condition issue presented when books were sent for digitization, condition assessment results allowed us to readily identify replacement candidates for digitization where possible.
Metadata
Sounding Spirit Digital Library metadata is governed by a metadata application profile that establishes the scope and process for metadata collection for each digitized volume. Our Metadata and Research Data Working Group, composed of project team members and advisors with subject matter expertise and a range of relevant experiences, developed these profiles, as well as a metadata guide and instructions. The combined resources provide detailed information on each metadata field, including the definition of each field, source of the field’s information, and acceptable values for that field.
- A downloadable version of the project’s metadata application profile will soon be accessible here.
- A downloadable version of the project’s metadata guide and instructions will soon be accessible here.
Values
The scope of metadata collection defined in the metadata application profile was driven by three values aligned with our project’s overarching goals:
- Browsing and faceted search. Some cataloging conventions produce metadata that is unfamiliar or confusing to site visitors; we sought alternatives to make browsing the books most accessible to users. We also considered how the needs and interests of our audience could be best met by collecting related metadata and identified which metadata fields would be most useful for faceting search results.
- Interoperability through data sharing partnerships. To access the range of disciplinary audiences and publics interested in these volumes, we forged metadata sharing partnerships with the Digital Library of Georgia (a state affiliate of the Digital Public Library of America), Répertoire International des Sources Musicales, and the Atla Digital Library. We developed our scope of metadata collection to meet the data sharing requirements for these partners and adopted external controlled vocabularies whenever possible to facilitate interoperability and support faceting.
- Quantitative research. To ensure our metadata would help researchers learn about vernacular sacred music, we included elements that relate to thematic areas of interest. Specifically, we foregrounded data pertaining to demographics, place, religion, genre, and format and developed locally controlled vocabularies to support research using such metadata as a controlled variable.
Metadata Fields and Categories
Project metadata fields fall into two categories:
- Those that can be populated without reading music by referencing the book’s covers, frontmatter, and internal pages by consulting catalog records, WorldCat records, and websites associated with external controlled vocabularies employed by the project and
- Those requiring music literacy or subject matter expertise related to musicological, historical, or religious studies fields.
These distinctions allowed project team members with varying levels of subject matter expertise, including music literacy, to collect metadata.
Following the completion of metadata collection, team members spot checked metadata records, identifying areas where additional review could improve metadata quality, ultimately producing a prioritized list of issues to address through metadata quality control. We performed automated, semi-automated, and manual correction of metadata using OpenRefine, Python scripting, and our metadata collection platform to make thousands of corrections improving metadata quality and consistency.
In addition to contributing metadata to our data sharing partners, we also shared our metadata research findings with those responsible for maintaining the external controlled vocabularies we used, where possible. Our project team contributed a dozen new place names or corrections to the Getty Thesaurus of Geographic Names, more than 1,000 people to the Personal Name authority of the Répertoire International des Sources Musicales, and more than 200 names of people to the Library of Congress Name Authority File.