By Amanda Henrichs-
Recently, the HathiTrust held its third semi-annual UnCamp at UC Berkeley: over 2 days, researchers and librarians from all over North America presented research, outreach efforts, and pedagogical strategies using the 16 million digitized volumes in its collection. Beginning in 2008 as a partnership between the Big Ten Academic Alliance and the UC library system, HathiTrust boasts partner institutions all over the world. As one of the largest repositories of digitized books, HathiTrust maintains “an international community of research libraries committed to the long-term curation and availability of the cultural record.” This ambitious goal is supported by the efforts of the partner libraries, who maintain the metadata records and provide the scans of the digitized texts.
A good portion of the UnCamp (the unusual name signaling the fact that participants set part of the agenda, including voting on sessions and electing session leaders the day of the conference) featured reports on HathiTrust’s efforts to make their massive collections accessible. For much of its existence, HathiTrust has been building its collections and developing the infrastructure necessary to maintain such a large and disparate database. After all, the texts contained in HathiTrust are not owned by HathiTrust; all digital books and their records are still the property and responsibility of the library that digitized them.
The fact that the libraries retain control of and responsibility for their books is one of the main benefits and challenges facing users of HathiTrust. Because HathiTrust does not seek to own the books in question, it can feature a dramatically larger collection by pulling from libraries all over the world. But, on the flip side, because HathiTrust does not control any of the records, users of the database often face problems with incomplete metadata: for example, almost all records have titles, but only 80% have subject headings. And it’s safe to say that many exploratory searchers would not know the exact title they are looking for! These kinds of gaps or holes in the records can pose real challenges for researchers at all levels, from faculty to undergraduates.
And yet, for researchers who are interested in computational text analysis (CTA) or in working with the metadata of library records such as author, location of publication, date, or subject (basically, the information that is found in bibliographic records), HathiTrust is the gold standard. So what are we to do with the fact that even the gold standard has missing and incomplete information?
One answer might be simply to make a list of the things that are missing, as a group of participants did in the final UnCamp session on Friday afternoon. A group of scholars, librarians, and archivists had signed up to talk about what do to with missing data when constructing a corpus; while we came from many different backgrounds, careers, and institutions, we all shared an interest in corpus construction. There, we compiled a list of the many different kinds of gaps or missing data that might affect our work; and I’ll share this list with you here, along with examples where helpful.
Textual losses of n kinds:
1.Records are collected but lost. For example, a book might have been correctly documented in a card catalog, but the card was lost when digitizing the collection.
2.Books that were never collected at all. This is perhaps the most frustrating gap, since in this scenario we don’t even know what we’re missing.
3.Then there are texts that are collected but not described;
a.or not encoded. In 3 and 3a, description and encoding are different versions of the same problem: both mean a lack of some kind of digital record of the physical object. Whether that is a missing subject heading in the bibliographic metadata, or a lack of the more sophisticated tagging that enables the HathiTrust system (called Zephir) to manage duplicate records, missing description or encoding can render a text invisible.
4.Being protected by copyright can make a text missing! HathiTrust’s strength is in making the full text of out-of-copyright books available, though they are working on the complex legal negotiations necessary to digitize and make searchable books that are still in copyright. But if the researcher has a question dealing with books published after 1877, they might find that they can’t access the text, effectively rendering it missing for their purposes.
5.Lack of full-text digitization or bad optical character recognition (or OCR, when scanned images of text are processed to enable searching or reading by programs like Adobe or Word) can be disastrous for a researcher. I personally work on poetry from the seventeenth century in Britain: unfortunately, much of the OCR for those texts in HathiTrust is utterly illegible. This means I might be able to search metadata, but not the text itself, walling off certain kinds of research questions.
6.Or what if local archives are digitized, but without support they cease to exist? Many small libraries might be working to digitize local records; but digital books need support and maintenance just as physical books do. Often, smaller archives might not have the resources to maintain their collections, and they run the risk of disappearing.
a.or if there are archives that aren’t accessible to digitization, or archives that are closed? Some libraries simply do not digitize their texts; this is connected to #2, and we don’t even know what we’re missing from a closed archive.
b.Or, sometimes archives digitize microfilm rather than the book itself. This can produce poor-quality text, resulting in a lot of noise in the system.
7.From a social sciences researcher: what constitutes a research archive/set in the first place? What is excluded?
a.Sometimes there is a lack of awareness of where the data is coming from, and the limits that imposes. In other words, research bias can create gaps, even if that bias is as innocuous as a failure to consider that some of your texts might have poor-quality OCR.
i.This raises the (enormous) question: Are we in fact scripting bias into our datasets in the construction of search algorithms?
8.The fact of publication reinforces gaps. HathiTrust is fundamentally concerned with published texts (see also #4 about public domain). There are a few digitized manuscripts, but not many.
This long list started to seem overwhelming: can we in fact do any kind of research, when every question stumbles across a gap in the archival record? One really interesting possibility, though, is the new data capsule feature under development. Briefly, data capsules are way for researchers to analyze data without actually downloading, reading or in fact personally accessing any of that data. (The full documentation is here.) What this means in practice is that researchers can create a corpus—or workset in HTRC parlance—load that corpus into a capsule, and run their analysis with the dedicated server space HathiTrust provides for capsule users. This kind of research doesn’t address all of the gaps or missing data by any means; but data capsules offer researchers the opportunity to access many more kinds of data, and to tailor their queries to account for some of those gaps.