| By Wendy Kurtz, Digital Humanities Specialist |
As the first Digital Humanities Specialist with Gale, I have been anticipating the official release of Gale’s Digital Scholar Lab since joining the team. I came aboard just over a year ago after completing my PhD in Hispanic Language and Literatures at the University of California, Los Angeles. Because of my involvement in digital humanities initiatives on campus and abroad, I was excited to put my experience into practice by participating in the development of Gale’s Digital Scholar Lab. During my four years as a Research and Instructional Technology Consultant with the UCLA Center for Digital Humanities, I supported humanities faculty and graduate students in the use of technology for instruction and contributed to digital research projects undertaken in cooperation with the center. When I learned about the goals for Digital Scholar Lab, I immediately saw the value of the research environment for both scholarship and classroom use. Over the past year, I have been a part of the evolution of Digital Scholar Lab as it progressed from alpha, to beta and now to the first production release. This post describes the impetus for creating the Lab and gives an overview of our beta testing initiatives that informed the finalized design, workflow, features and functionality of the first release. It finishes by describing some of our next steps as we move beyond the initial launch.
Gale’s Digital Scholar Lab provides a new way to approach the millions of digital pages made available through Gale Primary Sources collections by facilitating the interrogation of these documents using text mining methodologies. In doing so, Digital Scholar Lab tackles some significant barriers to entry into the field of digital humanities, specifically text mining and visualization projects—such as the compilation and curation of textual data for analysis, as well as the integration of a variety of analysis tools to mine a corpus of documents.
Currently, analysis of digital texts found online from sources such as Project Gutenberg, Google Books, HathiTrust or any of Gale Primary Sources involves downloading one document of OCR (optical character recognition) text at a time, compiling these individual documents into a corpus, then running the collection through any number of text mining tools. In the first illustration below, we see one primary source document next to the OCR text output, which has been downloaded by the researcher, then uploaded and analyzed in Voyant. Using this method, the process of collecting, curating and formatting multiple documents to create a content set for analysis can take months or even years to complete. Ultimately this process often proves unsustainable for the compilation of larger corpora.
Even if you have created a large dataset for analysis, there are still hurdles to jump before a humanities researcher can embark on the process of text mining and analysis. In many cases, the tools themselves are complex, and mastery of just one requires a considerable investment of time and effort. Even out-out-the-box applications, such as Gephi, aren’t aimed at the complete beginner and often assume knowledge to install and use. Beyond that, tools like Mallet, until recently, required the use of command line operations, which can be intimidating to the novice. While Python or R are popular languages for querying data, learning them is out of scope for many students and researchers in the humanities. What we’ve built in Gale’s Digital Scholar Lab links the content sets you create within the platform directly with the digital tools to analyze them. Learning the skills required to run a command line interface or programming your own scripts is valuable, and the process certainly has its place in research and teaching. But for scholars new to the field or within the context of a traditional humanities classroom (vs. a specific digital humanities course), the flexibility of Digital Scholar Lab offers tremendous value.
Digital Scholar Lab is a cloud-based research environment that allows students and scholars to apply natural language processing tools to raw OCR text in one platform. It was developed for use in the humanities to explore custom-curated corpora of documents. This infrastructure is designed specifically with humanists in mind, with original content that has been prepared for use in the platform.
Gale’s Digital Scholar Lab research environment provides:
- access to a wide variety of texts from Gale Primary Source
- the ability to build custom-curated content sets from these collections.
- access to powerful text mining tools which are embedded in the dataset curation process.
- the organization of research in one space.
- the ability to export the statistical data and visualization outputs from your analysis.
As we have continued iteratively developing Digital Scholar Lab, we have undergone several rounds of beta testing with a variety of institutions of higher learning to ensure that the production release provides the tools, content, and workflow desired by a wide spectrum of end-users in support of digital scholarship and text mining. Over the course of the past two years, we have evaluated eight different prototypes with a host of potential users, ranging from faculty, digital humanities practitioners, librarians, and graduate students. Beta version testing occurred in the early part of 2018 and we have been implementing changes to the platform based on our testers’ feedback. We feel fortunate to have worked closely with these institutions and users in order to fine-tune our development efforts as we moved towards the official launch.
The Lab alleviates some of the pain points associated with the traditional workflows of text mining and visualization projects. We have designed the interface in a way that is approachable, but that does not mean that the analysis methods included in the platform, their implications and the interpretation of their output is easy. Understanding how the default tool configurations work and how customizations affect the output of an analysis is not something that can (or should) be simplified. Digital Scholar Lab surfaces the complexities of the process, especially for novice users who may not be aware of the decisions that went into something as fundamental as how a publisher like Gale constructs an archive. Working with Digital Scholar Lab in a classroom setting offers multiple avenues of discussion for users with varying levels of digital literacy. These range from content and curation processes such as OCR creation, metadata standards, data set building, to topics related to distant reading and interpretations of visualizations, for example. We have made the workflow and process of creating then analyzing a personalized archive as transparent as possible for our users. Not only can they can evaluate their research outcomes, but also the back-end methodologies implemented in Gale’s Digital Scholar Lab to get those results.
Even though today marks our official release and is an important milestone in Digital Scholar Lab’s journey, it’s only the beginning. Users will continue to see updates and improvements on a regular basis. We’ll continue to maintain a close connection with the Digital Scholar Lab community of scholars, librarians and students to respond to their needs as we plan future Lab development. As the Digital Humanities Specialists at Gale, Dr. Sarah Ketchley and I, along with other members of the digital scholarship team, will be working closely with users to help them effectuate their research and pedagogical goals through use of the Lab. Looking forward, our next steps include an integrated OCR text cleaning solution within Gale’s Digital Scholar Lab as well as enhancements to the tool suite and more robust interactivity for the visualizations.