Optical Character Recognition (OCR) Improvement: Enabling Deeper Historical Research through Innovation

5 min read

| By Gale Staff |

For more than 20 years, Gale has pioneered the application of powerful optical character recognition (OCR) technology to create some of the world’s largest and most widely researched digital archives. With OCR, you can quickly turn a printed text into a searchable document.

Read the Q&A below with Gale senior product manager Megan Sullivan and learn more about OCR’s capabilities and improvements coming to The Times Digital Archive and Eighteenth Century Collections Online.

Can you provide some background information about how Gale has applied OCR technology to historical archives?

It started back in 2002 with the launch of the Gale Primary Sources program, which digitized the near complete run of The Times (London). The following year, Gale digitized the 180,000 titles and 32 million pages which comprise Eighteenth Century Collections Online (ECCO). With both projects, Gale Primary Sources was a forerunner in large-scale digitization initiatives with full-text search, and both archives are heavily trafficked today, with millions of user search queries annually.

Digitizing these collections can save users an incredible amount of time when they’re looking for something specific. What limitations are there when applying OCR to a historical collection?

We have made a tremendous amount of progress in the past 20 years. There are some limitations still today, and these are largely related to the original document itself. What condition is it in? How legible is the printed text? Does it have many annotations? Can one see the complete text or is some text obscured by the central gutter of the book due to tight bindings? How old is it? Some other factors may include the equipment used to scan the original document or the maturity of the OCR algorithm used at the time of creation. That said, OCR technology has improved significantly since we first published The Times Digital Archive and ECCO.

How is Gale considering advancements in OCR technology to address these limitations and what new opportunities do they provide?

We’re committed to the continuous development of the Gale Primary Sources program. As part of these ongoing efforts, Gale has rerun significant portions of The Times Digital Archive and ECCO through updated OCR technology to improve their searchability.

Can you tell us more about rerunning the OCR on ECCO and The Times, and what impact that might have for users?

Rerunning the OCR for ECCO and The Times will provide an improved dataset for users to search against when looking for sources in the collections. Improving the OCR data facilitates even more accurate search results, enabling users to bring their research questions into greater focus through a lens of historical documents.

Which parts of The Times and ECCO is Gale upgrading?

For The Times Digital Archive, we reran all newspaper issues published from 1785 to 1825 and from 1900 to 1920. For ECCO, we reran all documents written by women and BIPOC authors as well as any material written by prominent eighteenth-century authors as defined by the original ECCO selection criteria. We also reran anything identified as a dictionary, encyclopedia, or a periodical.

Given the large volume of content in The Times Digital Archive and ECCO, how did your team determine what to rerun to ensure user relevance?

In both The Times and ECCO, we used a combination of user data and research trends to identify content sets for OCR improvement. In many instances, we relied on user search trends and document retrieval data to better understand what topics and time periods users search and interact with. We also relied directly on user feedback. Our product team logs every user request that comes through to us. We took these requests into consideration when identifying this content. Finally, we looked at broader trends in scholarship to analyze current research. For example, we identified works by women and BIPOC authors as top candidates for OCR improvement.  

Will these updates affect the OCR Confidence scores?

The quality standards and score algorithms of newer OCR engines have evolved and gotten more rigorous over the years. This means that upgraded documents may have OCR Confidence scores that remain roughly equivalent, or in some cases they may decrease slightly. OCR Confidence reflects the confidence of the software that ran it, so it can be difficult to compare between the software used twenty years ago and that used today.

How can users access the full list of titles for documents affected by OCR upgrades?

Anyone interested in these upgrades can request title lists for The Times Digital Archive and ECCO by emailing [email protected]

Will users be able to access previous versions of the OCR after updates come into effect?

Yes, the original OCR will be archived and available to be shipped to users on a physical drive. If users require this data, they can send an email to [email protected]

Moving forward, will Gale continue with OCR updates in other digital archives?

We have plans to upgrade the OCR for additional products in 2024. Interested users can enter their email below to stay up-to-date when additional collections are upgraded!

Gale Digital Scholar Lab users who access and curate materials from these archives will see updated OCR and OCR confidence scores automatically when the enhancements described above are released.

Leave a Comment