by Ray Bankoski
Thanks to the digital age, we have access to information like never before. Although a single online database may host millions of digital images, users can easily and quickly find the exact piece they need by inputting a simple search term. Well, this process may seem simple to the user, but an incredible amount of work goes into creating the ease of access so many take for granted. This seemingly magical solution is metadata, and it is the key to the world-renown Gale Digital Collections product line produced by Gale, part of Cengage Learning.
Metadata is an elastic model that varies greatly depending on the type of document it describes. The characteristics associated with a book are different than that of a newspaper; those of a newspaper differ from those of a journal, and so on. Each kind of primary source is described in a file called a Document Type Definition (DTD), while captured metadata is output in an Extensible Markup Language (XML) file.
The phrase “capturing metadata” is a tad misleading. Metadata elements are actually defined by manually entering information about each and every page image into a file. This is a time consuming task that requires a team of hundreds. Each page is placed under careful scrutiny, requiring an approach unique to its content type. Among the elements captured are page numbers, chapter headings, article titles, and graphic captions (when available). The process requires multiple operators to key in information from the same pages, compare their input, and review the results to ensure the capture is correct.
After the first level of capture has passed its quality control check, level two commences. This involves moving from a focus on words to a focus on image segments in a process known as intelligent tagging. First, all graphics receive an assigned graphic type—such as cartoon, portrait, or chart—and get recorded along with their associated captions. This added step allows users to search for specific graphics, such as a portrait of Abraham Lincoln or a chart of food prices in 1856. This same technique of intelligent tagging is applied to newspaper, journal, and magazine articles. Possible classifications include advertisements, illustrations, and birth notices. As a result, end-users are able to limit their searches to return only the desired classifications, easily homing in on the exact information they want to find.
Recognizing Hand Scripts
Hand-written notes and documents present special challenges, since this material cannot be captured using the Optical Character Recognition (OCR) software that is used to transform printed text into searchable words. At Gale, we have developed an exciting, new proprietary process that aids in the capture of people’s names, place names, and dates found on these documents. The result is that a wealth of contextual information is being made discoverable for the very first time, thus opening the door to new research opportunities that just weren’t possible before.
Remaining Committed to Quality
Gale captures millions of pages of material each year and has committed to providing quality metadata to help users discover those pages with ease and speed. Our workflow allows for quick and accurate capture of basic information and utilizes subject matter experts when more complex capture is necessitated. Every single metadata element collected is also reviewed for accuracy in a quality control facility operated by Gale.
Yes, Gale truly has made finding a needle in a large data stack possible. So the next time you use a Gale Digital Collection, consider all that has gone into giving you a quick and accurate search experience.
About the Author
In his role as VP, Electronic Asset Management, for Cengage Learning, Ray oversees all aspects of the capture, conversion and quality assurance of rare material for the Gale Digital Collections product line.