Saturday, July 25, 2020

the machine-readable text: methods of conversion


Although the Workshop did not include a systematic examination of the methods for converting texts from paper (or from facsimile images) into machine-readable form, nevertheless, various speakers touched upon this matter. For example, WEIBEL reported that OCLC has experimented with a merging of multiple optical character recognition systems that will reduce errors from an unacceptable rate of 5 characters out of every l,000 to an unacceptable rate of 2 characters out of every l,000.

Pamela ANDRE presented an overview of NAL's Text Digitization Program and Judith ZIDAR discussed the technical details. ZIDAR explained how NAL purchased hardware and software capable of performing optical character recognition (OCR) and text conversion and used its own staff to convert texts. The process, ZIDAR said, required extensive editing and project staff found themselves considering alternatives, including rekeying and/or creating abstracts or summaries of texts. NAL reckoned costs at $7 per page. By way of contrast, Ricky ERWAY explained that American Memory had decided from the start to contract out conversion to external service bureaus. The criteria used to select these contractors were cost and quality of results, as opposed to methods of conversion. ERWAY noted that historical documents or books often do not lend themselves to OCR. Bound materials represent a special problem. In her experience, quality control—inspecting incoming materials, counting errors in samples—posed the most time-consuming aspect of contracting out conversion. ERWAY reckoned American Memory's costs at $4 per page, but cautioned that fewer cost-elements had been included than in NAL's figure.


Source: Project Gutenberg's LOC Workshop on Electronic Texts, by Library of Congress

  • rss
  • Del.icio.us
  • Digg
  • Twitter
  • StumbleUpon
  • Reddit
  • Share this on Technorati
  • Post this to Myspace
  • Share this on Blinklist
  • Submit this to DesignFloat

0 comments:

Post a Comment