With 100% OCR on the line, are ASCII characters a bane or boon for outsourcers?

Published On February 29, 2012 -   by

Recent advances in Optical Character Recognition ”“ OCR for short ”“ have led to the development of scanning equipment and software systems that are able to reproduce printed text with near-perfection. By scanning a printed page with specialized optical hardware, an OCR system breaks down the characters into recognizable shapes and recreates the documentation in memory, allowing the instantaneous, photographic transfer of documents from pen and paper to computer hard drive.

While OCR is not quite perfect ”“ yet ”“ efforts are underway to enhance and increase OCR sensitivity, giving text-reading hardware and software platforms the ability to gather more detail from printed characters, and reproduce them more effectively. It may sound like science fiction, but it is undoubtedly science fact ”“ and while reading text off a page might sound like something incredibly easy ”“ five your olds do it every day ”“ it turns out that computers and software don’t see the world the same way we do, and that’s causing problems with the implementation of OCR.

The primary culprit in the delay of 100% OCR is ASCII ”“ one of the oldest and most well-known set of character codes in America, and around the world. ASCII represents each character of the English alphabet (along with various numbers, punctuation marks, and special symbols like # or @) in a single set of numerical instructions. However, there are a multitude of programs producing alphabets from ASCII character code, and each one looks a bit different.

Think of it like fonts ”“ ASCII tells the computer to type the letter ”œA”�, but that letter is going to look different depending on what kind of font the software program is presently using. It might look like ”œA”� or ”œA”� or ”œA”� or one of a hundred thousand other font options out there. While we can easily recognize each of these letters as a capital A, an optical scanning device is easily confused by italics, serifs, or fonts it has never seen before. Since ASCII provides no dimensions for how letters are printed on the final page, it allows an essentially infinite number of different fonts, creating an insoluble problem for OCR readers.

Ultimately, the only solution to the problem, and the only way to reach the Holy Grail of 100% OCR, is to do away with the archaic and dated ASCII system. Only be introducing a new alphabet code at the source level ”“ one that defines dimensions of letters in a way that satisfies the needs of OCR developers ”“ can the industry finally move to the next level of document compatibility.

– The Data Czar @ DEO

