Feature
posted 21 Jul 2006 in Volume 3 Issue 2
Information capture
Dangerously good
New developments in data capture technologies are making it easier than ever to transform paper documents and forms into digital information.
By Jessica Twentyman
DATA CAPTURE is growing up. For many years, getting information held on paper documents into back-end systems in data format was a hit-and-miss affair. Now, considerable advances in optical and intelligent character recognition (OCR/ICR) technologies mean that data capture is “ready for prime time and poised to deliver results,” says Robert Markham, an analyst with Forrester Research.
At its heart, OCR technology uses complex algorithms to interpret machine-written text in multiple international character sets, total numeric values and their related currencies; to relate line-item details to business rules; and to recognise a document’s type and origin, based on page layout.
ICR, meanwhile, is used primarily to interpret forms with boxes into each of which users print a capital letter or number. Often, instructions for entering text appear on the form itself. Under these conditions, handwriting tends not to differ much from one user to the next, so recognition rates are normally quite high.
These capabilities – in theory, at least – should result in a low number of ‘exceptions’ that require manual intervention and, therefore, take a great deal of the time and human energy out of processes such as accounts payable, for example. Character recognition, say its proponents, can help in any process in which text or data from forms and documents needs to be captured and delivered to another system.
“The acid test for OCR/ICR is whether or not it reduces ‘residual keystroke activity’ – the manual work that goes into re-keying and correcting capture data. The smaller that number is, the better,” says Rob Packington, sales and marketing director for OCR/ICR supplier SoftCo.
But despite the overall success of OCR and ICR in reducing that activity, says Markham of Forrester, almost one-quarter of all documentimaging enquiries his company received from clients in 2005 were specifically about OCR/ICR.
Why do users have so many concerns and questions about the potential accuracy of OCR/ICR? “It’s because significant variance in both the quality and orientation of scanned images and faxes makes it very hard to account for and correct in the OCR/ICR algorithms,” he says.
And in many cases, conventional OCR/ICR has let users down. At an average rate of 98%, conventional OCR still produces a significant number of errors per scanned page. “Assume a typical full-text page has 2,000 characters. Using this accuracy rate, 40 of the characters will be wrong,” says Markham. “Of those 40 errors, only the 60% of errors marked as suspicious are manually reviewed in typical quality-assurance processes. Errors not marked as suspicious must be caught by logical checks in mainframes, databases or other target applications.”
And historical rates of marking suspicious characters approach approximately two-and-a-half times the total number of actual OCR errors. The total number of characters marked as suspicious varies significantly depending on the OCR engine used and the methodology for marking suspicious characters is also configurable for each engine.
Either way, that still creates a hefty workload for data entry clerks, who spend an average of half-a-second checking suspicious characters that are, in fact, correct and one-and-a-half seconds checking suspicious characters that are errors. “Just one second of conventional OCR processing leads to almost 20 seconds of manual editing time, while also leaving almost four errors in each processed page,” says Markham.
Given that scenario, it is no wonder that potential customers have serious doubts about the business benefits of OCR/ICR technologies. “It’s fair to say that OCR/ICR deployments used to have something of the ‘black arts’ surrounding them,” says June Dean, UK managing director of OCR/ICR specialist ITESoft.
“Users considered themselves lucky to capture what they were able to capture and simply had to put up with the amounts of data that were missed or misinterpreted.”
However, many organisations are unaware that advanced OCR/ICR technologies now exist to achieve much higher accuracy rates. “While these advanced techniques are widely utilised by imaging service bureaus, buyers should also recognise that using these techniques in-house will also enhance the benefits of automated information capture,” says Markham of Forrester.
Freestyle capture
So-called ‘free-form’ data capture is one of these advances, because the ability to capture and process the same information located in different places on varying documents should result in improved process efficiencies at lower cost as a result of less manual data entry.
That is important in a world where non-uniform documents are the norm – invoices are redesigned over time, birth certificates vary by place of birth, government forms such as licence applications and voter registrations vary by region, and forms that cross geographic boundaries come in many different languages.
Free-form OCR/ICR can capture information located in different places within documents, such as scanned invoices. “In the past, each invoice had to be predefined with the specific location of data to be captured, requiring companies to painstakingly build templates to feed into OCR/ICR engines,” says Jupp Stoepetie, CEO of OCR/ICR specialist ABBYY Europe.
“OCR/ICR technologies can now process semi- and unstructured forms that contain valuable information but aren’t laid out in a standard format, such as invoices, almost as well as they can process structured forms, such as credit card applications,” he says. In that way, two invoices that contain the same information, but in different locations on the page, can be processed almost as well as two identical invoices.
The first step in enabling free-form data capture is to set rules. High-level rules are set to classify document types and to identify what information to look for within certain document types. “For example, a rule could be that when the word ‘invoice’ is found, the document is classified as an invoice and the OCR engine then looks for phrases like ‘amount due’,” says Markham.
Further rules may also need to be set, he explains. “For example, if the phrase ‘claim number’ is also found in a document with the phrase ‘invoice’, it’s highly likely that the document is not an invoice after all and another set of rules needs to be applied to its capture.”
Time to vote
Voting algorithms can also have a dramatic effect on the accuracy of character recognition. These statistically combine the results of multiple OCR/ICR engines to enhance recognition results by reducing the number of results that fall below a threshold and therefore require manual data entry and correction.
Voting algorithms must use at least two OCR/ICR engines, but in most cases, three or more engines are used to further increase the accuracy and reduce ‘false positives’ for unidentified characters requiring correction. “Deploying several OCR/ICR engines that use different identification algorithms and then programmatically comparing the results guarantees a chosen threshold for accuracy never seen before,” says Markham.
By employing voting within OCR/ICR technologies, he says, organisations can reduce overall errors by up to 65%, saving almost three seconds per-page-scanned in error correction time, and saving almost four seconds per page in checking suspicious characters, he claims.
The savings generated by using voting algorithms with multiple OCR/ICR engines rather than conventional methods are “clear and substantial”, says Markham – plus prices have fallen as the voting algorithms used in highly specialised remittance and item processing equipment have gone downmarket into general-purpose document capture systems. In conjunction, then, with free-form capture, the level of automation, speed of processing, and reduction in manual labour make these technologies an absolute requirement for any organisation with a substantial amount of forms processing requirements.
Market transitions
For that reason, vendor offerings are maturing to support user requirements. The capture market was once dominated by independent vendors, but many of these have either been acquired or merged. The remainder, including such suppliers as Kofax, partner actively with leading information management software suppliers. These developments are enabling users to combine more closely advanced capture technologies with business processes, such as mortgage lending in retail financial services; claims processing and underwriting in insurance; patients records in healthcare; various claims in the public sector; and, accounts payable and accounts receivables across all industries.
Organisations interested in OCR/ICR, says Markham, need to transition from “scan to archive” to “scan to process”. And, at the same time, they need to consider that capture products come with varying levels of support for data recognition, ranging from single-engine OCR to multiple engine OCR with voting algorithms to achieve the highest possible accuracy rate. “If your organisation only deals in standard forms, simple recognition will likely suffice; if free-form data capture is the norm, look for high-end recognition capabilities in the capture solution,” he says.
Finally, they should focus on solutions that fit with existing enterprise content management (ECM) and BPM systems. If these are in place already, it is likely that the existing product will offer some kind of document capture capability either natively or via partnership. “Give preference to capture solutions that are either native to your ECM system or aligned via tight partnership,” he says.
Case study: Hammersmith Hospital
Hammersmith Hospital in London, part of the UK’s National Health Service (NHS), was the first hospital in Europe to take a ‘filmless’ approach to x-rays. It has now also taken steps to increase efficiency by providing electronic access to the print output results of common clinical tests. The hospital had a number of challenges to tackle, among them:
-
Space restrictions
Because of the large amount of clinical equipment at the hospital, floor space is at a premium;
-
Low tolerance for specialized training
The system needed to use standard office equipment in order to limit training requirements for clinicians who would capture images of patient test results, which were otherwise only available through manual filing and retrieval;
-
Automated indexing
Automating the creation of indexes for future access was a high priority. This approach would keep manual data entry to a minimum. The indexes also had to contain metadata so that secure access to the images could be assured;
-
Image quality
The fidelity of the captured images had to be of a very high quality to enable doctors to make medical decisions based on online access to the test results.
Armed with a list of priorities, the hospital undertook a search to match its requirements with a viable solution. During the course of the investigation, elements of the solution emerged, were tested and a pathway towards implementation began to evolve.
The key elements included:
-
High-quality images from a copier
Hammersmith Hospital tested many options to find the optimum image quality and resulting image file size. In addition, the scanning device needed a long duty cycle to minimise downtime. The space requirements of flatbed scanners did not provide a viable solution. When a new multi-function printer (MFP) with integrated scanning capability became available to replace the existing copier, it was tested for scan quality and was found to provide clinical-quality scan images of test results. The printer was from the Xerox WorkCentre line of office MFPs and offered a programmable touchscreen for capturing patient information to be used for indexing;
-
Automated creation of image indexes
A combination of elements, such as the patient’s name, date of birth, and hospital number, are required for the index to provide for security and future access to test result images. This information is now collected through user input from a touchscreen on the copier, which was programmed to accept patient index information and automated index capture, accomplished through the use of Adobe Acrobat;
-
Centralised access and control
All images are stored and indexed in Xerox DocuShare, which replaced the manual access methods used in the past. Many competing systems were evaluated against the key requirements, including integration with MFPs, use of standard off-the-shelf hardware, and simple user interface;
-
Use of a standard browser for image access
Xerox DocuShare is used to view captured images because it does not require a browser plug-in to provide image access, thus allowing the use of a standard desktop and browser;
-
Expanded process to other test devices
Hospitals generally do not implement document or imaging systems because many manufacturers of test devices provide integrated, proprietary image storage and access.
This prevents a centralised access method of patient test results without the purchase and integration of a vendorsupplied interface. The integration into a centralised system is complicated by the absence of a common protocol among device manufacturers.
The results have been impressive, says Lee Lewis, a cardiovascular physiologist by profession who now manages cardiac services at the Hammersmith Hospitals NHS Trust. “The end results are an excellent way of turning diagnostic paper recordings – produced by a plethora of equipment, from many different manufacturers – into compact electronic records that can be retrieved by clinicians anywhere in the hospital from a single archive system,” he says.
Efficiency has been increased through early access to test results and the ability to re-purpose results for student access. The reduction in the number of re-tests due to lost test results has been dramatic. Patient records are now available through a secure login from a web browser – previously, hard copies of the test results needed to be made and delivered to the requesting physician, a process that took time.
Finally, the system does not require a full-time administrator for tasks like adding new users to the system and creating security profiles. Much of the system administration was automated through configuration during the implementation, making administration a straightforward process. The system was implemented from widely available off-the-shelf hardware – not bespoke development – with disaster recovery features, available as an add-on to Xerox DocuShare, added at a minimal additional cost.
Be careful: Apply OCR/ICR to the right products
-
Volume of scanning to be OCR’d
Small batches can only be justified if they are part of a larger free-form capture effort.The expense and setup of OCR/ICR does require a significant volume of images to have a payback that’s worth the investment. Outsourcing small batches to a service bureau makes economic sense for smaller capture efforts;
-
Accuracy determined by how common the document type is
It pays to do research upfront on the common types of images that will be captured so as to verify that the OCR/ICR engines have high accuracy rates for physical documents being scanned. For instance, is the font that is used common, or is it more error-prone for certain OCR/ICR engines?
-
The correct fields are identified
This step involves setting up business logic to accurately identify the fields to be captured. For free-form capture, use integration with an enterprise resource planning (ERP) or financial system to verify and increase the accuracy rate.
-
Actual labor currently involved
No OCR/ICR system will be without labour for error correction. Be sure to plan for error correction as part of the overall effort and also plan a quality-control program.
Source: Forrester Research
denotes premium content | Jan 6 2009 


