Streamlining eDiscovery with optical character recognition (OCR)

Consider today’s eDiscovery process. The average legal case in South Africa consists of at least 5GB of data.

According to LexisNexis, there are on average 64,784 Microsoft Word pages in one GB, over 100 000 email files, 165 000 Excel files, more than 17 500 PowerPoint files, almost 678 000 text files and over 15 000 image files.

The vast scope of data that needs to be reviewed during the review process is why Technology Assisted Review (TAR)is so critical to keep review costs down, but what happens when you are dealing with images and not text files? Can technology platforms still look for key search terms? Can an image be reviewed in the same way as a word document, email or Excel spreadsheet, for example, or do legal teams need to revert back to a manual review process?

Link to TAR blog article:

The power of optical character recognition (OCR)

Thankfully, there is a solution. Optical character recognition, or OCR, is the use of technology that can recognise numbers, letters, and other written characters from ‘flattened’ images. This means that a scanned paper document or image can be converted into searchable electronic text. In other words, OCR converts electronic or paper-based discovery into computer-based text.

OCR systems are made up of a combination of software and hardware solutions. First, hardware, such as a specialised circuit board or optical scanner, copies or reads the text. Software is then used for advanced processing by examining the text of a document, translating the characters into code and then using that code for data processing.

OCR is an incredibly advanced solution, using artificial intelligence to identify characters. This means that letters can be recognised even when they are remarkably different. Think how many ways ‘a’ and ‘g’ are written and how different their structures are across fonts. With OCR, this doesn’t matter. OCR software can even be used to decipher handwriting.

OCR and eDiscovery

When a legal team begins the eDiscovery process, they are dealing with thousands and thousands (or even hundreds of thousands) of documents. Many of these discoverable materials are received as images such as TIFF files or flattened PDFs of scans. Unlike images or paper, electronic text is fully searchable. How can images and paper therefore become electronic text to streamline the discovery phase?

OCR software not only identifies numbers, letters and text characters, but converts them from pixel-based pictures into readable text, which means documents can be electronically reviewed with search terms.

OCR is also critical if a legal team receives paper-based documents. Instead of manually reviewing each physical document, physical pages can be scanned, processed through an OCR system, and then rendered as computer files in a fraction of the time it would take a human to read them. From there, TAR software takes care of the first round of reviews,

Legal teams can search these files for names, keywords, dates, and any other text-based content, significantly reducing the amount of time required to both process and review images or paper-based discovery.

The pros and cons of OCR in eDiscovery today

As you can see, there are significant benefits to using OCR in civil litigation cases. OCR streamlines and simplifies discovery, supporting better early case assessment (ECA). The ability to speed up the entire eDiscovery process will simultaneously improve accuracy (versus a manual team of human reviewers) and also drastically lower costs, particularly within the review phase.

OCR also allows discoverable information to be extracted as text and electronically stored, which prevents wasting time locating a particular document and eliminates the risk of misplacing a specific piece of paper.

However, OCR also has its shortcomings, for example, despite extremely advanced software, OCR is still not 100% accurate (although it is coming increasingly close). Occasionally, misspelled words and missed search terms occur.

Nevertheless, OCR will most likely eventually make paper-based discovery entirely obsolete, as anything on paper can be rapidly ingested for processing as eDiscovery. Meanwhile, OCR offers the ability to review TIFFS and PDFs as easily as any other type of Electronically Stored Information (ESI).