Detect Text in Images with Python - pytesseract vs. easyocr vs keras

Introduction

In this article, we will explore how to extract text from images using Python. There are several libraries available for text extraction, but we will focus on comparing three popular ones: pytesseract, easyocr, and keras_ocr. To demonstrate the capabilities of these libraries, we will use a dataset called Text OCR, which contains over a million annotations of text in images. This dataset is perfect for testing the performance of these libraries.

Overview of the Dataset

The Text OCR dataset consists of numerous images annotated with the text they contain. The dataset is organized into training and validation folders and includes several CSV and Parquet files that contain annotations, as well as metadata for each image. Each annotation includes a unique ID, associated image ID, bounding boxes for words, and the text itself.

Setting Up the Environment

For our experiments, we'll work within a Kaggle notebook, where we can leverage its built-in support for Python libraries. We will be importing essential libraries such as:

import pandas as pd
import numpy as np
import glob
from tqdm import tqdm
import matplotlib.pyplot as plt
from PIL import Image

We'll read in the Parquet files containing annotations and image metadata, and use glob to retrieve the paths for the image files.

Data Exploration

Before diving into text extraction, we need to visualize some of the images in the dataset along with their annotations. This helps us understand what kind of images and text we are working with.

Text Extraction Methods

1. pytesseract

The first method we will explore is pytesseract, a Python wrapper for Google's Tesseract-OCR Engine. Although pytesseract is widely used for document text extraction, it may not perform as well on diverse image types typically found in datasets like Text OCR.

To use pytesseract, we can invoke the following command:

import pytesseract

text = pytesseract.image_to_string(image_file_name, lang='eng')
print(text)

After running this on an example image, we will analyze the output but note that the results may not be optimal.

2. easyocr

Next, we will test easyocr, which relies on deep learning models for text detection. It's slightly slower but often yields better results than traditional methods like pytesseract.

To use easyocr, we create a reader object and invoke the read_text method:

import easyocr

reader = easyocr.Reader(['en'])
results = reader.readtext(image_file_name)

The output includes the detected text, bounding boxes, and confidence scores.

3. keras_ocr

The final library we will compare is keras_ocr, which combines a detector and recognizer under a unified pipeline. While keras_ocr is not pre-installed in Kaggle, we can easily install it using pip:

!pip install keras-ocr

Then we can run the text extraction as follows:

import keras_ocr

pipeline = keras_ocr.pipeline.Pipeline()
results = pipeline.recognize([image_file_name])

Comparing Results

Having extracted text from images using all three methods, we will compare their performance. We will focus on key aspects such as accuracy, detection of bounding boxes, and any missing annotations.

Visualization

To visualize the results, we can use built-in tools from keras_ocr to draw annotations directly onto the images, allowing us to clearly see how well each library performed.

We will create a function to facilitate the plotting of results side-by-side for a direct comparison.

Conclusion

We have explored three libraries for text extraction from images—pytesseract, easyocr, and keras_ocr—and analyzed their performance using a rich dataset. Each library has its strengths and weaknesses, and the choice of which to use may depend on the specific use case.

Keywords

Python
Image processing
Text extraction
pytesseract
easyocr
keras_ocr
Optical Character Recognition (OCR)
Dataset
Annotations

FAQ

Q1: What is Optical Character Recognition (OCR)?
A: OCR is a technology that converts different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data.

Q2: Which Python library should I choose for text extraction?
A: The choice of library may depend on your project's specific requirements. For document-like texts, pytesseract might suffice. For a more diverse set of images, easyocr or keras_ocr may be preferable due to their better performance with complex backgrounds.

Q3: What is the advantage of using deep learning-based OCR libraries?
A: Deep learning-based libraries, like easyocr and keras_ocr, tend to be more accurate and robust in detecting text in a variety of fonts and styles, especially in challenging image conditions.

Q4: Can I run these libraries on my local machine?
A: Yes! You can install pytesseract, easyocr, and keras_ocr in your local Python environment. Just be sure to follow the installation instructions, especially for any dependencies.

Q5: How does the performance differ between these libraries?
A: Performance can vary based on the complexity of the images and text. It's suggested to test different libraries on your specific dataset to determine which provides the best results.