In this article, we will explore how to extract text from images using Python. There are several libraries available for text extraction, but we will focus on comparing three popular ones: pytesseract, easyocr, and keras_ocr. To demonstrate the capabilities of these libraries, we will use a dataset called Text OCR, which contains over a million annotations of text in images. This dataset is perfect for testing the performance of these libraries.
The Text OCR dataset consists of numerous images annotated with the text they contain. The dataset is organized into training and validation folders and includes several CSV and Parquet files that contain annotations, as well as metadata for each image. Each annotation includes a unique ID, associated image ID, bounding boxes for words, and the text itself.
For our experiments, we'll work within a Kaggle notebook, where we can leverage its built-in support for Python libraries. We will be importing essential libraries such as:
import pandas as pd
import numpy as np
import glob
from tqdm import tqdm
import matplotlib.pyplot as plt
from PIL import Image
We'll read in the Parquet files containing annotations and image metadata, and use glob
to retrieve the paths for the image files.
Before diving into text extraction, we need to visualize some of the images in the dataset along with their annotations. This helps us understand what kind of images and text we are working with.
The first method we will explore is pytesseract, a Python wrapper for Google's Tesseract-OCR Engine. Although pytesseract is widely used for document text extraction, it may not perform as well on diverse image types typically found in datasets like Text OCR.
To use pytesseract, we can invoke the following command:
import pytesseract
text = pytesseract.image_to_string(image_file_name, lang='eng')
print(text)
After running this on an example image, we will analyze the output but note that the results may not be optimal.
Next, we will test easyocr, which relies on deep learning models for text detection. It's slightly slower but often yields better results than traditional methods like pytesseract.
To use easyocr, we create a reader object and invoke the read_text
method:
import easyocr
reader = easyocr.Reader(['en'])
results = reader.readtext(image_file_name)
The output includes the detected text, bounding boxes, and confidence scores.
The final library we will compare is keras_ocr, which combines a detector and recognizer under a unified pipeline. While keras_ocr is not pre-installed in Kaggle, we can easily install it using pip:
!pip install keras-ocr
Then we can run the text extraction as follows:
import keras_ocr
pipeline = keras_ocr.pipeline.Pipeline()
results = pipeline.recognize([image_file_name])
Having extracted text from images using all three methods, we will compare their performance. We will focus on key aspects such as accuracy, detection of bounding boxes, and any missing annotations.
To visualize the results, we can use built-in tools from keras_ocr to draw annotations directly onto the images, allowing us to clearly see how well each library performed.
We will create a function to facilitate the plotting of results side-by-side for a direct comparison.
We have explored three libraries for text extraction from images—pytesseract, easyocr, and keras_ocr—and analyzed their performance using a rich dataset. Each library has its strengths and weaknesses, and the choice of which to use may depend on the specific use case.
Q1: What is Optical Character Recognition (OCR)?
A: OCR is a technology that converts different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data.
Q2: Which Python library should I choose for text extraction?
A: The choice of library may depend on your project's specific requirements. For document-like texts, pytesseract might suffice. For a more diverse set of images, easyocr or keras_ocr may be preferable due to their better performance with complex backgrounds.
Q3: What is the advantage of using deep learning-based OCR libraries?
A: Deep learning-based libraries, like easyocr and keras_ocr, tend to be more accurate and robust in detecting text in a variety of fonts and styles, especially in challenging image conditions.
Q4: Can I run these libraries on my local machine?
A: Yes! You can install pytesseract, easyocr, and keras_ocr in your local Python environment. Just be sure to follow the installation instructions, especially for any dependencies.
Q5: How does the performance differ between these libraries?
A: Performance can vary based on the complexity of the images and text. It's suggested to test different libraries on your specific dataset to determine which provides the best results.
In addition to the incredible tools mentioned above, for those looking to elevate their video creation process even further, Topview.ai stands out as a revolutionary online AI video editor.
TopView.ai provides two powerful tools to help you make ads video in one click.
Materials to Video: you can upload your raw footage or pictures, TopView.ai will edit video based on media you uploaded for you.
Link to Video: you can paste an E-Commerce product link, TopView.ai will generate a video for you.