Python | Reading PDF content with OCR (Optical Character Recognition)

There are several ways to do this, including using libraries like

# Library import

from PIL import Image

import pytesseract

import sys

from pdf2image import convert_from_path

import os

  
# PDF path

PDF_file = "d.pdf"

 
"" "
Part # 1: Convert PDFs to Images
" ""

 
# Store all PDF pages in a variable

pages = convert_from_path (PDF_file, 500 )

 
# Counter for storing images of each PDF page for an image

image_counter = 1

 
# Loop through all pages stored above

for page in pages:

 

# Declare the file name for each PDF page as JPG

# For each page, the file name will be:

# PDF page 1 - & gt; page_1.jpg

# PDF page 2 - & gt; page_2.jpg

# PDF page 3 - & gt; page_3.jpg

# ....

# PDF page n - & gt; page_n.jpg

filename = "page_" + str (image_counter) + ". jpg "

  

  # Save the page image to the system

page.save (filename, `JPEG` )

 

# Increase counter to update filename

image_counter = image_counter + 1

  
" ""
Part # 2 - OCR image text recognition
“ »»

3

# Variable to get the total number of pages

filelimit = image_counter - 1

  
# Create a text file to write the output

outfile = "out_text.txt"

  
# Open the file in add mode so that
# All content of all images is added to one file

f = open (outfile, "a" )

 
# Iterate from 1 to total pages

for i in range ( 1 , filelimit + 1 ):

  

  # Set filename for OCR from

# Again, these files will be:

# page_1.jpg

# page_2.jpg

#. ...

# page_n.jpg

filename = "page_" + str (i) + ". jpg "

  

# Recognize text as a string in an image using pytesserct

text = str (((pytesseract.image_to_string (Image. open (filename)))))

 

# The recognized text is stored in a variable text

# Any string processing can be applied to the text

# Basic formatting was done here

  # In many PDFs, at the end of the line if the word cannot

# be written in full, a hyphen is added.

# The rest of the word is written on the next line

# For example name: this is an example of the text of this word here GeeksF-

# orGeeks - half on the first line, remaining on the next one.

# To remove this, we replace every & # 39; - / n & # 39; to & # 39; & # 39 ;.

text = text.replace ( `-` , ``) 

 

# Finally, write the processed text to a file.

f.write ( text)

 
# Close the file after writing all the text.
f.close ()

Exit:

Input PDF:

Output text file:

As we can see, the PDF pages have been transformed vans in images. The images were then read and the content written to a text file.

Benefits of this method include:

  1. Avoiding text-based conversion from for encoding schemes that lead to data loss.
  2. Even handwritten PDF content can be recognized using OCR.
  3. Only certain PDF pages can be recognized.
  4. Get the text as a variable so that any necessary preprocessing can be done.

Disadvantages of this method include:

  1. Disk storage is used to store images on the local system. Although these images are tiny in size.
  2. Using OCR cannot guarantee 100% accuracy. Considering that a computer-printed PDF provides very high accuracy.
  3. Handwritten PDFs are still recognized, but accuracy depends on various factors such as handwriting, page color, etc.