Change language

Python | Reading PDF content with OCR (Optical Character Recognition)

| |

There are several ways to do this, including using libraries like pip3 install PIL pip3 install pytesseract pip3 install pdf2image sudo apt-get install tesseract-ocr

The program consists of two parts.

Part # 1 is devoted to converting PDF to image files. Each PDF page is stored as an image file. Saved image names:
PDF page 1 -" page_1.jpg
PDF page 2 -" page_2.jpg
PDF page 3 -" page_3.jpg
& # 8230 ;. 
PDF page n -" page_n.jpg

Part # 2 is dedicated to recognizing text from image files and saving it to a text file. Here we process images and convert them to text. Once we get the text as a string variable, we can do any text processing. For example, in many PDFs, when a line is complete but a particular word cannot be written entirely on one line, a hyphen (& # 39; - & # 39;) is added and the word continues on the next line. For example —

 This is some sample text but this parti- 
cular word could not be written in the same line.

Fundamental preprocessing is now performed on such words to convert hyphen and newline to full word. After preprocessing is complete, this text is saved in a separate text file.

To get the input PDFs used in the code, click d.pdf

Below is the implementation:

# Library import

from PIL import Image

import pytesseract

import sys

from pdf2image import convert_from_path

import os

  
# PDF path

PDF_file = "d.pdf"

 
"" "
Part # 1: Convert PDFs to Images
" ""

 
# Store all PDF pages in a variable

pages = convert_from_path (PDF_file, 500 )

 
# Counter for storing images of each PDF page for an image

image_counter = 1

 
# Loop through all pages stored above

for page in pages:

 

# Declare the file name for each PDF page as JPG

# For each page, the file name will be:

# PDF page 1 -" page_1.jpg

# PDF page 2 -" page_2.jpg

# PDF page 3 -" page_3.jpg

# ....

# PDF page n -" page_n.jpg

filename = "page_" + str (image_counter) + ". jpg "

  

  # Save the page image to the system

page.save (filename, ’JPEG’ )

 

# Increase counter to update filename

image_counter = image_counter + 1

  
" ""
Part # 2 - OCR image text recognition
“»»

3

# Variable to get the total number of pages

filelimit = image_counter - 1

  
# Create a text file to write the output

outfile = "out_text.txt"

  
# Open the file in add mode so that
# All content of all images is added to one file

f = open (outfile, "a" )

 
# Iterate from 1 to total pages

for i in range ( 1 , filelimit + 1 ):

  

  # Set filename for OCR from

# Again, these files will be:

# page_1.jpg

# page_2.jpg

#. ...

# page_n.jpg

filename = "page_" + str (i) + ". jpg "

  

# Recognize text as a string in an image using pytesserct

text = str (((pytesseract.image_to_string (Image. open (filename)))))

 

# The recognized text is stored in a variable text

# Any string processing can be applied to the text

# Basic formatting was done here

  # In many PDFs, at the end of the line if the word cannot

# be written in full, a hyphen is added.

# The rest of the word is written on the next line

# For example name: this is an example of the text of this word here GeeksF-

# orGeeks - half on the first line, remaining on the next one.

# To remove this, we replace every & # 39; - / n & # 39; to & # 39; & # 39 ;.

text = text.replace ( ’-’ , ’’) 

 

# Finally, write the processed text to a file.

f.write ( text)

 
# Close the file after writing all the text.
f.close ()

Exit:

Input PDF:

Output text file:

As we can see, the PDF pages have been transformed vans in images. The images were then read and the content written to a text file.

Benefits of this method include:

  1. Avoiding text-based conversion from for encoding schemes that lead to data loss.
  2. Even handwritten PDF content can be recognized using OCR.
  3. Only certain PDF pages can be recognized.
  4. Get the text as a variable so that any necessary preprocessing can be done.

Disadvantages of this method include:

  1. Disk storage is used to store images on the local system. Although these images are tiny in size.
  2. Using OCR cannot guarantee 100% accuracy. Considering that a computer-printed PDF provides very high accuracy.
  3. Handwritten PDFs are still recognized, but accuracy depends on various factors such as handwriting, page color, etc.

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically