There are several ways to do this, including using libraries like
p > p> Exit: Input PDF: Output text file: Benefits of this method include: Disadvantages of this method include:
# Library import p>
# PDF path
Part # 1: Convert PDFs to Images
# Store all PDF pages in a variable p >
# Counter for storing images of each PDF page for an image
= code >
# Loop through all pages stored above
# Declare the file name for each PDF page as JPG
# For each page, the file name will be:
# PDF page 1 - & gt; page_1.jpg
# PDF page 2 - & gt; page_2.jpg
# PDF page 3 - & gt; page_3.jpg
# PDF page n - & gt; page_n.jpg
". jpg "
# Save the page image to the system
# Increase counter to update filename
Part # 2 - OCR image text recognition
# Variable to get the total number of pages
# Create a text file to write the output
outfile code >
# Open the file in add mode so that
# All content of all images is added to one file
# Iterate from 1 to total pages p>
# Set filename for OCR from p>
# Again, these files will be: p>
". jpg "
# Recognize text as a string in an image using pytesserct
# The recognized text is stored in a variable text
# Any string processing can be applied to the text
# Basic formatting was done here
# In many PDFs, at the end of the line if the word cannot
# be written in full, a hyphen is added.
# The rest of the word is written on the next line
# For example name: this is an example of the text of this word here GeeksF-
# orGeeks - half on the first line, remaining on the next one.
# To remove this, we replace every & # 39; - / n & # 39; to & # 39; & # 39 ;.
# Finally, write the processed text to a file.
f.write ( text)
# Close the file after writing all the text.
As we can see, the PDF pages have been transformed vans in images. The images were then read and the content written to a text file.
Output text file:
Benefits of this method include:
Disadvantages of this method include: