Change language

Working with PDFs in Python

|

Invented by Adobe , PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video and business logic.

In this article, we will learn how we can perform various operations, such as:

  • Extract text from PDF
  • Rotating PDF pages
  • Combine PDF files
  • Split PDF
  • Add a watermark to PDF pages

using simple Python scripts!

Install

We will be using a third party module, PyPDF2.

PyPDF2 — it is a Python library built as a PDF toolkit. It is capable of:

  • Extract document information (title, author, & # 8230;)
  • Split documents page by page
  • Merge documents page by page
  • Clipping pages
  • Combining multiple pages into one page
  • Encrypting and decrypting PDF files
  • and more!

To install PyPDF2, run the following command from the command line:

  pip install PyPDF2 

This module name is case sensitive, so make sure that the letter y is lowercase and everything else is uppercase. All code and PDFs used in this tutorial / article are available here .

1. Extract text from PDF file

# import required modules

import PyPDF2

 
# create a PDF object

pdfFileObj = open ( ’example.pdf’ , ’rb’ )

  
# create a PDF reader

pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

 
# print the number of pages in the pdf file

print (pdfReader.numPages)

 
# create page object

pageObj = pdfReader.getPage ( 0 )

 
# extract text from the page

print (pageObj.extractText ())

 
# close the pdf file object
pdfFileObj .close ()

The output of the above program looks like this:

 20 PythonBasics SRDoty August27,2008 Contents 1Preliminaries 4 1.1WhatisPython? ....... .............................. 4 1.2 Installation anddocumentation ................. ... ......... 4  [and some more lines ...]  

Let’s try to understand the above code piece by piece:

  •  pdfFileObj = open (’example.pdf’,’ rb’) 

    We have opened example.pdf in binary mode. and saved the file object as pdfFileObj .

  •  pdfReader = PyPDF2.PdfFileReader (pdfFileObj) 

    Here we create an object of class PdfFileReader of the PyPDF2 module and pass the PDF object and get the PDF reader.

  •  print (pdfReader.numPages) 

    The numPages property specifies the number of pages in the PDF file. For example, in our case it is 20 (see first line of output).

  •  pageObj = pdfReader.getPage (0) 

    Now we create an object of the class PageObject of the PyPDF2 module. The PDF reader has a function getPage () which takes a page number (starting index of the form 0) as an argument and returns a page object.

  •  print (pageObj.extractText ()) 

    The page object has a extractText () function to extract text from a PDF page.

  •  pdfFileObj .close () 

    Finally, we close the pdf file object.

Note. Although PDF is files are great for arranging text in a way that people can easily print and read, they are not that easy to parse in plaintext software. As such, PyPDF2 may fail when extracting text from a PDF, and may even fail to open some PDFs at all. Unfortunately, there is nothing you can do about it. PyPDF2 may simply not work with some of your PDF files.

2. Rotate PDF pages

# import required modules

import PyPDF2

 

def PDFrotate (origFileName, newFileName, rotation):

 

# create pdf Original pdf file object

pdfFileObj = open (origFileName, ’ rb’ )

  

  # create a PDF object Re ader

pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

 

# create a PDF writer for the new PDF

pdfWriter = PyPDF2.PdfFileWriter ()

 

# rotate each page

for page in range (pdfReader.numPages):

  

  # create a rotated page object

  pageObj = pdfReader.getPage (page)

  pageObj.rotateClockwise (rotation)

 

# add the rotated page object to the pdf Writer

pdfWriter.addPage (pageObj)

  

  # new pdf file object

newFile = open (newFileName, ’wb’ )

 

# writing rotated pages to new file

pdfWriter.write (newFile)

 

# close the original pdf file

pdfFileObj.close ()

 

# close the new pdf file object

newFile.close ()

 

 

def main ():

  

  # original pdf filename

  origFileName = ’ example.pdf’

  

  # new pdf file name

newFileName = ’ rotated_example.pdf’

 

# rotation angle

rotation = 270

 

# call the PDFrotate function

  PDFrotate (origFileName, newFileName, rotation)

  

if __ name__ = = " __ main__ " :

  # main function call

main ()

Here you can see what the first page looks like rotated_example.pdf (right image) after rotation:

Some important points related to the above code:

  • For rotation, we first create a PDF reader ori of the final PDF.
  •  pdfWriter = PyPDF2.PdfFileWriter () 

    The turned pages will be written to the new PDF. For writing in PDF format, we use the PdfFileWriter class object of the PyPDF2 module.

  •  for page in range (pdfReader.numPages): pageObj = pdfReader.getPage (page) pageObj.rotateClockwise (rotation) pdfWriter.addPage (pageObj) 

    Now we repeat every page of the original PDF. We get the page object using the getPage () method of the pdf reader class. Now we rotate the page using the rotateClockwise () method of the page object class. We then add the page to the PDF writer using the addPage () method of the PDF writer, passing in the rotated page object.

  •  newFile = open (newFileName, ’wb’) pdfWriter.write (newFile) pdfFileObj.close () newFile.close () 

    Now we need to write the PDF pages to a new PDF file. First, we open a new file object and write pdf pages to it using the write () method of the pdf writer object. Finally, we close the original PDF file object and the new file object.

3. Merge PDF files

# import required modules

import PyPDF2

 

def PDFmerge (pdfs, output): 

# create a PDF merge object

  pdfMerger = PyPDF2.PdfFileMerger ()

 

# add PDF one at a time

for pdf in pdfs:

with open (pdf, ’rb’ ) as f:

pdfMerger.append (f)

 

  # write a combined PDF to output the PDF

with open (output, ’ wb’ ) as f:

pdfMerger.write (f)

 

def main ():

# pdf files to merge

pdfs = [ ’example.pdf’ , ’rotated_example.pdf’ ]

  

# pdf file name output

output  = ’combined_example.pdf’

 

# call the PDF merge function

PDFmerge (pdfs = pdfs, output = output)

  

if __ name__ = = " __ main__ " :

# main function call

  main ()

The output of the above program is a combined pdf file, combined_instance.pdf, obtained by merging example.pdf and rotated_example.pdf .

Let’s take a look at the important aspects of this program:

  •  pdfMerger = PyPDF2.PdfFileMerger () 

    To combine, we use a pre-built class, PdfFileMerger of the PyPDF2 module.
    Here we create a pdfMerger pdf merger

  •  for pdf in pdfs: with open (pdf, ’rb’) as f: pdfMerger.append (f) 

    We now add the file object of each PDF file to the PDF merge object using the append () method.

  •  with open (output, ’wb’) as f: pdfMerger.write (f) 

    Finally, we write the pdf pages to the output pdf using the write method pdf merger object.

4. Splitting PDF file

# import of required modules

import PyPDF2

 

def PDFsplit (pdf, splits):

  # create input pdf file

pdfFileObj = open (pdf, ’rb’ )

 

# create a PDF reader

  pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

 

# starting index of the first slice

start = 0

 

# starting index last slice

end = splits [ 0 ]

 

 

for i in range ( len (splits) + 1 ):

# create a PDF record object for the (i + 1) th split

  pdfWriter = PyPDF2.PdfFileWriter ()

 

# pdf file name output

outputpdf = pdf.split ( ’.pdf ’ ) [ 0 ] + str (i) + ’.pdf’

  

  # add pages to the PDF writer

for page in range (st art, end):

pdfWriter.addPage (pdfReader.getPage (page))

 

# writing split PDF pages to PDF

with open (outputpdf, " wb " ) as f:

pdfWriter.write (f)

 

# swap start position for next split

start = end

try :

  # setting the end position of the split for the next split

end = splits [i + 1 ]

except IndexError:

  # set the ending split position for the last split

end = pdfReader.numPages

  

  # close the pdf file input object

pdfFileObj.close ()

 

def main ():

# pdf file to share

pdf = ’example.pdf’

  

# split pages

splits = [ 2 , 4 ]

 

# call PDFsplit functions for splitting i pdf

PDFsplit (pdf, splits)

 

if __ name__ = = "__ main__" :

# main function call

main ()

The output will be three new PDF files with split 1 (page 0,1), split 2 (page 2,3), split 3 (page 4-end) .

No new function or class has been used in the above Python program. Using simple logic and iteration, we created the partitions of the passed pdf into according to the passed partitions of the list.

5 ... Adding a watermark to PDF pages

# import required modules

import PyPDF2

 

def add_watermark (wmFile, pageObj):

# open the PDF watermark

wmFileObj = open (wmFile, ’ rb’ )

 

# create pdf reader for pdf watermark object

  pdfReader = PyPDF2.PdfFileReader (wmFileObj) 

 

# merge the watermark of the first PDF page with the passed page object.

pageObj.mergePage (pdfReader. getPage ( 0 ))

 

# close the watermark PDF object

wmFileObj.close ()

 

# return the watermarked page object

  return pageObj

  

def main ():

# pdf watermark file name

mywatermark = ’watermark.pdf’

  

# original pdf filename

  origFileName = ’example.pdf’

 

# new file name pdf

newFileName = ’ watermarked_example.pdf’

 

# create pdf Original pdf file object

pdfFileObj = open (origFileName, ’rb’ )

  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    # create a PDF Reader object

pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

  

# create a PDF writer for the new PDF

pdfWriter = PyPDF2 .PdfFileWriter ()

 

# add a watermark to each page

for page in range (pdfReader. numPages):

  # create a watermarked page object

  wmpageObj = add_watermark (mywatermark, pdfReader.getPage (page))

 

# add a page object with a watermark in PDF Writer

pdfWriter.addPage (wmpageObj)

 

# new pdf file object

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

psycopg2: insert multiple rows with one query

12 answers

NUMPYNUMPY

How to convert Nonetype to int or string?

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Javascript Error: IPython is not defined in JupyterLab

12 answers


Wiki

Python OpenCV | cv2.putText () method

numpy.arctan2 () in Python

Python | os.path.realpath () method

Python OpenCV | cv2.circle () method

Python OpenCV cv2.cvtColor () method

Python - Move item to the end of the list

time.perf_counter () function in Python

Check if one list is a subset of another in Python

Python os.path.join () method