Working with PDFs in Python

Invented by Adobe , PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video and business logic.

In this article, we will learn how we can perform various operations, such as:

  • Extract text from PDF
  • Rotating PDF pages
  • Combine PDF files
  • Split PDF
  • Add a watermark to PDF pages

using simple Python scripts!

Install

We will be using a third party module, PyPDF2.

PyPDF2 — it is a Python library built as a PDF toolkit. It is capable of:

  • Extract document information (title, author, & # 8230;)
  • Split documents page by page
  • Merge documents page by page
  • Clipping pages
  • Combining multiple pages into one page
  • Encrypting and decrypting PDF files
  • and more!
  • To install PyPDF2, run the following command from the command line:

      pip install PyPDF2 

    This module name is case sensitive, so make sure that the letter y is lowercase and everything else is uppercase. All code and PDFs used in this tutorial / article are available here .

    1. Extract text from PDF file

    # import required modules

    import PyPDF2

     
    # create a PDF object

    pdfFileObj = open ( `example.pdf` , `rb` )

      
    # create a PDF reader

    pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

     
    # print the number of pages in the pdf file

    print (pdfReader.numPages)

     
    # create page object

    pageObj = pdfReader.getPage ( 0 )

     
    # extract text from the page

    print (pageObj.extractText ())

     
    # close the pdf file object
    pdfFileObj .close ()

    The output of the above program looks like this:

     20 PythonBasics SRDoty August27,2008 Contents 1Preliminaries 4 1.1WhatisPython? ....... .............................. 4 1.2 Installation anddocumentation ................. ... ......... 4  [and some more lines ...]  

    Let`s try to understand the above code piece by piece:

    •  pdfFileObj = open (`example.pdf`,` rb`) 

      We have opened example.pdf in binary mode. and saved the file object as pdfFileObj .

    •  pdfReader = PyPDF2.PdfFileReader (pdfFileObj) 

      Here we create an object of class PdfFileReader of the PyPDF2 module and pass the PDF object and get the PDF reader.

    •  print (pdfReader.numPages) 

      The numPages property specifies the number of pages in the PDF file. For example, in our case it is 20 (see first line of output).

    •  pageObj = pdfReader.getPage (0) 

      Now we create an object of the class PageObject of the PyPDF2 module. The PDF reader has a function getPage () which takes a page number (starting index of the form 0) as an argument and returns a page object.

    •  print (pageObj.extractText ()) 

      The page object has a extractText () function to extract text from a PDF page.

    •  pdfFileObj .close () 

      Finally, we close the pdf file object.

    Note. Although PDF is files are great for arranging text in a way that people can easily print and read, they are not that easy to parse in plaintext software. As such, PyPDF2 may fail when extracting text from a PDF, and may even fail to open some PDFs at all. Unfortunately, there is nothing you can do about it. PyPDF2 may simply not work with some of your PDF files.

    2. Rotate PDF pages

    # import required modules

    import PyPDF2

     

    def PDFrotate (origFileName, newFileName, rotation):

     

    # create pdf Original pdf file object

    pdfFileObj = open (origFileName, ` rb` )

      

      # create a PDF object Re ader

    pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

     

    # create a PDF writer for the new PDF

    pdfWriter = PyPDF2.PdfFileWriter ()

     

    # rotate each page

    for page in range (pdfReader.numPages):

      

      # create a rotated page object

      pageObj = pdfReader.getPage (page)

      pageObj.rotateClockwise (rotation)

     

    # add the rotated page object to the pdf Writer

    pdfWriter.addPage (pageObj)

      

      # new pdf file object

    newFile = open (newFileName, `wb` )

     

    # writing rotated pages to new file

    pdfWriter.write (newFile)

     

    # close the original pdf file

    pdfFileObj.close ()

     

    # close the new pdf file object

    newFile.close ()

     

     

    def main ():

      

      # original pdf filename

      origFileName = ` example.pdf`

      

      # new pdf file name

    newFileName = ` rotated_example.pdf`

     

    # rotation angle

    rotation = 270

     

    # call the PDFrotate function

      PDFrotate (origFileName, newFileName, rotation)

      

    if __ name__ = = " __ main__ " :

      # main function call

    main ()

    Here you can see what the first page looks like rotated_example.pdf (right image) after rotation:

    Some important points related to the above code:

    • For rotation, we first create a PDF reader ori of the final PDF.
    •  pdfWriter = PyPDF2.PdfFileWriter () 

      The turned pages will be written to the new PDF. For writing in PDF format, we use the PdfFileWriter class object of the PyPDF2 module.

    •  for page in range (pdfReader.numPages): pageObj = pdfReader.getPage (page) pageObj.rotateClockwise (rotation) pdfWriter.addPage (pageObj) 

      Now we repeat every page of the original PDF. We get the page object using the getPage () method of the pdf reader class. Now we rotate the page using the rotateClockwise () method of the page object class. We then add the page to the PDF writer using the addPage () method of the PDF writer, passing in the rotated page object.

    •  newFile = open (newFileName, `wb`) pdfWriter.write (newFile) pdfFileObj.close () newFile.close () 

      Now we need to write the PDF pages to a new PDF file. First, we open a new file object and write pdf pages to it using the write () method of the pdf writer object. Finally, we close the original PDF file object and the new file object.

    3. Merge PDF files

    # import required modules

    import PyPDF2

     

    def PDFmerge (pdfs, output): 

    # create a PDF merge object

      pdfMerger = PyPDF2.PdfFileMerger ()

     

    # add PDF one at a time

    for pdf in pdfs:

    with open (pdf, `rb` ) as f:

    pdfMerger.append (f)

     

      # write a combined PDF to output the PDF

    with open (output, ` wb` ) as f:

    pdfMerger.write (f)

     

    def main ():

    # pdf files to merge

    pdfs = [ `example.pdf` , `rotated_example.pdf` ]

      

    # pdf file name output

    output  = `combined_example.pdf`

     

    # call the PDF merge function

    PDFmerge (pdfs = pdfs, output = output)

      

    if __ name__ = = " __ main__ " :

    # main function call

      main ()

    The output of the above program is a combined pdf file, combined_instance.pdf, obtained by merging example.pdf and rotated_example.pdf .

    Let`s take a look at the important aspects of this program:

    •  pdfMerger = PyPDF2.PdfFileMerger () 

      To combine, we use a pre-built class, PdfFileMerger of the PyPDF2 module.
      Here we create a pdfMerger pdf merger

    •  for pdf in pdfs: with open (pdf, `rb`) as f: pdfMerger.append (f) 

      We now add the file object of each PDF file to the PDF merge object using the append () method.

    •  with open (output, `wb`) as f: pdfMerger.write (f) 

      Finally, we write the pdf pages to the output pdf using the write method pdf merger object.

    4. Splitting PDF file

    # import of required modules

    import PyPDF2

     

    def PDFsplit (pdf, splits):

      # create input pdf file

    pdfFileObj = open (pdf, `rb` )

     

    # create a PDF reader

      pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

     

    # starting index of the first slice

    start = 0

     

    # starting index last slice

    end = splits [ 0 ]

     

     

    for i in range ( len (splits) + 1 ):

    # create a PDF record object for the (i + 1) th split

      pdfWriter = PyPDF2.PdfFileWriter ()

     

    # pdf file name output

    outputpdf = pdf.split ( `.pdf ` ) [ 0 ] + str (i) + `.pdf`

      

      # add pages to the PDF writer

    for page in range (st art, end):

    pdfWriter.addPage (pdfReader.getPage (page))

     

    # writing split PDF pages to PDF

    with open (outputpdf, " wb " ) as f:

    pdfWriter.write (f)

     

    # swap start position for next split

    start = end

    try :

      # setting the end position of the split for the next split

    end = splits [i + 1 ]

    except IndexError:

      # set the ending split position for the last split

    end = pdfReader.numPages

      

      # close the pdf file input object

    pdfFileObj.close ()

     

    def main ():

    # pdf file to share

    pdf = `example.pdf`

      

    # split pages

    splits = [ 2 , 4 ]

     

    # call PDFsplit functions for splitting i pdf

    PDFsplit (pdf, splits)

     

    if __ name__ = = "__ main__" :

    # main function call

    main ()

    The output will be three new PDF files with split 1 (page 0,1), split 2 (page 2,3), split 3 (page 4-end) .

    No new function or class has been used in the above Python program. Using simple logic and iteration, we created the partitions of the passed pdf into according to the passed partitions of the list.

    5 ... Adding a watermark to PDF pages

    # import required modules

    import PyPDF2

     

    def add_watermark (wmFile, pageObj):

    # open the PDF watermark

    wmFileObj = open (wmFile, ` rb` )

     

    # create pdf reader for pdf watermark object

      pdfReader = PyPDF2.PdfFileReader (wmFileObj) 

     

    # merge the watermark of the first PDF page with the passed page object.

    pageObj.mergePage (pdfReader. getPage ( 0 ))

     

    # close the watermark PDF object

    wmFileObj.close ()

     

    # return the watermarked page object

      return pageObj

      

    def main ():

    # pdf watermark file name

    mywatermark = `watermark.pdf`

      

    # original pdf filename

      origFileName = `example.pdf`

     

    # new file name pdf

    newFileName = ` watermarked_example.pdf`

     

    # create pdf Original pdf file object

    pdfFileObj = open (origFileName, `rb` )

      

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        # create a PDF Reader object

    pdfReader = PyPDF2.PdfFileReader (pdfFileObj)

      

    # create a PDF writer for the new PDF

    pdfWriter = PyPDF2 .PdfFileWriter ()

     

    # add a watermark to each page

    for page in range (pdfReader. numPages):

      # create a watermarked page object

      wmpageObj = add_watermark (mywatermark, pdfReader.getPage (page))

     

    # add a page object with a watermark in PDF Writer

    pdfWriter.addPage (wmpageObj)

     

    # new pdf file object