Invented by Adobe , PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video and business logic.
In this article, we will learn how we can perform various operations, such as:
- Extract text from PDF
- Rotating PDF pages
- Combine PDF files
- Split PDF
- Add a watermark to PDF pages
using simple Python scripts!
We will be using a third party module, PyPDF2.
PyPDF2 — it is a Python library built as a PDF toolkit. It is capable of:
- Extract document information (title, author, & # 8230;)
- Split documents page by page
- Merge documents page by page
- Clipping pages
- Combining multiple pages into one page
- Encrypting and decrypting PDF files
- and more!
To install PyPDF2, run the following command from the command line:
pip install PyPDF2
This module name is case sensitive, so make sure that the letter y is lowercase and everything else is uppercase. All code and PDFs used in this tutorial / article are available here .
1. Extract text from PDF file
The output of the above program looks like this:
20 PythonBasics SRDoty August27,2008 Contents 1Preliminaries 4 1.1WhatisPython? ....... .............................. 4 1.2 Installation anddocumentation ................. ... ......... 4 [and some more lines ...]
Let’s try to understand the above code piece by piece:
pdfFileObj = open (’example.pdf’,’ rb’)
We have opened example.pdf in binary mode. and saved the file object as pdfFileObj .
pdfReader = PyPDF2.PdfFileReader (pdfFileObj)
Here we create an object of class PdfFileReader of the PyPDF2 module and pass the PDF object and get the PDF reader.
The numPages property specifies the number of pages in the PDF file. For example, in our case it is 20 (see first line of output).
pageObj = pdfReader.getPage (0)
Now we create an object of the class PageObject of the PyPDF2 module. The PDF reader has a function getPage () which takes a page number (starting index of the form 0) as an argument and returns a page object.
print (pageObj.extractText ())
The page object has a extractText () function to extract text from a PDF page.
pdfFileObj .close ()
Finally, we close the pdf file object.
Note. Although PDF is files are great for arranging text in a way that people can easily print and read, they are not that easy to parse in plaintext software. As such, PyPDF2 may fail when extracting text from a PDF, and may even fail to open some PDFs at all. Unfortunately, there is nothing you can do about it. PyPDF2 may simply not work with some of your PDF files.
2. Rotate PDF pages
Here you can see what the first page looks like rotated_example.pdf (right image) after rotation:
Some important points related to the above code:
- For rotation, we first create a PDF reader ori of the final PDF.
pdfWriter = PyPDF2.PdfFileWriter ()
The turned pages will be written to the new PDF. For writing in PDF format, we use the PdfFileWriter class object of the PyPDF2 module.
for page in range (pdfReader.numPages): pageObj = pdfReader.getPage (page) pageObj.rotateClockwise (rotation) pdfWriter.addPage (pageObj)
Now we repeat every page of the original PDF. We get the page object using the getPage () method of the pdf reader class. Now we rotate the page using the rotateClockwise () method of the page object class. We then add the page to the PDF writer using the addPage () method of the PDF writer, passing in the rotated page object.
newFile = open (newFileName, ’wb’) pdfWriter.write (newFile) pdfFileObj.close () newFile.close ()
Now we need to write the PDF pages to a new PDF file. First, we open a new file object and write pdf pages to it using the write () method of the pdf writer object. Finally, we close the original PDF file object and the new file object.
3. Merge PDF files
The output of the above program is a combined pdf file, combined_instance.pdf, obtained by merging example.pdf and rotated_example.pdf .
Let’s take a look at the important aspects of this program:
pdfMerger = PyPDF2.PdfFileMerger ()
To combine, we use a pre-built class, PdfFileMerger of the PyPDF2 module.
Here we create a pdfMerger pdf merger
for pdf in pdfs: with open (pdf, ’rb’) as f: pdfMerger.append (f)
We now add the file object of each PDF file to the PDF merge object using the append () method.
with open (output, ’wb’) as f: pdfMerger.write (f)
Finally, we write the pdf pages to the output pdf using the write method pdf merger object.
4. Splitting PDF file
The output will be three new PDF files with split 1 (page 0,1), split 2 (page 2,3), split 3 (page 4-end) .
No new function or class has been used in the above Python program. Using simple logic and iteration, we created the partitions of the passed pdf into according to the passed partitions of the list.
5 ... Adding a watermark to PDF pages