Bruno Behnken

Nov 04, 2023

Manipulating PDFs With pypdf

During the Python Brasil 2023 conference I presented a small talk about manipulating PDFs using the library pypdf, so I thought it would be a great idea to post the content here.

According to the documentation,

pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

Merging PDFs

The first thing I used pypdf for was to merge two PDFs. I needed to print both of them, but didn't want to take two files to the local lan house (I don't have a printer at my house anymore). The code is very simple, and we're using the PdfWriter class to write the merged PDF.

from pypdf import PdfWriter

merger = PdfWriter()

for pdf in ["pdf1.pdf", "pdf2.pdf"]:
    merger.append(pdf)

merger.write("print.pdf")
merger.close()

This code simply merges two files into one. It's interesting to notice that it does not need to open the files to merge them. But what if you don't want to merge all the pages? Maybe you need to select some pages to be merged. This can be performed by the following code:

from pypdf import PdfWriter

merger = PdfWriter()
pdf1 = open("pdf1.pdf", "rb")
pdf2 = open("pdf2.pdf", "rb")

merger.append(fileobj=pdf1, pages=(0, 1))
merger.append(fileobj=pdf2)

merger.write("print.pdf")
merger.close()

Notice that now we are opening the files using the default open function, and when we pass the files to the merger.append function we can specify a tuple containing a range of pages to be merged.

Instead of opening the files this way, we can use the PdfReader class from the pypdf lib, which will give us much more possibilities to play with our PDFs. Let's start by doing the same thing we did before, using PdfReader:

from pypdf import PdfWriter, PdfReader

pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()

writer.add_page(pdf1_reader.pages[0])
writer.add_page(pdf2_reader.pages[0])

writer.write("print.pdf")
writer.close()

Now, the reader variables will have PdfReader objects, containing several properties of the opened PDFs. One of these properties is the attribute pages, which is a list containing all the PDF pages. Thus, instead of using the append method, we are using the add_page method, referencing an element (a page) of the pages list.

Rotating pages

Now let's explore a bit more of the possibilities of the PdfReader class. We can start applying a transformation that rotates the pdf2 page we are merging:

from pypdf import PdfWriter, PdfReader

pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()

writer.add_page(pdf1_reader.pages[0])
writer.add_page(pdf2_reader.pages[0].rotate(180))

writer.write("print.pdf")
writer.close()

The rotate(180) transformation will rotate 180˚ the pdf2 first page, leaving it upside down.

Merging pages

Another transformation would be to merge the two pages into only one:

from pypdf import PdfWriter, PdfReader

pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()

pdf1_page = pdf1_reader.pages[0]
pdf2_page = pdf2_reader.pages[0]
pdf1_page.merge_page(pdf2_page)

writer.add_page(pdf1_page)
writer.write("print.pdf")
writer.close()

This code will overlap the two pages. Observe that merge_page is a method from the page object, and it's not called from the writer object. The page passed as argument (pdf2_page) will overlap the page that has the merge_page method called (pdf1_page), which means any element in the pdf1_page will become not visible if an element of pdf2_page happens to cover it when the overlap occurs. Be careful when using this transformation.

While many other transformations are available, we have explored just a bunch of them. Take a look at the documentation for more examples about pypdf transformations.

Password protection

Another very handful pypdf feature is protecting PDFs with a password. For that, we should call the merger.encrypt() method, giving it a password and an encryption algorythm:

from pypdf import PdfWriter

merger = PdfWriter()
merger.append("pdf1.pdf")

merger.encrypt("pythonbrasil2023", algorithm="AES-256-R5")

merger.write("print.pdf")
merger.close()

Now the PDF will be encrypted with the password pythonbrasil2023. It's important to notice that you can choose between several algorithms, some more secure than others, and some that will require additional packages to work.

Text extraction

pypdf also works for some basic text extraction from PDFs. It's worth noting that it does not read texts from images (you will need an OCR for that), and that it does not extract texts from scrambled fonts. That said, we can extract a PDF text and manipulate it as a string in our program. Check this code out:

from pypdf import PdfReader

reader = PdfReader("text.pdf")
page = reader.pages[0]
text = page.extract_text()
print("text:", text)

If everything went well, we should see the text of the page printed. Because it is in the text variable, we can perform any string operation with it. When I presented this code, one person in the audience pointed out an interesting use case for it: reading the pages of a book and submitting the strings to a text-to-speech algorithm, in order to produce an audiobook.

Those are only some of the features offered by the pypdf library. I, again, encourage you to check the documentation to see more of its capabilities.