During the Python Brasil 2023 conference I presented a small talk about manipulating PDFs using the library pypdf, so
I thought it would be a great idea to post the content here.
According to the documentation,
pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the
pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text
and metadata from PDFs as well.
Merging PDFs
The first thing I used pypdf for was to merge two PDFs. I needed to print both of them, but didn't want to take two
files to the local lan house (I don't have a printer at my house anymore). The code is very simple, and we're using the
PdfWriter class to write the merged PDF.
from pypdf import PdfWriter
merger = PdfWriter()
for pdf in ["pdf1.pdf", "pdf2.pdf"]:
merger.append(pdf)
merger.write("print.pdf")
merger.close()
This code simply merges two files into one. It's interesting to notice that it does not need to open the files to merge
them. But what if you don't want to merge all the pages? Maybe you need to select some pages to be merged. This can
be performed by the following code:
from pypdf import PdfWriter
merger = PdfWriter()
pdf1 = open("pdf1.pdf", "rb")
pdf2 = open("pdf2.pdf", "rb")
merger.append(fileobj=pdf1, pages=(0, 1))
merger.append(fileobj=pdf2)
merger.write("print.pdf")
merger.close()
Notice that now we are opening the files using the default open
function, and when we pass the files to the
merger.append
function we can specify a tuple containing a range of pages to be merged.
Instead of opening the files this way, we can use the PdfReader
class from the pypdf lib, which will give us much
more possibilities to play with our PDFs. Let's start by doing the same thing we did before, using PdfReader
:
from pypdf import PdfWriter, PdfReader
pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()
writer.add_page(pdf1_reader.pages[0])
writer.add_page(pdf2_reader.pages[0])
writer.write("print.pdf")
writer.close()
Now, the reader variables will have PdfReader
objects, containing several properties of the opened PDFs. One of these
properties is the attribute pages
, which is a list containing all the PDF pages. Thus, instead of using the append
method, we are using the add_page
method, referencing an element (a page) of the pages
list.
Rotating pages
Now let's explore a bit more of the possibilities of the PdfReader
class. We can start applying a transformation
that rotates the pdf2
page we are merging:
from pypdf import PdfWriter, PdfReader
pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()
writer.add_page(pdf1_reader.pages[0])
writer.add_page(pdf2_reader.pages[0].rotate(180))
writer.write("print.pdf")
writer.close()
The rotate(180)
transformation will rotate 180˚ the pdf2
first page, leaving it upside down.
Merging pages
Another transformation would be to merge the two pages into only one:
from pypdf import PdfWriter, PdfReader
pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()
pdf1_page = pdf1_reader.pages[0]
pdf2_page = pdf2_reader.pages[0]
pdf1_page.merge_page(pdf2_page)
writer.add_page(pdf1_page)
writer.write("print.pdf")
writer.close()
This code will overlap the two pages. Observe that merge_page
is a method from the page
object, and it's not
called from the writer
object. The page passed as argument (pdf2_page
) will overlap the page that has the
merge_page
method called (pdf1_page
), which means any element in the pdf1_page
will become not visible if an
element of pdf2_page
happens to cover it when the overlap occurs. Be careful when using this transformation.
While many other transformations are available, we have explored just a bunch of them. Take a look at the
documentation for more examples about pypdf transformations.
Password protection
Another very handful pypdf feature is protecting PDFs with a password. For that, we should call the merger.encrypt()
method, giving it a password and an encryption algorythm:
from pypdf import PdfWriter
merger = PdfWriter()
merger.append("pdf1.pdf")
merger.encrypt("pythonbrasil2023", algorithm="AES-256-R5")
merger.write("print.pdf")
merger.close()
Now the PDF will be encrypted with the password pythonbrasil2023
. It's important to notice that you can choose
between several algorithms, some more secure than others, and some that will require additional packages to work.
Text extraction
pypdf also works for some basic text extraction from PDFs. It's worth noting that it does not read texts from images
(you will need an OCR for that), and that it does not extract texts from scrambled fonts. That said, we can extract
a PDF text and manipulate it as a string in our program. Check this code out:
from pypdf import PdfReader
reader = PdfReader("text.pdf")
page = reader.pages[0]
text = page.extract_text()
print("text:", text)
If everything went well, we should see the text of the page printed. Because it is in the text
variable, we can
perform any string operation with it. When I presented this code, one person in the audience pointed out an interesting
use case for it: reading the pages of a book and submitting the strings to a text-to-speech algorithm, in order to
produce an audiobook.
Those are only some of the features offered by the pypdf library. I, again, encourage you to check the
documentation to see more of its capabilities.
I talked about Python Iterators in my last post, and how they work within loop structures. Now let's talk about Python
Generators. The first thing worth mentioning is that a Generator is actually a special Iterator that automatically
implements the methods __iter__
and __next__
, so you don't have to. Second, you don't have to declare a Generator
class (though you can still do it if you want to). A function that contains yield
is called a Generator
function, and will automatically instantiate a new Generator and return it to the caller.
yield
works similarly as the return
statement, but with one key difference: while return
will
return a value to the caller and end the function call, removing it from the memory stack; yield
will return a value
to the caller, transfer the execution control to the caller and save the function call,
so it can be restored later.
Why would you want to restore the context of a function call after returning a value? Because you may need to return
more values. Let's explain this better with the same example we used in the last post: building a custom range
.
def our_range(lower_boundary, upper_boundary):
i = lower_boundary
while i < upper_boundary:
yield i
i += 1
Now let's try that in a for
loop:
>>> for number in our_range(1, 10):
... print(number, end=' ')
...
1 2 3 4 5 6 7 8 9
You are now probably wondering how this is working. When our_range
is called, a Generator object is returned to be
used by the for
loop. As we learned in the last post, at every iteration of the for
loop the __next__
function is
called. When this happens, our_range
is called. On the first call, i
is set to be equal to the lower_boundary
,
the while
condition is evaluated, and we execute yield i
. At this point, 1 is going to be yielded (returned) to the
for
loop and be assigned to number
, which is then printed. On the next for
iteration, our_range
is
going to be called again, and instead of executing the function from the start, the execution will resume from the
line next to the yield
; in this case, i += 1
, which will assign value 2 to i
. The while
condition will be
evaluated again, and the yield i
will be executed once more, this time yielding value 2 to the for
loop, which
will be assigned to number
and printed. This goes on until the condition in the while
loop evaluates to
False
. When this happens, the function will end its execution and won't yield any value, meaning the Generator is
exhausted. When an exhausted Generator has its __next__
function called, it will raise a StopIteration
Exception, which, in our case, will be caught by the for
and cause the loop to end, finishing our execution.
In a simple Iterator, when the __next__
function ends its execution, all its context is lost, so any values that
we can't afford to lose must be kept as attributes of the Iterator object. That is what I did to the i
variable in
the Iterator's post. When using a Generator, we can keep these values in variables inside the generator function, since
the context is not lost between the calls.
The Real Generator Deal
You may be thinking “nice, but what real benefit comes with this 'saving the context' approach?” Well, the real deal
is: Iterators, generally speaking, must have all their data assigned to a variable (thus, stored in memory) to
iterate through them; while Generators, because of saving the context at every iteration, can generate the
value for each iteration, thus not requiring to have all the values available in memory.
Again, let's explain this better with an example: we will make a script that reads values from a txt file, perform
an operation (calling the function perform_operation
) with each of them, and saves the values in a csv file.
def read_txt(filename):
file = open(filename)
return file.read().split('\n')
csv_file = open('filename.csv', 'w')
for line in read_txt('file.txt'):
print(perform_operation(line), file=csv_file)
Pay attention to the way we are reading the data. The read
function will bring the whole file to memory, so it can be
stored in the lines
variable. This means this script will require a memory at least the size of the file.
If you are processing a big file in a restricted memory environment (for example, a container), your script may fail
simply because it ran out of memory (MemoryError
exception). Now let's fix that with a Generator.
def read_txt(filename):
for row in open(filename):
yield row
csv_file = open('filename.csv', 'w')
for line in read_txt('file.txt'):
print(perform_operation(line), file=csv_file)
Now we are reading and yielding one row at a time, which means that instead of requiring a memory the size of the file,
now we only require a memory the size of a line.
Generator Comprehension
This advantage becomes even more clear when we use generator comprehension, which works the same way as list
comprehension. As an example, let's iterate through a million numbers. Using a list, this would be:
numbers = [i for i in range(1_000_000)]
for number in numbers:
print(number)
numbers
is a list containing a million elements. Let's check its size:
>>> sys.getsizeof(numbers)
8448728
If, instead of a list, we used a Generator, the code would be:
numbers = (i for i in range(1_000_000))
for number in numbers:
print(number)
Let's now check the numbers
Generator size:
>>> sys.getsizeof(numbers)
112
Both codes will behave the same, but the code using the Generator requires much less memory.
To the infinity and beyond
Because of the property of only storing the current element of a sequence, Generators are useful for representing
infinite sequences. If your teacher or boss asks you to build a sequence with all the natural numbers, you can either
say that it is impossible, or you can give them this Generator:
def natural_numbers():
i = 0 # My natural numbers start with 0, yours can start with 1 if you want to (:
while True:
yield i
i += 1
This Generator will never stop giving numbers, so it is a viable way of representing a sequence that never ends.
A few more Generator tricks
Generators are versatile, and here are some more things you can do with them.
Multiple yields
Unlike the return
statement, yield
can be used multiple times in the same function. Let's suppose you want a
generator that returns a number, returns the square of this number and then increments the number.
This can be done by the following code:
def numbers_and_squares():
i = 0
while True:
yield i
yield i ** 2
i += 1
If we call the __next__
function repeatedly, the yields will be 0 0 1 1 2 4 3 9 ...
.
The close
function
As we already know, Generators can represent infinite sequences, which means they will never stop returning numbers.
What if we want them to stop? Maybe we want to prevent an infinite loop, or maybe we want to define a “big enough”
value. We can use the close
function for that. Let's put a stop to our natural_numbers
infinite Generator.
def natural_numbers():
i = 0
while True:
yield i
i += 1
numbers = natural_numbers()
for number in numbers:
if number >= 1:
numbers.close()
print(number)
On the first iteration the for
loop will print 0
, on the second iteration it will close the Generator and print
2
, on the next call the Generator will raise a StopIteration
exception that will be caught by the for
,
ending its loop.
The throw
function
As we saw, a closed Generator will raise a StopIteration
exception if called. What if you don't want this exception?
Maybe you want a ValueError
, or a EOFError
. You can set a custom exception to be raised by a Generator using the
throw
function.
def natural_numbers():
i = 0
while True:
yield i
i += 1
numbers = natural_numbers()
for number in numbers:
if number >= 1:
numbers.throw(EOFError)
print(number)
Again, on the first iteration the for
loop will print 0
, on the second iteration it will set the Generator to throw
a EOFError
and print 2
, on the next call the Generator will raise the EOFError
exception. Since we did not wrap
the for
loop in a try/except
block, this exception will not be caught, and will break our execution:
Traceback (most recent call last):
File "<input>", line 10, in <module>
File "<input>", line 4, in natural_numbers
EOFError
The send
function
Last, but not least, I have to tell you a secret I've been hiding until now. yield
is not a statement.
It is an expression, which means it can attribute values to variables. You may think that this value is the
same value the yield
returns to its caller, but it actually is the opposite: the caller can
give the Generator a value, and this value will be the result of the yield
expression. This is possible by using
the send
function. Let's suppose we want our natural_numbers
Generator to stop generating if the caller gives it
a number bigger than 10. The code would then be:
def natural_numbers():
i = 0
while True:
number = yield i
if type(number) == int and number > 10:
break
i += 1
Now let's test it:
>>> a = natural_numbers()
>>> next(a)
0
>>> next(a)
1
>>> a.send(5)
3
>>> next(a)
4
>>> a.send(100)
Traceback (most recent call last):
File "<input>", line 1, in <module>
StopIteration
If you have a cunning eye, you noticed that when calling the send
function, the Generator yields another value. This
means that when you call send
, a __next__
is called, right? Wrong. Again, the opposite is happening: when you
call a __next__
, it actually calls a send(None)
. This is why we are testing if number
is an int
: because
when we call __next__
, a None
will be assigned to number
. Also, notice that when we send a number bigger than 10,
the execution breaks right away; we don't need to call __next__
for that to happen. Keep that in mind when sending
numbers to Generators.
Final thoughts
Generators are a broad topic, and I only covered it partly. If you want to know more, I recommend you read the
Python Wiki page on Generators, and search for more information on Google.
There are plenty of good resources out there.
I was in college when I first met Python, and most of my code up to that point has been written in C. When making the
transition, I got convinced that this:
for (int i = 0; i < 10; i++)
translated to Python would be this:
While that is not wrong, because the final behavior of the code is the same, what is going on under the surface is very
different. While C stores the i
variable in memory, incrementing and testing its value at every iteration, Python
instantiates a new Iterator.
An Iterator in Python is an instance of a class that implements the methods __iter__
and __next__
.
The __iter__
method is responsible for returning an Iterator object, which usually is the same instance that
holds the method. This means that, generally speaking, most __iter__
implementations will just return self
, but
more complex implementations may have additional logic.
The __next__
method is where all the magic happens. When this method is called, it is expected to
return the value that will be used in the iteration, or a StopIteration
exception if the values have all been
already used. To do that, the __next__
method must contain the logic that goes in the C for
.
Let's explain this better with an example, creating our own implementation of range(x, y)
.
class OurRange:
def __init__(self, lower_boundary, upper_boundary):
self.i = lower_boundary
self.limit = upper_boundary
def __iter__(self):
return self
def __next__(self):
if self.i == self.limit:
raise StopIteration
value = self.i
self.i += 1
return value
In this implementation, we are storing the boundary values as attributes, so they persist through the method
calls. The __next__
method first checks if the upper_boundary
value has been reached, raising the StopIteration
exception if it has. If not, then the value is saved in a variable and incremented by 1. The original value is then
returned. Let's see what happens when we use OurRange
in a for
loop.
>>> for i in OurRange(1, 10):
... print(i, end=' ')
...
1 2 3 4 5 6 7 8 9
As we can see, it behaves exactly as range(0, 10)
.
Since OurRange
is an object, we can also assign it to a variable, and call __next__
manually. Let's try this.
>>> our_range = OurRange(0, 3)
>>> our_range.__next__()
0
>>> our_range.__next__()
1
>>> our_range.__next__()
2
>>> our_range.__next__()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 11, in __next__
StopIteration
Now let's try the same for range(0, 3)
. It must be noted that range
by itself is not an Iterable, but its
Iterable can be obtained by calling the __iter__
method.
>>> range = range(0, 3)
>>> range = range.__iter__()
>>> range.__next__()
0
>>> range.__next__()
1
>>> range.__next__()
2
>>> range.__next__()
Traceback (most recent call last):
File "<input>", line 1, in <module>
StopIteration
As we can see, both implementations behave the same way: while the Iterator has not reached the limit, the values
are returned; when the limit is reached, a StopIteration
exception is raised, and our call breaks. It is also
important to notice that when we are using a for
loop, this exception is caught by the for
itself, without us
realizing it ever happened.
Real World Example
So, where do Iterators apply in real world code?
An example would be to simplify the way you get values from sequential API calls. You can encapsulate the calling
logic in the __next__
method, raising an exception when the API returns an empty result. Let's see an example code.
class JsonPlaceholderCaller:
def __init__(self):
self.post = 1
def __iter__(self):
return self
def __next__(self):
resp = requests.get(f'https://jsonplaceholder.typicode.com/posts/{self.post}')
if resp.status_code == 404:
raise StopIteration
self.post += 1
return resp
for response in JsonPlaceholderCaller():
print(response)
At every iteration, this for
will call the JSON Placeholder API to get a new post, until a new post is not found.
Of course, in real world code this class and the code that calls it would be separated in different layers.
I hope you enjoyed learning more about Python Iterators. My next post is about Python
Generators, which are also cool.