Bruno Behnken

Nov 04, 2023

Manipulating PDFs With pypdf

During the Python Brasil 2023 conference I presented a small talk about manipulating PDFs using the library pypdf, so I thought it would be a great idea to post the content here.

According to the documentation,

pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

Merging PDFs

The first thing I used pypdf for was to merge two PDFs. I needed to print both of them, but didn't want to take two files to the local lan house (I don't have a printer at my house anymore). The code is very simple, and we're using the PdfWriter class to write the merged PDF.

from pypdf import PdfWriter

merger = PdfWriter()

for pdf in ["pdf1.pdf", "pdf2.pdf"]:
    merger.append(pdf)

merger.write("print.pdf")
merger.close()

This code simply merges two files into one. It's interesting to notice that it does not need to open the files to merge them. But what if you don't want to merge all the pages? Maybe you need to select some pages to be merged. This can be performed by the following code:

from pypdf import PdfWriter

merger = PdfWriter()
pdf1 = open("pdf1.pdf", "rb")
pdf2 = open("pdf2.pdf", "rb")

merger.append(fileobj=pdf1, pages=(0, 1))
merger.append(fileobj=pdf2)

merger.write("print.pdf")
merger.close()

Notice that now we are opening the files using the default open function, and when we pass the files to the merger.append function we can specify a tuple containing a range of pages to be merged.

Instead of opening the files this way, we can use the PdfReader class from the pypdf lib, which will give us much more possibilities to play with our PDFs. Let's start by doing the same thing we did before, using PdfReader:

from pypdf import PdfWriter, PdfReader

pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()

writer.add_page(pdf1_reader.pages[0])
writer.add_page(pdf2_reader.pages[0])

writer.write("print.pdf")
writer.close()

Now, the reader variables will have PdfReader objects, containing several properties of the opened PDFs. One of these properties is the attribute pages, which is a list containing all the PDF pages. Thus, instead of using the append method, we are using the add_page method, referencing an element (a page) of the pages list.

Rotating pages

Now let's explore a bit more of the possibilities of the PdfReader class. We can start applying a transformation that rotates the pdf2 page we are merging:

from pypdf import PdfWriter, PdfReader

pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()

writer.add_page(pdf1_reader.pages[0])
writer.add_page(pdf2_reader.pages[0].rotate(180))

writer.write("print.pdf")
writer.close()

The rotate(180) transformation will rotate 180˚ the pdf2 first page, leaving it upside down.

Merging pages

Another transformation would be to merge the two pages into only one:

from pypdf import PdfWriter, PdfReader

pdf1_reader = PdfReader("pdf1.pdf")
pdf2_reader = PdfReader("pdf2.pdf")
writer = PdfWriter()

pdf1_page = pdf1_reader.pages[0]
pdf2_page = pdf2_reader.pages[0]
pdf1_page.merge_page(pdf2_page)

writer.add_page(pdf1_page)
writer.write("print.pdf")
writer.close()

This code will overlap the two pages. Observe that merge_page is a method from the page object, and it's not called from the writer object. The page passed as argument (pdf2_page) will overlap the page that has the merge_page method called (pdf1_page), which means any element in the pdf1_page will become not visible if an element of pdf2_page happens to cover it when the overlap occurs. Be careful when using this transformation.

While many other transformations are available, we have explored just a bunch of them. Take a look at the documentation for more examples about pypdf transformations.

Password protection

Another very handful pypdf feature is protecting PDFs with a password. For that, we should call the merger.encrypt() method, giving it a password and an encryption algorythm:

from pypdf import PdfWriter

merger = PdfWriter()
merger.append("pdf1.pdf")

merger.encrypt("pythonbrasil2023", algorithm="AES-256-R5")

merger.write("print.pdf")
merger.close()

Now the PDF will be encrypted with the password pythonbrasil2023. It's important to notice that you can choose between several algorithms, some more secure than others, and some that will require additional packages to work.

Text extraction

pypdf also works for some basic text extraction from PDFs. It's worth noting that it does not read texts from images (you will need an OCR for that), and that it does not extract texts from scrambled fonts. That said, we can extract a PDF text and manipulate it as a string in our program. Check this code out:

from pypdf import PdfReader

reader = PdfReader("text.pdf")
page = reader.pages[0]
text = page.extract_text()
print("text:", text)

If everything went well, we should see the text of the page printed. Because it is in the text variable, we can perform any string operation with it. When I presented this code, one person in the audience pointed out an interesting use case for it: reading the pages of a book and submitting the strings to a text-to-speech algorithm, in order to produce an audiobook.

Those are only some of the features offered by the pypdf library. I, again, encourage you to check the documentation to see more of its capabilities.

posted at 18:00 · Python · python pdf pypdf

Oct 01, 2023

Python Generators

I talked about Python Iterators in my last post, and how they work within loop structures. Now let's talk about Python Generators. The first thing worth mentioning is that a Generator is actually a special Iterator that automatically implements the methods __iter__ and __next__, so you don't have to. Second, you don't have to declare a Generator class (though you can still do it if you want to). A function that contains yield is called a Generator function, and will automatically instantiate a new Generator and return it to the caller.

yield works similarly as the return statement, but with one key difference: while return will return a value to the caller and end the function call, removing it from the memory stack; yield will return a value to the caller, transfer the execution control to the caller and save the function call, so it can be restored later.

Why would you want to restore the context of a function call after returning a value? Because you may need to return more values. Let's explain this better with the same example we used in the last post: building a custom range.

def our_range(lower_boundary, upper_boundary):
    i = lower_boundary
    while i < upper_boundary:
        yield i
        i += 1

Now let's try that in a for loop:

>>> for number in our_range(1, 10):
...     print(number, end=' ')
... 
1 2 3 4 5 6 7 8 9

You are now probably wondering how this is working. When our_range is called, a Generator object is returned to be used by the for loop. As we learned in the last post, at every iteration of the for loop the __next__ function is called. When this happens, our_range is called. On the first call, i is set to be equal to the lower_boundary, the while condition is evaluated, and we execute yield i. At this point, 1 is going to be yielded (returned) to the for loop and be assigned to number, which is then printed. On the next for iteration, our_range is going to be called again, and instead of executing the function from the start, the execution will resume from the line next to the yield; in this case, i += 1, which will assign value 2 to i. The while condition will be evaluated again, and the yield i will be executed once more, this time yielding value 2 to the for loop, which will be assigned to number and printed. This goes on until the condition in the while loop evaluates to False. When this happens, the function will end its execution and won't yield any value, meaning the Generator is exhausted. When an exhausted Generator has its __next__ function called, it will raise a StopIteration Exception, which, in our case, will be caught by the for and cause the loop to end, finishing our execution.

In a simple Iterator, when the __next__ function ends its execution, all its context is lost, so any values that we can't afford to lose must be kept as attributes of the Iterator object. That is what I did to the i variable in the Iterator's post. When using a Generator, we can keep these values in variables inside the generator function, since the context is not lost between the calls.

The Real Generator Deal

You may be thinking “nice, but what real benefit comes with this 'saving the context' approach?” Well, the real deal is: Iterators, generally speaking, must have all their data assigned to a variable (thus, stored in memory) to iterate through them; while Generators, because of saving the context at every iteration, can generate the value for each iteration, thus not requiring to have all the values available in memory.

Again, let's explain this better with an example: we will make a script that reads values from a txt file, perform an operation (calling the function perform_operation) with each of them, and saves the values in a csv file.

def read_txt(filename):
    file = open(filename)
    return file.read().split('\n')

csv_file = open('filename.csv', 'w')
for line in read_txt('file.txt'):
    print(perform_operation(line), file=csv_file)

Pay attention to the way we are reading the data. The read function will bring the whole file to memory, so it can be stored in the lines variable. This means this script will require a memory at least the size of the file. If you are processing a big file in a restricted memory environment (for example, a container), your script may fail simply because it ran out of memory (MemoryError exception). Now let's fix that with a Generator.

def read_txt(filename):
    for row in open(filename):
        yield row

csv_file = open('filename.csv', 'w')
for line in read_txt('file.txt'):
    print(perform_operation(line), file=csv_file)

Now we are reading and yielding one row at a time, which means that instead of requiring a memory the size of the file, now we only require a memory the size of a line.

Generator Comprehension

This advantage becomes even more clear when we use generator comprehension, which works the same way as list comprehension. As an example, let's iterate through a million numbers. Using a list, this would be:

numbers = [i for i in range(1_000_000)]
for number in numbers:
    print(number)

numbers is a list containing a million elements. Let's check its size:

>>> sys.getsizeof(numbers)
8448728

If, instead of a list, we used a Generator, the code would be:

numbers = (i for i in range(1_000_000))
for number in numbers:
    print(number)

Let's now check the numbers Generator size:

>>> sys.getsizeof(numbers)
112

Both codes will behave the same, but the code using the Generator requires much less memory.

To the infinity and beyond

Because of the property of only storing the current element of a sequence, Generators are useful for representing infinite sequences. If your teacher or boss asks you to build a sequence with all the natural numbers, you can either say that it is impossible, or you can give them this Generator:

def natural_numbers():
    i = 0  # My natural numbers start with 0, yours can start with 1 if you want to (:
    while True:
        yield i
        i += 1

This Generator will never stop giving numbers, so it is a viable way of representing a sequence that never ends.

A few more Generator tricks

Generators are versatile, and here are some more things you can do with them.

Multiple yields

Unlike the return statement, yield can be used multiple times in the same function. Let's suppose you want a generator that returns a number, returns the square of this number and then increments the number. This can be done by the following code:

def numbers_and_squares():
    i = 0
    while True:
        yield i
        yield i ** 2
        i += 1

If we call the __next__ function repeatedly, the yields will be 0 0 1 1 2 4 3 9 ....

The `close` function

As we already know, Generators can represent infinite sequences, which means they will never stop returning numbers. What if we want them to stop? Maybe we want to prevent an infinite loop, or maybe we want to define a “big enough” value. We can use the close function for that. Let's put a stop to our natural_numbers infinite Generator.

def natural_numbers():
    i = 0
    while True:
        yield i
        i += 1

numbers = natural_numbers()
for number in numbers:
    if number >= 1:
        numbers.close()
    print(number)

On the first iteration the for loop will print 0, on the second iteration it will close the Generator and print 2, on the next call the Generator will raise a StopIteration exception that will be caught by the for, ending its loop.

The `throw` function

As we saw, a closed Generator will raise a StopIteration exception if called. What if you don't want this exception? Maybe you want a ValueError, or a EOFError. You can set a custom exception to be raised by a Generator using the throw function.

def natural_numbers():
    i = 0
    while True:
        yield i
        i += 1

numbers = natural_numbers()
for number in numbers:
    if number >= 1:
        numbers.throw(EOFError)
    print(number)

Again, on the first iteration the for loop will print 0, on the second iteration it will set the Generator to throw a EOFError and print 2, on the next call the Generator will raise the EOFError exception. Since we did not wrap the for loop in a try/except block, this exception will not be caught, and will break our execution:

Traceback (most recent call last):
  File "<input>", line 10, in <module>
  File "<input>", line 4, in natural_numbers
EOFError

The `send` function

Last, but not least, I have to tell you a secret I've been hiding until now. yield is not a statement. It is an expression, which means it can attribute values to variables. You may think that this value is the same value the yield returns to its caller, but it actually is the opposite: the caller can give the Generator a value, and this value will be the result of the yield expression. This is possible by using the send function. Let's suppose we want our natural_numbers Generator to stop generating if the caller gives it a number bigger than 10. The code would then be:

def natural_numbers():
    i = 0
    while True:
        number = yield i
        if type(number) == int and number > 10:
            break
        i += 1

Now let's test it:

>>> a = natural_numbers()
>>> next(a)
0
>>> next(a)
1
>>> a.send(5)
3
>>> next(a)
4
>>> a.send(100)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
StopIteration

If you have a cunning eye, you noticed that when calling the send function, the Generator yields another value. This means that when you call send, a __next__ is called, right? Wrong. Again, the opposite is happening: when you call a __next__, it actually calls a send(None). This is why we are testing if number is an int: because when we call __next__, a None will be assigned to number. Also, notice that when we send a number bigger than 10, the execution breaks right away; we don't need to call __next__ for that to happen. Keep that in mind when sending numbers to Generators.

Final thoughts

Generators are a broad topic, and I only covered it partly. If you want to know more, I recommend you read the Python Wiki page on Generators, and search for more information on Google. There are plenty of good resources out there.

posted at 12:00 · Python · python generators

Sep 01, 2023

Python Iterators

I was in college when I first met Python, and most of my code up to that point has been written in C. When making the transition, I got convinced that this:

for (int i = 0; i < 10; i++)

translated to Python would be this:

for i in range(0,10)

While that is not wrong, because the final behavior of the code is the same, what is going on under the surface is very different. While C stores the i variable in memory, incrementing and testing its value at every iteration, Python instantiates a new Iterator.

An Iterator in Python is an instance of a class that implements the methods __iter__ and __next__.

The __iter__ method is responsible for returning an Iterator object, which usually is the same instance that holds the method. This means that, generally speaking, most __iter__ implementations will just return self, but more complex implementations may have additional logic.

The __next__ method is where all the magic happens. When this method is called, it is expected to return the value that will be used in the iteration, or a StopIteration exception if the values have all been already used. To do that, the __next__ method must contain the logic that goes in the C for.

Let's explain this better with an example, creating our own implementation of range(x, y).

class OurRange:
    def __init__(self, lower_boundary, upper_boundary):
        self.i = lower_boundary
        self.limit = upper_boundary

    def __iter__(self):
        return self

    def __next__(self):
        if self.i == self.limit:
            raise StopIteration
        value = self.i
        self.i += 1
        return value

In this implementation, we are storing the boundary values as attributes, so they persist through the method calls. The __next__ method first checks if the upper_boundary value has been reached, raising the StopIteration exception if it has. If not, then the value is saved in a variable and incremented by 1. The original value is then returned. Let's see what happens when we use OurRange in a for loop.

>>> for i in OurRange(1, 10):
...     print(i, end=' ')
... 
1 2 3 4 5 6 7 8 9

As we can see, it behaves exactly as range(0, 10).

Since OurRange is an object, we can also assign it to a variable, and call __next__ manually. Let's try this.

>>> our_range = OurRange(0, 3)
>>> our_range.__next__()
0
>>> our_range.__next__()
1
>>> our_range.__next__()
2
>>> our_range.__next__()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 11, in __next__
StopIteration

Now let's try the same for range(0, 3). It must be noted that range by itself is not an Iterable, but its Iterable can be obtained by calling the __iter__ method.

>>> range = range(0, 3)
>>> range = range.__iter__()
>>> range.__next__()
0
>>> range.__next__()
1
>>> range.__next__()
2
>>> range.__next__()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
StopIteration

As we can see, both implementations behave the same way: while the Iterator has not reached the limit, the values are returned; when the limit is reached, a StopIteration exception is raised, and our call breaks. It is also important to notice that when we are using a for loop, this exception is caught by the for itself, without us realizing it ever happened.

Real World Example

So, where do Iterators apply in real world code?

An example would be to simplify the way you get values from sequential API calls. You can encapsulate the calling logic in the __next__ method, raising an exception when the API returns an empty result. Let's see an example code.

class JsonPlaceholderCaller:
    def __init__(self):
        self.post = 1

    def __iter__(self):
        return self

    def __next__(self):
        resp = requests.get(f'https://jsonplaceholder.typicode.com/posts/{self.post}')
        if resp.status_code == 404:
            raise StopIteration
        self.post += 1
        return resp

for response in JsonPlaceholderCaller():
    print(response)

At every iteration, this for will call the JSON Placeholder API to get a new post, until a new post is not found. Of course, in real world code this class and the code that calls it would be separated in different layers.

I hope you enjoyed learning more about Python Iterators. My next post is about Python Generators, which are also cool.

posted at 12:00 · Python · python iterators