Python Generators
I talked about Python Iterators in my last post, and how they work within loop structures. Now let's talk about Python
Generators. The first thing worth mentioning is that a Generator is actually a special Iterator that automatically
implements the methods __iter__
and __next__
, so you don't have to. Second, you don't have to declare a Generator
class (though you can still do it if you want to). A function that contains yield
is called a Generator
function, and will automatically instantiate a new Generator and return it to the caller.
yield
works similarly as the return
statement, but with one key difference: while return
will
return a value to the caller and end the function call, removing it from the memory stack; yield
will return a value
to the caller, transfer the execution control to the caller and save the function call,
so it can be restored later.
Why would you want to restore the context of a function call after returning a value? Because you may need to return
more values. Let's explain this better with the same example we used in the last post: building a custom range
.
def our_range(lower_boundary, upper_boundary):
i = lower_boundary
while i < upper_boundary:
yield i
i += 1
Now let's try that in a for
loop:
>>> for number in our_range(1, 10):
... print(number, end=' ')
...
1 2 3 4 5 6 7 8 9
You are now probably wondering how this is working. When our_range
is called, a Generator object is returned to be
used by the for
loop. As we learned in the last post, at every iteration of the for
loop the __next__
function is
called. When this happens, our_range
is called. On the first call, i
is set to be equal to the lower_boundary
,
the while
condition is evaluated, and we execute yield i
. At this point, 1 is going to be yielded (returned) to the
for
loop and be assigned to number
, which is then printed. On the next for
iteration, our_range
is
going to be called again, and instead of executing the function from the start, the execution will resume from the
line next to the yield
; in this case, i += 1
, which will assign value 2 to i
. The while
condition will be
evaluated again, and the yield i
will be executed once more, this time yielding value 2 to the for
loop, which
will be assigned to number
and printed. This goes on until the condition in the while
loop evaluates to
False
. When this happens, the function will end its execution and won't yield any value, meaning the Generator is
exhausted. When an exhausted Generator has its __next__
function called, it will raise a StopIteration
Exception, which, in our case, will be caught by the for
and cause the loop to end, finishing our execution.
In a simple Iterator, when the __next__
function ends its execution, all its context is lost, so any values that
we can't afford to lose must be kept as attributes of the Iterator object. That is what I did to the i
variable in
the Iterator's post. When using a Generator, we can keep these values in variables inside the generator function, since
the context is not lost between the calls.
The Real Generator Deal
You may be thinking “nice, but what real benefit comes with this 'saving the context' approach?” Well, the real deal is: Iterators, generally speaking, must have all their data assigned to a variable (thus, stored in memory) to iterate through them; while Generators, because of saving the context at every iteration, can generate the value for each iteration, thus not requiring to have all the values available in memory.
Again, let's explain this better with an example: we will make a script that reads values from a txt file, perform
an operation (calling the function perform_operation
) with each of them, and saves the values in a csv file.
def read_txt(filename):
file = open(filename)
return file.read().split('\n')
csv_file = open('filename.csv', 'w')
for line in read_txt('file.txt'):
print(perform_operation(line), file=csv_file)
Pay attention to the way we are reading the data. The read
function will bring the whole file to memory, so it can be
stored in the lines
variable. This means this script will require a memory at least the size of the file.
If you are processing a big file in a restricted memory environment (for example, a container), your script may fail
simply because it ran out of memory (MemoryError
exception). Now let's fix that with a Generator.
def read_txt(filename):
for row in open(filename):
yield row
csv_file = open('filename.csv', 'w')
for line in read_txt('file.txt'):
print(perform_operation(line), file=csv_file)
Now we are reading and yielding one row at a time, which means that instead of requiring a memory the size of the file, now we only require a memory the size of a line.
Generator Comprehension
This advantage becomes even more clear when we use generator comprehension, which works the same way as list comprehension. As an example, let's iterate through a million numbers. Using a list, this would be:
numbers = [i for i in range(1_000_000)]
for number in numbers:
print(number)
numbers
is a list containing a million elements. Let's check its size:
>>> sys.getsizeof(numbers)
8448728
If, instead of a list, we used a Generator, the code would be:
numbers = (i for i in range(1_000_000))
for number in numbers:
print(number)
Let's now check the numbers
Generator size:
>>> sys.getsizeof(numbers)
112
Both codes will behave the same, but the code using the Generator requires much less memory.
To the infinity and beyond
Because of the property of only storing the current element of a sequence, Generators are useful for representing infinite sequences. If your teacher or boss asks you to build a sequence with all the natural numbers, you can either say that it is impossible, or you can give them this Generator:
def natural_numbers():
i = 0 # My natural numbers start with 0, yours can start with 1 if you want to (:
while True:
yield i
i += 1
This Generator will never stop giving numbers, so it is a viable way of representing a sequence that never ends.
A few more Generator tricks
Generators are versatile, and here are some more things you can do with them.
Multiple yields
Unlike the return
statement, yield
can be used multiple times in the same function. Let's suppose you want a
generator that returns a number, returns the square of this number and then increments the number.
This can be done by the following code:
def numbers_and_squares():
i = 0
while True:
yield i
yield i ** 2
i += 1
If we call the __next__
function repeatedly, the yields will be 0 0 1 1 2 4 3 9 ...
.
The close
function
As we already know, Generators can represent infinite sequences, which means they will never stop returning numbers.
What if we want them to stop? Maybe we want to prevent an infinite loop, or maybe we want to define a “big enough”
value. We can use the close
function for that. Let's put a stop to our natural_numbers
infinite Generator.
def natural_numbers():
i = 0
while True:
yield i
i += 1
numbers = natural_numbers()
for number in numbers:
if number >= 1:
numbers.close()
print(number)
On the first iteration the for
loop will print 0
, on the second iteration it will close the Generator and print
2
, on the next call the Generator will raise a StopIteration
exception that will be caught by the for
,
ending its loop.
The throw
function
As we saw, a closed Generator will raise a StopIteration
exception if called. What if you don't want this exception?
Maybe you want a ValueError
, or a EOFError
. You can set a custom exception to be raised by a Generator using the
throw
function.
def natural_numbers():
i = 0
while True:
yield i
i += 1
numbers = natural_numbers()
for number in numbers:
if number >= 1:
numbers.throw(EOFError)
print(number)
Again, on the first iteration the for
loop will print 0
, on the second iteration it will set the Generator to throw
a EOFError
and print 2
, on the next call the Generator will raise the EOFError
exception. Since we did not wrap
the for
loop in a try/except
block, this exception will not be caught, and will break our execution:
Traceback (most recent call last):
File "<input>", line 10, in <module>
File "<input>", line 4, in natural_numbers
EOFError
The send
function
Last, but not least, I have to tell you a secret I've been hiding until now. yield
is not a statement.
It is an expression, which means it can attribute values to variables. You may think that this value is the
same value the yield
returns to its caller, but it actually is the opposite: the caller can
give the Generator a value, and this value will be the result of the yield
expression. This is possible by using
the send
function. Let's suppose we want our natural_numbers
Generator to stop generating if the caller gives it
a number bigger than 10. The code would then be:
def natural_numbers():
i = 0
while True:
number = yield i
if type(number) == int and number > 10:
break
i += 1
Now let's test it:
>>> a = natural_numbers()
>>> next(a)
0
>>> next(a)
1
>>> a.send(5)
3
>>> next(a)
4
>>> a.send(100)
Traceback (most recent call last):
File "<input>", line 1, in <module>
StopIteration
If you have a cunning eye, you noticed that when calling the send
function, the Generator yields another value. This
means that when you call send
, a __next__
is called, right? Wrong. Again, the opposite is happening: when you
call a __next__
, it actually calls a send(None)
. This is why we are testing if number
is an int
: because
when we call __next__
, a None
will be assigned to number
. Also, notice that when we send a number bigger than 10,
the execution breaks right away; we don't need to call __next__
for that to happen. Keep that in mind when sending
numbers to Generators.
Final thoughts
Generators are a broad topic, and I only covered it partly. If you want to know more, I recommend you read the Python Wiki page on Generators, and search for more information on Google. There are plenty of good resources out there.