Threads and Signals in Python

For the last few days I’ve been working on a web spider (also known as a web
crawler – see Wikipedia), in
Python. This is something I’ve been thinking about doing for a
while, simply because it always seemed like it would be a good, fun,
educational programming challenge. I’ve been motivated to actually go ahead
and write one just now mainly on account of my burgeoning interest in
natural language processing, web searching, the semantic web and the overlap
between these three. I have a few ideas for projects that I’d like to try
out some day and they all require having a local copy of a small subset of
the web to tinker with. A spider seems the natural way to achieve this.

I’m very happy with how the spider is progressing and I’ll write about it in
some detail closer to when I actually release it (which shouldn’t be far off.
And, no, I haven’t forgotten about feedformatter, there’s a new version of that
in the works too). The point of this entry is to discuss the interplay of
threads and signals in Python, which is something I had to contend with today.

My spider is multi-threaded. The main thread creates an instance of a UrlQueue,
which is just a simple subclass of the standard library’s Queue object
and then spawns a number of worker threads which pull URLs off of this queue,
download the sites at those URLs and then parse the HTML looking for links,
placing any new URLs found onto the queue to be handled later by the same or a
different thread. The whole thing is run from the command line, so I’d really
like it if when the user hit Ctrl+C, each of the threads could finish dealing
with their current URL and then stop, so that the whole crawl finishes within a
few extra seconds.

Those readers with a bit of Unix background will know that what Ctrl+C actually
does is send a "signal" (specifically, SIGINT) to the process. You can read up
on signals at Wikipedia. Any modern Unix has a signal system call
which lets you register a “signal handler”, a function which is called upon
receipt of a signal. Python gives you access to this system call via the
signal module, so you can register a signal handler for SIGINT and
make Ctrl+C do whatever you like. The default SIGINT handler, by the way,
simply raises a KeyboardInterrupt exception, so if you don’t want to use signals
you can put your entire program in a try/except structure and get more or less
the same effect.

The first problem is that Python’s signal module documentation explicitly
states that when multiple threads are running, only the main thread (i.e. the
thread that was created when your process started) will receive signals. So
the signal handler will execute only in one thread and not in all of them. In
order to get all threads to stop in response to a signal, you need the main
thread’s signal handler to communicate the stop message to the other threads.
You can do this in plenty of different ways, perhaps the simplest being by
having the main thread flip the value of a boolean variable that all threads
hold a reference too. This is not a huge problem, and I’ve done things like
this before.

To my surprise today, this approach just wasn’t working. I put a print
statement in my signal handler and discovered that even the main thread wasn’t
receiving the SIGINT signal, even though it was definitely supposed to.

This leads to the second problem involved in mixing threads and signals. When
you send a signal to a multi-threaded Python program, that signal is put into a
queue. The main thread processes signals from that queue and invokes the
relevant handlers, but – and here’s the catch – it doesn’t do this until it has
something else to do as well. That is, if your main thread fires off a group
of worker threads and then sits there doing nothing while they work then as far
as Python’s thread scheduler is concerned there is no need to give that main
thread any CPU time while the worker threads are actually doing something, so
your SIGINTs – and, indeed, any other signals – just pile up in the queue and
are never handled. Note that "doing nothing" while the worker threads work
include sitting in a blocked state after a call to the join method of
a worker thread.

This means that if you want your main thread to be able to catch a Ctrl+C and
shut down all the worker threads, you need to make sure your main thread is
doing something while the others work. This doesn’t have to be anything
useful, of course, you can just make a call to sleep in a loop every
second or so. The code I am now using looks a bit like this, and seems to
work as intended:

# Start threads
threads = []
for i in range(0, num_threads):
    thread = WorkerThread()
    threads.append(thread)
    thread.start()

# Wait for threads to finish
while True:
    if not any([thread.isAlive() for thread in threads]):
        # All threads have stopped
    break
    else:
        # Some threads are still going
        sleep(1)

With this code, if I hit Ctrl+C while the worker threads are working, the
SIGINT gets put in the main thread’s signal queue. After no more than one
second, the sleep call in the infinite loop returns and the main
thread has something to do (check if all the threads have stopped yet). It
thus gets a slice of CPU time from the thread scheduler and so gets a chance to
handle any signals which have built up. If you need a bit more responsiveness,
you can sleep for less than a second, but the less you sleep the more CPU time
your main thread will chew up evaluating the expression in the if statement.
While on that subject, the any in that if statement is a new built-in
function that appeared in Python 2.5. An equivalent statement that should work
in earlier versions is:


if not True in [thread.isAlive() for thread in threads]:

I hope that this is helpful to somebody at some stage. Also, to give credit where
credit is due, this email by James
Henstridge on the PyGtk mailing list is where I got the insight to realise how
to fix my spider. Thanks, James! It’s probably also worth noting that there
is a recipe
in the ASPN Python Cookbook
by Allen Downey which proposes a solution to this problem involving the
fork system call – the main thread calls fork to create a
child process. The worker threads are spawned in the child process, leaving
just one thread in the parent process which can thus also receive a signal. The
parent process can catch SIGINT and then kill its child to get the desired
effect. I feel that this approach is a bit uglier than sleeping in a
loop, but it may have advantages that make it the better choice under certain
circumstances.

Leave a Reply