<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Luke Maurits &#187; threading</title>
	<atom:link href="http://luke.maurits.id.au/blog/tag/threading/feed/" rel="self" type="application/rss+xml" />
	<link>http://luke.maurits.id.au</link>
	<description>Assorted geekery</description>
	<lastBuildDate>Sun, 06 Mar 2011 06:52:47 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Threads and Signals in Python</title>
		<link>http://luke.maurits.id.au/blog/2008/03/threads-and-signals-in-python/</link>
		<comments>http://luke.maurits.id.au/blog/2008/03/threads-and-signals-in-python/#comments</comments>
		<pubDate>Tue, 25 Mar 2008 14:58:00 +0000</pubDate>
		<dc:creator>Luke Maurits</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[signals]]></category>
		<category><![CDATA[threading]]></category>

		<guid isPermaLink="false">http://luke.maurits.id.au/blog/2008/03/threads-and-signals-in-python/</guid>
		<description><![CDATA[For the last few days I&#8217;ve been working on a web spider (also known as a web
   crawler &#8211; see Wikipedia), in
   Python.  This is something I&#8217;ve been thinking about doing for a
   while, simply because it always seemed like it would be a good, fun,
   [...]]]></description>
			<content:encoded><![CDATA[<p>For the last few days I&#8217;ve been working on a web spider (also known as a web<br />
   crawler &#8211; see <a href="http://en.wikipedia.org/wiki/Web_crawler">Wikipedia</a>), in<br />
   Python.  This is something I&#8217;ve been thinking about doing for a<br />
   while, simply because it always seemed like it would be a good, fun,<br />
   educational programming challenge.  I&#8217;ve been motivated to actually go ahead<br />
   and <i>write</i> one just now mainly on account of my burgeoning interest in<br />
   natural language processing, web searching, the <a href="http://en.wikipedia.org/wiki/Semantic_web">semantic web</a> and the overlap<br />
   between these three.  I have a few ideas for projects that I&#8217;d like to try<br />
   out some day and they all require having a local copy of a small subset of<br />
   the web to tinker with.  A spider seems the natural way to achieve this.
</p>
<p>I&#8217;m very happy with how the spider is progressing and I&#8217;ll write about it in<br />
   some detail closer to when I actually release it (which shouldn&#8217;t be far off.<br />
   And, no, I haven&#8217;t forgotten about <a href="/software/feedformatter/">feedformatter</a>, there&#8217;s a new version of that<br />
   in the works too).  The point of this entry is to discuss the interplay of<br />
   threads and signals in Python, which is something I had to contend with today.
</p>
<p>My spider is multi-threaded.  The main thread creates an instance of a <tt>UrlQueue</tt>,<br />
   which is just a simple subclass of the standard library&#8217;s <a href="http://docs.python.org/lib/QueueObjects.html"><tt>Queue</tt></a> object<br />
   and then spawns a number of worker threads which pull URLs off of this queue,<br />
   download the sites at those URLs and then parse the HTML looking for links,<br />
   placing any new URLs found onto the queue to be handled later by the same or a<br />
   different thread.  The whole thing is run from the command line, so I&#8217;d really<br />
   like it if when the user hit Ctrl+C, each of the threads could finish dealing<br />
   with their current URL and then stop, so that the whole crawl finishes within a<br />
   few extra seconds.
</p>
<p>Those readers with a bit of Unix background will know that what Ctrl+C actually<br />
   does is send a &quot;signal&quot; (specifically, SIGINT) to the process.  You can read up<br />
   on <a href="http://en.wikipedia.org/wiki/Signal_%28computing%29">signals</a> at Wikipedia.  Any modern Unix has a <tt>signal</tt> system call<br />
   which lets you register a &#8220;signal handler&#8221;, a function which is called upon<br />
   receipt of a signal.  Python gives you access to this system call via the<br />
   <a href="http://docs.python.org/lib/module-signal.html"><tt>signal</tt></a> module, so you can register a signal handler for SIGINT and<br />
   make Ctrl+C do whatever you like.  The default SIGINT handler, by the way,<br />
   simply raises a <tt>KeyboardInterrupt</tt> exception, so if you don&#8217;t want to use signals<br />
   you can put your entire program in a try/except structure and get more or less<br />
   the same effect.
</p>
<p>The first problem is that Python&#8217;s signal module documentation explicitly<br />
   states that when multiple threads are running, only the main thread (i.e. the<br />
   thread that was created when your process started) will receive signals.  So<br />
   the signal handler will execute only in one thread and not in all of them.  In<br />
   order to get all threads to stop in response to a signal, you need the main<br />
   thread&#8217;s signal handler to communicate the stop message to the other threads.<br />
   You can do this in plenty of different ways, perhaps the simplest being by<br />
   having the main thread flip the value of a boolean variable that all threads<br />
   hold a reference too.  This is not a huge problem, and I&#8217;ve done things like<br />
   this before.
</p>
<p>To my surprise today, this approach just wasn&#8217;t working.  I put a print<br />
   statement in my signal handler and discovered that even the main thread wasn&#8217;t<br />
   receiving the SIGINT signal, even though it was definitely supposed to.
</p>
<p>This leads to the second problem involved in mixing threads and signals.  When<br />
   you send a signal to a multi-threaded Python program, that signal is put into a<br />
   queue.  The main thread processes signals from that queue and invokes the<br />
   relevant handlers, but &#8211; and here&#8217;s the catch &#8211; it doesn&#8217;t do this until it has<br />
   something else to do as well.  That is, if your main thread fires off a group<br />
   of worker threads and then sits there doing nothing while they work then as far<br />
   as Python&#8217;s thread scheduler is concerned there is no need to give that main<br />
   thread any CPU time while the worker threads are actually doing something, so<br />
   your SIGINTs &#8211; and, indeed, any other signals &#8211; just pile up in the queue and<br />
   are never handled.  Note that &quot;doing nothing&quot; while the worker threads work<br />
   include sitting in a blocked state after a call to the <tt>join</tt> method of<br />
   a worker thread.
</p>
<p>This means that if you want your main thread to be able to catch a Ctrl+C and<br />
   shut down all the worker threads, you need to make sure your main thread is<br />
   doing something while the others work.  This doesn&#8217;t have to be anything<br />
   useful, of course, you can just make a call to <a href="http://docs.python.org/lib/module-time.html"><tt>sleep</tt></a> in a loop every<br />
   second or so.  The code I am now using looks a bit like this, and seems to<br />
   work as intended:
</p>
<pre><code># Start threads
threads = []
for i in range(0, num_threads):
    thread = WorkerThread()
    threads.append(thread)
    thread.start()

# Wait for threads to finish
while True:
    if not any([thread.isAlive() for thread in threads]):
        # All threads have stopped
    break
    else:
        # Some threads are still going
        sleep(1)
</code></pre>
<p>With this code, if I hit Ctrl+C while the worker threads are working, the<br />
   SIGINT gets put in the main thread&#8217;s signal queue.  After no more than one<br />
   second, the <tt>sleep</tt> call in the infinite loop returns and the main<br />
   thread has something to do (check if all the threads have stopped yet).  It<br />
   thus gets a slice of CPU time from the thread scheduler and so gets a chance to<br />
   handle any signals which have built up.  If you need a bit more responsiveness,<br />
   you can sleep for less than a second, but the less you sleep the more CPU time<br />
   your main thread will chew up evaluating the expression in the if statement.<br />
   While on that subject, the <a href="http://docs.python.org/lib/built-in-funcs.html"><tt>any</tt></a> in that if statement is a new built-in<br />
   function that appeared in Python 2.5.  An equivalent statement that should work<br />
   in earlier versions is:
</p>
<p><code><br />
   if not True in [thread.isAlive() for thread in threads]:<br />
   </code>
</p>
<p>I hope that this is helpful to somebody at some stage.  Also, to give credit where<br />
   credit is due, <a href="http://www.daa.com.au/pipermail/pygtk/2002-March/002568.html">this email</a> by James<br />
   Henstridge on the PyGtk mailing list is where I got the insight to realise how<br />
   to fix my spider.  Thanks, James!  It&#8217;s probably also worth noting that there<br />
   is a <a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496735">recipe</a><br />
   in the <a href="http://aspn.activestate.com/ASPN/Cookbook/Python">ASPN Python Cookbook</a><br />
   by Allen Downey which proposes a solution to this problem involving the<br />
   <tt>fork</tt> system call &#8211; the main thread calls <tt>fork</tt> to create a<br />
   child process.  The worker threads are spawned in the child process, leaving<br />
   just one thread in the parent process which can thus also receive a signal.  The<br />
   parent process can catch SIGINT and then <tt>kill</tt> its child to get the desired<br />
   effect.  I feel that this approach is a bit uglier than <tt>sleep</tt>ing in a<br />
   loop, but it may have advantages that make it the better choice under certain<br />
   circumstances.</p>
]]></content:encoded>
			<wfw:commentRss>http://luke.maurits.id.au/blog/2008/03/threads-and-signals-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

