Posts tagged ‘programming’

Python vs C for performance

Via Reddit today I came across a fairly decent article on Python optimization tips and issues, which comes across fairly heavily in favour of the idea that by being careful and knowing what you’re doing, you can typically make Python implementations of
numerical algorithms fast enough for practical purposes, saving you the massive headaches associated with development in C.

This is a fairly relevant topic for me. A huge part of my PhD research is writing and playing with computational models of language acquisition. These models usually use Bayesian inference as a model of an “ideal learner” (doing this is fairly trendy in modern computational cognitive science, and not without good reason). When it comes to doing numerical Bayesian inference, a class of techniques known as Markov Chain Monte Carlo set the standard, and MCMC computations are the bread and butter of my programming work. Without going into too much detail, MCMC computations are highly iterative – you basically just do the same few steps over and over as fast as you can until your program converges on your answer.

When I first started writing MCMC models for my research, I did them in Python. I knew that my C skills weren’t great, that I’d be able to program a lot faster in Python, and that there were enough resources on the net for high performance Python computing that I’d be able to get my programs fast enough.  But after investing a lot of time trying to get my first Python MCMC program running fast – using things like numpy’s arrays and psyco – I still wasn’t happy with how slow things were going. I knuckled down and rewrote my model in C, and it absolutely blew the pants off my Python version. Of course I expected it to be faster, but it was much faster than I ever expected it to be. From then on in, I’ve written all of my MCMC stuff in C.  Since then I’ve become aware of PyMC, a Python library geared specifically toward MCMC. So obviously someone out there is able to make Python work for MCMC at a decent speed. This makes me wonder if maybe I really was doing something extremely wrong the whole time I was using Python for numerical work. Maybe it’s time I really dedicate some time to learning how to make Python faster for this kind of thing, to save myself time and effort in the future.

Even if it turns out that suffering through C implementations for the past 18 months has been something of a waste of time, to be honest I wouldn’t really regret it. Relying on C for my research has taught me more about it than I’d ever have learned otherwise. I am now comfortable with using malloc, calloc and free, I’ve done some multithreading work with the pthread API and I’ve learned how to use gdb and gprof.  I feel like I actually deserve to be able to say that I “know C”, as opposed to before when I knew the syntax but couldn’t really work effectively in C. Of course, I am still a long way from being any sort of guru, but I feel like I’ve taken important steps. I’ll be much more confident the next time I have to read and understand some C code.  A lot of people probably feel that in this day and age of Python and Ruby and Lua (and in less “webby” parts of the world, Java and C#) that competence in C is an obsolete skill. I think there’s a degree of truth in that, and certainly C shouldn’t be used for new projects without a compelling argument in favour of it (and a simple “it’s faster”, is not a compelling argument!). But at the same time I think that mastery of C is still an important rite of passage for serious programmers, and it feels good to have taken a number of extra steps in that direction.

But if I can work in Python from now on, I’m not going to miss the segaults.

Another Unix shell in Python

A while ago I posted the code for a simple Unix shell written in Python. Not too long after this I expressed my delight at discovering the Python standard library’s cmd module. It turns out that these two concepts go together really well. Here’s a much nicer simple Unix shell, less than 100 lines long. It still lacks input/output redirection or pipes, but by virtue of subclassing Cmd it has gained a command line history and tab completion. The tab completion is not the smoothest you’ll ever see – this is basically the first thing I could get almost working – but I think this solidly demonstrates that Cmd makes a fine base for a Pythonic Unix shell.

from cmd import Cmd
from glob import glob
from os import access, chdir, execv, fork, getenv, getcwd, listdir, wait, X_OK
from os.path import exists, isdir
from sys import exit

class Shell(Cmd):

        def _substitute_env_var(self, token):
                if token.startswith("$"):
                        return getenv(token[1:], "")
                else:
                        return token

        def precmd(self, line):
                words = line.split()
                words = map(self._substitute_env_var, words)
                return " ".join(words)

        def _get_executables(self):
                result = []
                for dir in getenv("PATH").split(":"):
                        try:
                                result.extend(filter(lambda(x): access(dir+"/"+x,X_OK), listdir(dir)))
                        except OSError:
                                pass
                return result

        def postcmd(self, stop, line):
                self.prompt = getenv("PS1", "$ ")
                self.executables = self._get_executables()
                if stop:
                        return True

        def completenames(self, text, *ignored):
                return filter(lambda(x): x.startswith(text), self.executables)

        def completedefault(self, text, line, begidx, endidx):
                prefix = line.split()[-1]
                return map(lambda(x): x[len(prefix)-len(text):], glob(prefix+"*"))

        def do_cd(self, line):
                if not line:
                        line = getenv("HOME", "/")
                try:
                        chdir(line.split()[0])
                except OSError, (errno, strerror):
                        if errno == 2:
                                print "Directory %s does not exist." % line.split()[0]

        def complete_cd(self, text, line, begidx, endidx):
                prefix = line.split()[-1]
                return map(lambda(x): x[len(prefix)-len(text):],filter(lambda(x): isdir(x), glob(prefix+"*")))

        def do_cwd (self, line):
                print getcwd()

        def do_EOF(self, line):
                print ""
                return True

        def default(self, line):

                argv = line.split()
                command = argv[0]

                found = False
                path = getenv("PATH")
                for dir in path.split(":"):
                        if exists("/".join((dir, command))):
                                found = True
                                if(fork() == 0):
                                        execv("/".join((dir, command)), argv)
                                else:
                                        wait()
                if not found:
                        print "%s: Not found" % command

shell = Shell()
shell.prompt = getenv("PS1", "$ ")
shell.executables = shell._get_executables()
shell.cmdloop()

Lately I’ve been trying to make an effort to use Python’s functional programming features, like map and filter, instead of cumbersome piles of fors, ifs and appends, which is evident in this code.

NetBSD 5.0 changes, link to Pypes

Well known German NetBSD guy Hubert Feyrer (personal website in English or German) made an entry in his NetBSD blog the other day discussing some user-visible changes in the upcoming NetBSD 5.0. It looks like something of a mixed bag to me, although it’s mostly good.

On the good side: audit-packages and download-vulnerability-list are scheduled to become part of the base system, which I’m very pleased by. I think this functionality is fantastic and amazingly rare (only FreeBSD has something similar, to my knowledge), so I’m glad to see it made easier. /tmp will now be per-user via the use of magic symlinks, which is fairly boring but I suppose has positive security implications. The old DHCP client has been replaced by a simpler one which uses about 50% as much memory – the absolute saving is quite small (1058 vs 548 Kb), but if there is no loss of functionality or stability then it makes sense to do this. Finally, the bootloader will have a config file. I’ve never had to need to configure the bootloader (which is delightfully simple) past its defaults, but this harms nobody is actually something I had simply assumed was already there. All in all some nice, simple changes – evolution, not revolution, as the OpenBSD folks are fond of saying.

On the bad side: NetBSD will soon have a webserver in the base system. Hubert himself says “Seriously, I have no idea why this is” and I couldn’t agree more. The only other OS I’ve known which does this is OpenBSD (which comes with a heavily patched version of Apache 1.3.x), and I’ve never really enjoyed it. Frankly, a webserver has no place in the base system of an OS. It’s something that belongs squarely within a packaging system, like pkgsrc or ports.

Not only is this decision aesthetically displeasing, but the choice of webserver itself borders on the surreal. The OpenBSD folks at least have some excuse for shipping a webserver, in the sense that they are acting in keeping with their “theme” of being a high-security OS by shipping a version of Apache which they feel is more secure than the off-the-shelf version (the Apache devs have refused to put some of the OpenBSD security patches into the official Apache). NetBSD 5.0 will ship with, wait for it, the bozotic HTTP server: a small and low-featured webserver that is somewhat similar to Acme software’s thttpd, with the primary difference that about one thousandth as many people have ever heard of it. I suppose one could make the argument that NetBSD are sticking to their own “theme” by shipping a very small, simple, minimal-dependency web server that would probably be ideal for embedded devices, but I don’t quite buy that – bozohttpd is available in pkgsrc and long has been, which makes the situation different to OpenBSD because their customised Apache is not in their ports tree.

I just don’t see the point in this. Not that I have a problem with bozohttpd itself – small and simple servers like it and thttpd were an inspiration for me when I was working on comanche – but of all the servers to jam somewhere one doesn’t belong, why pick one that so few people are familiar with and will be happy with? It’s much harder to find support for bozohttpd and it’s much harder if not actually impossible to get it to do things that a lot of people used to Apache are going to want it to do, i.e. any sort of dynamic content faster than CGI. I can’t see it getting used much, if at all, so why bother putting it in? It’s such a small thing in itself that it would seem ridiculous to call it bloat, but it sets a precedent of letting in stuff that nobody needs which, if maintained for a few release cycles, could lead to genuine bloat.

Unrelated link drop:
An entry in someone’s Livejournal about “Pypes”, an interesting Python object that overloads the | operator (usually used to perform bitwise or operations on integers) to apply a function passed to the constructor. The effect is that you can mimic the familiar and much loved “pipe” syntax of Unix shells. A brief taste:


>>> double = Pype(map, lambda x : x * 2)
>>> sum = Pype(reduce, lambda x,y: x+y)
>>> range(7) | double | sum
42

Interesting and clever stuff!

Australian Weather Hacking Project

Ever since I wrote my getauweather program in early 2006 I have been meaning to put it to some sort of good use and actually do something with it. Just a little overdue, I’ve spent the last week working to this end and today am ready to announce my Australian weather data hacking project. At this page I hope to make available progressively more complicated and interesting applications of the huge amount of data that the Bureau of Meteorology make available. Perhaps more importantly, I am going to make every effort to make that data available to the community at large so that people who aren’t me can do cool stuff with it as well.

At the moment, there is not an awful lot there. Every hour, a cron job runs a little Python script which uses getauweather (not the version you can currently download from my software page, a newer, better version that I’ll release in a few days when I am confident that it is working) to grab the latest data from all the weather stations and then:

  1. Reformats the data into CSV, XML and YAML, which anybody can then use in their own applications, for whatever purpose. The .csv file comes in at just under 100 kb and the two markup language files are in the area of 250 kb.
  2. Updates this Google map of weather stations.  This little visualisation really is my first use of either Javascript or the Google Maps API, so please forgive the fact that it really does suck. The page is a whopping 160 kb in size and, on most computers
    I’ve tried it on, Firefox will complain about how long it runs – just tell it to keep going and you’ll see results soon enough. I plan to use this map as a means to develop my Javascript skills, so hopefully it won’t suck for too much longer and will instead be powered by lightweight AJAX goodness. Of course, if you are already a Javascript wizard feel free to upstage me by making the coolest thing you can using those three freely available files above.

The next logical step for this project is to start logging these hourly weather results into a database which can be queried via a HTTP API. This will give me an excuse to finally try out CherryPy. Once such a database is up, AJAX magic will allow all sorts of awesome applications, and if people can download monthly dumps of the db then they can perform all sorts of fancy statistical analysis and the like. It should be good. I will note in passing that this undertaking is probably legally shaky for now.
Unlike the situation in the US, which has the eminently sensible rule that the federal government is not permitted to claim copyright on anything they produce, the Commonwealth of Australia feels quite entitled to copyright the BoM
observation data
. Apparently they are very easy going about granting licenses to repackage and redistribute the freely available stuff, so I’ll try to go down that route soon and, I suppose, pull the project if the BoM ends up objecting.
Personally, I am not entirely convinced that their claim of copyright can be legitimate. My understanding is that copyright law allows for protecting a particular expression of an idea, not the idea itself, which can only be protected by a patent. It’s not clear to me how this distinction applies to raw numbers, like weather station data, which can only possibly be expressed in one sensible possible way. At any rate, I don’t expect any trouble to show up on this front.

On a closing note, returning to the subject of improving my Javascript skills: In a previous entry I posted a link to a level of Super Mario World implemented in 14 kb of Javascript. I learned the other day that the same guy has gone ahead and started working on implementing Super Mario Kart the same way, and has previously done Wolfenstein 3D! The blog author, one Jacob Seidelin, is in fact quite the Javascript hacker, in the most traditional sense of the word – coding for fun with no specific goal
or direction, simply a desire to push boundaries and overcome limitations, which he is certainly doing. I’ll have to keep an eye on his work.

A little known awesome Python module

Via my constant companion, the programming reddit, the other day I came across a blog post by Doug Hellmann about a Python module named cmd, which contains one public class you can subclass to really easily make awesome command-driven programs in Python.

Reading about cmd, I thought it looked really awesome, couldn’t wait to use it, and was incredibly jealous that this Hellmann guy had thought of writing such a useful module before I had. It would have been a fairly easy but really useful and respectable bit of free software to have produced.

Then, reading the first comment on the post, I realised that I had misunderstood entirely what this was. cmd wasn’t a third party module that Hellmann had written and released. It was a part of the Python standard library that Hellmann was simply describing. This thing, this awesome thing, has been in Python since version 1.4, which was released last century, and before I had done any programming in anything other than Commodore BASIC 2.0, before I had any idea what Unix was. As far as I
personally am concerned it may as well have been the first thing Guido ever wrote.

In all my years of Python usage, I’ve never heard of cmd or seen it used!  Heck, I own a dead-tree copy of the O’Reilly book Python Standard Library and flip through it occasionally while waiting for the kettle to boil, or whatever, and I’ve never seen it in there. I see now that of course it is in there, but I never came across it in my random walks through that book, nor in any of the other Python books or blogs I have read. I am really astonished. This is something that should be much better publicised.

On an unrelated note, the anti-spam questions I discussed setting up in my last entry have completely eliminated the Russian spam problem! Rock on.

A Simple Unix Shell in Python

Here is an absolute bare-minimum Unix shell written in just 36 lines of Python
(including a few blank lines). It doesn’t do pipelines, input/output redirection,
tab-completion or have a command history, it’s just a really light wrapper around
the fork, execv and wait system calls, using the os
module
. The only internally implemented function is cd because a shell
is a painful thing to use without it.


from os import chdir, execv, fork, getenv, wait
from os.path import exists
from sys import exit

while True:

    # Get input
    prompt = getenv("PS1", "$ ")
    try:
        input = raw_input(prompt)
    except EOFError:
        exit()

    # Parse input
    argv = input.split()
    command = argv[0]

    # Do inbuilts
    if command == "cd":
        if len(argv) == 1:
           chdir(getenv("HOME", "/"))
        else:
           chdir(argv[1])
    # Do external
    else:
        found = False
        path = getenv("PATH")
        for dir in path.split(":"):
            if exists("/".join((dir, command))):
                found = True
                if(fork() == 0):
                    execv("/".join((dir, command)), argv)
                else:
                    wait()
    if not found:
        print "%s: Not found" % command

This works quite smoothly, it doesn’t feel laggy or anything when you’re
actually using it. I suppose this isn’t unusual because the Python code in use
probably just directly calls the underlying C system calls. It’s really pretty
neat, if not slightly silly.

I have a somewhat more complicated version working, but at the moment it’s still
looking a bit too ugly to post in a blog entry – it handles interpolation of
environment variables, multiple semi-colon delimited commands on a line and has the
beginnings of pipeline and redirection support.

I think I’ll end up writing an article that explains how Unix shells work,
using a Python shell as an example (because it’s much easier for the average
person to read than C), and put the complete shell up on my
software page.

On an unrelated note, here’s Super Mario World implemented in Javascript. Wow.

Threads and Signals in Python

For the last few days I’ve been working on a web spider (also known as a web
crawler – see Wikipedia), in
Python. This is something I’ve been thinking about doing for a
while, simply because it always seemed like it would be a good, fun,
educational programming challenge. I’ve been motivated to actually go ahead
and write one just now mainly on account of my burgeoning interest in
natural language processing, web searching, the semantic web and the overlap
between these three. I have a few ideas for projects that I’d like to try
out some day and they all require having a local copy of a small subset of
the web to tinker with. A spider seems the natural way to achieve this.

I’m very happy with how the spider is progressing and I’ll write about it in
some detail closer to when I actually release it (which shouldn’t be far off.
And, no, I haven’t forgotten about feedformatter, there’s a new version of that
in the works too). The point of this entry is to discuss the interplay of
threads and signals in Python, which is something I had to contend with today.

My spider is multi-threaded. The main thread creates an instance of a UrlQueue,
which is just a simple subclass of the standard library’s Queue object
and then spawns a number of worker threads which pull URLs off of this queue,
download the sites at those URLs and then parse the HTML looking for links,
placing any new URLs found onto the queue to be handled later by the same or a
different thread. The whole thing is run from the command line, so I’d really
like it if when the user hit Ctrl+C, each of the threads could finish dealing
with their current URL and then stop, so that the whole crawl finishes within a
few extra seconds.

Those readers with a bit of Unix background will know that what Ctrl+C actually
does is send a "signal" (specifically, SIGINT) to the process. You can read up
on signals at Wikipedia. Any modern Unix has a signal system call
which lets you register a “signal handler”, a function which is called upon
receipt of a signal. Python gives you access to this system call via the
signal module, so you can register a signal handler for SIGINT and
make Ctrl+C do whatever you like. The default SIGINT handler, by the way,
simply raises a KeyboardInterrupt exception, so if you don’t want to use signals
you can put your entire program in a try/except structure and get more or less
the same effect.

The first problem is that Python’s signal module documentation explicitly
states that when multiple threads are running, only the main thread (i.e. the
thread that was created when your process started) will receive signals. So
the signal handler will execute only in one thread and not in all of them. In
order to get all threads to stop in response to a signal, you need the main
thread’s signal handler to communicate the stop message to the other threads.
You can do this in plenty of different ways, perhaps the simplest being by
having the main thread flip the value of a boolean variable that all threads
hold a reference too. This is not a huge problem, and I’ve done things like
this before.

To my surprise today, this approach just wasn’t working. I put a print
statement in my signal handler and discovered that even the main thread wasn’t
receiving the SIGINT signal, even though it was definitely supposed to.

This leads to the second problem involved in mixing threads and signals. When
you send a signal to a multi-threaded Python program, that signal is put into a
queue. The main thread processes signals from that queue and invokes the
relevant handlers, but – and here’s the catch – it doesn’t do this until it has
something else to do as well. That is, if your main thread fires off a group
of worker threads and then sits there doing nothing while they work then as far
as Python’s thread scheduler is concerned there is no need to give that main
thread any CPU time while the worker threads are actually doing something, so
your SIGINTs – and, indeed, any other signals – just pile up in the queue and
are never handled. Note that "doing nothing" while the worker threads work
include sitting in a blocked state after a call to the join method of
a worker thread.

This means that if you want your main thread to be able to catch a Ctrl+C and
shut down all the worker threads, you need to make sure your main thread is
doing something while the others work. This doesn’t have to be anything
useful, of course, you can just make a call to sleep in a loop every
second or so. The code I am now using looks a bit like this, and seems to
work as intended:

# Start threads
threads = []
for i in range(0, num_threads):
    thread = WorkerThread()
    threads.append(thread)
    thread.start()

# Wait for threads to finish
while True:
    if not any([thread.isAlive() for thread in threads]):
        # All threads have stopped
    break
    else:
        # Some threads are still going
        sleep(1)

With this code, if I hit Ctrl+C while the worker threads are working, the
SIGINT gets put in the main thread’s signal queue. After no more than one
second, the sleep call in the infinite loop returns and the main
thread has something to do (check if all the threads have stopped yet). It
thus gets a slice of CPU time from the thread scheduler and so gets a chance to
handle any signals which have built up. If you need a bit more responsiveness,
you can sleep for less than a second, but the less you sleep the more CPU time
your main thread will chew up evaluating the expression in the if statement.
While on that subject, the any in that if statement is a new built-in
function that appeared in Python 2.5. An equivalent statement that should work
in earlier versions is:


if not True in [thread.isAlive() for thread in threads]:

I hope that this is helpful to somebody at some stage. Also, to give credit where
credit is due, this email by James
Henstridge on the PyGtk mailing list is where I got the insight to realise how
to fix my spider. Thanks, James! It’s probably also worth noting that there
is a recipe
in the ASPN Python Cookbook
by Allen Downey which proposes a solution to this problem involving the
fork system call – the main thread calls fork to create a
child process. The worker threads are spawned in the child process, leaving
just one thread in the parent process which can thus also receive a signal. The
parent process can catch SIGINT and then kill its child to get the desired
effect. I feel that this approach is a bit uglier than sleeping in a
loop, but it may have advantages that make it the better choice under certain
circumstances.