Posts tagged ‘python’

Chrooting Python in OpenBSD 4.5

Just a quick technical entry today, on the off chance that this helps somebody, somewhere.

For reason’s I’ll go into another time, I’ve recently been trying to chroot a Python installation into the directory /srv/www on an OpenBSD 4.5 machine. I took the standard approach to this, which is to examine the output of ldd /usr/local/bin/python2.5:

/usr/local/bin/python2.5:
Start End Type Open Ref GrpRef Name
1c000000 3c004000 exe 1 0 0 /usr/local/bin/python2.5
05432000 2547c000 rlib 0 1 0 /usr/local/lib/libpython2.5.so.1.0
05dc3000 25dc7000 rlib 0 1 0 /usr/lib/libutil.so.11.0
0df1e000 2df44000 rlib 0 1 0 /usr/lib/libstdc++.so.47.0
0d543000 2d54d000 rlib 0 1 0 /usr/lib/libm.so.5.0
02a7c000 22a85000 rlib 0 1 0 /usr/lib/libpthread.so.11.1
06a56000 26a8f000 rlib 0 1 0 /usr/lib/libc.so.50.1
08db8000 08db8000 rtld 0 1 0 /usr/libexec/ld.so

and copy each of these files over to the corresponding location in the chroot directory (i.e. stuff from /usr/lib/ goes in /srv/www/usr/lib). But after doing this, testing the setup with chroot /srv/www python2.5 yielded the error: /usr/bin/local/python2.5: can't load library 'libpython2.5.so.1.0'.

I went near mad trying to figure out why this wasn’t working. The libpython file was clearly in the correct location. A PHP installation that I had chrooted into the very same place using the very same procedure worked flawlessly. There seemed to be no scope for rational explanation of why this wasn’t working.

In the end, I managed to solve this problem by copying the file /var/run/ld.so.hints into the corresponding location in the chroot directory. Everything worked perfectly after that. I don’t profess to have any idea why you need to do this (I discovered that it worked after much trial and error, based on random permutations of stuff found on the web), but you do. Hopefully this post saves some other poor hacker from wasting a few frustrated hours.

I note that the ld.so.hints file exists in the same location on recent NetBSD releases, so I assume this advice ports over to NetBSD as well. However, there is no such file in my install of Arch Linux, so things are probably different in Linux land. Linux distros do seem to have a file /etc/ld.so.conf which serves a similar purposes – it was actually reading about this that gave me the idea to look for something similar in my situation.

Python vs C for performance

Via Reddit today I came across a fairly decent article on Python optimization tips and issues, which comes across fairly heavily in favour of the idea that by being careful and knowing what you’re doing, you can typically make Python implementations of
numerical algorithms fast enough for practical purposes, saving you the massive headaches associated with development in C.

This is a fairly relevant topic for me. A huge part of my PhD research is writing and playing with computational models of language acquisition. These models usually use Bayesian inference as a model of an “ideal learner” (doing this is fairly trendy in modern computational cognitive science, and not without good reason). When it comes to doing numerical Bayesian inference, a class of techniques known as Markov Chain Monte Carlo set the standard, and MCMC computations are the bread and butter of my programming work. Without going into too much detail, MCMC computations are highly iterative – you basically just do the same few steps over and over as fast as you can until your program converges on your answer.

When I first started writing MCMC models for my research, I did them in Python. I knew that my C skills weren’t great, that I’d be able to program a lot faster in Python, and that there were enough resources on the net for high performance Python computing that I’d be able to get my programs fast enough.  But after investing a lot of time trying to get my first Python MCMC program running fast – using things like numpy’s arrays and psyco – I still wasn’t happy with how slow things were going. I knuckled down and rewrote my model in C, and it absolutely blew the pants off my Python version. Of course I expected it to be faster, but it was much faster than I ever expected it to be. From then on in, I’ve written all of my MCMC stuff in C.  Since then I’ve become aware of PyMC, a Python library geared specifically toward MCMC. So obviously someone out there is able to make Python work for MCMC at a decent speed. This makes me wonder if maybe I really was doing something extremely wrong the whole time I was using Python for numerical work. Maybe it’s time I really dedicate some time to learning how to make Python faster for this kind of thing, to save myself time and effort in the future.

Even if it turns out that suffering through C implementations for the past 18 months has been something of a waste of time, to be honest I wouldn’t really regret it. Relying on C for my research has taught me more about it than I’d ever have learned otherwise. I am now comfortable with using malloc, calloc and free, I’ve done some multithreading work with the pthread API and I’ve learned how to use gdb and gprof.  I feel like I actually deserve to be able to say that I “know C”, as opposed to before when I knew the syntax but couldn’t really work effectively in C. Of course, I am still a long way from being any sort of guru, but I feel like I’ve taken important steps. I’ll be much more confident the next time I have to read and understand some C code.  A lot of people probably feel that in this day and age of Python and Ruby and Lua (and in less “webby” parts of the world, Java and C#) that competence in C is an obsolete skill. I think there’s a degree of truth in that, and certainly C shouldn’t be used for new projects without a compelling argument in favour of it (and a simple “it’s faster”, is not a compelling argument!). But at the same time I think that mastery of C is still an important rite of passage for serious programmers, and it feels good to have taken a number of extra steps in that direction.

But if I can work in Python from now on, I’m not going to miss the segaults.

Zine, another Python blogging engine

I’m a bit late in blogging about this, but last month saw the release of Zine, a blog engine written in Python and in the spirit of WordPress – i.e. having a user-friendly web interface, nice looking themes, a lot of plugins, etc. It is the only Python blog engine that I know of in this style. These are very early days for the project, but already it looks pretty impressive and I have to admit I am tempted to start using it instead of investing more effort in CherryBlosxom.

Where CherryBlosxom uses CherryPy to handle the HTTP side of things, Zine uses Werkzeug (which, if I recall correctly from my quick peek when it was released, is actually even lighter and simpler than CherryPy). Where CherryBlosxom uses Cheetah for its templating, Zine uses Jinja (which I’ve never really looked at before). Where CherryBlosxom keeps everything on the filesystem, Zine uses SQLAlchemy (which I’ve never used but have consistently heard very high praise for) to store entries in MySQL, PostgreSQL or SQLite. The fact that it works with any of these three major database engines gives it a good advantage over WordPress, which only works on MySQL.

I don’t really see Zine as competition for CherryBlosxom as such, since they are both striving to fill extremely different ecological niches. I am sure that Zine will appeal to people the most and if Zine only gets better than how it looks now I expect that it won’t be too long before it is the preferred blog engine for Pythonistas, since it will be able to do anything WordPress can do just as well – and perhaps more, perhaps better. Nevertheless, I know that at first glance the two are going to look like their in competition by virtue of both being blog engines in Python (of which there are surprisingly few), so the release of Zine is somewhat motivating to put more effort into CherryBlosxom.

I’d like to put some proper, intelligent caching functionality into CherryBlosxom – so that GETs of entries or entry lists are cached (i.e. the filesystem isn’t hit to get the entry text each time) but as soon as someone comments on the entry the cache is invalidated (so that the next GET does hit the filesystem and pick up the new comment along the way). I’d also like to write some command line tools to make using CherryBlosxom easier, so that people aren’t directly writing files in their entries directory, and to automate things like spell checking, etc – much like the old setup I had for PyBlosxom, discussed here. Perhaps I’ll release CherryBlosxom 0.2 with some steps in this direction (oh, and RSS/Atom feeds, of course) sometime soon.

Another Unix shell in Python

A while ago I posted the code for a simple Unix shell written in Python. Not too long after this I expressed my delight at discovering the Python standard library’s cmd module. It turns out that these two concepts go together really well. Here’s a much nicer simple Unix shell, less than 100 lines long. It still lacks input/output redirection or pipes, but by virtue of subclassing Cmd it has gained a command line history and tab completion. The tab completion is not the smoothest you’ll ever see – this is basically the first thing I could get almost working – but I think this solidly demonstrates that Cmd makes a fine base for a Pythonic Unix shell.

from cmd import Cmd
from glob import glob
from os import access, chdir, execv, fork, getenv, getcwd, listdir, wait, X_OK
from os.path import exists, isdir
from sys import exit

class Shell(Cmd):

        def _substitute_env_var(self, token):
                if token.startswith("$"):
                        return getenv(token[1:], "")
                else:
                        return token

        def precmd(self, line):
                words = line.split()
                words = map(self._substitute_env_var, words)
                return " ".join(words)

        def _get_executables(self):
                result = []
                for dir in getenv("PATH").split(":"):
                        try:
                                result.extend(filter(lambda(x): access(dir+"/"+x,X_OK), listdir(dir)))
                        except OSError:
                                pass
                return result

        def postcmd(self, stop, line):
                self.prompt = getenv("PS1", "$ ")
                self.executables = self._get_executables()
                if stop:
                        return True

        def completenames(self, text, *ignored):
                return filter(lambda(x): x.startswith(text), self.executables)

        def completedefault(self, text, line, begidx, endidx):
                prefix = line.split()[-1]
                return map(lambda(x): x[len(prefix)-len(text):], glob(prefix+"*"))

        def do_cd(self, line):
                if not line:
                        line = getenv("HOME", "/")
                try:
                        chdir(line.split()[0])
                except OSError, (errno, strerror):
                        if errno == 2:
                                print "Directory %s does not exist." % line.split()[0]

        def complete_cd(self, text, line, begidx, endidx):
                prefix = line.split()[-1]
                return map(lambda(x): x[len(prefix)-len(text):],filter(lambda(x): isdir(x), glob(prefix+"*")))

        def do_cwd (self, line):
                print getcwd()

        def do_EOF(self, line):
                print ""
                return True

        def default(self, line):

                argv = line.split()
                command = argv[0]

                found = False
                path = getenv("PATH")
                for dir in path.split(":"):
                        if exists("/".join((dir, command))):
                                found = True
                                if(fork() == 0):
                                        execv("/".join((dir, command)), argv)
                                else:
                                        wait()
                if not found:
                        print "%s: Not found" % command

shell = Shell()
shell.prompt = getenv("PS1", "$ ")
shell.executables = shell._get_executables()
shell.cmdloop()

Lately I’ve been trying to make an effort to use Python’s functional programming features, like map and filter, instead of cumbersome piles of fors, ifs and appends, which is evident in this code.

NetBSD 5.0 changes, link to Pypes

Well known German NetBSD guy Hubert Feyrer (personal website in English or German) made an entry in his NetBSD blog the other day discussing some user-visible changes in the upcoming NetBSD 5.0. It looks like something of a mixed bag to me, although it’s mostly good.

On the good side: audit-packages and download-vulnerability-list are scheduled to become part of the base system, which I’m very pleased by. I think this functionality is fantastic and amazingly rare (only FreeBSD has something similar, to my knowledge), so I’m glad to see it made easier. /tmp will now be per-user via the use of magic symlinks, which is fairly boring but I suppose has positive security implications. The old DHCP client has been replaced by a simpler one which uses about 50% as much memory – the absolute saving is quite small (1058 vs 548 Kb), but if there is no loss of functionality or stability then it makes sense to do this. Finally, the bootloader will have a config file. I’ve never had to need to configure the bootloader (which is delightfully simple) past its defaults, but this harms nobody is actually something I had simply assumed was already there. All in all some nice, simple changes – evolution, not revolution, as the OpenBSD folks are fond of saying.

On the bad side: NetBSD will soon have a webserver in the base system. Hubert himself says “Seriously, I have no idea why this is” and I couldn’t agree more. The only other OS I’ve known which does this is OpenBSD (which comes with a heavily patched version of Apache 1.3.x), and I’ve never really enjoyed it. Frankly, a webserver has no place in the base system of an OS. It’s something that belongs squarely within a packaging system, like pkgsrc or ports.

Not only is this decision aesthetically displeasing, but the choice of webserver itself borders on the surreal. The OpenBSD folks at least have some excuse for shipping a webserver, in the sense that they are acting in keeping with their “theme” of being a high-security OS by shipping a version of Apache which they feel is more secure than the off-the-shelf version (the Apache devs have refused to put some of the OpenBSD security patches into the official Apache). NetBSD 5.0 will ship with, wait for it, the bozotic HTTP server: a small and low-featured webserver that is somewhat similar to Acme software’s thttpd, with the primary difference that about one thousandth as many people have ever heard of it. I suppose one could make the argument that NetBSD are sticking to their own “theme” by shipping a very small, simple, minimal-dependency web server that would probably be ideal for embedded devices, but I don’t quite buy that – bozohttpd is available in pkgsrc and long has been, which makes the situation different to OpenBSD because their customised Apache is not in their ports tree.

I just don’t see the point in this. Not that I have a problem with bozohttpd itself – small and simple servers like it and thttpd were an inspiration for me when I was working on comanche – but of all the servers to jam somewhere one doesn’t belong, why pick one that so few people are familiar with and will be happy with? It’s much harder to find support for bozohttpd and it’s much harder if not actually impossible to get it to do things that a lot of people used to Apache are going to want it to do, i.e. any sort of dynamic content faster than CGI. I can’t see it getting used much, if at all, so why bother putting it in? It’s such a small thing in itself that it would seem ridiculous to call it bloat, but it sets a precedent of letting in stuff that nobody needs which, if maintained for a few release cycles, could lead to genuine bloat.

Unrelated link drop:
An entry in someone’s Livejournal about “Pypes”, an interesting Python object that overloads the | operator (usually used to perform bitwise or operations on integers) to apply a function passed to the constructor. The effect is that you can mimic the familiar and much loved “pipe” syntax of Unix shells. A brief taste:


>>> double = Pype(map, lambda x : x * 2)
>>> sum = Pype(reduce, lambda x,y: x+y)
>>> range(7) | double | sum
42

Interesting and clever stuff!

A little known awesome Python module

Via my constant companion, the programming reddit, the other day I came across a blog post by Doug Hellmann about a Python module named cmd, which contains one public class you can subclass to really easily make awesome command-driven programs in Python.

Reading about cmd, I thought it looked really awesome, couldn’t wait to use it, and was incredibly jealous that this Hellmann guy had thought of writing such a useful module before I had. It would have been a fairly easy but really useful and respectable bit of free software to have produced.

Then, reading the first comment on the post, I realised that I had misunderstood entirely what this was. cmd wasn’t a third party module that Hellmann had written and released. It was a part of the Python standard library that Hellmann was simply describing. This thing, this awesome thing, has been in Python since version 1.4, which was released last century, and before I had done any programming in anything other than Commodore BASIC 2.0, before I had any idea what Unix was. As far as I
personally am concerned it may as well have been the first thing Guido ever wrote.

In all my years of Python usage, I’ve never heard of cmd or seen it used!  Heck, I own a dead-tree copy of the O’Reilly book Python Standard Library and flip through it occasionally while waiting for the kettle to boil, or whatever, and I’ve never seen it in there. I see now that of course it is in there, but I never came across it in my random walks through that book, nor in any of the other Python books or blogs I have read. I am really astonished. This is something that should be much better publicised.

On an unrelated note, the anti-spam questions I discussed setting up in my last entry have completely eliminated the Russian spam problem! Rock on.

A Simple Unix Shell in Python

Here is an absolute bare-minimum Unix shell written in just 36 lines of Python
(including a few blank lines). It doesn’t do pipelines, input/output redirection,
tab-completion or have a command history, it’s just a really light wrapper around
the fork, execv and wait system calls, using the os
module
. The only internally implemented function is cd because a shell
is a painful thing to use without it.


from os import chdir, execv, fork, getenv, wait
from os.path import exists
from sys import exit

while True:

    # Get input
    prompt = getenv("PS1", "$ ")
    try:
        input = raw_input(prompt)
    except EOFError:
        exit()

    # Parse input
    argv = input.split()
    command = argv[0]

    # Do inbuilts
    if command == "cd":
        if len(argv) == 1:
           chdir(getenv("HOME", "/"))
        else:
           chdir(argv[1])
    # Do external
    else:
        found = False
        path = getenv("PATH")
        for dir in path.split(":"):
            if exists("/".join((dir, command))):
                found = True
                if(fork() == 0):
                    execv("/".join((dir, command)), argv)
                else:
                    wait()
    if not found:
        print "%s: Not found" % command

This works quite smoothly, it doesn’t feel laggy or anything when you’re
actually using it. I suppose this isn’t unusual because the Python code in use
probably just directly calls the underlying C system calls. It’s really pretty
neat, if not slightly silly.

I have a somewhat more complicated version working, but at the moment it’s still
looking a bit too ugly to post in a blog entry – it handles interpolation of
environment variables, multiple semi-colon delimited commands on a line and has the
beginnings of pipeline and redirection support.

I think I’ll end up writing an article that explains how Unix shells work,
using a Python shell as an example (because it’s much easier for the average
person to read than C), and put the complete shell up on my
software page.

On an unrelated note, here’s Super Mario World implemented in Javascript. Wow.

Pyblosxom Hack Number 1

Here’s my first “pyBlosxom hack”. It’s not really a hack on the
pyBlosxom system
itself, it’s more of a “usage hack”, but I think it’s a relatively
neat one.

Back when this blog was statically rendered, I used to write the
entries in "http://pyblosxom.sourceforge.net/">Markdown, and they were
stayed that one on the file system. The statially rendered HTML
pages were in proper HTML, however, because I used the "http://pyblosxom.sourceforge.net/registry/text/PyMarkdown.html">Markdown
parser
for PyBlosxom. This worked just fine for static
rendering, but when I went dynamic I immediatley realised a huge
problem with this set up. For some reason the Markdown parser is
unbelievably slow. It took literally whole minutes for pyBlosxom to
render the latest 10 entries, which is obviously completely
unacceptable.

I found this quite odd at first, because I write my articles in
Markdown too, and use "http://www.freewisdom.org/projects/python-markdown/">Markdown in
Python
to translate them to HTML. I had always assumed that
PyBlosxom used the same Markdown translation code – afterall, why
would someone code a Python Markdown library if there was already
one out there? But it turns out that in fact this is what’s
happened. The PyBlosxom renderer uses completely different – and
obviously much less efficient – code to Markdown in Python.

The obvious solution to this problem would have been to wrap
Markdown in Python up in whatever interface pyBlosxom uses for
parsers, but I’ve solved it by doing something quite different
which gives me a fairly powerful interface to using pyBlosxom.

I’ve written a python script called makeentry which
does the following:

  • Starts up vi, my editor of choice, editing a temporary
    file in /tmp. I use this editing session to write an entry
    in Markdown. Note that I write just the entry, without the
    metadata that pyBlosxom would usually want at the start, like a
    title or tags.
  • Upon the vi process terminating after I finish writing
    the entry, it starts up "http://aspell.net/">aspell to spell check that
    file.
  • After spell checking the file, it (quickly!) translates the
    Markdown to HTML using Markdown in Python.
  • I then get prompted to entire a title and list of tags.
  • The title, tags and translated HTML entry are then all
    concatenated in the expected way into a file in my pyBlosxom entry
    directory (the filename is automatically generated from the title
    by converting to lowercase and replacing spaces with
    underscores).

This way I still get to write in Markdown, but with the
following benefits over wrapping Markdown in Python up with
pyBlosxom’s parser interface:

  • I get to do do spell checking (indeed, arbitrary
    pre-processing) before publishing my entry.
  • pyBlosxom reads the entry of the disk in HTML, so no time at
    all is consumed doing a translation (which is faster than even the
    fastest Markdown translator possible).

I quite like this usage paradigm. I’m hoping that sometime not
too far off I get the chance to add another level of
pre-processing: Pygments is a
code colouring system (written in Python, of course), which
translates code in just about any modern programming language into
HTML with appropriate span tags to perform syntactic code
colouration. I’d really like it if I could have my
makeentry script search the HTML entry for code
tags nested in pre tags (using "http://docs.python.org/lib/module-HTMLParser.html">HTMLParser
from the Python standard library) and automatically replace the
contents with colourised code using Pygments. This would be pretty
cool and shouldn’t be too hard. Keep an eye out for it in the
nearish future.

Threads and Signals in Python

For the last few days I’ve been working on a web spider (also known as a web
crawler – see Wikipedia), in
Python. This is something I’ve been thinking about doing for a
while, simply because it always seemed like it would be a good, fun,
educational programming challenge. I’ve been motivated to actually go ahead
and write one just now mainly on account of my burgeoning interest in
natural language processing, web searching, the semantic web and the overlap
between these three. I have a few ideas for projects that I’d like to try
out some day and they all require having a local copy of a small subset of
the web to tinker with. A spider seems the natural way to achieve this.

I’m very happy with how the spider is progressing and I’ll write about it in
some detail closer to when I actually release it (which shouldn’t be far off.
And, no, I haven’t forgotten about feedformatter, there’s a new version of that
in the works too). The point of this entry is to discuss the interplay of
threads and signals in Python, which is something I had to contend with today.

My spider is multi-threaded. The main thread creates an instance of a UrlQueue,
which is just a simple subclass of the standard library’s Queue object
and then spawns a number of worker threads which pull URLs off of this queue,
download the sites at those URLs and then parse the HTML looking for links,
placing any new URLs found onto the queue to be handled later by the same or a
different thread. The whole thing is run from the command line, so I’d really
like it if when the user hit Ctrl+C, each of the threads could finish dealing
with their current URL and then stop, so that the whole crawl finishes within a
few extra seconds.

Those readers with a bit of Unix background will know that what Ctrl+C actually
does is send a "signal" (specifically, SIGINT) to the process. You can read up
on signals at Wikipedia. Any modern Unix has a signal system call
which lets you register a “signal handler”, a function which is called upon
receipt of a signal. Python gives you access to this system call via the
signal module, so you can register a signal handler for SIGINT and
make Ctrl+C do whatever you like. The default SIGINT handler, by the way,
simply raises a KeyboardInterrupt exception, so if you don’t want to use signals
you can put your entire program in a try/except structure and get more or less
the same effect.

The first problem is that Python’s signal module documentation explicitly
states that when multiple threads are running, only the main thread (i.e. the
thread that was created when your process started) will receive signals. So
the signal handler will execute only in one thread and not in all of them. In
order to get all threads to stop in response to a signal, you need the main
thread’s signal handler to communicate the stop message to the other threads.
You can do this in plenty of different ways, perhaps the simplest being by
having the main thread flip the value of a boolean variable that all threads
hold a reference too. This is not a huge problem, and I’ve done things like
this before.

To my surprise today, this approach just wasn’t working. I put a print
statement in my signal handler and discovered that even the main thread wasn’t
receiving the SIGINT signal, even though it was definitely supposed to.

This leads to the second problem involved in mixing threads and signals. When
you send a signal to a multi-threaded Python program, that signal is put into a
queue. The main thread processes signals from that queue and invokes the
relevant handlers, but – and here’s the catch – it doesn’t do this until it has
something else to do as well. That is, if your main thread fires off a group
of worker threads and then sits there doing nothing while they work then as far
as Python’s thread scheduler is concerned there is no need to give that main
thread any CPU time while the worker threads are actually doing something, so
your SIGINTs – and, indeed, any other signals – just pile up in the queue and
are never handled. Note that "doing nothing" while the worker threads work
include sitting in a blocked state after a call to the join method of
a worker thread.

This means that if you want your main thread to be able to catch a Ctrl+C and
shut down all the worker threads, you need to make sure your main thread is
doing something while the others work. This doesn’t have to be anything
useful, of course, you can just make a call to sleep in a loop every
second or so. The code I am now using looks a bit like this, and seems to
work as intended:

# Start threads
threads = []
for i in range(0, num_threads):
    thread = WorkerThread()
    threads.append(thread)
    thread.start()

# Wait for threads to finish
while True:
    if not any([thread.isAlive() for thread in threads]):
        # All threads have stopped
    break
    else:
        # Some threads are still going
        sleep(1)

With this code, if I hit Ctrl+C while the worker threads are working, the
SIGINT gets put in the main thread’s signal queue. After no more than one
second, the sleep call in the infinite loop returns and the main
thread has something to do (check if all the threads have stopped yet). It
thus gets a slice of CPU time from the thread scheduler and so gets a chance to
handle any signals which have built up. If you need a bit more responsiveness,
you can sleep for less than a second, but the less you sleep the more CPU time
your main thread will chew up evaluating the expression in the if statement.
While on that subject, the any in that if statement is a new built-in
function that appeared in Python 2.5. An equivalent statement that should work
in earlier versions is:


if not True in [thread.isAlive() for thread in threads]:

I hope that this is helpful to somebody at some stage. Also, to give credit where
credit is due, this email by James
Henstridge on the PyGtk mailing list is where I got the insight to realise how
to fix my spider. Thanks, James! It’s probably also worth noting that there
is a recipe
in the ASPN Python Cookbook
by Allen Downey which proposes a solution to this problem involving the
fork system call – the main thread calls fork to create a
child process. The worker threads are spawned in the child process, leaving
just one thread in the parent process which can thus also receive a signal. The
parent process can catch SIGINT and then kill its child to get the desired
effect. I feel that this approach is a bit uglier than sleeping in a
loop, but it may have advantages that make it the better choice under certain
circumstances.

Gopher Mirror and Site Redesign

Happy new year, everybody. On a quick personal note, I am a married man now (as of December 15, 2007, photos to be uploaded soon) and am off for a 3 week honeymoon in Europe (with a quick stop in Japan on the way to visit some good friends who missed the wedding) in just over a week.

You may have noticed that my website has changed a little in appearance over the past week. Some pages are probably throwing 404s or the like, as well. There’s something of a lengthy backround story to this, which I’ll share in this blog entry.

About a month ago, I discovered Gopher, an old (very early 90s) internet protocol for providing menu-driven access to hierarchically structured plaintext data. Gopher was basically the pre-cursor to the World Wide Web we know today. I had heard of Gopher before (in the Python standard library documentation, amongst other places), but didn’t know much about it and certainly had never tried it. Now I have, and I’ve decided that I like it. A lot. Admittedly this is partially because of its “retro charm”, but also because by forcing a hierarcical structure and removing all aspects of visual design, Gopher perfectly represents the the philosophy of information sharing that the internet was built on, while simultaneously achieving levels of speed, ease of use and elegence which have completely vanished from the modern web. This is the kind of thing I really enjoy.

Oh, and there’s no adverts.

I decided that I would like to mirror this website in Gopherspace, for the learning experience and the fun of it, as something of a statement of my preference for simplicity in computing and – admittedly – on account of the fact that according to automated Goperspace searching tools, doing so would make me administrator of the only Gopher server in Australia! Serious geek credit. I’m really quite incredulous of this fact, but it seems to be true. It is also reported that “There are no known gopher servers in Africa, Eurasia, the Middle East or the Indian subcontinent”. What are the geeks of the world doing, honestly?

To this end, I have spent the last week or so redesigning the way my site is managed. This had always been a fairly ad hoc afair, with lots of ugly PHP scripts (eww), hand written HTML code and rather more manual work than I’d have liked involved in even simple layout changes, etc. The need to conveniently make the same content available over two different protocols (HTTP and Gopher) in two different formats (HTML and plain text) demanded a complete overhaul, and I’m happy to report that something quite flexible and usable appears to be emerging from my work in this direction.

PHP is out and Python scripts are (as they probably always should have been) in. I am using the Cheeltah templating engine (devel/py-cheetah in pkgsrc) to do the grunt work. I write all of my articles in Markdown, as plain text files which are free of any metadata like titles, subtitles, keywords, etc. This metadata is stored in a configuration file compatible with the ConfigParser module in the Python standard library. A simple script reads the content files and the metadata and then passes that information to Cheetah templates – one set of templates handles the creation of my website (the Markdown is first transformed into HTML by “Markdown in Python“, /textproc/py-markdown in pkgsrc), another set of templates handles the creation of my Gopherspace, where pages are served up in their original Markdown, which is relatively readable. I am in the process of writing a script to also update an RSS feed whenever a new article runs through the above process, so that people can readily keep track of updates to my site. Surprisingly, there is not a lot of good code out there for generating RSS feeds with Python. I am putting something together using effbot.org’s ElementTree library, working from this handy example.

This whole new system works rather nicely and I am happy that managing my website from now on will be easy, flexible and Pythonic. But even moreso I am happy with how easily and quickly the system has come together. By using existing free software projects and making design decisions which leverage Python’s strengths (like using ConfigParser comapitlbe config files) I feel like I have made a working product in an amount of time so short it is completely out of proportion to the product’s functionality. It’s still quite ugly and unpolished at the moment, but I think that I will work on it over time and hopefully end up releasing at least parts of it on this website eventually.

Anyway, the end result for you of all this is that you can now point something – either a proper Gopher client (try net/gopher in pkgsrc) or a web browser that supports Gopher (Firefox is one) – to luke.maurits.id.au) on port 70 and see this site mirrored in Gopherspace! It’s a bit messy for now, but I do hope to clean it up soon. The server I am using is PyGopherd (/net/pygopherd in pkgsrc), a modern Gopher server written, of course, in Python.

Enjoy the anachronism!