Archive for March 2008

Problems with Ubuntu

Following on from my post about the unavoidability of Flash, maybe a week or two ago I took a plunge back into the world of Linux after a goodly 3 years or so using NetBSD as my main desktop operating system. I decided to check out this Ubuntu thing that everybody seems so very worked up about these days. More specifically, I tried the Xubuntu flavour, because I find Xfce to be less repugnant than the bloated giants that are Gnome and KDE.

Here’s a list of things I’ve discovered about it so far which have bothered me.  I consider all of them to be fairly serious flaws, at least from an idealistic stand point:

  • The “base” installation bares no resemblance to the conceptual ideal of a base installation – it contains many things which are in no reasonable way necessary for your system to work (including Perl and Python and Ruby!), yet misses out things that one would reasonably expect to be a part of any Unix installation and in many cases are necessary for the system to work, like NFS support (client or server).
  • By default, the root account is disabled (i.e. you cannot login as root from the console or use su from a shell) and you are forced to use sudo to do anything with root powers. At no point whatsoever during the installation process are you told that this is happening, even though no other Unix-like operating system on Earth behaves this way.
  • In a default installation, typing vi at a shell doesn’t start vi, it starts vim. I don’t like being lied to by my computer.
  • Third party software that you install via apt-get gets put in /usr/bin/, the exact same place as stuff that was installed as part of the system. /usr/local remains empty and there is no separation of system from extras, as is seen in the BSD systems and like I’m pretty sure was the norm when I last used Linux.
  • Whenever you install a piece of server software (like, for instance, the OpenSSH server, which, astonishingly, is not part of the titanic base install) Ubuntu immediately starts this server up, using the default configuration. You don’t get asked if you’d like to start it now. You don’t get a chance to change the default configuration to something that is appropriate for your environment. It just gets started. I think this one is particularly bad.

With all that said and done, having Flash working is pretty nice, and the package management has been an absolute dream so far, compared to pkgsrc. It doesn’t have some of pkgsrc’s cooler features, like license auditing or automated checking for security vulnerabilities, but the three simple facts that:

  1. Everything you could ever want is in there,
  2. Everything is available as a binary package,
  3. Updating everything is easy,

make those little sacrifices entirely negligible. It’s awesome, just a shame that the rest of the system has such severe deficiencies. I suppose the obvious thing to do based on this experience is to check out plain old Debian, although I also have my eye on Arch Linux. Hopefully I can find something out there that doesn’t suck. I’m somewhat worried about the age old issue of Debian’s packages being prehistoric, but I suppose that if it’s so bad, I can live with Xubuntu – most of the issues above are things you only have to deal with once.

On an unrelated note, you’ve probably noticed the appearance of commenting on this blog – I think that this works, so feel free to have at it. The road to getting comments to work in pyBlosxom was winding and fraught with peril, and my next entry will probably be about that.

Oh, and the commencement of my PhD has been delayed by probably about a week due to various administrative complications. Sigh.

Threads and Signals in Python

For the last few days I’ve been working on a web spider (also known as a web
crawler – see Wikipedia), in
Python. This is something I’ve been thinking about doing for a
while, simply because it always seemed like it would be a good, fun,
educational programming challenge. I’ve been motivated to actually go ahead
and write one just now mainly on account of my burgeoning interest in
natural language processing, web searching, the semantic web and the overlap
between these three. I have a few ideas for projects that I’d like to try
out some day and they all require having a local copy of a small subset of
the web to tinker with. A spider seems the natural way to achieve this.

I’m very happy with how the spider is progressing and I’ll write about it in
some detail closer to when I actually release it (which shouldn’t be far off.
And, no, I haven’t forgotten about feedformatter, there’s a new version of that
in the works too). The point of this entry is to discuss the interplay of
threads and signals in Python, which is something I had to contend with today.

My spider is multi-threaded. The main thread creates an instance of a UrlQueue,
which is just a simple subclass of the standard library’s Queue object
and then spawns a number of worker threads which pull URLs off of this queue,
download the sites at those URLs and then parse the HTML looking for links,
placing any new URLs found onto the queue to be handled later by the same or a
different thread. The whole thing is run from the command line, so I’d really
like it if when the user hit Ctrl+C, each of the threads could finish dealing
with their current URL and then stop, so that the whole crawl finishes within a
few extra seconds.

Those readers with a bit of Unix background will know that what Ctrl+C actually
does is send a "signal" (specifically, SIGINT) to the process. You can read up
on signals at Wikipedia. Any modern Unix has a signal system call
which lets you register a “signal handler”, a function which is called upon
receipt of a signal. Python gives you access to this system call via the
signal module, so you can register a signal handler for SIGINT and
make Ctrl+C do whatever you like. The default SIGINT handler, by the way,
simply raises a KeyboardInterrupt exception, so if you don’t want to use signals
you can put your entire program in a try/except structure and get more or less
the same effect.

The first problem is that Python’s signal module documentation explicitly
states that when multiple threads are running, only the main thread (i.e. the
thread that was created when your process started) will receive signals. So
the signal handler will execute only in one thread and not in all of them. In
order to get all threads to stop in response to a signal, you need the main
thread’s signal handler to communicate the stop message to the other threads.
You can do this in plenty of different ways, perhaps the simplest being by
having the main thread flip the value of a boolean variable that all threads
hold a reference too. This is not a huge problem, and I’ve done things like
this before.

To my surprise today, this approach just wasn’t working. I put a print
statement in my signal handler and discovered that even the main thread wasn’t
receiving the SIGINT signal, even though it was definitely supposed to.

This leads to the second problem involved in mixing threads and signals. When
you send a signal to a multi-threaded Python program, that signal is put into a
queue. The main thread processes signals from that queue and invokes the
relevant handlers, but – and here’s the catch – it doesn’t do this until it has
something else to do as well. That is, if your main thread fires off a group
of worker threads and then sits there doing nothing while they work then as far
as Python’s thread scheduler is concerned there is no need to give that main
thread any CPU time while the worker threads are actually doing something, so
your SIGINTs – and, indeed, any other signals – just pile up in the queue and
are never handled. Note that "doing nothing" while the worker threads work
include sitting in a blocked state after a call to the join method of
a worker thread.

This means that if you want your main thread to be able to catch a Ctrl+C and
shut down all the worker threads, you need to make sure your main thread is
doing something while the others work. This doesn’t have to be anything
useful, of course, you can just make a call to sleep in a loop every
second or so. The code I am now using looks a bit like this, and seems to
work as intended:

# Start threads
threads = []
for i in range(0, num_threads):
    thread = WorkerThread()
    threads.append(thread)
    thread.start()

# Wait for threads to finish
while True:
    if not any([thread.isAlive() for thread in threads]):
        # All threads have stopped
    break
    else:
        # Some threads are still going
        sleep(1)

With this code, if I hit Ctrl+C while the worker threads are working, the
SIGINT gets put in the main thread’s signal queue. After no more than one
second, the sleep call in the infinite loop returns and the main
thread has something to do (check if all the threads have stopped yet). It
thus gets a slice of CPU time from the thread scheduler and so gets a chance to
handle any signals which have built up. If you need a bit more responsiveness,
you can sleep for less than a second, but the less you sleep the more CPU time
your main thread will chew up evaluating the expression in the if statement.
While on that subject, the any in that if statement is a new built-in
function that appeared in Python 2.5. An equivalent statement that should work
in earlier versions is:


if not True in [thread.isAlive() for thread in threads]:

I hope that this is helpful to somebody at some stage. Also, to give credit where
credit is due, this email by James
Henstridge on the PyGtk mailing list is where I got the insight to realise how
to fix my spider. Thanks, James! It’s probably also worth noting that there
is a recipe
in the ASPN Python Cookbook
by Allen Downey which proposes a solution to this problem involving the
fork system call – the main thread calls fork to create a
child process. The worker threads are spawned in the child process, leaving
just one thread in the parent process which can thus also receive a signal. The
parent process can catch SIGINT and then kill its child to get the desired
effect. I feel that this approach is a bit uglier than sleeping in a
loop, but it may have advantages that make it the better choice under certain
circumstances.

Research explained

In my last entry I said I’d explain the research page page which mysteriously appeared during my site redesign. Here’s the story.

My first job at m.Net Corporation was basically to refine and extend some work done as part of a joint research project between m.Net and a research psychologist from my alma mater.  This psychologist was Daniel Navarro, an insanely smart guy who, despite being psychologist, actually understands things like maths and statistics and can even write code (though be fair his code sometimes sucks).

Working together we had moderate success in adapting latent Dirichlet allocation, a mathematical model originally developed for natural language processing, to a collaborative filtering problem as part of m.Net’s customer analytics research. It was pretty cool stuff, and I learned a lot. I was genuinely surprised and excited to realise that some psychologists actually do things like heavy Bayesian statistics and intense number crunching, instead of just blindly assuming that all the world’s data is normally distributed and interpreting simple linear regression as the Word of God (which is what mathematicians generally assume psychologists spend all of their time doing – it’s a reputation they deserve for teaching their students from a book called Statistics Without Maths for Psychology. I mean, really). Check out MIT’s Computational Cognitive Science Group for a better idea of the cool kind of stuff some people do. Anyway, about a month ago Dan mentioned to me in passing that an internal PhD scholarship in the School of Psychology may be about to become available, and suggested that if I were interested he could try to convince them to let me apply for it, on the grounds that teaching a mathematician the basics of psychology is about 100 times easier than teaching a psychologist the basics of mathematics and hence recruiting mathematicians is actually a smarter way to produce good research in mathematical psychology. I said I was interested, because I really did find the LDA work I did interesting, and he said he’d try but that I shouldn’t get my hopes up because it was a long shot. So I didn’t. Fast forward to earlier this week and the last bit of paperwork has gone through and the scholarship is mine. Sometime before the end of the month I’ll be starting a PhD, with Dan as my supervisor. The topic of study has not yet been finally decided, but it will revolve in someway around how humans firstly learn and subsequently understand language (and these are obviously related problems, considering the way that advanced language is typically learned via explanation using simpler language) and involve as much mathematical modelling and number crunching as I can possibly squeeze into it. I’m very excited about possible applications of these models, to things like improving the “intelligence” of tools like search engines and news aggregators and, perhaps more ambitiously, using software to “bootstrap” the semantic web by auto-generating RDF files en masse.

So you can expect any papers or the like that I write in the course of this PhD to appear on my research page, any software I write as part of it to appear on my software page (under a BSD license, of course), and occasional thoughts to appear in this blog.

This doesn’t explain the fact that all of this is happening with probability 0.5. I’ll leave that for another entry.

Oh, and I am trying to arrange to stay on at m.Net for one week a day during the PhD, because it’s an awesome place to work and I’d be genuinely sad to leave for good.

Another new look

As you can almost certainly see, I’ve given my site a new look, again. This one is substantially less simple than any of my previous ones (although I’d like to think it’s still clean), and I think it happens to look pretty good.  It’s not my own work, of course – the CSS was done by Erwin Aligam, who makes a whole bunch of really nice, neat looking templates available under a Creative Commons license on his website, StyleShout. Thanks, Erwin! Using someone else’s template instead of doing another one myself means that (1) my site actually looks good and (2) my silly urges to fiddle with the site layout are promptly satisified before I waste too much time better spent producing, y’know, actual content.

During the reworking of the site, I’ve tried my best to make sure that any existing bookmarks won’t break. You can still access any of my articles at their old URLs, without problems. The main structural change to the site is that a lot of sections from the old site – programming, unix, humanitarian computing, maths and cryptography – have all been subsumed into a larger “writings” section. I did this for two reasons.  One was that it made my navigation menu shorter, which better lends itself to certain site designs and layouts. The main reason, though, was because sometimes I want to write something which doesn’t neatly fit into any of the above categories, but which is a bit too long and/or formal to make into a blog entry. Having a single writings page means I can dump these miscelaneous things at the bottom and be done with it, rather than having to come up with a whole new section.

Some of you may also have noticed the new “research” section with an unexpected place holding message. Everything regarding this will be explained in a coming entry that I’ll probably write tomorrow.

The Unavoidability of Flash

Before the main thrust of this entry, I just wanted to point out that I (finally) got around to putting up at least a first version of my article on password storage, which has been linked to by my SQL injection article for a long time but hasn’t actually existed until earlier this week. Enjoy, and feedback is welcome!

Anyway, the main point of this article is that lately I have found myself ever more dissatisfied over the lack of availability of Flash on my home desktop machine. For those of you who didn’t realise, Flash is only available as a binary plugin for the mainstream operating systems and NetBSD isn’t amongst those. Getting Flash to work in NetBSD has always been a bit hit and miss. There are a wide range of possible solutions (and I discuss most of them, I think, in my NetBSD survival guide), mostly
based around various kinds of emulation. These solutions work to wildly varying degrees, depending on everything from the versions of Flash, NetBSD and Firefox involved to, apparently, the current phase of the moon. At the moment, Flash is effectively not working for me – video is jerky and intermittent and audio is non-existent. It’s not good enough for 9 out of 10 uses of Flash.

Now, this has been the situation for years, ever since I started using NetBSD.  But I used to absolutely not care. You only need to go a few years back in time to arrive at an internet in which Flash was completely and utterly useless and technical people could quite happily go without it. The uses of Flash could be summarised almost completely as:

  • Hideous banner adverts on websites which included video and/or sound. These things are often mind blowingly obnoxious (doing things like playing sound when rolled over with the mouse) and invariably not interesting enough to be
    worth the increased loading time and security risk.
  • Website navigation systems created by incompetent and inconsiderate web developers who had no concept of convenience or accessibility and were perfectly happy to make people with dial up connections wait for 10 minutes to their site and for people who didn’t use a supported OS or browser to simply not be able to see it. Invariably, these navigation systems offered nothing of value which couldn’t be achieved using faster, safer, and more accessible HTML, perhaps with Javascript, and the associated websites were entirely missable.  There’s a great rant about the problems with this sort of site here.
  • Interactive games or lengthy animations, the kind of things people email around to everybody they have ever met. Most of the time these things were fairly mindless, unwelcome distraction from actual work. Sometimes they were genuinely amusing (I used to be quite fond of the Strongbad email animations on Homestar Runner). In either case, they were something one could live without pretty easily.

These 3 categories accounted for 90% of the Flash on the web. I used to consider Flash as a cancer on the web, sucking up vast resources and creating substantial division amongst the online community, while rarely contributing anything of value. I was happy, even proud, to not have a working Flash installation on my computer. I felt liberated. And then YouTube came along.

At first I simply ignored YouTube as well. I thought the idea of using Flash to distribute video was stupid. I did not understand what the problem would be with simply providing direct links to mpeg or avi video files which could be downloaded via
HTTP or FTP. This would let anybody enjoy these videos regardless of their personal choice of operating system or browser. Furthermore, in the early days YouTube seemed to me to be little more than the new version of the final dot point in my list of Flash uses above – a way to distribute stupid, possibly amusing (but probably not) 5 minute videos that wasted my time. And some of the comments left on YouTube videos rank very highly amongst the stupidest things that humans have ever written
(a point made in this xkcd comic). YouTube? No, thanks.

However, today I am forced to admit that YouTube has become useful. Maybe it became useful a long time ago and I missed it while grumbling with my stone tools and bearskin clothes in my Flash-free cave, I’m not sure. To be sure, there is still a tremendous amount of crap on YouTube, complete with shockingly stupid comments. But at the same time, a lot of intelligent, creative people are using YouTube to broadcast stuff which is genuinely interesting, educational or useful. After
Itojun passed away I learned that he had posted a series of videos on YouTube explaining the basics of (what else?) IPV6, in both Japanese and English.   Just last night, my brother-in-law Gareth pointed me in the direction of some YouTube videos by
Johnny Chung Lee, a hacker from CMU, who has done some really clever stuff with the Nintendo Wii’s “Wiimote”, like building quick and cheap head-tracking hardware, electronic whiteboards and finger trackers. I also recently found via Reddit a
video demonstrating “Shredz64″, a port of the popular Guitar Hero game to the Commodore 64, which uses the actual PlayStation guitar controller, hooked up to the C64 through a home-made adapter. These are just some things I’ve found relatively recently and thought were awesome – I have to assume there is a plethora of similar stuff on YouTube.

It’s not just YouTube, either. YouTube has popularised the notion of embedded video streaming in web sites. It crops up in a lot of places, and it’s often used for good things. Google’s technical talks come to mind first, but they’re not alone. Not only is there a lot of other stuff out there now, but it’s clear that there is only going to be more in the future. For better or worse, this is the medium that the internet community as a whole has chosen. I don’t doubt that if, for instance, internet-based citizen journalism takes off (and I sincerely hope that it does), YouTube or YouTube-like technology will be behind it.

Clearly, the situation regarding Flash has changed since I last evaluated it.  It now looks like these days I have more to lose than I do to gain by forsaking Flash. This is a sad situation, to be sure. It’s always a sad situation when in order to fully participate in the wonder of the internet one has to have one’s freedom of choice of OS and browser limited by the will of a company which stubbornly refuses to release source code, or at least file format documentation (Why not, Adobe? The Flash player is (financially) free anyway!). But pragmatism has to trump idealism at some point. Maybe, with Flash, this point has been reached?

Another mathematician’s lament

Via Reddit this morning I came across a 25 page essay by a research mathematician turned maths teacher named Paul Lockhart. The essay is called “A Mathematician’s Lament”. Here’s the discussion on reddit, the introductory article and here is a .pdf of the essay itself.  Since things on the web have an annoying habit of disappearing after a few years, I’m locally hosting a copy of the essay here. It’s a fantastic piece of writing which is essentially a critique of the way that mathematics is currently taught to
students at high school. I’m really encouraged to see the overwhelmingly positive response to this on Reddit and I’ve been motivated to write a bit on the subject matter myself.

Although it’s not something I really talk about lot (because I usually doubt the ability of non-mathematicians to appreciate it), I have been, in my mind, critical of modern mathematics education for a long time. There’s no way I could not be, having experienced both the dizzying intellectual highs of Galois theory and the soul destroying drudgery of being told “here’s how to solve a particular kind of problem. Now solve these 50 instances”. Here is a brief overview of mathematics in my life.

I never dreamed of being a mathematician. I was not good at it during high school. I wasn’t /terrible/ at it, mind you, and I wouldn’t say I was afraid of it, but I usually achieved scores somewhere in the 70-80% bracket on my assessments. I was average at best. It was rare for me to enjoy mathematics, too. I didn’t usually hate it, but sometimes I did. I remember quite clearly walking to the mathematics classes toward my final year of high school with feelings of intense dread. This was during the death march of preparation for the final exam at the end of the year. There was no actual teaching involved, just revision. We would be given gigantic problem sets and told to work through them, for the entire lesson (sometimes for two consecutive lessons!).
Our teacher would sit silently at her desk and wait for students to approach her for individual help with problems they couldn’t solve. And that was it. An near hour or near two hours of tedious silence and solitary drudgery through hordes and hordes of unmotivated questions.

I did enjoy physics, though. Once again, not at first. At some stage I read Stephen Hawking’s “A Brief History of Time” and I was absolutely fascinated. That book drew me into physics with absolute force. I followed it quickly with Brian Greene’s “The Elegant Universe” (a fantastic book from which was eventually – perhaps inevitably – made a rather mediocre television series). These books got me fascinated by physics and I was kept that way by a physics teacher I had later, one who had only relatively recently left university, who had real theoretical physics research experience and could still radiate enthusiasm for man’s quest to understand the universe. By the time I finished high school, I had a qualitative, lay person’s understanding of the broad concepts involved in relativity and quantum mechanics and modern cosmology. I could not wait to go to university to study theoretical physics and learn all the gritty details. As anybody who has taken a first year physics course can probably appreciate, I
was disappointed to say the least. First year physics scratches the very surface of relativity theory, and I’m not sure even mention was made of quantum mechanics. It was largely a rehash of what I had already learned in high school, except now we were allowed to do it with calculus (which of course could not be done at high school, because very many people in the physics classes were not also taking a maths course that included calculus). To be fair, doing this sort of stuff with calculus is the way to do it and we did need to be taught that. We also had to spend many hours doing experiments that I thought were stupid; measuring the density of brass using archaic equipment, or verifying that momentum really is conserved. I won’t say that I didn’t enjoy first year physics, because on the whole I probably did. But it was not what I had been longing for. Not by half.

At the same time as physics had been generally disappointing me, something quite unexpected was happening on the other side of campus. By necessity, I was taking a lot of maths classes at the same time. But this was not the dreary drudge work I had done in high school. This was a fantastic, brave new world! For the first time in my life, mathematics was not a disparate collection of unrelated problem classes which I was told how to solve without explanation and then made to grind through a multitude of. Mathematics was a whole, a flowing poetry of ideas which built on each other successively. It was a cohesive fabric woven of rigorous thought, a crystalline iceberg of internal consistency. I hadn’t seen anything like it before.

Half way through my second year of university, I took a deep breath and followed my gut instinct by dropping out of my physics degree and taking up place in mathematics. It was not an easy choice to make – I deliberated over it for literally months. Furthermore, there was more motivation involved than I’ve written about here (I got quite swept up in the rage against reductionism in physics that the authors of several books on chaos theory and emergence exposed me too, but that’s something for another entry), but in very large part my decision to switch from one discipline to the other was a direct result of the rapturous joy I felt at my first true exposure to real mathematics.  I am in the utmost agreement with anybody who expounds Lockhart’s sentiment that by presenting mathematics to the young people of the world in a way which robs it of its creative aspects, which shatters it into disparate, boring parts without even hinting at the interconnected beauty of the whole, we are predisposing them to hate it, to fear it, to fail at it. We are also denying them the ability to enjoy one of mankind’s greatest achievements. It should be done differently.

Lockhart’s ideas about teaching mathematics in a way which emphasises its simple naturalness and creativeness – its underlying core of just thinking about things, developing a feel for problems, experimenting with potential solutions, refining ideas – deserve a lot of attention. Unfortunately, and somewhat pessimistically, I don’t hold out a lot of hope for progress in this direction. The reason for this is that almost everybody who is in a position to change this situation have themselves already had their ability to appreciate maths destroyed by the drudge and grind style of education.  Nevertheless, it’s encouraging to see these ideas being well articulated. If you know anybody involved in mathematics education, I encourage you to point them in the direction of Lockhart’s writing.

Another New Feedformatter

Well, true to my word, Feedformatter 0.3 is out tonight. I think I will make this the last of the “Release early, release often” rush releases. There is really very little sense to it. That said, I am enjoying this project and am pleased with the direction it is heading. All of the releases so far have been kind of ugly because they’ve been one-day improvements upon the previous version. Because the original was a quick-and-dirty solution that I didn’t so much design as just beat around until it worked, none of the subequent versions have looked much better. I think I’ll leave 0.4 until this weekend sometime and make sure it is a substantial improvement. I know my way around the problem space much better by now and should be able to produce something that is half-way decent. Please look forward to it (as they say in Japan)!

In unrelated news, I have been reading the docs for CherryPy these past few days and have been thinking of giving it a shot with my new lighttpd setup. I have an idea for a first project (that leverages some of my existing free software) that I’ll write about when it looks closer to actually happening.

Lighttpd and new feedformatter

Last night I replaced the Apache 1.3.x webserver which had been hosting this site with lighttpd (pronounced “lighty”), a very small, light and fast webserver which emphasises the use of FastCGI to overcome the limitations of traditional CGI, instead of embedding language interpreters in the server. This is a view point that I approve of, for reasons of security and freedom of choice in server/language pairings. I’ve actually tried switching to lighty before, but ended up not because I couldn’t get PHP working with FastCGI (a requirement for my TombSaver page). It turns out if I’d read the MESSAGE that pkgsrc shows after you install php I almost certainly would have, but oh well. It’s done now and I’m happy with the change.

I’ve also released a new version of feedformatter – already! I am taking the “release early, release often” idea to quite an extreme with this latest project (normally I wouldn’t release anything in the state that feedparser 0.1 and 0.2 have been). Realistically, this is no big problem – in all probability nobody has even used 0.1 yet anyway. The new version includes “pretty printing” of feeds (with newlines and indentation), a first stab at some compatibility with the Universal Feed Parser, better feed validation (though there is still a long way to go on this front) and slightly tidier code.

Yes, there probably will be a 0.3 release in the next day or two.

Now with feeds!

Travels

I’m back from honeymoon! Some of you may have noticed the nifty new travel maps that are up on my homepage. I expect these will change fairly slowly over time, due to the costs of international travel. I’ll try to get a working photo gallery of honeymoon shots up soon.

Web framework progress

Some of you may have noticed that there are now links from my homepage to valid RSS 1.0, RSS 2.0 and Atom 1.0 feeds for articles published on this site. These are generated by a Python module I wrote specifically for the task, which I have released on my software page as the Universal Feed Formatter (in reference to the well known, used and loved Universal Feed Parser). I was actually surprised I had to write my own module to achieve this. There is a lot of Python code for parsing various feed formats on the internet, but surprisingly few for producing the feeds themselves. I certainly couldn’t find anything on the net that could take a single dictionary structure and produce files in various formats like feedformatter can. Hopefully someone else can take advantage of this convenience.

feedformatter is now integrated with the simple web framework that I mentioned in my last entry. You’ll also notice that I have a working (though imperfect) sitemap up as well, again generated by the framework. With these things done, I think I’ve now accomplished all of my original goals for this project. The code is by no means clean or reliable, so I won’t be releasing it at the moment, but it works and can be progressively polished over time. I will probably do this before I begin work on implementing some sort of commenting system for my articles.

I have been thinking, vaguely, about extending the framework to include blogging, and replacing pyblosxom with it. My reason for this is not really a direct disatisfaction with pyblosxom. It’s the fact that a lot of the plugins that people write with pyblosxom do not work well (or at all!) with pyblosxom’s static rendering mode (which is the only mode I will use because I refuse to dynamically render static content each time it is viewed). This deficiency is the reason that there is no pagination on this blog (yet). Eventually this will become a problem, at which point I’ll either need to hack someone else’s pyblosxom plugin or switch to a new blogging platform – that new platform may as well be an extension of my own framework, because that will mean one less set of templates I need to maintain to match the rest of the my site.

Web log analysis

Several months ago I installed the /www/webalizer package from pgksrc on my web server – it’s a web log analyser that I run from cron every hour. It compiles basic statistics on hits to my website (most popular pages, most popular entry and exit pages, viewer country statistics based on GeoIP, etc.) and then produces HTML reports. I kept a half-hearted eye on these statistics for the first few days, but then mostly forgot about them. I revisited my stats pages earlier this week, and was pleased to see how much traffic I was apparently getting.

Intrigued, I decided to step my analyses up a bit by configuring my web server to log user agents and referring URLs in addition to the basic information already logged. Now able to see user agents, it’s become clear that most of the traffic I thought I was getting was not actually from people but rather search engine crawlers. Oops. I’ve changed my webalizer settings now to ignore these hits, but it will be a while before I can collect meaningful statistics on the genuine human traffic.

The most interesting things the log analysis reveals at this point are

  1. My NetBSD survival guide is the most popular page on the site. In fact, with some googling I was even able to discover that the URL for that page was given out in an OpenBSD IRC channel earlier this year! The survival guide was actually in fairly poor shape all this time, so I’ve put some effort into expanding and polishing it lately, given the important role it seems to play for my site. I still have a bit more to write, though, so watch that page over the next week or two for some activitiy.
  2. More than one person has wound up at this blog page searching for information on Itojun’s cause of death toward the end of last year. I did a lot of searching trying to find this out myself, and have come to the conclusion that there is not currently, and probably is not likely to ever be, a definite answer to this on the web. The only real leads I’ve found so far are a claim on the OpenBSD news site undeadly.org that it was a car accident and a claim on Slashdot that it was suicide – neither of these are substantiated by any kind of hard evidence. It seems clear that Itojun’s family and close friends wish the cause the remain private, and I think the best thing would be for his well wishers to respect that.