Archive for October 2008

Bayesian inference in phylogenetics

This blog entry represents quite a substantial departure from my usual subject matter, in that it has a lot to do with molecular biology. To say that this is not my area of expertise would be an understatement. I have no formal education in biology beyond the bare minimum that every Australian high school graduate must get – I ditched it for physics and chemistry at the first possible opportunity. My entry point into the material discussed here is this paper, which I found by virtue of it being cited in this paper, which is substantially more relevant to my current field. So I make no guarantees of complete factual accuracy in what follows, although I’d like to think I haven’t misunderstood anything too severely.

Phylogenetics is the study of evolutionary relatedness between organisms – identifying which plants or animals have common ancestors. The end result of such study is the production of phylogenetic trees, or “trees of life”, which look something like this and diagrammatically convey our best estimate as to when and where two closely related species have diverged earlier in evolutionary history. Historically, I get the impression this has been a 1very labour intensive process, and one where it has been difficult to get any sort of objective measure of reliability: evolutionary relatedness has had to been inferred based on observed similarities in things like large-scale physical traits or behaviour.

Modern molecular biology, however, has opened the door to, appropriately enough, so-called molecular phylogenetics, in which evolutionary relatedness is determined in a more direct and objective way by comparing organisms in terms of molecular structure – specifically, DNA or some form of RNA. In this case each organism is represented by a GATTACA-style sequence of base pairs and we assess relatedness by using some sort of mathematical model of the evolutionary process itself. As a simplest approximation, we might represent evolution by a sequence of independent mutations, where a particular base pair is picked uniformly at random from the entire sequence and replaced by another base pair picked uniformly at random from the 16 possible pairs. It is possible to refine this first order approximation by making mutations in some areas of the sequence more likely than others and/or making some mutations more likely than others.

With a collection of aligned sequences and such a mathematical model of evolution, we can then proceed to infer phylogenetic trees by any one of a number of possible techniques – by appealing to parsimony, and seeking the tree which minimises the total number of required mutations; by appealing to maximum likelihood and seeking the tree whose underlying sequence of mutations is more probably than that underlying any others; or by using Bayesian inference to compute a posterior probability distribution over trees. These heavily computational approaches to molecular phylogenetics are called (once again, appropriately enough), computational phylogenetics. The Bayesian case is the most immediately interesting to me, because it’s the only one I’ve read about in significant detail and because the specific algorithms used typically belong to the same broad class of algorithms that I use in my own work (Markov Chain Monte Carlo approximations), although Wikipedia suggests that this technique is controversial at best. The trees produced via Bayesian methods in the paper cited above, though, certainly seem reasonable to my non-expert eye.

The awesome explanatory power of the ideas of evolution mapped onto genetics is poignantly apparent when we step back and think about what is being achieved here. We are providing as input to a computation the structure of certain molecules sampled from organisms – invisible, tiny fragments of whole animals – and, making only assumptions about mutation processes on these molecules (which are known for a fact to occur) we are receiving as output tree structures of relatedness which are in excellent broad agreement with the tree structures we made in the pre-genetic days based on our observations of macroscopic physical features of whole animals. How inspiring! How giddyingly exciting it is to think that it is by no means impossible that in the future, after we have refined our techniques of , developed more efficient and accurate numerical methods for Bayesian inference or other statistical techniques, and greatly increased the processing speeds of our computers, we may be able to input into a great machine cell samples from all the plants and animals we have discovered and, after a lengthy process of computation, be rewarded one day with a glance at the complete history of all life currently found on this planet!

Cherryblosxom 0.1 released

Back on Thursday I made the first public release of Cherryblosxom, calling it 0.1. It is probably of minimal use to anybody because it comes with no documentation and is missing crucial features (e.g. RSS/Atom feeds), but I think even a mostly-useless 0.1 release can be beneficial for a project in that it maintains a sense of momentum.

As you’ll be able to see if your browser does RSS/Atom auto-discovery, the development version of Cherryblosxom that powers this blog has basic RSS/Atom functionality – provided by a development version of feedformatter. I hope to have both of those development versions released sometime in the next week, but I have substantially less free time for that sort of thing for the next fortnight (and have had for the last week, hence the relative lack of blogging). If I do get the Cherryblosxom version out as 0.2, perhaps I’ll make some minor concessions in the direction of user friendliness. Once the product itself is a little more complete I will disttools to make it easy to install (in the usual way, python setup.py install) and eventually I’ll put it on PyPi.

I’ve not forgotten about my Prime Time series, but when I next have time to put out a “proper” blog entry I hope to talk briefly about the use of Bayesian inference in phylogenetics. This is admittedly not something I know a lot about, but I feel like I know enough to find it justifiably fascinating.

Correlation attacks in Wikipedia

The other day I wrote a Wikipedia article on correlation attacks. I first noticed a red link to that page from the article on stream ciphers (meaning that the correlation attack article had been referred to
but did not yet actually exist) when I was studying the at RMIT
in early 2006. I came back to the stream cipher article recently for some reason and was astonished to
find that the correlation attack link was still red – nobody had filled this rather substantial hole
(correlation attacks are a very basic part of stream cipher cryptanalysis) in Wikipedia in close to
three years! So I’ve made something of a start on it. I think there is plenty of scope for elaboration,
still.

I haven’t thought much at all about this sort of thing for a very long time and I quite enjoyed revisiting
it. Stream ciphers are not anywhere near as well publicised or discussed on the web as block ciphers are and
so they’re probably a weak point in most self-taught people’s understanding of cryptography. Arguably this
makes sense because they’re not used as often in the civilian world as block ciphers, but there are a lot of
interesting and fun problems associated with stream ciphers that it seems a shame to just miss out on.

Prime Time 02: The music of the primes

Time for the second installment of my series on primes. This time I want to talk about the distribution of prime numbers: how primes are spread out along the line of positive integers. Are there any patterns in the distribution, or is it entirely random? This is a question which has concerned number theorists for centuries. Answers to questions about the distribution of primes can be surprisingly hard to find, and sometimes tantalisingly vague. There are a lot of results in this field of study. The canonical one is the aptly-named prime number theorem, but I’m going to simply gloss over this result at the end of the post. Instead I am going to focus mainly on two other ideas which aren’t as “powerful” as the prime number theorem, but are much more accessible to beginners.

Before I go any further, I’ll point out that the title of this entry, “the music of the primes”, isn’t my own creation, it’s the title of a famous book aimed at introducing some of the magic  of primes to the public at large. The phrase has since become common within the field.  I actually can’t guarantee that it originated with that book. At any rate, I borrowed it.

Twin primes, and beyond

One extreme in the distribution of primes is the existence of the so-called twin primes. The term twin primes refers to any two primes who differ by two, i.e. if p is a prime and p + 2 is a prime, the p and p + 2 are twin primes. There are a lot of examples of twin primes near the start of the positive integers: 5 and 7 are twin primes, as are 11 and 13 and 17 and 19. Twin primes are as close together as two primes can be: it is impossible for p and p + 1 to both be primes, since at least one of two consecutive
integers has to be even, and even numbers are, by definition, always divisible by 2. There is exactly one exception to this rule – 2 and 3 are consecutive and are both primes. This works because 2 is the only even prime number – which leads some mathematicians to jokingly call it “the odd” prime, a play on words using the alternate meaning of “odd” as “unusual”.  This says a lot about the sense of humor of a lot of mathematicians…

How many twin primes are there? As it turns out, nobody knows for certain. The twin prime conjecture states that there is an infinite number of twin primes. As the name suggests, this is a conjecture, not a theorem – it hasn’t been proven, although people have been trying to prove it for centuries. Certainly, most people who have studied the problem believe that the conjecture is true. With modern computers it’s relatively easy to search for twin primes automatically, and in this way number theorists
have found a lot of them. The largest pair of twin primes yet found is 2003663613 x 2195000 – 1 and 2003663613 x 2195000 + 1, two huge numbers with 58711 digits in them! The fact that as our computers get faster we keep finding higher and higher twin primes is just one reason to believe in the infinitude of twin primes (although it certainly isn’t a guarantee that we won’t one day stop finding them). There are a number of rough arguments based on known facts that make it feel more sensible that there are infinitely many twin primes than that there aren’t. Regardless of whether or not the twin prime conjecture is true, the fact is that there are a lot of twin primes out there: that is, there are a lot of cases where primes are distributed as close together as possible, which is an important consideration in trying to develop a feel for how the prime numbers are spread out.

Before moving on to the opposite extreme of prime distribution, I should point out that the idea of twin primes can be generalised pretty easily to prime triplets and various other higher level groupings. Some of these close groupings of primes are also conjectured to be infinitely abundant.

Large gaps in the primes

Having looked at the issue of twin primes, where primes are positioned as close together as they possibly can be, now I want to look at the opposite extreme: where the number line contains long stretches of non-primes (usually called “composites”) in between consecutive primes. There’s a really cool, kind of surprising and very easy to prove result about these situations, which simply says this:

For any positive integer n, there exist n consecutive integers, none of which are prime.

That is to say, you can pick any number you like – even the number of grains of sand on Earth or the number of stars in the universe, or any other outrageously large number – and it is guaranteed that somewhere along the number line there are that many non-prime integers in a row. The proof of this result doesn’t just establish that these prime-free stretches exist, it explicitly shows us how to construct them. This might sound obvious, but not all so-called “existence proofs” in mathematics are nice enough to come with instructions like this. Sometimes we prove something exists by assuming that it doesn’t and deriving a contradiction (the same logical strategy that we saw Euclid use to prove the infinitude of primes earlier), which doesn’t necessarily leave us with any idea how to find the thing we just proved the existence of. Maths can be funny like that. Anyway, this is one of the nicer cases where the proof shows us exactly where to look.

We need to sort out a tiny bit of notation first. When a mathematician puts an exclamation mark after an integer, like 42!, what they mean is the product of that integer, in this case 42, with every positive integer less than it, all the way down to one. So, for example:

5! = 5 x 4 x 3 x 2 x 1 = 120,

10! = 10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 3628800

15! = 15 x 14 x 13 x 12 x 11 x 10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 1307674368000

As you can see, these numbers get very large very quickly. By the way, when reading these equations out, the exclamation mark is pronounced “factorial”, i.e. 42! is “forty-two factorial”.  Now we’re ready to prove our result. Let’s say we want to find a prime-free gap of length 1000, i.e. 1000 non-prime numbers in a row. These numbers are simply:

1001! + 2, 1001! + 3, 1001! + 4, …, 1001! + 1001.

These numbers are clearly consecutive and there are also clearly 1000 of them. Are any of them prime? 1001! + 2 isn’t prime, because both sides of the addition sign are divisible by 2: 2 obviously is, and 1001! is divisible by any integer less than 1002, because it is, by definition, 1001 x 1000 x 999 x 998 x … x 2 x 1. So the first number in the sequence isn’t prime. 1001! + 3 also isn’t prime, because for the same reason both sides of the sum are divisible by 3. It should now be clear that all of the numbers in the sequence are divisible by something other than themselves, because of the way they are constructed. This isn’t a proof, yet, of course, it’s merely an example, but it’s very easy to generalise. To find n consecutive non-primes for any value of n, you just have to consider the sequence:

(n+1)! + 2, (n+1)! + 3, (n+1)! + 4, …, (n+1)! + (n+1).

So here we have a result that is at the opposite extreme of the conjectured infinitude of twin primes: there are an infinite number of cases where primes are spaced as far apart as you like.  In the interests of showing off some pedantic preciseness that being a good mathematician necessitates, I should point out that these theorem establishes that there exists at least one sequence of n consecutive composite numbers for any n. The sequence that is explicitly constructed in the proof is not guaranteed to be the only sequence of the length, nor the first such sequence to occur along the line. This isn’t really an important distinction for the purposes of demonstrating a contrast to the conjectured infinitude of twin primes, but it might be an important detail to keep in mind if we were to ever, say, use this result to attempt to prove another. But I’m digressing.

The big picture

So far we’ve only seen a kind of frustrating inconsistency in the distribution of the primes: sometimes we find them as close together as they can possibly get, and sometimes we find them so far apart that our minds can’t really grasp the significance of the number of composites between them. We haven’t seen yet anything on whereabouts these two phenomena happen along the number line or whether one happens more often than the other. The truth is that hard and fast rules about the distribution of primes are something modern mathematics is distinctly lacking. The more widely applicable a rule is, the more vague it seems to be. A sense of romance about this tantalising mix of apparent randomness and apparent order is what attracts a lot of people to the study of primes in the first place.

In the interests of completeness, having shown you some very specific results about the distribution of primes, I should talk about the prime number theorem, which is one of those results that says something kind of vague, but is more widely applicable.
The formal statement of the theorem that you’ll see on Wikipedia is not very accessible to non-mathematicians, so here’s a rough or, as a mathematician would say, “hand waving” statement that captures the gist of it:

As the positive integer n gets larger and larger, the chances of a random number picked from somewhere close by to n being prime gets closer and closer to 1 / ln(n).

Here ln(n) is the natural logarithm of n, which, to be very vague, is just something associated with a number than you can ask your calculator to figure out for you, much like the square root of n or the sine or cosine of n, etc.  This hand waving result can give us an intuitive sense for what happens to the distribution of primes as we get far out into the number line. If n = 100,000, then ln(n) is roughly 11.5, so about 1 in 12 numbers near 100,000 is prime. If n = 1,000,000, then ln(n) is roughly 13.8, so
about 1 in 14 numbers near 1,000,000 is prime. At n = 10,000,000 it’s 1 in 16 and at n = 100,000,000 it’s in in 18. If n is a googol (that is, a 1 followed by 100 zeros), about 1 in 230 numbers is prime. So we can see that prime numbers become more spread out the further we go along the line, but only in an “on average” sense of picking random numbers “near” some point on the line. The precise details of what is going on at any particular point are something a mystery until we actually crunch the numbers and look.

As a final point of interest, for those of you who get a lot more mileage out of a picture than a pile of words and theorems, here’s a graph showing the length of the gap between consecutive primes for the first 1000 such gaps. I got this data from the On-Line Encyclopedia of Integer Sequences (gaps between consecutive primes is sequence A001223). I drew the graph using gnuplot. At the very bottom left you can see a single red cross that is lower than every other cross on the graph: that represents the gap of 1 between the primes 2 and 3. There’s only one cross this low down because this is the only gap of 1 that ever happens in the sequence of primes, as we discussed 1earlier. All of the red crosses on the next level up represent gaps of 2: there is one cross for
every twin prime. The reason that the crosses appear to come in horizontal stripes is simply reflective of the fact that the gap between any two primes always has to be even (except in that one exceptional case of 2 and 3) – if you add an odd number to any prime other than 2 you get an even (and hence non-prime) number. At roughly n = 200 and n = 950 you can see unusually high crosses sitting above the crowd – hints at the fact we saw earlier that the gap between consecutive primes can actually
become arbitrarily large.

An (almost) 150 year old mystery

Although I’ve said all I really want to say on the distribution of primes for this entry, I would be remiss if I wrote something about the distribution of the primes and didn’t give at least a passing mention to something called the Riemann hypothesis. This is the grand daddy of unsolved problems in modern mathematics. No kidding. If you prove this hypothesis, first proposed in 1859 by Bernhard Riemann (a very important German mathematician who, amongst other things, developed mathematical ideas which later came to be important in Einstein’s theory of general relativity), to be true, you will win one million US dollars from the Clay Mathematics Institute (proving the Riemann hypothesis is one of seven so-called “millennium problems” that the CMI is offering a one million dollar reward each for) and serious fame. But don’t think this is easy money: the world’s very best mathematicians have been considering this problem for almost 150 years and nobody has found a proof yet. Aside 1from the simple attractiveness of solving a problem nobody else could in 150 years, one of reasons that so much effort is thrown at proving the Riemann hypothesis true is that if it is true, we’ll be able to use it to prove a lot of other interesting results. One of
these is a result similar to the prime number theorem but much stronger which would greatly improve our understanding of the distribution of prime numbers. This is some pretty heavy mathematics, and I freely admit that I don’t know enough about the connection between the Riemann hypothesis and prime distribution to talk intelligently about it here – but I know that is a very important connection and provides a lot of the motivation for actually working on the problem.

Like the twin prime conjecture, most mathematicians who have studied the Riemann hypothesis strongly suspect it is true – we just haven’t found a way to prove it beyond doubt yet.

Next prime time

Next prime time I want consider a slightly more practical issue about primes, namely: how do we actually know that a number is prime? This might sound silly, but when we are talking about the search for new primes larger than any found before, it is an entirely practical question to ask. We can’t take a candidate number n which might or might not be a new prime and then try dividing it by 2, then by 3, then by 4, and so until we’ve tried dividing it by everything up to n.  If n has 12978189 digits in it (like the current largest known prime) then trial division by every number less than n would take even the fastest computer on Earth an amount of time many orders of magnitude longer than our best estimates of the age of the universe. Thus we enter the field of
primality testing, which is all about clever tricks to test whether or not a number is prime as quickly as possible. Although the question this field tries to answer is conceptually simple, it turns out that the best known methods for efficient testing involve extremely advanced mathematics. I’ll introduce to one of the simplest improvements on trial division and discuss its strengths and weaknesses. Please look forward to it!

Prime time 01: An infinity of building blocks

Time for the first entry in my promised series on primes. In this entry I hope to introduce prime numbers as the fundamental building blocks of integers (whole numbers for those of you who’ve been out of school for a while!), and then show how we know, beyond any doubt, that an infinite number of primes exists. Both of these results, or results very similar to them, are ancient, and have been either known or suspected for centuries. Despite this, they’re not things people ever seem to mention when they introduce someone to the concept of primality, which I think is a strong contributing factor to why few people understand the intense sense of interest and reverence that mathematicians seem to bestow upon the primes.

Primes as building blocks

So, what do I mean when I speak of primes as building blocks? I am referring to a result known to number theorists as the fundamental theorem of arithmetic, which states this:

Every positive integer greater than 1 can be uniquely written as a product of prime powers.

For example, we can write the number 6936 in the form:

6936 = 23 x 3 x 172,

which is special in the sense that 2, 3 and 17 are all primes – we’ve expressed 6936 as a product of prime powers. This way of writing a number is usually called its “canonical factorisation”. Here are a few more examples of canonical factorisations:

282475249 = 710,

12345 = 3 x 5 x 823,

860552 = 23 x 7 x 112 x 127.

There are two aspects to this theorem – the fact that such a representation exists for every integer, and the fact that it is unique, i.e. there is only one such representation for every integer. I won’t discuss the rigorous proof of either aspect here as I don’t think the details are sufficiently enlightening to impose the burden of pedantic detail on readers who are not mathematicians (later in this post I will go though a proof where I think this trade off is more than fair). But it is easy enough to develop an intuitive understanding of why the first aspect is true: if a number is itself a prime then the canonical factorisation of that number is just that number itself, and we’re done. If a number isn’t prime then by definition it is divisible by something else, in which case we can write it as a product of two other numbers. If those other two numbers are both prime then we’ve arrived at a canonical factorisation. If they aren’t, then once again by definition we must be able to split them up further. Recursive application of this simple principle eventually leaves us with a product of primes, which constitutes the kind of factorisation we’re looking for.

In light of the fundamental theorem of arithmetic, it makes sense to say that prime numbers are building blocks for the integers – every integer is made from a finite number of primes multiplied together. With this in mind, it is entirely natural for number theorists to make primes an important object of study – just as natural as it is for biologists to study cells as the building blocks of living organisms, or for chemists to study atoms as the building blocks of molecules. In the same way that an understanding of cells or atoms can lead to a better understanding of biological processes or chemical reactions, sometimes it is possible to answer a question about a particular number by first breaking it into its prime components and then answering the question about each prime component individually.

Hopefully it is now clear that the property of being prime is in no way an arbitrary numerical curiosity like being perfect.  There is a combination of theoretical fundamentality and practical usefulness in primality that makes it entirely deserving of close attention and careful consideration.

The infinitude of primes

Having established the importance of primes as building blocks of all other integers, a natural matter of concern is the question: “How many primes are there?”. It may be tempting to suggest that there “obviously” has to be an infinite number of primes to account for the fact that there are obviously an infinite number of integers. But this is not the case. A finite set of primes would be sufficient to generate an infinite amount of integers, since there is an infinite number of integers from which to select powers. If there were only 3 primes in existence, let’s call them p1, p2 and p3, we can still multiply them together in an infinite number of ways:

p11 x p20 x p30,

p10 x p21 x p30,

p10 x p20 x p31,

p11 x p21 x p30,

p11 x p20 x p31,

p10 x p21 x p31,

p11 x p21 x p31,

p12 x p20 x p30,

p12 x p21 x p30,

p12 x p20 x p31,

It is (hopefully!) clear that we need never stop – we may continue raising our mere three primes to arbitrary choices of arbitrarily high integer powers. So there is some substance to the question after all: is it possible that after, say, one million primes, the set of possible integers you can make by multiplying together different powers of primes actually encompasses every integer?

It turns out that there are, in fact, an infinite number of primes, but I feel it is important to establish that this is not obvious, as I have had people tell me it is in the past.  The proof that there is an infinite number of primes is attributed to the ancient Greek mathematician Euclid, and it is in my opinion one of the most important proofs that a non-mathematician can be shown. It is elegantly simple, demonstrates a method of reasoning known as reductio ad absurdum which is extremely powerful and is used everywhere in mathematics, and offers a glimpse at the kind of thinking that mathematicians actually do, which is almost nothing at all like what most non-mathematicians think it is.

The proof starts by assuming that there is a finite number of primes. We don’t need to specify an exact number, it just needs to be less than infinity. We’ll call it “n” so we can refer to it as something. If this makes you uncomfortable, you can mentally
replace n by 42,000,000 everywhere after this point, it won’t change the details. What’s important is that based on this assumption we’re going to logically derive something that contradicts the assumption. We’ll see exactly what that means for our assumption later.   Let’s continue…

So, we have n primes, let’s call them p1, p2, …, pn. We can multiply all of these primes together and then add 1 to the result to get a new number, which we’ll call m:

m = p1 x p2 x … x pn + 1

Let’s think a bit about m, specifically about what we can divide it by. Remember that the fundamental theorem of arithmetic says that we can write m, just like any other integer, as a product of prime powers. That means that we should be able to divide m by one of our primes p1, …, pn. So let’s try to do that:

We can’t divide m by p1. That would leave us with:

m / p1= p2 x p3 x … x pn + 1 / p1,

and 1 / p1 certainly isn’t a whole number. We can’t divide m by p2 either, as that would leave us with:

m / p2 = p1 x p3 x … x pn + 1 / p2,

and again 1 / p2 isn’t a whole number. In fact, it should now be clear that m has been cleverly constructed so that it cannot be divided by any of the n primes in our set. Which, since the fundamental theorem of arighmetic is true (I know I did not prove
this earlier, but hopefully gave you sufficient grounds on which to trust me!), means that there must be some other prime number out there, missing from our set, which divides m. But we started with the explicit assumption that p1 through to pn comprised the entire, finite set of primes. We have contradicted ourselves!

What does this mean? We’ve shown that assuming that only a finite number of primes exist (any finite number of primes – see how the actual value of n played no important role in getting to this point?) leads directly to a logical inconsistency. The entire point of mathematics is that things should be consistent. Any assumption which leads to inconsistency has to be wrong and so we reject it. This is a general technique that mathematicians use often to decide whether or not something is true: if by assuming something is true you can use logical steps in thinking to “prove” something you know is false (or, equivalently, to contradict something you know to be true), then that initial something you assumed must be false, to preserve the consistency of mathematics as a whole. In this particular case, the initial assumption was that there are only a finite number of primes, and based on this assumption we were able to contradict something that followed from the known truth of the fundamental theorem
of arithmetic. Thus we are forced to reject our original supposition, which leads to our goal: certainty that there must be infinitely many primes.

If you’ve followed this argument for why there must be an infinite number of primes, you shouldn’t discount the significance of that achievement! This very line of thinking dates back to at least 300 years BC, and the very same logical steps have been followed through by countless people in every country of the world ever since then. This may be one of the most often-shared abstract mental experiences of the human race! It also provides an excellent first glimpse into the world of pure mathematics, which has no concern whatsoever with actually performing computations or solving equations or doing any of the other things you probably remember maths being all about in school. Instead its all about figuring out universal truths about numbers (and other things), like the fundamental theorem of arithmetic, or answering tricky questions with certainty, not by relying upon particular examples or special cases or gut instinct, but by using clever logical arguments, like Euclid’s argument for the infinitude
of primes.

Next prime time

Now that we have some appreciation for why primes are interesting, and know that there are infinitely many of them out there, I’ll use the next Prime Time entry to discuss what feels like the next natural question: where do we find the primes? Is there a pattern in the way they are spread out through the integers? Are they sprinkled about evenly? Are they bunched up toward the low numbers, or are they bunched up out somewhere near infinity? Are they distributed completely randomly, with no rhyme or reason? Questions like these and their sometimes surprising answers are a source of a lot of the sense of mystery and captivation that a lot of people feel about primes. Please look forward to it!

On the nonsense of “cyberbullying”

This article in The Australian represents some of the latest noise in a steadily increasing clamour in the Australian mainstream media about the phenomenon of “cyberbullying”, which is defined as “harassment via text messaging, internet or email” (that’s right, internet or email. It’s clear that “internet” should be glossed here as “social networking sites”, like Facebook or MySpace). Since the beginning of this trend I have had very little respect for the notion of cyberbullying as a serious problem and have been disappointed to see that the vast majority of discussion on the matter betrays stunning ignorance on the part of those most upset by it. Although it’s a trifling issue not overly relevant to the overall debate, there’s some evidence of this in The Australian’s embarassing claim that “the latest addition to the cyber vocabulary is “flaming” — a form of online verbal abuse using capital letters to express aggression”. Use of the word “flaming” to refer to online abuse is, of course, as old as computer networking itself. The jargon file, the most authoritative source on hacker slang, suggests that this use was practiced at places like MIT in the late 60s. You don’t need to be deeply immersed in hacker culture to know this, either. Anybody with any appreciable internet usage – certainly anybody who is actually qualified to be writing for the IT section of a major newspaper – heard the term “flaming” years before the term “cyberbullying” had been dreamed up. But I digress.

Much more troubling than the media’s failure to get basic terminology right is the fact that nowhere in the media have I seen anybody bring up the most immediately obvious difference between cyberbullying and real bullying, the difference which makes cyberbullying a non-issue: the ability to effortlessly block the bully with 100% efficiency. It might be very nearly or even completely impossible for a young child to avoid the schoolyard bully who catches the same bus as him, or who hangs out in central hallways or infront of the most popular piece of playground equipment – but it takes literally less than a minute to stop somebody from ever contacting you again via email or any of the major instant messaging or social networking sites. The process is quick, painless, free and foolproof.

I expect a lot of people would retort to this fact with the suggestion that we cannot reasonably expect young children to know how to perform such complicated tasks with their computers (a rather disingenuous claim, as most school aged children either are or quickly will become more proficient with using computers than the parents who would make the claim). This argument can pretty much be rejected outright. Here’s a screenshot of the conversation window of Windows Live Messenger (courtesy of www.msgweb.nl), which is by far the most commonly used instant messaging platform amongst Australian youth. The icons along the top, with the pictures of telephones and playing cards, are used to send other people files, initiate voice conversations, play games with them, etc. Children sure as hell know that they’re there and know how to use them: they control some of the main functionality of the program. The rightmost icon, of the green face behind a prohibitive red circle? It “blocks” the person you’re currently talking to. They can’t send you messages anymore. They can’t see whether or not you’re online. They cannot bully you. And it will stay that way until you decide to unblock them. It’s not a complicated series of menus of dialogs children have to navigate to achieve this, they have to use a single intuitive-looking button in a familiar part of the window. If they can’t figure it out, their parents certainly should be able to. If the children and their parents combined can’t work this out in one night, or if parents aren’t being involved at all at with what their young children are doing online, then there are much bigger problems afoot than cyberbullying.

Even if we grant both school children and their parents incredible levels of stupidity, a half hour classroom session and a pamphlet sent home with screenshots at the start of each school year would get people up to speed on this. It’s just not a big problem, at all. Yet, to quote the Australian article, “the Government has admitted it is yet to fully understand the problem — or the extent of it”. That’s quite a worrying indictment of our government’s understanding of some of the most basic and common things that happen on the internet. What’s going to happen when they try to tackle an internet related problem that’s actually…a problem?

I have never seen any media treatment of the issue even mention that blocking contact from people on the internet is possible, let alone trivial. Instead they spout ridiculous ideas about how the government can help solve this problem, about how there should be laws passed against cyberbullying and court cases should be held (this isn’t suggested in the referenced article above, but it certainly has been discussed in the media). What planet are these people on to think that these solutions are even possible, to say nothing of sensible? For the government to monitor all the various different forms of communication by which cyberbullying supposedly takes place (many of which are based on proprietary protocols subject to compatibility-breaking changes without noice) in anything close to real time is technically infeasible without spending a truly vast amount of time and money, entirely out of proportion to the severity of the problem, and would also constitute giving the government a disturbing amount of control over the private communication of citizens, which is something fraught with problems both ideological and, as I discussed recently, practical.

Without such incredibly expensive and invasive monitoring, any sort of court cases over the issue are out of the question on the simple grounds that nobody can prove beyond reasonable doubt that anybody said anything to anyone. Are we going to take little Johnny’s word and an easily doctored screenshot or print out as sufficient evidence to find little Timmy guilty of something? And even if we are, what punishment are we going to impose? Is little Timmy to be banned from Windows Live Messenger for some period of time? How would this be enforced? Timmy’s parents’ ISP could block connections from their computer to known Live Messenger servers (in principle, that is, once again for them to actually start doing this in practice would cost time and money to put the procedures in place), but that would indiscriminantly block everyone in Timmy’s household from using the service, including his well-behaved and angelic sister who has never flamed anybody in her life. It would also do nothing to stop Timmy from using emails or Facebook or any other online medium to continue doing exactly what he was doing before. The entire concept of any sort of government control of the problem is plainly riddled with problems and shortcommings and should not be taken seriously by anyone.

In summary: cyberbullying is something it is realistically impossible for the government to fix and absolutely trivial for individual parents, or even their children by themseles, to fix. All it would take is a minimal amount of education about something they really should have figured out for themselves by now. The supposed severity of the problem as it has been portrayed in the media is a gross exaggeration born staggering technial ignorance, and perhaps also cynical acceptance of the fact that sensationalist news about supposedly serious problems sells well. I only hope someone in the media or government actually realises this eventually, and I look forward to the day I never have to hear the term cyberbullying again.

Link dump 000001

Here’s a quick dump of a few interesting links I’ve stumbled across in the last week or so, in lieu of a proper entry for today:

  • Brazilian American hacker Gustavo Duarte has a fascinating series of blog posts about computer internals and low level operation: motherboard chipsets and the memory map, how computers boot up and the kernel boot process. These subjects represent some of the greatest gaps in my understanding of computers, and these excellent posts went a long way toward plugging those gaps.
  • The Ludwig von Mises Institute has a very interesting article on the political effects of the internet.
  • As a former iaidoka I have a fairly strong general interest in anything related to swords, including the smithing process, so I quite enjoyed this blog post explaining recent scientific discoveries relating to the legendary Damascus blades of the Crusades.
  • Peter Norvig, best known as co-author of a popular textbook on artificial intelligence, has a page of infrequently asked Python questions that I’ve somehow manage to avoid stumbling across until now. Some of the questions are odd or silly, but there looks to be a few genuinely useful ideas in there, and all the ideas, even the silly ones, are thought provoking and a good tour of the less well known corners of Python.
  • Here’s a positively fascinating video (and a surprisingly old one considering that I haven’t heard of anything even remotely this cool happening in this field more recently) narrating the evolution of simulated block creatures in virtual environments. A more thorough version of this video, say 30 minutes long, would make a fantastic addition to high school education on the issue of evolution.

I am starting to feel pretty confident about Cherryblosxom’s current state. A 0.1 supremely alpha release may very well happen this week. Stay tuned!

Absurd UK surveillance ideas

Amongst others, the Times Online is reporting on considerations by ministers of the UK government of a plan to store and monitor every email sent by every person in Britain. The supposed reason that such an insane system is need, of course, is to fight terrorism. If we (like almost everyone else in government or the media) set aside the all too salient fact that terrorism typically kills less Britons each year than accidental drownings, and suppose that the government really should be spending time and money trying to do something about it, ample grounds still exist for criticism of this scheme.

While the UK government may conceivably be able to eventually muster the sheer amount of hardware required for intercepting and storing such a vast quantity of emails, it is entirely infeasible that they are ever going to have the ability to read any encrypted emails that they may have harvest. Furthermore, competent terrorists know this. Competent terrorists know they can use PGP or GnuPG to encrypt their emails and rest assured that the UK government simply cannot read them, short of physically apprehending the terrorists and torturing passphrases out of them. The very fact that as soon as a major terrorist incident happens the relevant government starts making loud noises about the threat encryption poses makes absolutely sure that terrorists know they can do this. So they will do it, and this scheme will fail at its intended task, wasting a horrendous amount of taxpayer’s money and putting undue strain on the country’s internet infrastructure. It’s a horrible idea.

But it gets worse.

All the innocent non-terrorists in Britain will, with a few rare exceptions, continue not to encrypt their emails, so these will be collected and stored by the government. This is a cause for tremendous concern because the UK government has recently made it embarassingly clear to the world that when it comes to the secure storage of sensitive data, they are nothing short of incompetent. Just look at these incidents – each of them from 2007. To be fair to the UK government, they’re not alone in this regard, and Google will help you find just as many or more breaches of a similar scale by the US government.

Naturally the loss and theft of hard drives and disks is bound to happen from time to time, but the possible impact of these breaches can be reduced to zero by using readily and cheaply available encryption technology. In none of the cases cited above was this data encrypted like it should have been, suggesting that data security is either not taken seriously by the UK government or it is handled by people not qualified to be handling it. When unencrypted disks full of everyday citizen’s personal emails are lost or stolen or bribed away from the government’s hands – and based on all the evidence we have so far, this is more likely than this email surveillane scheme actually thwarting a terrorist plot – end up anonymously posted to the internet, the consequences will be severe.

Details about people’s personal finances, love lives, political and religious beliefs will be exposed for all to see. Commerically sensitive material of every imaginable kind will be available to every company’s most feared competitor. Identity theft, industrial espionage, harassment and stalking are all likely consequences. The risk is simply far too great, and entirely disproportionate to any reasonably expectable benefits.

This rant says nothing about the basic principles of freedom and privacy that this issue obviously treads on (for a well-written and concise rebuttal to the standard issue “If you’ve done nothing wrong then you’ve got nothing to hide” justifications that are inevitably thrown around on this matter, see Bruce Schneier’s excellent “The Eternal Value of Privacy“), which are also well worth consideration. In an attempt to make the rejection of massive government surveillance programs appeal to a wider audience, in this post I’ve gone with a slight twist on an old saying and not resorted to considering malice where it is adequate to consider incompetence.

Although not relevant here, it bears mentioning in closing that the ideas of government incompetence at secure data storage discussed here should be the first thing that pops into your head when a government suggests (and, depressingly, this really does happen) that they should keep a record of everyone’s fingerprints, eye scans, DNA or any other biometric credential. When those details are lost or stolen (and how confident can you be that they never will be?), you can’t have them replaced like you can your credit card and passport. They’ll be on the internet for good.

Prime time

As was reported at various places (see, e.g., the BBC), in August this year a prime number (in particular a Mersenne prime) was discovered which was larger than any previously known prime – the first 10 million digit prime number! The number, 243112609-1, was discovered by a distributed computing project called GIMPS, the Great Internet Mersenne Prime Search. Congratulations to everyone involved!

I have been wanting for a while now to introduce more non-programming related content to this blog and the discovery of this new prime has motivated me to write somewhat regularly about issues in number theory and about prime numbers and related concepts especially. I think prime numbers are very misunderstood, or perhaps it would be better to say “under-understood”. Almost everybody understands what a prime number is, but very rarely is anybody ever told why it is that mathematicians get excited by them. The BBC story linked to above, for example, offers not even a token explanation as to why millions of people donate their spare computer power to a world-wide collaborative effort to find large new primes, or why organisations like the EFF are willing to offer large cash prizes for people who find primes. Do prime numbers actually have practical uses? Yes! Are there good intellectual reasons to be more interested in them than all the other, composite numbers? Yes! Is there something special and enchanting about primes that causes people to become obsessed with trying to understand them? Yes! But it’s an exceedingly rare school teacher who mentions any of these things when teaching kids what a prime number is. To be fair, perhaps that in particular isn’t unreasonable – in Australia, at least, children are taught what a prime number is at a fairly young age, too young to really appreciate many of these things. Perhaps the problem is that primes, and number theory in general, aren’t ever revisited later in a child’s education, where these things could be made clear.

I have personally fallen “victim” to this. Until I switched from a physics to a mathematics major in university, I had no idea why anybody would particularly care about prime numbers. I thought that the division of numbers into primes and composites was no more compelling than the division of numbers into odd numbers and even numbers, into perfect numbers and imperfect numbers, into triangular, square, etc. numbers. Many of these distinctions struck me as rather arbitrary and uninteresting ones, based on frankly quite childish and superficial observations. I still hold this opinion today for most of these divisions, but I have learned a lot about primes (and their abstract algebraic generalisations!) since then and now understand that they certainly do not belong in the same mental pigeon hole as perfect or square numbers.

In an attempt to help spread some of this understanding, over the next few months I’ll try to write entries explaining some of the reasons prime numbers are interesting, some of their practical applications, explain some simple algorithms to do with prime related concepts (like primality testing and factorisation) and point out some of the many compelling mysteries about primes that continue to capture the attention of mathematicians today. A survey of interesting stuff to do with primes will take readers all the way from ancient Greece, around 300 BC, to the early internet days of the 1970s and 1980s, with a few detours through 17th century France. I’ll try my best to make the material interesting and accessible to non-mathematicians. Please look forward to it!

Another Unix shell in Python

A while ago I posted the code for a simple Unix shell written in Python. Not too long after this I expressed my delight at discovering the Python standard library’s cmd module. It turns out that these two concepts go together really well. Here’s a much nicer simple Unix shell, less than 100 lines long. It still lacks input/output redirection or pipes, but by virtue of subclassing Cmd it has gained a command line history and tab completion. The tab completion is not the smoothest you’ll ever see – this is basically the first thing I could get almost working – but I think this solidly demonstrates that Cmd makes a fine base for a Pythonic Unix shell.

from cmd import Cmd
from glob import glob
from os import access, chdir, execv, fork, getenv, getcwd, listdir, wait, X_OK
from os.path import exists, isdir
from sys import exit

class Shell(Cmd):

        def _substitute_env_var(self, token):
                if token.startswith("$"):
                        return getenv(token[1:], "")
                else:
                        return token

        def precmd(self, line):
                words = line.split()
                words = map(self._substitute_env_var, words)
                return " ".join(words)

        def _get_executables(self):
                result = []
                for dir in getenv("PATH").split(":"):
                        try:
                                result.extend(filter(lambda(x): access(dir+"/"+x,X_OK), listdir(dir)))
                        except OSError:
                                pass
                return result

        def postcmd(self, stop, line):
                self.prompt = getenv("PS1", "$ ")
                self.executables = self._get_executables()
                if stop:
                        return True

        def completenames(self, text, *ignored):
                return filter(lambda(x): x.startswith(text), self.executables)

        def completedefault(self, text, line, begidx, endidx):
                prefix = line.split()[-1]
                return map(lambda(x): x[len(prefix)-len(text):], glob(prefix+"*"))

        def do_cd(self, line):
                if not line:
                        line = getenv("HOME", "/")
                try:
                        chdir(line.split()[0])
                except OSError, (errno, strerror):
                        if errno == 2:
                                print "Directory %s does not exist." % line.split()[0]

        def complete_cd(self, text, line, begidx, endidx):
                prefix = line.split()[-1]
                return map(lambda(x): x[len(prefix)-len(text):],filter(lambda(x): isdir(x), glob(prefix+"*")))

        def do_cwd (self, line):
                print getcwd()

        def do_EOF(self, line):
                print ""
                return True

        def default(self, line):

                argv = line.split()
                command = argv[0]

                found = False
                path = getenv("PATH")
                for dir in path.split(":"):
                        if exists("/".join((dir, command))):
                                found = True
                                if(fork() == 0):
                                        execv("/".join((dir, command)), argv)
                                else:
                                        wait()
                if not found:
                        print "%s: Not found" % command

shell = Shell()
shell.prompt = getenv("PS1", "$ ")
shell.executables = shell._get_executables()
shell.cmdloop()

Lately I’ve been trying to make an effort to use Python’s functional programming features, like map and filter, instead of cumbersome piles of fors, ifs and appends, which is evident in this code.