Tom Morris

23 February 2010

A pungent mix of programming, philosophy, pedanticism, procrastination, perplexity, peripheral political polemic, and platters of preposterousness.

A programming language dictates possibility, not fate

I just read on Twitter that a speaker at Future of Web Apps said that Python was “slow”. This later became “slow at threading”. Now, this may be a perfectly reasonable statement - the Python users I know tell me that Python sucks at threading. Of course, it probably doesn’t suck as much as Ruby does at threading! Heh.

I thought I better set the sprinkler system going pre-emptively. When we are discussing the performance characteristics of software, people get very hung up on language and ignore everything else.

Last year, I wrote a script to extract data out of a whole bunch of HTML files I downloaded. I used Ruby with Hpricot. I had about 2.7Gb of HTML files to go through and I wanted to extract the contents of the h1 element and a span with a particular classname. Hpricot was pretty slow back then. And I hadn’t learned Nokogiri, which I now know. I cooked up a Ruby script in about fifteen minutes that did what I thought I wanted. I ran some tests and worked out that it was going to take four days for the script to run. This was because each iteration was going to take five or six seconds. This is on a two-point-whatever dual core GHz MacBook.

I rewrote the code in Java. I could have rewritten it in C, I guess, but I’m a Java weenie and I love my garbage collector. The code I wrote in Java used java.io.BufferedReader to read through the files, loading each line into a String object, then using java.lang.String’s startsWith method on the string to see if it matches the HTML element strings. Once I had read all the lines that I care about, I close the file and move on, and spit out the strings I care about to System.out. This is one pretty tight for-loop. This code took about 25 minutes to write - plus a few minutes to go and get a library off the Internet - and the code took about 8 minutes to run. It turned out that I had made a goofup in one of the variable names. So after those eight minutes had run, I changed the code, recompiled it and it worked.

Now, at this point, there are two possible reactions.

If you are a programmer you go “No fucking shit. You wrote some slow code and it was slow. You wrote some fast code and it was quick. What are you, some kind of beautiful and unique snowflake?” Well, actually, that is what a very cynical programmer would say. Most programmers would say “Okay, that makes sense.”

If you are a tech blogger who has learned their story-finding skills from TechCrunch, you now say “BREAKING NEWS: Ruby is slower than Java. Stop the presses!”

Feel free to substitute Python for Ruby or C for Java or whatever. The same story will be told over and over again. You don’t believe me? Remember all the hoopla when Twitter changed their message queue over from Starling to Kestrel. A sane technical decision from the Twitter team turned into an absolutely farcical pissing contest.

To show you how ludicrous these comparisons are, I wrote three scripts - one in Ruby, one in Java and one in C. Both do the same thing: print out “Hello World” 100,000 times.

Running ‘time’ on the compiled C script gave me these results - real: 0m0.462s, user: 0m0.046s, sys: 0m0.132s.

Running ‘time’ on the interpreted Ruby script gave me these results - real: real: 0m0.664s, user: 0m0.178s, sys: 0m0.155s.

Running ‘time’ on the compiled Java gave me these results - real: 0m1.358s, user: 0m0.809s, sys: 0m0.360s.

Not surprisingly, C is faster than Ruby. And Ruby is faster than Java, not surprisingly (the JVM takes time to startup - to do what, exactly? Print 100,00 Hello Worlds?). Now, please go and build all your web applications in C. It is obviously faster than Ruby and Java. Speed is the only thing that matters, remember. And the only thing that dictates speed is what language you choose. What algorithms you use? No consequence at all. Libraries? Nuh-uh. It is not like different types of software have different performance characteristics, and that deciding on what programming language (and compiler or interpreter or whatever) is a complex decision made up of many factors. No, much better to trust what some goofball with an arbitrary ‘time’ output says. Or better, just trust what some goofball on a tech industry blog who barely knows what HTML stands for. Arrington and pals know a lot more about programming than your programmers do, remember.

Or you could do what sane people do: use the right tool for the job, test out speed claims in something vaguely approaching a scientific manner. And maybe get your programmers to decide what programming language to use, rather than the technical press or the bloggers. Base your language and technical decisions on your own problems. That my code took eight minutes when I wrote it in Java using the sort of constructs that Java gives me compared to doing it in Ruby with the sort of approach I take to writing Ruby code (hint: I could have written it in Ruby to run a lot faster. I just happen to know Java’s IO libraries slightly better than I do Ruby’s.) is pretty irrelevant. The problem that you might be facing may have absolutely nothing to do with file IO or HTML parsing or whatever. Just turn off the chatterbox and do what is right for your own situation based on actual good reasons rather than what is the current hype on the blogs.