Tom Morris

9 December 2009

A pungent mix of programming, philosophy, pedanticism, procrastination, perplexity, peripheral political polemic, and platters of preposterousness.

How I built the LeWeb IRC bot in my pyjamas

So, today was the first day of LeWeb. The theme of LeWeb this year is “the real-time web”. Appropriately, I guess, Twitter broke down just before the conference started.

Last time I went to LeWeb, I decided what I really wanted was for all the tweets to go into IRC. I stayed up all night last time trying to get it to work. This time, I did it! And it was pretty easy. I built it in about an hour. Here is how easy it was: I built it after just waking up, while sitting in bed in my pyjamas.

The architecture is really simple: I have a Ruby script (“grabber”) that uses the Twitter Streaming API to get everything that contains the string ‘leweb’ or ‘leweb09’. This just tracks everything coming in and adds it to a Distribtued Ruby (DRb) Queue object. I added some minimal filtering to this: basically to remove any tweet with the word “RT” at the start. I have another script that uses JRuby to instantiate a PircBot object and use that to post everything that comes across the queue - with three seconds between posts. Obviously, I have a third script that’s just the DRb server. Finally, I have an IRb (actually, it was mostly a jirb - the JRuby equivalent of an IRb shell) running to let me inspect and monitor the queue - earlier, when testing, I opened up another IRb shell and had it print out the total number of items in the queue every second so I could debug it and get it down to processing them as fast as they were coming in without overloading the server. I found that 3 seconds was ideal - any slower and you end up building up a backlog of unprocessed tweets, any faster and the IRC server would complain that you are flooding it.

Why the DRb queue? Well, the gem I use to get the data out of Twitter’s Streaming API uses a native C binding, so I couldn’t really use that on JRuby, at least not without lots of faffing around with FFI. I’m familiar with PircBot as I’ve used it in Scala - but obviously, Scala couldn’t talk to a Distributed Ruby queue, and I was too lazy to go and install something like RabbitMQ. Ruby is a pretty nice way to build stuff using the Twitter Streaming API. Performance-wise, the DRb server uses the most memory - 21.6Mb - and it often bursts CPU usage of up to 59% - it is using two threads. The “grabber” uses 4.4Mb, virtually no CPU and it’s a single thread. The JRuby process uses about 28% of the CPU, a consistent 55.4Mb of RAM and 18 threads. It’s been running for the last seven hours, and the whole setup has been pretty stable. In the morning, I had to kill grabber a few times after it would just hang up. JRuby has been ridiculously awesome though: I really love JRuby and the JVM. My machine did crash, although I blame the ridiculous number of Firefox tabs I had going. Did I mention I’ve been running this from my laptop?

In a few hours, once everything has settled down a bit is I’m planning on making link-flood preventer. There’s an unfortunate problem with the stream - when a site like ReadWriteWeb posts an entry, there’s an absolute ton of accounts that tweet/retweet it. My immediate plan is to basically have another server, probably a local web server backed by a SQLite database, that will be able to take a URL, disentangle the original URL if it’s bit.ly/is.gd/tinyurl or any of the other URL shorteners that are widely used, and see if that link has been posted in the last 15 minutes. If it has, it’ll return false, while if it hasn’t, it’ll return true. 15 minutes might be too long - I haven’t got a problem with having people reposting the same link - people in the chatroom should be able to see the links going out - it’s really to stop it when there’s 20 tweets in rapid succession that are all linking to the same item. Alternatively, to prevent rapid multiposting, I may just keep an array of the last 500 received tweets, pass the latest string to it, do a Levenshtein comparison on it and if it’s above a certain magic number, not post it. All of these may be a bit computationally taxing - if anyone has any better ideas on how to despamify a live Twitter stream, I’m game. I’m cruising the Java Collections Framework documentation finding the relevant way of doing it (I love the JCF by the way. Google says nobody has said this before, so let me be the first: I truly love the Java Collections Framework.).

I also flirted with using the WhatLanguage gem to remove non-English tweets - when we started, I had the idea that the stream would be too overwhelming for the server, and that we’d be getting hundreds of tweets per second - filtering out the non-English stuff seemed like one way to make it a teeny amount more manageable. I quickly removed it because (a) it’s presumptuous to think everyone in the channel isn’t interested in the non-English tweets, (b) the language detection algorithm isn’t flawless, (c) because of (a) and (b), the CPU time being spent on language detection was wasteful.

Ideally, I’d like to make running a Twitter to IRC bridge very easy. Twitter’s commitment to opening up the Firehose for everybody in the new year should make this possible. Tomorrow, I’ll post the source code of what I did today. I’m thinking it may be easier to produce an all-in-one Java package. PircBot is very good. Someone has also produced an Jakarta Commons HTTPClient-based TwitterClient for Java. This is cool, but I didn’t have time to fully explore it this morning. Ideally, the bridge software could be as simple as a Java (or, better, Scala) app that has a an SWT front-end - so a person can literally just double-click a JAR, tap in their Twitter authentication details, the query they want to use and the IRC server credentials and hit ‘start’. It should be that easy.

Malet Street, London, WC1E 6