Tom Morris

A pungent mix of programming, philosophy, pedanticism, procrastination, perplexity, peripheral political polemic, and platters of preposterousness.

image filter




Opt-in image filter: enabling censorware?

I’ve been meaning to write a long and detailed assessment of the opt-in image filter debate currently raging in Wikimedia circles. There’s lots of power plays going on and not a lot of good faith.

I’ve been watching from the sidelines: I’m fairly agnostic about the whole thing. My only contributions thus far have been to challenge what I see as bad arguments. If I were pushed, I’d say I’m mildly in favour of the proposal, but if it doesn’t happen, I won’t be too crestfallen. As I said, my primary interest is in the quality of the arguments (there’s a reason I’m a philosophy graduate student…).

One argument I’ve heard over and over again runs something like this…

We shouldn’t have an image filter as the categorisation system that comes with it would enable others to filter Wikipedia.

Basically, to enable an opt-in image filter, we’d build up categories and schemata for the opt-in filter which could then be reused by others who want to prevent others from having access to Wikipedia. It’s basically a utilitarian objection: rather than objecting to the principle of the filter, it is an objection to the probable knock-on effects that creating the filter would have on others.

There’s nothing wrong with the logical structure of the argument, but I do have some very strong doubts as to whether you should accept the conclusions.

First of all, a slight philosophical objection. The argument could go for a lot of reuse. People who create photos or music or anything else and license it as public domain or as CC BY or BY SA run the risk that someone they don’t like ends up using “their” content. I wouldn’t be too pleased if I found that one of the articles I’d written for Wikinews or one of the photos I’d put on Commons turned up on websites affiliated with, say, the British National Party. But that’s a risk I run from licensing stuff freely. I willingly take that risk because the benefit of having things like Wikipedia far outweigh the downside of having politically disagreeable, well, dicks reusing the content. It’s the same risk we take with open source: what if some big weapons giant starts using your code to power their weapons systems? What if someone takes your open source wiki system and starts something as profoundly stupid and anti-intellectual as Conservapedia? The answer: well, that sucks. I don’t see why the same answer shouldn’t apply to this kind of objection.

English Wikipedia already has the “bad image list”: a list of shocking images that can only be included in the article it is listed for on the list. If you want to use it elsewhere, an admin has to update the list. It’s basically to prevent that delightful image “Autofellatio6.jpg” from being inserted into My Little Pony articles and other amusing bits of vandalism. Does the bad image list enable censorware? Yes. But it has kind of an important and useful function: preventing vandalism. Similarly, the doctrine of double effect can be called into play here: yes, we may be building up a list of categories that could be reused by censorware sellers, but that’s not our primary intention.

Anyway, the major objection is a lot more major than this charge of inconsistency. The major objection is simply that the sort of filtering is different enough for it not to matter.

The nature of an opt-in image filter is very different from a filter that isn’t opt-in.

Imagine you wanted to build Net Nanny or one of the other brands of what Wikipedia calls content-control software. The goal is simple: there’s a bunch of bad, evil, no good content out there on the Wild West Web that you want to prevent little Bobby in Missouri (or Manchester or Minsk or Matsumoto or Mangaung) from getting to. Some of it is images, some of it is particular websites, some of it is specific pages, whatever. You build up a big old list of URLs and other factors such that you can give the system a URL and for a given set of categories, it can say yes or no. If you get it right, little Bobby doesn’t have to see Tubgirl or 1man1jar… ever.

But if you get it wrong, there are problems. If you have false positives, that’s fairly bad. People laugh at you for your false positives. They make snarky blog posts saying “har har har, you can’t look up ‘Same-sex marriage’ on Wikipedia because this shitty censorware thinks the word ‘sex’ means it must be pornography”. And, yes, that’s a real example: my university has (or at least did a few years ago, things may have changed) a censorware system that blocked the article on English Wikipedia for same-sex marriage. Or you’ll get censorware that blocks websites about breast cancer or testicular cancer because, well, breasts and testicles are naughty. And if you are a government that implements it on publicly-accessible wifi hotspots in places like libraries and airports, you may get angry civil libertarian types laughing at you on BoingBoing.net… which due to the censorware used by the government, you probably won’t be able to read. So false positives are bad for your public image: a few are okay, but go too far and you end up being the corporate equivalent of the prudish philistine who tries to put some boxer shorts on Michaelangelo’s David because why would that nice ninja from Teenage Mutant NinjaHero Turtles be spending his time making nude sculptures of the important moral figures of our Judeo-Christian heritage rather than fighting crime!?

But false negatives are much, much worse for the censorware makers. Because once a false negative sneaks through the censorware, it’s game over. If little Bobby does see Tubgirl, he need only copy it onto a USB flash drive and stow that away somewhere his parents can’t find, with a filename like “homework”. And maybe little Bobby will share said picture with his friends at school. And maybe in return one of his classmates who doesn’t have prudish parents who install software with Orwellian names like ‘Net Nanny’ will download him some better pornography and some lovely images of self-inflicted chainsaw suicides or whatever it is the kids are into this week. And, as I said, it’s game over. Once you peek behind the veil of censorship, all those things you wanted to keep little Bobby away from start finding their way in. First it’ll be sex with farm animals, and then the Communist Manifesto, then he’ll want to go to college, and then he’ll want to edit Wikipedia! Pass the smelling salts!

So, if you wanna make censorware, it’s gotta be pretty damn strict. And you’ve also got to keep the false negatives down for PR purposes because otherwise snarky people will relentlessly mock you. Oh, and you’ve got to keep your lists secret because this is capitalism and competition requires secrecy. And if you leak the list, people will start poking around on those websites.

Making an image filter is a lot simpler then because the requirements are different. The goal of the proposed image filter (and a large number of different variants on the same theme one could conjure up to answer different objections) isn’t to prevent access at all. It’s to enable individuals to opt-out of displaying some images. It doesn’t need to be a comprehensive list of all things that meet the criteria for potentially controversial, nor does it need to work as hard as the censorware manufacturers to keep false positives low. If I decide that I don’t want to see bums and willies and boobies and so on (because I’m at work or on the train or in a public library or whatever), it doesn’t actually matter to me much if the filter isn’t 100% comprehensive. It isn’t trying to stop me from seeing any images of a particular class, it’s just giving me the option to view them or not. If one slips through (a false negative), it’s not game over either. It just means that the filter wasn’t as good as it could be. Whatever. If, while anti-vandalism patrolling, I get to see 90% less shocking images, great, sign me up. If I get to see 25% less shocking images, great, whatever. Anything better than zero is just fine.

And what about false positives? Okay, I wouldn’t necessarily want 100% false positives, and 50% would be pushing it a bit, but really, if all I have to do is click the image and it pops back in, the cost to a false positive is damn near negligible.

The sort of categorisation system that would flow from this is very different because the costs of inclusion or failure to include are so much different from in the Net Nanny type of case. Could the Net Nannies of the world use the lists and categories that get generated from an opt-in image filter? Sure. But why would they bother: they would still need to go over them to check for false positives and false negatives, because of the costs of both.

Back in 2008, I built something called the nsfw profile. It’s a GRDDL profile for defining certain links as not safe for work. (Don’t worry about what a GRDDL profile is.) The idea of the thing is that you could add nsfw as a class on links and then attach custom behaviour or maybe some kind of nice browser trick that would warn you about the not safe for work link. It didn’t take off because, well, for whatever reasons, but imagine if it did. The whole world started adding descriptive markup to their links so that browsers could work out what links are NSFW and so on. Could you build Net Nanny on top of this? Of course not. Again, false positives would be too high and the false negatives would be even higher.

Now, for the reasons I’ve given, I don’t think that it would be very likely that censorware firms would be very likely to use the resulting categories and lists from the image filter as part of their listings.

But there’s more. Let’s put ourselves back in the position of designing some censorware like Net Nanny. If you wanted to make sure that people could get access to Wikipedia but didn’t get to see, err, Double_penetration.svg, what would you do? Obviously, you can’t block Wikipedia. That’d be stupid. Well, if I were making some censorware, I’d probably just do a recursive category search starting at Category:Human sexuality on enwiki, then I’d hire a bunch of people to poke through each page and mark it as “porn” or “not porn”. Then I’d take the list of all the “porn” pages, scrape each one, work out what images are on there and add those to the list of naughty images. Then I’d ask the MediaWiki API to give me a list of all the inter-wiki links from all those pages to the other language versions. Then I’d scrape those to get a list of all the files they use and add any that aren’t already on the bad images list to that list. I’d pop all the pages on the list too, and then I’d set up a cron job to run once a month to find new images and new pages, run them past our minimum wage porn raters and… there you go, you’ve got a pretty damn good list of the sex stuff you need to filter from Wikipedia to protect the Flanders family from The Simpsons. If you are a repressive regime or a corporate censorware manufacturer, filtering the porn from Wikipedia is the easy bit: it gets a bit harder out there on the rest of the web where there isn’t a volunteer community dutifully sorting pictures into categories with names like Suggestive use of sticking out tongue and Cameltoes.

If you do want to build censorware that finds all the naughty on Wikipedia, the Wikimedia community has done most of the work for you already. You just need a few Python scripts and some minimum wage porn raters (I have a funny feeling that in a recession, there will be plenty of people wanting to get paid to categorise porn).

If we really want to stop censorware companies from reusing a category system for images on the Wikimedia sites, the panopoly of sexual image categories on Wikimedia Commons shows that it might be a bit late for that objection. As with content, we shouldn’t worry too much about how people reuse it, we should worry more about whether we are providing the best service for readers and editors (again, I don’t want to subject my fellow public transport users to some of the stuff I see while anti-vandal patrolling).

Censors may be humourless philistines, but they aren’t total morons. If they want to find the naughty stuff on Wikipedia and block it for their users, they are more than capable of doing so. Worrying about whether they would reuse our filtering categories is a complete red herring. Our non-filtering categories provide what most of what they need already, and I’m betting nobody is going to call for those to be shut down. Objection overruled!