We are seven people working together on a website that has 200,000 registered users. We have no QA department, no customer service department, nobody but us monitoring the site. Meanwhile, thousands of entries and comments are added to our database every day. Most of them are innocent, positive entries about people accomplishing their goals. Others are vile attacks on other users, childish pranks, heated religious debates, or perfectly valid content that happens to be meant for a mature audience. There is a lot of truly disgusting, malicious, and irrelevent material on our site, but luckily most people don’t know about it, because we are able to keep it off the main pages of our site. (Note that we don’t censor or forbid it except in extreme circumstances; we merely keep it out of the “display window” pages of our site.)
How do we do that? Well, we can’t read every entry or comment and determine if it is tasteful and representative of the site’s purpose. Like every web company facing this situation and serving user-created content, we have to use automated filtering on the incoming stream of entries. Since computers aren’t smart enough to determine the “gist” of what an entry contains, they need to filter text using keywords. The keywords can be the four-letter variety or they can be standard words that are often used in racy contexts. Consider something like “meet young girls” as a goal. None of the words are racy, but the phrase can be considered inappropriate depending on who is writing it, and their purpose. The computer can’t understand any of this, and can’t parse phrases well (e.g. “meet young girls” vs “meet some young girls” vs “meet me a young girl”); it can only find words (and parts of words) in the text streaming in. So that’s what it does.
There’s no point trying to list all the words, word parts, and word patterns we use, or having us justify each keyword being there. It’s a big list built from lots of real-world use. But religion seems to be the culprit of this particular conspiracy theory, so I’ll touch on it. Many discussions about religion become combative and inflammatory – in fact religion could be considered one of the most likely candidates to inspire heated debate rather than simple progress-tracking on life goals. So some of our keywords involve religion. Keep in mind that the computer isn’t taking sides on on one religion or another, on atheism vs theism, or whatever. It’s also not making a statement about how the site is meant to be used. It is merely flagging text as possibly containing inflammatory arguments or mature content, and keeping that text from appearing on the high-traffic pages of the site, just in case. More often than not, as some have noticed, the filter catches “false positives”, completely innocent content that happens to contain keywords which are sometimes found in inappropriate contexts. But, since we can’t read all the content, we have to use this imperfect, automated filtering to keep the Zeitgeist and other high-traffic pages relatively innocent and relatively representative of what 43 Things is all about.
This is a standard methodology of dynamic websites, and it actually works pretty well.
