Today brings a guest post from MIT senior (and good friend) Paul Kominers. Full post after the jump.
On Wednesday, the feature elves dropped off a new addition to Facebook’s long-suffering Groups feature: “Groups for Schools.” Colleges and universities can now have super-groups, accessible only to users with email addresses valid at the college, which aggregate sub-groups for various segments of campus. In my case, this is Groups at MIT.
Groups at MIT’s inaccuracies are chuckle-worthy. In real life, I am Course 14 and Course 17 (read: majoring in Economics and Political Science), a resident of Random Hall, and a member of the Class of 2012. But all of Facebook’s top suggestions for sub-groups I should join are dorms in which I do not reside, academic fields which I do not study, and graduating classes of which I am not a member. The top three suggestions are “East Campus” (a dorm); “Course 6″ (EECS); and “Course 18″ (Mathematics).
Facebook’s algorithms are trying to nail down my identity. They are doing a pretty lackluster job of it. Facebook failed to identify “me,” I think, for a fairly obvious reason: the central principle of Groups at MIT’s recommendation system seems to be “How many friends does he have in this sub-group?” Naturally, I would have many friends in a group for East Campus, where I spend much of my time, or for Courses 6 and 18, respectively the first- and third-most popular majors on campus.
These errors are amusing, but their broad implications are a bit troubling. The core problem Facebook is trying to solve is the same one at the heart of many high-tech services that work on imperfect information: “Given the data that I can see, what can I extrapolate about reality?” Groups at MIT rephrases it thus: “Given what I know about this person’s friends, which sub-groups will be the most appealing?” Netflix’s recommendation systems too: “Based on what movies this user likes, what other movies will this user like?” Such services mine your data in order to draw a conclusion about how best to achieve their goals.
In both of these cases, mismatches are probably harmless or amusing. You may not want to be bombarded with inaccurate group suggestions, and you may not appreciate the hours spent watching Daredevil when you were hoping for something on the caliber of Iron Man. But these are minor annoyances, the electronic equivalent of trying to find your keys. The issue is that computer engineers are trying to solve the same problem of imperfect information in much more important territory.
Consider algorithmic dispute resolution, one of Chris’s areas of expertise: “Given that this content has been marked as abusive, what are the odds that the content is actually abusive?” Or spam filters, which have a well-known proclivity for catching legitimate, and sometimes important, messages: “Given how users have reacted to emails like this, what are the odds that this email is spam?” Or smart infrastructure: “Given what these sensors are indicating, what are the odds that the system needs an automated intervention?”
In all of these cases, when the algorithms start failing, we can experience serious problems. Chris documented a case where links to a blog post of his were blocked for abuse on Facebook. The blog post in question documented a political link that had itself been blocked for abuse on Facebook. Facebook’s algorithms for determining abuse had decided that you could not post the link, and you could not post links that discussed your inability to discuss the link. Apparently, the first rule of Facebook’s “Report Abuse” button is that you do not talk about Facebook’s “Report Abuse” button.
Facebook ultimately apologized, and I attribute no bad intentions to the company. But legitimate speech was quashed because of an abuse in the algorithms that identify abuse. Facebook’s anti-spam system alone checks 25 billion items daily. We do not know what the error rate is. One percent? A tenth of a percent? Even a hundredth of a percent means 2.5 million items daily are improperly marked as spam. How much speech is incorrectly disrupted as an inherent risk of doing business on Facebook?
When designing computers to solve problems, we need to take great care with potential pitfalls. Users can lie or be mistaken: just because users mark content as abusive does not mean that it actually is. Machines only have as much context as we can give them: Just because I am Facebook friends with probably a hundred students in Course 6 does not mean that I have any interest in joining a Course 6 Facebook group. Developers may not think through their users’ behaviors: just because a user has sent email to another user does not mean those two people are friends. Users can outguess algorithms; malicious hackers can learn to attack better than automated computer security knows to defend.
More and more such decision problems are being relegated to algorithms. And that means that ensuring quality algorithms becomes more important. I do not want to put myself in the way of progress—as someone who has been getting hundreds of (mercifully filtered) spam messages daily for the last two weeks, I recognize that there are great benefits to be reaped by applying algorithms well. I ideally want my inbox clean, my network secure, and my buildings and utilities smart. Of course, to achieve these goals we will have to accept a nonzero error rate. But that does not mean that we should not try to make the error rate very, very low.
Paul Kominers, MIT ’12, studies economics and political science.