Using our powers for good – how web security software can help to transcribe old books

View Images

What would you do if someone asked you to help transcribe an old book onto a website? Chances are, you’d say no on the basis that you have other things to do, or simply that it just doesn’t sound very interesting. And yet, millions of people every day are helping with precisely this task, and most are completely unaware that they’re helping out.

View Images

It’s all thanks to a computer program developing by Luis von Ahn and colleagues at Carnegie Mellon University. Their goal was to slightly alter a simple task that all web users encounter and convert it from wasted time into something productive. That task – and you will all have done this before – is to look at an image of a distorted word and type what it is in a box. It often turns up when you’re trying to post on a blog or sign up for an account.

The distorted word is called a CAPTCHA and, playing fast and loose with the spirit of acronyms, it stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”. Their point is to make users prove that they are human, because modern computer programs cannot discern the distorted letters as well as humans can. The CAPTCHAs are visual sentinels that protect  against automated programs that would otherwise overbuy tickets for sale at inflated prices, set up millions of fake email accounts for spamming or inundate polls, forums and blogs with comments.

They have become so commonplace that von Ahn estimates that people type in over 100 million CAPTCHAs every day. And even though the goals of improving web security is a worthwhile one, these efforts add up to hundreds of thousands of hours that are effectively wasted on a daily basis. Now, von Ahn’s team have found a way of tapping this effort and putting it to better use – to help decipher scanned words, and usher old printed books into the digital age.

Reverse-Turing tests

As von Ahn writes, the goal of these projects is to “preserve human knowledge and to make information more accessible to the world.” Digitising books makes them simpler to search and store, but doing so is easier said than done. Books can be scanned and their words decoded by “optical recognition software” but these programmes are still far from perfect. And any weaknesses they have are exacerbated by the faded ink and yellowing paper of the very texts they are most interested in preserving.

View Images

So recognition software is automated but only about 80% accurate. Humans are far more accurate; if two fleshy scribes work independently and check any discrepancies in their transcripts, they can achieve an accuracy of over 99%. We, however, are far from automated and usually quite expensive to hire.

The new system, aptly named reCAPTCHA, combines the best of both worlds by asking people to decipher words that software cannot, while solving CAPTCHAs. Instead of random words or characters, it creates CAPTCHAs using words from scanned texts than recognition software has struggled to read.

Two different recognition programmes scour the texts in question and when if their readings differ, words are classified as “suspicious”. These are placed alongside a “control” word that is already known. The pair is distorted even further, and used to make a CAPTCHA. The user has to solve both words to prove their humanity – if they get the control word right, the system assumes that they are genuine and gains a bit of confidence that their guess for the suspicious word is also right.

Every suspicious word is sent to multiple users and if the first three people to see it all provide the same guess, it shunts over to the pool of control words. If the humans disagree, a voting system kicks in and the most popular answer is taken as the right one. Users have an option to discard the word if it’s illegible, and if this happens six times without any guesses being made, the word is marked as “unreadable” and discarded.

At first, von Ahn’s team tested the reCAPTCHA system using 50 scanned articles from the New York Times archive taken as far back as 1860 and totalling just over 24,000 words. The reCAPTCHA system achieved an excellent accuracy of 99.1%, getting only 216 words wrong and far outstripping the meagre 83.5% rate managed by standard recognition software.

Human transcription services guarantee an accuracy of 99% or better, so reCAPTCHA certainly lives up to that exacting standard. Indeed, when humans were asked to do the same task, they made 189 errors, just 27 fewer than the programme. The neck-and-neck nature of the two scores is all the more impressive because unlike a human reader, reCAPTCHA cannot make use of context to decode a word’s identity.

Virtual security

That’s all well and good, but are there selfish reasons for a website to use reCAPTCHA, if its goal of preserving its own security (quite understandably) outweighs any interest in text conservation? Certainly, according to the researchers. Because the new system only uses words that are unrecognisable to current optical character recognition software, it’s actually more secure than current CAPTCHAs are.

Conventional CAPTCHAs use a small number of predictable rules to distort a set of characters and various groups have developed learning programmes that can them with over 90% accuracy. But the same techniques always fail to solve reCAPTCHAs because on top of the usual twists, this system has two extra levels of ‘encryption’ – the random fading of the underlying text and ‘noisy’ distortion caused by the scanning process. There’s a certain irony in making something state-of-the-art out of the old and the inaccurate.

It’s an interesting advance – von Ahn was in fact the person responsible for developing CAPTCHAs in their current form, so it’s perhaps unsurprising that his team have developed the next escalation of this technology.

Some might suggest that CAPTCHAs are a bit annoying anyway, so having to fill two out would seem like too onerous a task for today’s short attention spans. Not so – most CAPTCHAs are strings of random characters and these take just as long to solve as two actual English words.

Recycling effort

These guarantees, along with the prospect of doing something worthy, has already turned reCAPTCHA into a bit of an online hit. It’s being used by over 40,000 websites and it’s already making an impact. In its first year, web users solved over 1.2 billion reCAPTCHAs and deciphered over 440 million words – the equivalent of 17,600 books. At the moment, the programme is deciphering over 4 million suspicious words (about 160 books) every day. For human scribes to do the same task in the same time-frame, you’d need a workforce of over 1,500 people working 40-hour weeks.

It’s a fantastic idea – turning web users into unwitting satellite processors, and making constructive use of a necessary but ultimately unproductive activity. This ethos, of treating human processing power as a resource that can be conserved as electricity or gas should be, underlies a lot of the team’s other work. They have developed online games that can analyse photos and audio recordings, and their work has inspired another group to create Fold It, a game in which people compete to work out the ideal structure of a protein.

Even pictures of cats can be put to good use. A Microsoft programme called ASIRRA uses images of cats and dogs as CAPTCHAs. Users have to select all the images of one of the other, but the twist is that all the photos come from animal shelters and users who take a liking to one of the animals can adopt it.

Now if only someone could harness the countless hours of effort wasted on trolling or posting comments on YouTube, we’d all be laughing.

Reference: Science doi: 10.1126/science.1160379