The Turing Test, which is intended to detect human-like intelligence in a machine, is fundamentally flawed. But that doesn’t mean it can’t be improved or modified. Here are eight proposed alternatives that could help us distinguish bot from human.
Perhaps the best way to detect intelligence in a machine is to have it tell you when it’s appropriate to laugh (see #2). Illustration by Tara Jacoby.
Can digital computers think? In the 1950s, computer science pioneer Alan Turing asked this question another way: “Are there imaginable digital computers which would do well in the imitation game?” While Turing’s original query speculated on a computer’s ability to participate in a simple party game, the question today is widely interpreted as “Are there imaginable digital computers which could convincingly imitate a human participating in a conversation?” If such a computer is said to exist, the reasoning goes, then that computer may also be considered intelligent.
Turing’s test has been the subject of much debate over the years. One of the biggest objections revolves around the assessment’s heavy emphasis on natural language processing skills, which encompass a very narrow measure of intelligence. Another complaint, fuelled by the 2014 Loebner Prize controversy, is that the test encourages deception as a means to achieving victory; the Russian chatbot Eugene Goostman “passed” the Turing Test by convincing one-in-three Loebner Prize judges that it was a 13-year-old non-native English-speaking Ukrainian boy. The bot used tricks, rather than bona fide intelligence, to win. That’s clearly not what Turing intended.
In light of incidents like these, and in consideration of the test’s inherent weaknesses, a number of thinkers have put forth ideas on how the Turing test could be improved, modified, or replaced altogether.
1. Winograd Schema Challenge
Hector Levesque, a professor of Computer Science at the University of Toronto, says that chatbots are effective at fooling some judges into thinking they’re human. But such a test, he says, merely reveals how easy it is to fool some humans — especially via short, text-based conversations.
To remedy this, Levesque devised the Winograd Schema Challenge (WSC), which he says is a superior alternative to the Turing Test. Named after Stanford University computer scientist Terry Winograd, the test presents a number of multiple-choice questions within a very specific format.
Here are some examples:
Q: The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)?
Answer 0: the trophy
Answer 1: the suitcase
Q: The town councillors refused to give the demonstrators a permit because they feared (advocated) violence. Who feared (advocated) violence?
Answer 0: the town councillors
Answer 1: the angry demonstrators
If the first question is posed with the word “big,” the answer is “0: the trophy.” If it is posed instead with the word “small,” the answer is “1: the suitcase.” The answer to the second question is similarly dependent upon whether the sentence incorporates the word “feared” or “advocated.”
The answers to these questions seem pretty simple, right? Sure — if you’re a human. Answering correctly requires skills that remain elusive for computers, such as spatial and interpersonal reasoning, knowledge about the typical sizes of objects, how political protests unfold, and other types of commonsense reasoning.
2. The Marcus Test
NYU cognitive scientist Gary Marcus is an outspoken critic of the Turing Test in its current format. Along with computer scientists Manuela Veloso and Francesca Ross, he recently chaired a workshop on the importance of thinking ” Beyond the Turing Test.” The event brought together a number of experts who came up with some interesting ideas, some of which appear on this list. Marcus himself has devised his own alternative, which I’m calling the Marcus Test.
Here’s how he explained it to The New Yorker:
[B]uild a computer program that can watch any arbitrary TV program or YouTube video and answer questions about its content — “Why did Russia invade Crimea?” or “Why did Walter White consider taking a hit out on Jessie?” Chatterbots like Goostman can hold a short conversation about TV, but only by bluffing. (When asked what “Cheers” was about, it responded, “How should I know, I haven’t watched the show.”) But no existing program — not Watson, not Goostman, not Siri — can currently come close to doing what any bright, real teenager can do: watch an episode of “The Simpsons,” and tell us when to laugh.
Great idea! If a computer can truly detect and comprehend humour, sarcasm, and irony — and then explain it in a meaningful way — then there must be some serious cogitations going on inside its silicon skull.
3. The Lovelace Test 2.0
Named in honour of Ada Lovelace (pictured) — the world’s first computer programmer — this test aims to detect an artificial intelligence by gauging its capacity for creativity. The test was originally developed in 2001 by Selmer Bringsjord and colleagues, who contended that, if an artificial agent could create a true work of art in a way that was inexplicable to its developer, there must be a human-like intelligence at work.
The Lovelace Test was recently upgraded by Georgia Tech professor Mark Riedl to remedy the ambiguity and subjectivity implicit in this approach.
The basic rules of the Lovelace 2.0 Test of Artificial Creativity and Intelligence go like this:
The artificial agent passes if it develops a creative artifact from a subset of artistic genres deemed to require human-level intelligence and the artifact meets certain creative constraints given by a human evaluator.
The human evaluator must determine that the object is a valid representative of the creative subset and that it meets the criteria. (The created artifact needs only meet these criteria — it does not need to have any aesthetic value.)
A human referee must determine that the combination of the subset and criteria is not an impossible standard.
For example, the judge could ask the agent in question to create a jazz piece in the spirit of Dave Brubeck, or paint a Monet-like impressionist landscape. Then judge will then have to decide how well the agent fared in this task given the requirements. So unlike the original test, the judges can work within a defined set of constraints, and without having to make value judgements. What’s more, the test makes it possible to compare the relative intelligence of different agents.
4. The Construction Challenge
Charlie Ortiz, senior principal manager of AI at Nuance Communications, came up with this one. Formerly known as the IKEA Challenge, this test is an effort to create a physically embodied version of the Turing Test. A fundamental weakness of the Turing Test, says Ortiz, is that it focuses on verbal behaviour while neglecting two important elements of intelligent behaviour: perception and physical action. Computers subjected to the Turing Test, after all, don’t have eyes or hands. As Ortiz pointed out to io9, “These are significant limitations: the field of AI has always assigned great importance to the ability to perceive the world and to act upon it.”
Ortiz’s Construction Challenge is a way to overcome this limitation. Here’s how he described it to io9:
In the Construction Challenge, a set of regular competitions will be organised around robots that can build physical structures such as IKEA-like modular furniture or Lego structures. To do this, a robot entrant will have to process verbal instructions or descriptions of artifacts that must be built, manipulate physical components to create the intended structures, perceive the structures at various stages of construction, and answer questions or provide explanations during the construction.
A separate track will look at scenarios involving collaborative construction of such structures with a human agent. Another track will investigate the learning of commonsense knowledge about physical artifacts (as a child might) through the manipulation of toys, such as Lego blocks, while interacting with a human teacher.
The added benefit of creating such a challenge is that it could foster the development of robots that can succeed in many larger-scale construction tasks, including setting up camps, either on Earth or beyond.
5. The Visual Turing Test
Like Ortiz’s challenge, the Visual Turing Test is an effort to diminish the natural language bias implicit in Turing’s original test. Computer scientists Michael Barclay and Antony Galton from the University of Exeter in the U.K. have developed a test that challenges a machine to mimic the visual abilities of humans.
Humans and software were asked a simple question about the scene depicted above: “Where is the coffee cup?” As you can see each of the multiple choice answers is technically correct — but some, Barclay and Galton note, can be considered more “correct” (i.e. more “human”) than others. As Celeste Biever and Richard Fisher explain at New Scientist:
The ability to describe to someone else where an object is relative to other things sounds like a simple task. In fact, making that choice requires several nuanced and subjective judgements, including the relative size of objects, their uniqueness relative to other objects and their relevance in a particular situation. Humans do it intuitively, but machines struggle.
New Scientist has an interactive version of the test, which challenges you to identify “human” answers from those typical of a computer. You can take it for yourself here.
6. The Reverse Turing Test
What if we switched things around a bit, and rejigged the test such that the machine had to be capable of identifying a human? Such a “test” currently exists in the form of CAPTCHAs — those annoying anti-spam procedures. If the test-taker can accurately transpose a series of wobbly characters, the computer knows it’s dealing with a human.
This verification technique has given rise to an arms race between CAPTCHA and the developers of CAPTCHA-busting bots ; but this game of one-upmanship could conceivably lead to evaluative systems that are exceedingly good at identifying humans from machines. It’s anyone’s guess what such a system might look like in practice, but the case can be made that a machine’s ability to recognise a human via a conversation is itself a reflection of intelligence.
7. Digital Dissection
We need more than behavioural tests to prove that a machine is intelligent; we also need to demonstrate that it contains the cognitive faculties required for human-like intelligence. In other words, we need some proof that it possesses the machine equivalent of a complex and dynamic brain (even if that brain amounts to a series of sophisticated algorithms ). In order to accomplish this, we’ll need to identify the machine-equivalents of the neural correlates of consciousness (NCC). Such an understanding would, in theory, let us know whether we’re dealing with a simulation (a “pretend” mind) or a bona fide emulation.
This is all easier said than done; neuroscientists are still struggling to define NCCs in humans, and much about the human brain remains a mystery . As a viable alternative to the Turing Test, we’ll have to set this one aside for now. But as a potential pathway towards the development of an artificial brain — and even artificial consciousness (AC) — it hold tremendous promise.
8. All of the Above
As shown by the work of Gary Marcus and others, the point of all this isn’t necessarily to create a successor to the Turing Test, but rather a set of tests. Call it the Turing Olympics. By confronting an AI with a diverse set of challenges, judges stand a far better chance of distinguishing bot from human.
One Last Consideration: Revise the Rules of the Loebner Prize
All this being said, some experts don’t believe the current limitations of the Turing Test don’t have to do with the test itself, but the ways in which it’s conducted and judged. Writing in Spectrum IEEE, Lee Gomes explains:
Harvard’s Stuart Shieber, for example, says that many of the problems associated with the test aren’t the fault of Turing but instead the result of the rules for the Loebner Prize, under the auspices of which most Turing-style competitions have been conducted, including last summer’s. Shieber says that Loebner competitions are tailor-made for chatbot victories because of the way they limit the conversation to a particular topic with a tight time limit and encourage nonspecialists to act as judges. He says that a full Turing test, with no time or subject limits, could do the job that Turing predicted it would, especially if the human administering the test was familiar with the standard suite of parlor tricks that programmers use to fool people.
Would these considerations constitute an improvement? Absolutely. But they still don’t get around the bias toward natural language processing skills.