IBM has edged past Microsoft in its latest speech-recognition test, achieving a word error rate of 5.5 percent.
That error rate does match the roughly one in 20 words that humans miss when listening to a conversation, and puts IBM ahead of Microsoft’s recent best of 5.9 percent recorded in October, which beat IBM’s top performance in 2016 of 6.9 percent.
But IBM stresses that it could still be a while before machines can beat humans at understanding conversations. Microsoft argued that it had reached “human parity” when announcing its 5.9 percent word error rate, but IBM says its new study proves Microsoft’s celebration was premature.
That’s why IBM principal research scientist George Saon says, “We’re not popping the champagne yet.”
“As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent,” Saon said.
“While our breakthrough of 5.5 percent is a big one, this discovery of human parity at 5.1 percent proved to us we have a way to go before we can claim technology is on par with humans.”
To achieve the 5.5 percent error rate, IBM combined long short-term memory (LSTM), a neural net, and WaveNet language models with three strong acoustic models. It tested these models using the Switchboard corpus, which contains a collection of formal phone conversations between strangers.
IBM also tested its network on a different and more challenging body of conversations known as ‘CallHome’, which consists of casual chats between family members on a range of topics that aren’t fixed beforehand.
IBM scored a 10.3 percent word error rate on this test and found that human performance here was 6.8 percent.
The company highlights in a research paper that the Switchboard test has a number problem. It notes that “36 out of 40 test speakers appear in the training data, some in as many as eight different conversations, and our acoustic models are very good at memorizing speech patterns seen during training.”
The larger gap in the CallHome test is because the acoustic and language models haven’t been exposed to test speakers’ data.
IBM highlights that it is using its achievements in speech recognition to add new features to its Watson Speech to Text service.