This lecture is a review of what is known about modeling human speech recognition (HSR). A model is proposed, and data are tested against the model.
There seem to be a large number of theories, or points of view, on how human speech recognition functions, yet few of these theories are comprehensive. What is needed is a set of models that are supported by experimental observation, that characterize how human speech recognition really works. Finally there is the practical problem of building a machine recognizer. One way to do this is to build a machine recognizer based on the reversed engineering of human recognition. This has not been the traditional approach to automatic speech recognition (ASR).
What is needed is some insight into why this large difference between human performance and present day machine performance exists. Author Jont Allen addresses this and other questions.
Immediately following the Second World War, between 1947 and 1955, several classic papers quantified the fundamentals of human speech information processing and recognition. In 1947 French and Steinberg published their classic study on the articulation index. In 1948 Claude Shannon published his famous work on the theory of information. In 1950 Fletcher and Galt published their theory of the articulation index, a theory that Fletcher had worked on for 30 years, which integrated his classic works on loudness and speech perception with models of speech intelligibility. In 1951 George Miller then wrote the first book Language and Communication, analyzing human speech communication with Claude Shannon’s just published theory of information. Finally in 1955 George Miller published the first extensive analysis of phone decoding, in the form of confusion matrices, as a function of the speech-to-noise ratio. This work extended the Bell Labs’ speech articulation studies with ideas from Shannon’s Information theory. Both Miller and Fletcher showed that speech, as a code, is incredibly robust to mangling distortions of filtering and noise.