article

Speech recognition (in many contexts, also known as automatic speech recognition, computer speech recognition or voice recognition) is the process of converting a speech signal to a set of words, by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last years include voice dialing (e.g., Call home), call routing (e.g., I would like to make a collect call), simple data entry (e.g., entering a credit card number), and preparation of structured documents (e.g., a radiology report).

Defining the Problem


According to "Survey of the State of the Art in Human Language Technology (1997) by Ron Cole et all" Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, for such applications as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve text formating or speech understanding.

Speech recognition systems can be characterized by many parameters as in the table below.

Parameters Range
Speaking Mode Isolated words to continuous speech
Speaking Style Read speech to spontaneous speech
Enrollment Speaker-dependent to Speaker-independent
Vocabulary Small (< 20 words) to large (> 20,000 words)
Language Model Finite-state to context-sensitive
Perplexity Small (< 10) to large (> 100)
SNR High (> 30 dB) to low (< 10 dB)
Transducer Voice-cancelling microphone to telephone
An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies and is much more dificult to recognize than speech read from script. Some systems require speaker enrollment (a user must provide samples of his or her speech before using them) whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words. The simplest language model can be specified as a finite-state network, where the permissible words following each word are explicitly given. More general language models approximating natural language are specified in terms of a context-sensitive grammar. One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied. In addition, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.

Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal.

  1. the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme /t/ in two, true, and butter in American English. At word boundaries, contextual variations can be quite dramatic making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.
  2. acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer.
  3. within speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality.
  4. differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across speaker variabilities.

Speech Recognition Technology


In terms of technology, most of the technical text books nowadays emphasize the use of Hidden Markov Model as the underlying technology. Though it should be noted that the use of dynamic algorithm approach, neural network-based approach and knowledge-based approach have been studied intensively in 80s and 90s.

Performance of Speech Recognition Systems


Speech recognition systems, depending on several different factors, could have a wide performance range as measured by word error rate. These factors include the environment, the speaking rate of the speaker, the context (or the grammar) being used in recognition.

Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. Part of the confusion mainly comes from the mixed usage of the term speech recognition and dictation.

Speaker-dependent dictation systems requiring a short period of training can capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy (getting one to two words out of one hundred wrong) if operated under optimal conditions. It should be noted that these optimal conditions usually means the test subjects have 1) matching speaker characteristics with the training data, 2) proper speaker adaptation, and 3) clean environment (e.g. office space). (This explains why some users, especially accented, might actually find that the recognition rate could be perceptually much lower than the expected 98% to 99%).

Other, limited vocabulary, systems requiring no training can recognize a small number of words (for instance, the ten digits) from most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organisations.

Use


Commercial systems for speech recognition have been available off-the-shelf since the 1990s. Despite the apparent success of the technology, few people use such speech recognition systems on their desktop computers. It appears that most computer users can create and edit documents and interact with their computer more quickly with conventional input devices, a keyboard and mouse, despite the fact that most people are able to speak considerably faster than they can type. Using both keyboard and speech recognition simultaneously, however, can in some cases be more efficient than using any one of these inputs alone. A typical office environment, with a high amplitude of background speech, is one of the most adverse environments for current speech recognition technologies, and large-vocabulary systems with speaker-independence that are designed to operate within these adverse environments have significantly lower recognition accuracy. The typical achievable recognition rate as of 2005 for large-vocabulary speaker-independent systems is about 80%-90% for a clear environment, but can be as low as 50% for scenarios like cellular phone with background noise. Additionally, heavy use of the speech organs can result in vocal loading.

Speech recognition systems have found use where the speed of text input is required to be extremely fast. They are used in legal and medical transcription, the generation of subtitles for live sports and current affairs programs on television; not directly but via an operator that re-speaks the dialog into software trained in the operator's voice; in such cases the operator also has special training, first to speak clearly and consistently to maximize recognition accuracy, second to indicate punctuation by various techniques, and also often domain-specific training (especially in medical or legal contexts where the operator needs to know specialized vocabulary and procedures). In courtrooms and similar situations where the operator's voice would disturb the proceedings, he or she may sit in a soundproofed booth or wear a Stenomask or similar device.

Speech recognition is sometimes a necessity for people who have difficulty interacting with their computers through a keyboard, for example, those with serious carpal tunnel syndrome, impaired extremities, or other physical limitations.

Speech recognition technology is used more and more for telephone applications like travel booking and information, financial account information, customer service call routing, and directory assistance. Using constrained grammar recognition (described below), such applications can achieve remarkably high accuracy. Research and development in speech recognition technology has continued to grow as the cost for implementing such voice-activated systems has dropped and the usefulness and efficiency of these systems has improved. For example, recognition systems optimized for telephone applications can often supply information about the confidence of a particular recognition, and if the confidence is low, it can trigger the application to prompt callers to confirm or repeat their request (for example "I heard you say 'billing', is that right?"). Furthermore, speech recognition has enabled the automation of certain applications that are not automatable using push-button interactive voice response (IVR) systems, like directory assistance and systems that allow callers to "dial" by speaking names listed in an electronic phone book. Nevertheless, speech recognition based systems remain the exception because push-button systems are still much cheaper to implement and operate.'''

Speech recognition is also used for speech fluency evaluation and language instruction.

Noisy Channel Formulation of Statistical Speech Recognition


Many modern approaches such as HMM-based and ANN-based speech recognition are based on noisy channel formulation (See also Alternative formulation of speech recognition). In that view, the task of a speech recognition system is to search for the most likely word sequence given the acoustic signal. In other words, the system is searching for the most likely word sequence \tilde{W} among all possible word sequences W^* from the acoustic signal A (or some will called observation sequence according to the Hidden Markov Model terminology)

\tilde{W} = arg max_{W \in W^*} \Pr(W | A)

Based on Bayes' rule, the above formulation could be rewritten as

\tilde{W} = arg max_{W \in W^*} \frac{\Pr(A |W) \Pr(W)}{\Pr(A)}

Because the acoustic signal is common regardless of which word sequence chosen, the above could be usually simply to

\tilde{W} = arg max_{W \in W^*} \Pr(A |W) \Pr(W)

The term \Pr(A|W) is generally called acoustic model. The term \Pr(W) is generally known as language model.

Both acoustic modeling and language modeling are important studies in modern statistical speech recognition. In this entry, we will focus on explaining the use of hidden Markov model (HMM) because notably it is very widely used in many systems. (It should be noted that language modeling has many other applications such as smart keyboard and document classification, please refer to the corresponding entries.

Approaches of Statistical Speech Recognition


Hidden Markov Model (HMM)-based Speech Recognition

Modern general-purpose speech recognition systems are generally based on hidden Markov models (HMMs). This is a statistical model which outputs a sequence of symbols or quantities.

One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 millsecond, speech could be approximated as a stationary process. Speech could thus be thought as a Markov model for many stochastic process (known as states).

Another reason why HMMs are popular because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, to give the very simplest setup possible, the hidden Markov model would output a sequence of n-dimensional real-valued vectors with n around, say, 13, outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have, in each state, a statistical distribution called a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

The above is a very brief introduction to some of the more central aspects of speech recognition. Modern speech recognition systems use a host of standard techniques which it would be too time consuming to properly explain, but just to give a flavor, a typical large-vocabulary continuous system would probably have the following parts. It would need context dependency for the phones (so phones with different left and right context have different realizations); to handle unseen contexts it would need tree clustering of the contexts; it would of course use cepstral normalization to normalize for different recording conditions and depending on the length of time that the system had to adapt on different speakers and conditions it might use cepstral mean and variance normalization for channel differences, vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use LDA followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform (MLLT)). A serious company with a large amount of training data would probably want to consider discriminative training techniques like maximum mutual information (MMI), MPE, or (for short utterances) MCE, and if a large amount of speaker-specific enrollment data was available a more wholesale speaker adaptation could be done using MAP or, at least, tree-based maximum likelihood linear regression. Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, but there is a choice between dynamically creating combination hidden Markov models which includes both the acoustic and language model information, or combining it statically beforehand (the AT&T approach, for which their FSM toolkit might be useful). Those who value their sanity might consider the AT&T approach, but be warned that it is memory hungry.

Neural Network-based Speech Recognition

Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden markov model part for language modeling.

Dynamic Time Warping (DTW)-based Speech Recognition

Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analysized with DTW.

A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

Knowledge-based Speech Recognition

Current Research problems

  • Speech recognition system are based on simplified stochastic models, so any aspects of the speech that may be important to recognition but are not represented in the models cannot be used to aid in recognition.
  • Speech segmentation, the division of the continuous speech signal into elementary units, is a very difficult problem. This task actually consists of two separate problems, the breakup and classification of the signal into a string of discrete "atomic" sounds (phonemes), and the division of that string into meaningful substrings (words, or, more generally, lexical units). For most languages, the first task is already quite difficult — partly because of co-articulation, which causes phonemes to interact or combine, even across word boundaries. The second task is not trivial either, because in normal spoken speech there are no pauses between words. An example, often quoted in the field, is the phrase how to wreck a nice beach — which, when spoken, sounds like How to recognize speech. Proper segmentation therefore depends on context, syntax and semantics, meaning human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer.
  • Intonation and sentence stress can play an important role in the interpretation of an utterance. As a simple example, utterances that might be transcribed as "go!", "go?" and "go." can clearly be recognized by a human, but determining which intonation corresponds to which punctuation is difficult for a computer. Most speech recognition systems are unable to provide any more information about an utterance other than what words were pronounced, so information about stress and intonation cannot be used by the application using the recognizer. Researchers are currently investigating emotion recognition, which may have practical applications. For example if a system detects anger or frustration, it can try asking different questions or forward the caller to a live operator.
  • In a system designed for dictation, an ordinary spoken signal doesn't provide sufficient information to create a written form that obeys the normal rules for written language, such as punctuation and capitalization. These systems typically require the speaker to explicitly say where punctuation is to appear.

Commercial Speech Recognition: A Survey of Market players


The challenge for developers of Automatic Speech Recognition (ASR) engines is that the end customer judges them on one criterion: did it understand what I said? That leaves little room for differentiation. Of course, there are areas like multi-language support, tuning tools, integration API (the proposed standard MRCP or proprietary) , etc., but recognition quality is most visible. Because of the complex algorithms and language models required to implement a high-quality speech recognition engine, it is both difficult for new companies to enter this market as well as difficult for existing vendors to maintain the necessary investment level to keep up and move ahead.

Currently, Nuance Communications (formerly known as ScanSoft) dominates the speech recognition market for server-based telephony and PC applications. There are several small vendors, like Aculab, Fonix Speech Group, Loquendo, LumenVox, Sensory Inc., Verbio, etc., but they are essentially niche players. Nuance's speech recognition business is actually composed of SpeechWorks and the products of several former niche players. IBM has also participated in the speech recognition engine market, but their ViaVoice product has gained traction primarily in the desktop command and control (grammar-constrained) and dictation markets. Nuance also makes Dragon NaturallySpeaking, a desktop dictation system with theoretically possible recognition rates of up to 99 percent. In practice this is impossible to achieve, due to the fact that people use a vocabulary of over 300,000 words and these words have not all been put into the vocabulary of Naturally Speaking.

A newcomer to this area Talking desktop combines speech recognition with text-to-speech to provide an interactive software experience.

Philips Speech Recognition Systems is the market leader in enterprise healthcare speech recognition systems with its flagship product SpeechMagic (according to a report of Frost & Sullivan of December 2005). SpeechMagic is installed in more than 7000 professional sites world-wide, supports 23 recognition languages and boasts a portfolio of more than 150 specialized recognition vocabularies.

For Mac users, iListen from MacSpeech, Inc. is available. Based on the Philips speech engine, iListen has been shipping on the Mac since 2000.

Speaker-independent speech recognition embedded for mobile phones is one of the fastest growing market segments. Grammar-based command and control and even dictation systems can now be purchased in mobile handsets from operators such as Cingular Wireless, Sprint PCS, Verizon Wireless, and Vodafone. Voicesignal is the dominant vendor in this rapidly growing segment. Microsoft, Nuance, and IBM have also announced intentions to enter this segment.

The big software heavyweights, Microsoft (Speech Server) and IBM (references - main site, voice toolkit preview, eWeek article, older InternetNews article, new InternetNews article on VXML toolkits) are now making substantial investments in speech recognition. Microsoft introduced speech recognition in Office 2002, then greatly improved it in Office 2003 (with engine 6.1), and has now invested heavily in upgrading speech recognition in Windows Vista. While speech recognition in Office was mainly geared to dictation, Vista Speech also has added extensive command-and-control functionality. --Itamar.evenzohar 02:19, 12 July 2006 (UTC)IBM claims to have put one hundred speech researchers on the problem of taking ASR beyond the level of human speech recognition by 2010. Bill Gates is also making very large investments in speech recognition research at Microsoft. At SpeechTEK, Gates predicted that by 2011 the quality of ASR will catch up to human speech recognition. IBM and Microsoft are still well behind Nuance in market share.

Low cost speech recognition chip for toys and consumer electronics are supplied by Sensory Inc. and Extell Technologies.

For further information


Popular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998). Keep an eye on government sponsored competitions such as those organised by DARPA (the telephone speech evaluation was most recently known as Rich Transcription). In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting (if you are very brave). You could also search for Carnegie Mellon University's SPHINX toolkit. For information on the current state of the commercial products, their comparative features, and the range of languages they cover see http://sr.even-zohar.com.

See also


References


External links


Computational linguistics | Speech recognition

Rozpoznávání řeči | Talegenkendelse | Spracherkennung | Reconocimiento del habla | Hizketaren ezagutza | تشخیص گفتار | Reconnaissance vocale | Recoñecemento da fala | 음성 인식 | Riconoscimento vocale | Spraakherkenning | 音声認識 | Rozpoznawanie mowy | Reconhecimento de fala | Rozpoznávanie reči | Puheentunnistus | Automatic Speech Recogniser | Nhận dạng tiếng nói | 语音识别

 

This article is licensed under the GNU Free Documentation License. It uses material from the "Speech recognition".

Home Pageartsbusinesscomputersgameshealthhospitalshomekids & teensnewsphysiciansrecreationreferenceregionalscienceshoppingsocietysportsworld