VOICE RECOGNITION USING DSP
This paper deals with the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. Speaker recognition methods can be divided into text-independent and text-dependent methods. In a text-independent system, speaker models capture characteristics of somebody's speech, which show up irrespective of what one is saying. In a text-dependent system, on the other hand, the recognition of the speaker's identity is based on his or her speaking one or more specific phrases, like passwords,
card numbers, PIN codes, etc. This paper is based on text independent speaker recognition system and makes use of mel frequency cepstrum coefficients to process the input signal and vector quantization approach to identify the speaker. The above task is implemented using MATLAB. This technique is used in application areas such as control access to services like voice dialing, banking by telephone, database access services, voice mail, security control for confidential information areas, and remote access to computers.
Principles of Speaker Recognition:
Speaker recognition can be classified into identification and verification. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. Figure 1 shows the basic structures of speaker identification and verification systems.
At the highest level, all speaker recognition systems contain two main modules (refer to Figure 1): feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. We will discuss each module in detail in later sections.
(a) Speaker identification
(b) Speaker verification
Figure 1. Basic structures of speaker recognition systems
All speaker recognition systems have to serve two distinguish phases. The first one is referred to the enrollment sessions or training phase while the second one is referred to as the operation sessions or testing phase. In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples. During the testing phase ( Figure 1), the input speech is matched with stored reference model and recognition decision is made.
Speech Feature Extraction
The purpose of this module is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end.
The speech signal is a slowly timed varying signal (it is called quasi-stationary). An example of speech signal is shown in Figure 2. When examined over a sufficiently short period of time
(between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Therefore, short-time spectral analysis is the most common way to characterize the speech signal.
Figure 2. An example of speech signal
Mel-frequency cepstrum coefficients processor:
MFCC's are based on the known variation of the human ear's critical bandwidths with frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz.
A block diagram of the structure of an MFCC processor is given in Figure 3. The speech input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears. In addition, rather than the speech waveforms themselves, MFFC's are shown to be less susceptible to mentioned variations.
Figure 3. Block diagram of the MFCC processor
Frame Blocking :
In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < n =" 256" m =" 100." n =" 0," href="http://1.bp.blogspot.com/_xyESBX_nWPQ/S9M3IccS_qI/AAAAAAAAAWs/PLZ-XWM6uQU/s1600/a5.jpg">
Figure 4. Conceptual diagram illustrating vector quantization codebook formation.
One speaker can be discriminated from another based of the location of centroids.
codebook for this speaker using those training vectors. There is a well-know algorithm, namely LBG algorithm [Linde, Buzo and Gray, 1980], for clustering a set of L training vectors into a set of M codebook vectors. The algorithm is formally implemented by the following recursive procedure:
1.Design a 1-vector codebook; this is the centroid of the entire set of training vectors (hence, no iteration is required here).
2.Double the size of the codebook by splitting each current codebook yn according to the rule
where n varies from 1 to the current size of the codebook, and is a splitting parameter (we choose =0.01).
3.Nearest-Neighbor Search: for each training vector, find the codeword in the current codebook that is closest (in terms of similarity measurement), and assign that vector to the corresponding cell (associated with the closest codeword).
4.Centroid Update: update the codeword in each cell using the centroid of the training vectors assigned to that cell.
5.Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold.
6.Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts first by designing a 1-vector codebook, then uses a splitting technique on the codewords to initialize the search for a 2-vector codebook, and continues the splitting process until the desired M-vector codebook is obtained.
Figure 5 shows, in a flow diagram, the detailed steps of the LBG algorithm. "Cluster vectors" is the nearest-neighbor search procedure which assigns each training vector to a cluster associated with the closest codeword. "Find centroids" is the centroid update procedure. "Compute D (distortion)" sums the distances of all training vectors in the nearest-neighbor search so as to determine whether the procedure has converged.
Figure 5. Flow diagram of the LBG algorithm
The following are the some of the fabulous areas were this amazing technology is implemented.
Training air traffic controllers
Telephony and other domains
Even though much care is taken it is difficult to obtain an efficient speaker recognition system since this task has been challenged by the highly variant input speech signals. The principle source of this variance is the speaker himself. Speech signals in training and testing sessions can be greatly different due to many facts such as people voice change with time, health conditions (e.g. the speaker has a cold), speaking rates, etc. There are also other factors, beyond speaker variability, that present a challenge to speaker recognition technology. Because of all these difficulties this technology is still an active area of research.
 L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J., 1993.
 L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, N.J., 1978.