Accent Recognition System - Human Computer Interaction

•

Project Summary: Designing a speech recognition system for various accents of English language by studying the important features of vowel segments.

Programming Language: Python
Project Status: Complete
Project Guide: Dr. Pradip K. Das | Professor, IIT Guwahati

Vowels have distinct harmonic spectrum and can be recognized easily as they are the voiced parts of speech. Although work has been done in the area of accent recognition based on formants and phonemes but focusing on vowels can improve the efﬁciency a lot. The objective of this project is to study and understand the importance of vowels in recognition of accents for English language. The performance of this method is also analysed.

Articulation of vowels are different from that of other variation of phonemes. Vowels have been classiﬁed on the basis of three criteria - height, horizontal position and roundness. Depending on the limits of height and horizontal position of vowels a vowel space is created. A vowel space is different for different languages. Based on these vowel space, a Cardinal Vowel system is deﬁned.

Method I : Formant Based Analysis

Difference between the ﬁrst (F1) and second (F2) formants (frequency) of vowels were stored in a code book. F1 is inversely related to vowel height, F2 is related to degree of backness of the Vowel Space. After feature extraction, they calculate the probability for the accent of the word.

Method II : Phoneme based Analysis

In different languages phonemes are restricted to particular environments or contexts exhibiting contrastive (change in meaning) or complementary (no effect) distribution. These are called allophones. Whenever an accent speciﬁc phoneme is observed the accent of the speaker can there by be detected.

Technology, Entertainment, Design (TED) talks is used to form the dataset for accent recognition. Five speakers were chosen and each of the ﬁve cardinal English vowels (/a/, /e/, /i/, /o/, /u/) were manually segmented from their talks.

Following are the five accents of English language and their respective speakers chosen for system training -

1. Data Pre-processing : DC shift and normalization is done for the manually segmented vowels of the training set.

2. MFCC Calculation : Reduced feature set or MFCCs were calculated for each frame for each of the data samples.

3. Pronunciation Dictionary : Frame-wise samples for each vowel for each speaker were taken and an average model for each of the classes was made which are stored in a look-up dictionary called the pronunciation dictionary (PD). Whenever an isolated vowel is given the ASR uses this information to recognize the accent of the sample. Therefore, MFCC for a particular sample in the dictionary is said to represent all variations of that particular accent for that particular vowel.

4. Recognition Procedure : Firstly, the test sample is pre-processed and the frame-wise MFCC is provided. Similarity between the stored dictionary models and the test sample is calculated. The model closest to the sample is determined to be the accent of the speaker.

• Comparison of /a/ - For /a/, Asian and European values are close whereas African, Australian and American values exhibit similar fashion.

• Comparison of /e/ - For /e/, Australian and American accents are quite similar. African, Asian and European were different altogether.

• Comparison of /i/ - For /i/, Asian and African accents are quite similar. Whereas, American, Australian and European accents show different properties.

• Comparison of /o/ - For /o/, all accents follow different trends.

• Comparison of /u/ - For /u/, African and American accents have similar features. Whereas, Asian, Australian and European show no pattern.

Study of the isolated accented vowels gives us an idea about the ways in which two accents of similar language varies and how can it be tackled so that the automatic speech recognition systems can handle this problem and therefore can be used for more generic purposes. Moreover, distinguishing same vowel between different accents with more accuracy will require more constraints to be taken account for. Also, be it accent independent or accent dependent more training data will help in generalizing the features appropriately.