USF Libraries
USF Digital Collections

Representation and interpretation of manual and non-manual information for automated American Sign Language recognition

MISSING IMAGE

Material Information

Title:
Representation and interpretation of manual and non-manual information for automated American Sign Language recognition
Physical Description:
Book
Language:
English
Creator:
Parashar, Ayush S
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla.
Publication Date:

Subjects

Subjects / Keywords:
expectation maximization
relational distributions
space of probablity function
face detection and tracking
principal component analysis
Dissertations, Academic -- Computer Science -- Masters -- USF   ( lcsh )
Genre:
government publication (state, provincial, terriorial, dependent)   ( marcgt )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: Continuous recognition of sign language has many practical applications and it can help to improve the quality of life of deaf persons by facilitating their interaction with hearing populace in public situations. This has led to some research in automated continuous American Sign Language recognition. But most work in continuous ASL recognition has only used top-down Hidden Markov Model (HMM) based approaches for recognition. There is no work on using facial information, which is considered to be fairly important. In this thesis, we explore bottom-up approach based on the use of Relational Distributions and Space of Probability Functions (SoPF) for intermediate level ASL recognition. We also use non-manual information, firstly, to decrease the number of deletion and insertion errors and secondly, to find whether the ASL sentence has 'Negation' in it, for which we use motion trajectories of the face. The experimental results show: - The SoPF representation works well for ASL recognition. The accuracy based on the number of deletion errors, considering the 8 most probable signs in the sentence is 95%, while when considering 6 most probable signs, is 88%. - Using facial or non-manual information increases accuracy when we consider top 6 signs, from 88% to 92%. Thus face does have information content in it. - It is difficult to directly combine the manual information (information from hand motion) with non-manual (facial information) to improve the accuracy because of following two reasons: 1. Manual images are not synchronized with the non-manual images. For example the same facial expressions is not present at the same manual position in two instances of the same sentences. 2. One another problem in finding the facial expresion related with the sign, occurs when there is presence of a strong non-manual indicating 'Assertion' or 'Negation' in the sentence. In such cases the facial expressions are totally dominated by the face movements which is indicated by 'head shakes' or 'head nods'. - The number of sentences, that have 'Negation' in them and are correctly recognized with the help of motion trajectories of the face are, 27 out of 30.
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2003.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Ayush S Parashar.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 80 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001416917
oclc - 52832761
notis - AJJ4769
usfldc doi - E14-SFE0000055
usfldc handle - e14.55
System ID:
SFS0024751:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001416917
003 fts
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 031010s2003 flua sbm s000|0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0000055
035
(OCoLC)52832761
9
AJJ4769
b SE
SFE0000055
040
FHM
c FHM
049
FHME
090
QA763
1 100
Parashar, Ayush S.
0 245
Representation and interpretation of manual and non-manual information for automated American Sign Language recognition
h [electronic resource] /
by Ayush S Parashar.
260
[Tampa, Fla.] :
University of South Florida,
2003.
502
Thesis (M.S.C.S.)--University of South Florida, 2003.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 80 pages.
520
ABSTRACT: Continuous recognition of sign language has many practical applications and it can help to improve the quality of life of deaf persons by facilitating their interaction with hearing populace in public situations. This has led to some research in automated continuous American Sign Language recognition. But most work in continuous ASL recognition has only used top-down Hidden Markov Model (HMM) based approaches for recognition. There is no work on using facial information, which is considered to be fairly important. In this thesis, we explore bottom-up approach based on the use of Relational Distributions and Space of Probability Functions (SoPF) for intermediate level ASL recognition. We also use non-manual information, firstly, to decrease the number of deletion and insertion errors and secondly, to find whether the ASL sentence has 'Negation' in it, for which we use motion trajectories of the face. The experimental results show: The SoPF representation works well for ASL recognition. The accuracy based on the number of deletion errors, considering the 8 most probable signs in the sentence is 95%, while when considering 6 most probable signs, is 88%. Using facial or non-manual information increases accuracy when we consider top 6 signs, from 88% to 92%. Thus face does have information content in it. It is difficult to directly combine the manual information (information from hand motion) with non-manual (facial information) to improve the accuracy because of following two reasons: 1. Manual images are not synchronized with the non-manual images. For example the same facial expressions is not present at the same manual position in two instances of the same sentences. 2. One another problem in finding the facial expresion related with the sign, occurs when there is presence of a strong non-manual indicating 'Assertion' or 'Negation' in the sentence. In such cases the facial expressions are totally dominated by the face movements which is indicated by 'head shakes' or 'head nods'. The number of sentences, that have 'Negation' in them and are correctly recognized with the help of motion trajectories of the face are, 27 out of 30.
590
Adviser: Sarkar, Sudeep
653
expectation maximization.
relational distributions.
space of probablity function.
face detection and tracking.
principal component analysis.
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.55



PAGE 1

REPRESENT A TION AND INTERPRET A TION OF MANUAL AND NON-MANUAL INF ORMA TION F OR A UTOMA TED AMERICAN SIGN LANGUA GE RECOGNITION b y A YUSH S. P ARASHAR A thesis submitted in partial fulllmen t of the requiremen ts for the degree of Master of Science Departmen t of Computer Science and Engineering College of Engineering Univ ersit y of South Florida Ma jor Professor: Sudeep Sark ar, Ph.D. Dmitry Goldgof, Ph.D. Barbara Lo eding, Ph.D. Date of Appro v al: July 9, 2003 Keyw ords: Exp ectation Maximization, Relational Distributions, Space of Probablit y F unction, F ace Detection and T rac king, Principal Comp onen t Analysis c r Cop yrigh t 2003, Ayush S. P arashar

PAGE 2

DEDICA TION This thesis is dedicated to Ma, P apa and Bhaiy a

PAGE 3

A CKNO WLEDGEMENTS I w ould lik e to thank Dr. Sudeep Sark ar for giving me this oppurtunit y to w ork with him. I o w e him a lot for ha ving giv en directions to m y though ts and unlimited encouragemen t. Without his insigh ts and patien t guidance this w ork w ould not ha v e b een p ossible. I also thank Dr. Dmitry Goldgof and Dr. Barbara Lo eding for the time they to ok to review this thesis and their helpful commen ts. I am grateful to m y family whic h has alw a ys supp orted me. I w ould also lik e to thank Ms. Jolene Bertlo and Ms. P eggy Kledzik of USF Studen t Disabilit y Services, for signing the ASL video sequences. I cannot forget the help and supp ort of all m y friends and colleagues who ha v e help ed me in ev ery step of the w a y

PAGE 4

T ABLE OF CONTENTS LIST OF T ABLES iii LIST OF FIGURES iv ABSTRA CT vi CHAPTER 1 INTR ODUCTION 1 1.1 Ov erview of the Approac h 2 1.2 Thesis Organization 5 CHAPTER 2 PRIOR W ORK 6 2.1 Static Sign Recognition 7 2.2 Isolated Sign Recogn tion 7 2.3 Con tin uous Sign Recognition 8 2.4 Recognition of Head Shak es and No ds 10 2.5 Existing ASL Datasets 10 CHAPTER 3 D A T A COLLECTION 12 3.1 ASL Con v en tions 12 3.2 Dataset 12 3.3 Camera Setup and Recording P arameters 13 3.4 Statistics of ASL Signs and Sen tences 15 CHAPTER 4 SEGMENT A TION OF SKIN PIXELS USING EM 18 4.1 Color Space 18 4.2 Exp ectation Maximization (EM) Algorithm for Pixel Classication 20 4.3 Connected Blob Analysis 22 CHAPTER 5 MOTION MODELING USING RELA TIONAL DISTRIBUTIONS AND SP A CE OF PR OBABILITY FUNCTION 26 5.1 Relational Distributions 27 5.2 Space of Probabilit y F unctions 28 CHAPTER 6 F A CE DETECTION AND TRA CKING 34 6.1 Detecting F ace Using Ey e T emplates 34 6.2 Masking the F ace With an Elliptical Structure 37 6.3 Eigen Represen tation for Condensing the F acial Expression Information 37 6.4 Motion T ra jectory of F ace 39 i

PAGE 5

CHAPTER 7 EV ALUA TION FRAMEW ORK AND RESUL TS 43 7.1 Similarit y Measures and P osition of Sign in the Sen tence 45 7.2 Deletion and Insertion Errors 55 7.3 Exp erimen ts and Results 56 7.3.1 In tegrating Non-Man ual Information 57 7.4 Using Motion T ra jectories of F ace to Find `Negation' 61 CHAPTER 8 CONCLUSION AND DISCUSSION 64 8.1 F uture W ork 65 REFERENCES 66 ii

PAGE 6

LIST OF T ABLES T able 2.1 Snapshot of Researc h in Con tin uous W ord Recognition in ASL. 9 T able 3.1 25 Sen tences Used With ASL T ranscription. 14 T able 3.2 39 ASL Gestures Used. 14 T able 4.1 Algorithm for Segmen tation of Skin Pixels. 19 T able 7.1 Eigh t Signs With Smallest Distances and their P osition. 54 T able 7.2 Eigh t Signs With Smallest Distances and their P osition Using NonMan ual Information. 54 T able 7.3 T raining and T esting Sets for 5 F old Cross V alidation. 56 T able 7.4 Deletion Error for the Fiv e T est Cases. 57 T able 7.5 Insertion Errors Without Use of Non-Man ual Information. 57 T able 7.6 Insertion Errors With and Without Use of Non-Man ual Information. 59 T able 7.7 Deletion Errors With and Without Use of Non-Man ual Information. 60 iii

PAGE 7

LIST OF FIGURES Figure 1.1 Mo del for ASL Recognition. 3 Figure 1.2 Pro cesses in Lo w and In termediate Lev el Recognition. 4 Figure 3.1 Images from Data Collection. 16 Figure 3.2 Camera Setup for ASL Data Acqusition. 16 Figure 3.3 Images Used for Syncronization. 17 Figure 4.1 Segmen tation of Skin Pixels Using EM. 23 Figure 4.2 Fitted Gaussians After One Run of EM Pro cess. 24 Figure 4.3 Final Fitted Gaussians. 25 Figure 5.1 Finding Edge Image of Skin Blobs. 29 Figure 5.2 Edge Pixel Based 2-ary Relational Distribution. 30 Figure 5.3 V ariations in Relational Distributions. 31 Figure 5.4 Eigen v ectors of SoPF. 33 Figure 5.5 F all of Eigen v alues for Relational Distributions. 33 Figure 6.1 Flo w c hart for the F ace Detection and T rac king. 35 Figure 6.2 Ey e T emplates Used and Mean Image of the Ey e T emplates. 35 Figure 6.3 Ey e Detection and Masking of F ace With the Elliptical Structure. 36 Figure 6.4 An throp ometric F acial Prop ortions and Masking of F ace With the Elliptical Structure. 38 Figure 6.5 Image Sho wing the F ace After Ey e Detection and Masking With the Elliptical Structure. 38 Figure 6.6 F all of Eigen v alues for F acial Expressions. 39 Figure 6.7 F ace Mo v emen t and its Motion T ra jectory for a Sen tence. 41 Figure 6.8 Motion T ra jectories of F ace for V arious Sen tences. 42 Figure 7.1 Bottom-Up Approac h for In termediate Lev el Recognition. 44 iv

PAGE 8

Figure 7.2 Eigen v ectors of F ace Space. 46 Figure 7.3 Eigen v ectors of SoPF. 47 Figure 7.4 F all of Eigen v alues for the F ace Space & SoPF. 48 Figure 7.5 Distribution of Signs Within the Sen tence. 51 Figure 7.6 The P osition of the Signs in the Sen tence. 52 Figure 7.7 Eigh t Signs With Minim um Distance When Correlated With the Sentence. 53 Figure 7.8 Eigh t Signs With Minim um Distance in Sen tence Using the Non-Man ual Information. 53 Figure 7.9 V ariation in Deletion Error With Num b er of Signs Considered P er Sentence ( n ). 58 Figure 7.10 Flo w c hart Explaning the Strategy to Com bine the Man ual and NonMan ual Information. 60 Figure 7.11 Asp ect Ratio (W/H) for the Sen tence With `Negation' in it. 61 Figure 7.12 Scatter Plot of Width, Heigh t and Asp ect Ratio of Motion T ra jectories of F ace. 62 v

PAGE 9

REPRESENT A TION AND INTERPRET A TION OF MANUAL AND NON-MANUAL INF ORMA TION F OR A UTOMA TED AMERICAN SIGN LANGUA GE RECOGNITION Ayush S. P arashar ABSTRA CT Con tin uous recognition of sign language has man y practical applications and it can help to impro v e the qualit y of life of deaf p ersons b y facilitating their in teraction with hearing p opulace in public situations. This has led to some researc h in automated con tin uous American Sign Language recognition. But most w ork in con tin uous ASL recognition has only used top-do wn Hidden Mark o v Mo del (HMM) based approac hes for recognition. There is no w ork on using facial information, whic h is considered to b e fairly imp ortan t. In this thesis, w e explore b ottom-up approac h based on the use of Relational Distributions and Space of Probabilit y F unctions (SoPF) for in termediate lev el ASL recognition. W e also use non-man ual information, rstly to decrease the n um b er of deletion and insertion errors and secondly to nd whether the ASL sen tence has `Negation' in it, for whic h w e use motion tra jectories of the face. The exp erimen tal results sho w: The SoPF represen tation w orks w ell for ASL recognition. The accuracy based on the n um b er of deletion errors, considering the 8 most probable signs in the sen tence is 95%, while when considering 6 most probable signs, is 88%. Using facial or non-man ual information increases accuracy when w e consider top 6 signs, from 88% to 92%. Th us face do es ha v e information con ten t in it. It is dicult to directly com bine the man ual information (information from hand motion) with non-man ual (facial information) to impro v e the accuracy b ecause of follo wing t w o reasons: vi

PAGE 10

1. Man ual images are not sync hronized with the non-man ual images. F or example the same facial expressions is not presen t at the same man ual p osition in t w o instances of the same sen tences. 2. One another problem in nding the facial expresion related with the sign, o ccurs when there is presence of a strong non-man ual indicating `Assertion' or `Negation' in the sen tence. In suc h cases the facial expressions are totally dominated b y the face mo v emen ts whic h is indicated b y `head shak es' or `head no ds'. The n um b er of sen tences, that ha v e `Negation' in them and are correctly recognized with the help of motion tra jectories of the face are, 27 out of 30. vii

PAGE 11

CHAPTER 1 INTR ODUCTION The problem of automated sign language recognition can b e put across as, \Giv en a video of a sign language sen tence, can w e iden tify the signs in the sen tence and reconstruct the sen tence?" The solution to the problem of sign language recognition has man y practical implications. Firstly adv ances in automated sign language recognition are necessary to impro v e the qualit y of life of deaf p ersons b y facilitating their in teraction with hearing p opulace in public situations. F or instance, the use of inno v ativ e computer tec hnologies can pro vide a solution to the dilemma a securit y screener faces in attempting to comm unicate with deaf passengers during the course of daily business activities. Also, it can b e helpful in other places lik e courtro om, con v en tions or ev en a gro cery store. On the other note, h uman computer in teraction (HCI) is gradually mo ving to w ards a mo dalit y where sp eec h recognition will pla y a ma jor role. While sp eec h recognition has made rapid adv ances, gesture recognition is lagging b ehind. With this gradual shift to sp eec h based I/O devices, there is a great danger that p ersons who rely solely on sign languages for comm unication will b e depriv ed access to state-of-the-art tec hnology unless there are signican t adv ances in automated recognition of sign languages. Secondly the problem of automated sign language recognition is also w orth while from a scien tic and tec hnological p oin t of in terest, since adv ances in this problem w ould denitely impact the general problem of automated gesture recognition, whic h is at the core of designing next generation man-mac hine in terface. Most of the w ork in con tin uous sign language recognition [59 60 61 50 ] has used HMMs (Hidden Mark o v Mo dels) for recognition and ha v e not used in an y w a y facial or 1

PAGE 12

non-man ual 1 information. Although facial information is considered to b e fairly imp ortan t [61 2 ], no prior w ork on con tin uous ASL recognition has made use of it. There has b een some w ork on detecting `head shak es' and `no ds' only [12 21 29 ], but these w orks do not sho w results on con tin uous sign language. In this thesis w e ha v e used non-man ual information to decrease the insertion and deletion errors and to nd whether there is `Negation' in the sen tence using the motion tra jectories of the face. Also w e lo ok at a dieren t b ottom-up approac h for motion mo deling and recognition using relational distributions and Space of Probabilit y F unction (SoPF), whic h has b een successfully used for gait based iden tication [45 ]. 1.1 Ov erview of the Approac h The b ottom-up driv en approac h is sk etc hed in Figure 1.1. There are three lev els of processing. The rst lev el is resp onsible for lo w lev el segmen tation of skin pixels and detection of the face. It also deals with nding the skin pixels in motion. This task can b e sp ecically dicult in the presence of complex bac kground. The second or in termediate lev el consists of mo deling the motion using relational distributions and SoPF. This lev el also deals with using facial or non-man ual information and com bining it with man ual information to reduce the deletion and insertion errors. The third or topmost lev el consists of using con text and grammatical information from ASL (American Sign Language) phonology to constrain and correctly predict the n um b er and t yp e of signs presen t in the sen tence, and also the exact p osition where they o ccur in the sen tence. Hence the topmost lev el is resp onsible for exact construction of the sen tence. In this w ork w e ha v e concen trated on the in termediate lev el recognition to nd signs whic h are most probable to o ccur in the sen tence. The individual pro cess at the lo w and in termediate lev els are sho wn in Figure 1.2. As can b e seen from the Figure 1.2 w e ha v e used Exp ectation Maximization (EM) algorithm for lo w lev el pro cess of skin blob detection. W e use simple template matc hing to detect ey es and hence detect the face. The shaded p ortion in Figure 1.2 sho ws the in termediate lev el of pro cessing. Relational distributions 1 Non-man ual refers to information including facial expressions, face motion, torso mo v emen t while manual refers to information con ten t through motion of hands. 2

PAGE 13

Figure 1.1. Mo del for ASL Recognition. The gure sho ws the o v erall approac h to recognize con tin uous signs in an ASL sen tence with the help of non-man ual expressions 3

PAGE 14

Figure 1.2. Pro cesses in Lo w and In termediate Lev el Recognition. and Space of Probabilit y F unctions (SoPF) are used for mo deling the motion of hands, while PCA space is used to represen t the facial expression information. The information obtained from the man ual and non-man ual part is com bined at this in termediate lev el to reduce the deletion and insertion errors. This information is then giv en to upp er lev el of pro cessing as seen in Figure 1.2. 4

PAGE 15

1.2 Thesis Organization In Chapter 2 w e lo ok at the prior w ork done in ASL recognition. This is follo w ed b y description of the data w e collected for the v arious ASL sen tences used in this w ork, the camera parameters during data collection and the proto col follo w ed to mark the signs for training, in Chapter 3. In Chapter 4 w e lo ok at the metho d used to nd skin blobs using EM Algorithm. Chapter 5 giv es an accoun t of motion mo deling theory used. Chapter 6 discusses the metho d used to detect face and nd the motion tra jectories of the face. In Chapter 7, w e lo ok at v arious exp erimen ts p erformed, similarit y measures adopted and results. W e conclude with Chapter 8. 5

PAGE 16

CHAPTER 2 PRIOR W ORK Sign languages are complex, abstract linguistic systems, with their o wn grammars [54 ]. American Sign Language (ASL) is one suc h visual/gestural sign language whic h is the primary means of comm unication of deaf p eople in America and parts of Canada. Lot of researc h has b een done on the syn tactical structure of ASL. After the germinal w ork b y Stok o e [55 ] in whic h he describ ed the phonological units of ASL, \c heremes", as w ell as ASL's structure, a lot of researc hers lik e Lidell & Johnson [34 ], Neidle et al. [38 ] and Bren tari [10 ] ha v e describ ed dieren t syn tactical structure for American Sign Language. Due to its v ery w ell dened syn tactical structure and grammar, and the practical signicance of its automated recognition there has b een signican t in terest in the vision comm unit y in the researc h related to automated recognition of ASL. ASL is a language that has its o wn grammar [38 ]. The problem of automated sign language recognition is usually though t to b e a relativ ely easy sub class of the general problem of recognizing h uman gestures, on whic h there has b een a lot of researc h [67 8, 64 6, 22 32]. Ho w ev er, in practice automated ASL recognition is at least dieren t, if not harder, than recognizing h uman gestures, suc h as those needed to in terface with a graphics displa y program or con trol a rob ot, partly b ecause the set of signs in ASL is xed; one cannot exp ect to train a signer to sign in a w a y to mak e it easier for the algorithm, rather the algorithm should b e able to accoun t for the v agaries of signing. Besides, the p erformance exp ectations in ASL recognition go es b ey ond recognizing individual signs (w ords) in to the domain of sen tences, paragraph, or ev en proso dy Previous w ork in sign language recognition has b een in recognition of static gestures, isolated signs as w ell as in con tinous sen tence recognition. Also there has b een some w ork in extracting non-man ual features suc h as `head no ds' and `head shak es'. In this c hapter 6

PAGE 17

w e lo ok at the prior w ork done in these dieren t categories of sign language recognition, esp ecially ASL recognition. W e also lo ok at v arious existing ASL datasets for automated ASL recognition. 2.1 Static Sign Recognition Cui and W eng [15 16 18 19, 17 ] ha v e w ork ed on the problem of recognizing static signs against complex bac kgrounds. They used a learning based prediction and v erication sc heme with most expressiv e features to detect 40 static hand p ostures tak en from ASL. Zhao and Quek [70 ] ha v e used a recursiv e inductiv e sc heme that is able to acquire hand p ose mo dels in the form of disjunctiv e normal form expressions in v olving m ulti-v alued features. The algorithm classied 94% of gesture images in the testing set correctly the total n um b er of distinct gestures considered b eing 15. T riesc h and Malsburg [58 ] classied the static hand p ostures against complex bac kground using elastic graph matc hing. They used 10 dieren t p ostures against complex bac kground and ac hiev ed an accuracy of 86%. 2.2 Isolated Sign Recogn tion The ma jorit y of the w ork in ASL has b een in isolated sign recognition in v olving motion. The earliest w ork in ASL recognition that w e are a w are of is from 1992 b y Charan y aphan and Marble [13 ], who devised an image pro cessing tec hnique that recognized correctly 27 signs out of a total of 31 signs. Since then, there has b een lot of w ork in isolated sign recognition [65 26, 62 31 25 28 69 ]. The most common metho d emplo y ed has b een neural net w ork [65 26 62, 31 ]. Examples of other metho ds include Gupta and Ma [25 ], who used edge con tour based features to ac hiev e almost 100% accuracy while Kadous [28 ] used decision trees to classify individual signs. Another more recen t w ork in isolated sign recognition that concen trates on the segmen tation problem is that b y Y ang, Ah uja and T abb [69 ], who presen t a fully automatic vision based metho d to nd motion tra jectories and use their shap es to classify 40 isolated signs. First, the image frames are segmen ted based on constancy of in tensit y Second, these segmen ted regions are corresp onded across frames assuming ane transformations. Third, these in ter-frame 7

PAGE 18

motion v ectors are stitc hed together to form motion tra jectories. Last, a time-dela y neural net w ork is used to classify these tra jectories. 2.3 Con tin uous Sign Recognition W ork in con tin uous sign recognition has only used Hidden Mark o v Mo dels (HMMs). Ev en foreign sign language recognizers also tend to use HMMs, Chinese [35 63 ], German [3, 4], Netherlands [1 ], T aiw anese [33 ] and British [57 ]. In T able 2.1 w e categorize the prior w ork in recognition of con tin uous w ords, em b edded in short sen tences, in terms of the t yp e of input data, size of exp erimen tal data set, features used, tec hnique used, and rep orted recognition rates. Starner and P en tland [50 ] w ere the rst to seriously use HMMs for con tin uous sign recognition. Their HMMs had 4 states with one skip transition and m ultidimensional, indep enden t, Gaussian observ ations. With these they ac hiev ed near p erfect recognition with sen tences of xed structure, i.e. con taining p ersonal pronoun, v erb, noun, adjectiv e, p ersonal pronoun, in that order. V ogler and Metaxas [59 60 61 ] ha v e b een instrumen tal in signican tly pushing the state-of-the-art in automated ASL recognition using HMMs. In terms of the basic HMM formalism, they ha v e explored man y v ariations, suc h as con text dep enden t HMMs, HMMs coupled with partially segmen ted sign streams, and parallel HMMs. One of the v ery exciting lines of w ork suggested b y them to tac kle the scalabilit y problem of HMMs is to design systems to recognize c heremes, the `phonemes' of ASL, instead of the w ords. Cheremes dier with resp ect to hand shap es, hand orien tation, wrist orien tation, lo cation, and mo v emen t. They extracted these c heremic features using 3D magnetic trac king systems. T o con trol the com binatorics of the explosion of p ossible states, they assume indep endence of the attributes c haracterizing the c heremes and using separate HMMs for eac h c hannel, i.e. parallel HMMs. It can b e noticed that most of the w ork in con tin uous sign language recognition has a v oided the v ery basic problem of segmen tation/trac king of hands b y using w earable devices, suc h as colored glo v es, data glo v es, or magnetic mark ers, to directly get the lo cation features. F or example V ogler and Metaxas [59 60 61 ] ha v e used 3D magnetic trac k8

PAGE 19

T able 2.1. Snapshot of Researc h in Con tin uous W ord Recognition in ASL. Group Input Data Data Set F eatures T ec hnique Recognition Rates UP enn [59 ][60 ] [61 ] 3 orthogonally placed cameras, magnetic mark ers, indo ors, blac k clothes and bac kground 53 signs, 489 sen tences with 2-12 w ords eac h 3D wrist p osition and orien tation HMM Con tin uous signs, for con text indep enden t: 88%, for con text dep enden t: 90%, with bigram context: 92%, with ep en thesis mo deling: 92% MIT [49 ] [52 ] [50 ] [51 ] [53 ] One color video, 5 fps, indo ors, colored glo v es 40 signs, 494 sen tences with 5 w ords eac h 2D p osition of hands, b ounding ellipse of hands HMM Con tin uous signs, with rule-based grammar, colored glo v es: 99%, with no grammar, colored glo v es: 91%, with grammar, skin tone: 92%, with no grammar, skin tone: 75% China [35 ] [63 ] Cyb erglo v es and 3SP A CE p osition trac kers 220 signs, 80 sen tences with 2-15 w ords eac h 18 hand join t angles, 12 p osition and orien tations HMM Isolated signs: 99%; Con tin uous signs: 93% German [3 ] [4] one video camera, colored glo v es 97 signs hand shap e, hand orientation, and lo cation HMM Con tin uous signs: 92% Nether lands [1 ] One color camera, colored glo v es, blac k clothes, white bac kground, 13fps 262 signs, 14 sen tences with 3 to 5 w ords eac h 31 lo cations on dominan t hand, 3 lo cation on other hand, and distance b et w een hands HMM Isolated signs: 91%, Con tin uous signs: 73% T aiw an [33 ] One data glo v e and p olhem us 3D trac k er 250 signs, 196 sen tences 10 p osition and 10 orientation features W ord segmen tation + HMM Isolated gesture: 94%, Short sentences: 83%, Long sen tences: 88% Japan [44 ] Cyb erglo v es 200 sen tences o v er 960 w ords Hand shap e, palm direction, and linear motion W ord segmen tation + minCost Con tin uous w ords: 87%; Sentences: 58% 9

PAGE 20

ing system, Starner and P en tland [50 ] ha v e used colored glo v es while Ma et.al. [35 63 ] ha v e used Cyb erglo v es. Also, the signing of ASL sen tences in the w ork b y V ogler and Metaxas [59 60, 61 ] and Starner and P en tland [50 ] is not done b y a nativ e ASL in terpreter/signer whic h adds an elemen t of non-naturalness in the w a y the sen tences ha v e b een signed. In addition to this, no w ork on con tin uous sign language recognition in an y w a y has made use of non-man ual mark ers presen t in ASL. F or example, man y a times `Negation' in an ASL sen tence is con v ey ed through only `head shak e'. There is no w ork whic h tak es this in accoun t. There has b een some w ork only on the use of facial information, but it is not in com bination with con tin uous sen tence recognition. W e will see this in next section. 2.4 Recognition of Head Shak es and No ds Extraction of non-man ual features for ASL is starting to receiv e more atten tion. Erdem and Sclaro [12 21 ] describ es a 3D mo del based trac king framew ork to detect `head shak es' and `head no ds', whic h are imp ortan t forms of non-man ual ASL comm unication, from regular visual sp ectrum images. Another head no d and shak e detector w as prop osed b y Kap o or and Picard [29 ], ho w ev er, it is based on ey e-pupil trac king using IR cameras. But there has b een no w ork whic h uses facial expressions to b o ost up the accuracy in con tin uous sign language recognition. 2.5 Existing ASL Datasets Researc h in automated ASL is empirical and is dep enden t on the existence of a go o d corpus of data. Un til recen tly this w as non-existen t. The largest corpus used in ASL con tains a v o cabulary of around 50 signs, em b edded in 500 or so sen tences. F oreign sign language w orks tend to use b et w een 200 to 300 w ords but with few er (14 to 200) n um b er of sen tences. Only recen tly has there b een a concerted eort in systematically constructing an ASL corpus for public dissemination. A t Boston Univ ersit y Neidle et al. [39 ] has created suc h as a dataset using SignStream, whic h is a system for linguistic annotation, storage, and retriev al of ASL and other forms of gestural comm unication. One of the issue that w e 10

PAGE 21

ran in to in using this dataset for automated sign recognition is that the video is sampled to o coarsely on an a v erage there are 5.8 frames p er sign. Ho w ev er, the dataset is a c hallenging one since man y signs are similar, sp ecically the single handed signs with hand o ccluding the face, whic h mak es segmen tation v ery hard. There are sev ere coarticulation eects. In some sen tence a w ord is signed using one hand while in another sen tence the same w ord is signed using b oth hands due to coarticulation eect 1 The Purdue ASL database [36 ], whic h w as designed taking in to accoun t the issues imp ortan t for automated recognition, consists of 2576 ASL video sequences from 14 nativ e signers, imaged under t w o dieren t ligh ting conditions, but with blac k bac kground. The data has three parts. First, is a set of videos of isolated signs with distinct motion patterns, sp ecically designed for the analysis of motion in ASL. Second, is a set of w ords with distinct hand shap es, where the hand shap e is constan t but the place of articulation, direction and motions migh t v ary Third, is data from sev eral ASL sen tences to study proso dy and sen tence structure. This database is y et to b e released. Hence for this w ork w e collected a small database whic h is describ ed in next c hapter. 1 The w a y the ASL gestures are signed c hanges dep ending on the signs whic h precede and follo w them. This is coarticulation eect. 11

PAGE 22

CHAPTER 3 D A T A COLLECTION In this c hapter w e will lo ok at the sen tences in the data collection, the ASL signs used for the same, proto col follo w ed for the data collection, the camera setup and the statistics of ASL signs, as w ell as, the sen tences. W e also discuss the problem of sync hronization of the frames of face with the b o dy But rst, w e tak e a lo ok at the con v en tions used in this w ork with regards to ASL. 3.1 ASL Con v en tions W e use follo wing con v en tion in the thesis: T ext in italics indicate sen tence in English. F or example `I c an lipr e ad' T ext in capitalized letters indicate ASL transcription. F or example `LIPREAD CAN I'. T ext in capitalized letters indicate ASL gloss. F or example ASL gloss for sign `lipr e ad' is `LIPREAD'. Negation in a sen tence signed using non-man ual mark ers is indicated b y ^ NOT or `Negation'. Multiw ord gloss for a single sign in ASL is indicated b y a h yphen. F or example `DONT-KNO W' is a m ultiw ord gloss for a single sign in ASL. 3.2 Dataset W e selected the sen tences for the database k eeping in mind the con text of comm unicating with deaf p eople at airp ort. T able 3.1 sho ws the sen tences used for the data collection 12

PAGE 23

with their ASL transcription. It should b e noticed that for some sen tences, `Negation' is con v ey ed only through `head shak es'. F or example, for the sen tences `I understand' and `I don 't understand' the ASL transcription is the same (`I UNDERST AND'). The only dierence is that in sen tence `I don 't understand' there is presence of a `head shak e' i.e. a non-man ual expressions to con v ey the presence of `Negation' in the sen tence. The ASL signs used in these sen tences are giv en in T able 3.2. An ASL in terpreter w as used to collect the data. English sen tences w ere giv en to the signer, who signed them in ASL. After the data collection, the ASL signs used in the sen tences w ere mark ed with the help of the signer. It should b e noted that the ASL signs w ere named as suggested b y the nativ e signer and also the frames whic h con tained eac h ASL gesture w ere mark ed as suggested b y the signer. T otal n um b er of sen tences as can b e seen in the T able 3.1 are 25 while total n um b er of distinct ASL signs presen t are 39 as can b e seen from T able 3.2. The total n um b er of ASL signs that o ccured in the 25 sen tences (including the ASL signs o ccuring m ultiple times) is 65. Data w as collected for 5 instances of eac h sen tence. There w as v ariation in the w a y some sen tences w ere signed. F or example the sen tence `If the plane is delaye d, I'l l b e mad' w as signed as `AIRPLANE POSTPONE A GAIN, MAD I' as w ell as `AIRPLANE A GAIN POSTPONE, MAD I'. Also in one of the instance of the sen tence `I p acke d my suitc ase' the ASL sign `I' w as not presen t. This ma y b e the case in some other sen tences also. The reason, as giv en b y the signer, w as that signs lik e `I' are implicit while con v ersing in ASL and hence can b e excluded. Also it should b e noted that for dieren t signs lik e `I', `ME' & `MINE', only one sign `I' is considered b ecause all the three signs are v ery similar to eac h other and hardly an y distinction is observ ed b et w een them when they w ere signed in dieren t sen tences. 3.3 Camera Setup and Recording P arameters The cameras used w ere consumer-grade Canon Optura. Tw o cameras w ere used for collecting data; one for capturing the b o dy and the other, for capturing the face image. The cameras are progressiv e-scan, single-CCD cameras capturing images at a rate of 30 frames p er second. The sh utter sp eed w as k ept at 1/250 and with auto-fo cus left on as the 13

PAGE 24

T able 3.1. 25 Sen tences Used With ASL T ranscription. ^ NOT in ASL transcription indicates that the `Negation' is con v ey ed through `head shak e'. Sen tence in English ASL T ranscription I c an lipr e ad LIPREAD CAN I I c an 't lipr e ad LIPREAD CANNOT ( ^ NOT) I I don 't understand I UNDERST AND ( ^ NOT) I understand I UNDERST AND Y ou don 't understand me YOU UNDERST AND ME ( ^ NOT) I p acke d my suitc ase SUITCASE I P A CK FINISH I don 't know DON'T-KNO W I ( ^ NOT) I don 't have the key I NOT ( ^ NOT) HA VE KEY That's mine! THA TONE MINE IT I ne e d that! I NEED THA T I ne e d my phone! MY PHONE, NEED Why? WHY? What do you me an? MEAN? Wher e is my suitc ase? SUITCASE WHERE? Can I move my suitc ase now? SUITCASE MO VE CAN I? If the plane is delaye d again, I'l l b e mad AIRPLANE POSTPONE A GAIN, MAD I I alr e ady b ought my ticket TICKET BUY FINISH Ther e was such a long line of p e ople, wait was infuriating! PEOPLE LONG LINE-W AIT ANGR Y! Wher e is gate? GA TE WHERE? My luggage is he avy LUGGA GE-HEA VY Y es YES No NO I just gave my ticket over ther e MY TICKET JUST GA VE Wher e is p assp ort? ID-P APERS WHERE? Passp ort is on table ID-P APERS T ABLE T able 3.2. 39 ASL Gestures Used. Lipread, Can, I, Cannot, Understand, Y ou, Suitcase, P ac k, Finish, Don t-Kno w, Ha v e, Tic k et, Key That, It, Need, Phone, Wh y Mean, Where, Mo v e, Airplane, Again, P ostp one, Mad, Buy Not, P eople, Long, Line, Angry Gate, Luggage-Hea vy Y es, No, Just, Ga v e, ID-P ap ers, T able 14

PAGE 25

signer w as essen tially at innit y The cameras stream compressed digital video to D V tap e at 25 Mbits p er second b y applying 4:1:1 c hrominance sub-sampling and quan tization, and lossy in tra-frame adaptiv e quan tization of DCT co ecien ts. The imagery w as reco v ered from tap e. The camera w as accessed o v er its IEEE 1394 Firewire in terface using Pinnacle's micro D V 300 PC b oard. The result is a stand-alone video le stores using Son y's (Digital Video) D V-sp ecic dvsd co dec in a Microsoft A VI wrapp er. This capture from tap e do es not re-compress and is not additionally lossy Finally the imagery is transco ded from D V to 24-bit R GB using the Son y deco der and the result is written as PPM les, one le p er frame. This represen tation trades o storage eciency for ease of access. The face and b o dy image obtained are sho wn in Figure 3.1. Also the camera setup can b e seen in Figure 3.2. The signer w as ask ed to stand at a distance of 2.05 m from the camera capturing the face images while a gra y screen w as placed at a distance of 2.75 m from the the same camera. It can b e noticed that the camera capturing the whole b o dy is at an angle with resp ect to the signer. This is to a v oid an y obstacle in the sigh t of camera whic h captures the fron tal image of the face. The sh utter sp eed of 1/250 helps in reducing blur in the image caused b y fast mo v emen ts of the hand. Normal ligh ting w as used and at higher sh utter sp eeds this causes little blur in images. Hence ligh ting conditions should b e tak en care of in future when using cameras at higher sh utter sp eeds. F or sync hronization of facial and b o dy image the sub ject w as ask ed to bring the hand in fron t of face b efore and after signing a sen tence. The frame with the hand at the upp ermost p osition with resp ect to face w as c hosen as the syncrhonizing frame. The correctness of sync hronization w as c hec k ed b y comparing the frames with upp ermost p osition of hand in fron t of face, when the signer brings her hand infron t of the face at the end of sen tence. The images used for man ual sync hronization can b e seen in Figure 3.3. 3.4 Statistics of ASL Signs and Sen tences The total n um b er of distinct sen tences presen t in data set is 25. Eac h sen tence is recorded 5 times. Hence o v erall 125 sen tences are presen t. The n um b er of distinct ASL signs presen t is 39, and in one set of 25 sen tences 65 ASL signs are presen t. F or all the 15

PAGE 26

(a) (b) Figure 3.1. Images from Data Collection. (a) sho ws an image of face tak en for sign `PHONE' of face, captured b y camera `A' and sync hronous image of b o dy capture b y camera `B' is sho wn in (b). Figure 3.2. Camera Setup for ASL Data Acqusition. Camera `A' w as used for capturing image of face while `B' w as used for capturing image of b o dy 16

PAGE 27

(a) (b) Figure 3.3. Images Used for Syncronization. (a) and (b) sho ws the images used for man ual sync hronization of rest of the images. Note that in b oth the images hand in brough t in fron t of the face and highest p osition of hand infron t of face is used for sync hronization. 125 sen tence, 325 ASL signs are presen t. The n um b er of ASL signs presen t in a sen tence v aried from 1 to 5. On an a v erage 2.7 signs are presen t p er sen tence. The n um b er of frames in a sen tence v aried from 51 to 163. The longest sen tence in terms of time is, `AIRPLANE POSTPONE A GAIN, MAD I', made up of 163 frames, while smallest sen tence is, `YES', made up of 51 frames. Av erage n um b er of frames in a sen tence is 90.263, and hence on an a v erage a sen tence is 3 second long (image frames are recorded at 30 frames p er second). Sign length v aried from 4 frames for sign, `CANNOT', to 71 frames for sign `LUGGA GEHEA VY'. On an a v erage a sign has 18.1 frames, i.e. it is 0.6 second long. 17

PAGE 28

CHAPTER 4 SEGMENT A TION OF SKIN PIXELS USING EM In con tin uous sign language recognition segmen tation of hands and face is an imp ortan t step. In most of the previous w ork for con tin uous ASL recognition the trac king of face and hand has b een made easy using colored glo v es [49 ] or magnetic mark ers [59 ]. Ev en foreign sign language recognizers ha v e used colored glo v es [35 ], [3 ], [1 ] or data glo v es [44 ], [33 ]. Only recen tly there has b een w ork that solv es segmen tation problem to a go o d exten t b y Y ang, Ah uja and T abb [69 ]. But it has only b een used for isolated sign recognition. Segmen tation and trac king b eing not the ma jor ob jectiv e, w e ha v e used simple EM A lgorithm with features b eing a,b comp onen ts in `L ab' color space to segmen t skin pixels from the image. W e use EM algorithm t wice, rst to separate the bac kground and second time to separate the clothing of signer from the skin pixels. Then w e use connected comp onen t analysis on the image obtained after applying EM t wice to thro w a w a y pixels whic h are close to skin color but do not form a blob big enough to b e a part of hand or face. The algorithm for segmen tation is giv en in T able 4.1. Although the algorithm is simple, it is applied on the image data that do es not use colored glo v es. It should b e noted that there w as no use of data glo v es or colored glo v es for data collection. 4.1 Color Space R GB color space is not go o d for represen ting the color information of pixels, as c hange in ligh ting condition creates problem. Hence w e ha v e used L ab color space for represen ting color information of pixels. In this color space, there are three comp onen ts: L Luminance a Red/Blue c hrominancy b Y ello w/Blue c hrominancy 18

PAGE 29

T able 4.1. Algorithm for Segmen tation of Skin Pixels. 1. Con v ert the image pixels from R GB to L ab space 2. Use EM to classify pixels as foreground (skin) or bac kground (non-skin). 3. Use EM for foreground image obtained from step 2., to furth ur classify pixels as foreground (skin) or bac kground (non-skin). 4. Use connected comp onen t analysis as a p ost pro cessing step to nd skin blobs. This color space is appro ximately p erceptually uniform and hence distances in this space are meaningful [68 ]. The L ab color space is dened with regard to CIE XYZ space. The form ulae for con v erting pixel information from R GB to L ab color space b y non linear mapping are giv en b elo w: X = 0 : 490 R + 0 : 310 G + 0 : 200 B Y = 0 : 177 R + 0 : 813 G + 0 : 011 B Z = 0 : 000 R + 0 : 010 G + 0 : 990 B L = 116 Y Y n 1 3 16 a = 500 X X n 1 3 Y Y n 1 3 # b = 200 Y Y n 1 3 Z Z n 1 3 # Here X n = 0 : 980, Y n = 1 : 000 & Z n = 1 : 183 and they are the X Y & Z co ordinates of reference white patc h. W e neglect the L c hannel to reduce the eect of illumination. 19

PAGE 30

4.2 Exp ectation Maximization (EM) Algorithm for Pixel Classication The Exp ectation Maximization algorithm has b een used for nding maxim um lik eliho o d parameters when there is missing or incomplete data [20 ] [11 ]. W e use Exp ecation Maximization algorithm here to nd maxmim um lik eliho o d parameters of a mixture of t w o Gaussians in the feature space. One of the Gaussian is used to represen t the foreground pixels (skin) while other Gaussian is used to represen t the bac kground pixels (non-skin). A t eac h pixel in the image, there is t w o class problem whether the pixel b elongs to foreground (skin) or bac kground (non-skin) and hence the use of t w o Gaussians to mo del these t w o classes. The observ ation is a v ector whic h is made up of a and b v alues ( L ab space) of the pixel; L is not used to reduce the illumination errors. W e use a mixture of t w o Gaussians and the form of probabilit y densit y is as follo ws: f ( x j ) = 2 X i =1 i f i ( x j i ) x feature v ector made up of a and b ( L ab space) v alues of pixel i prior probabilit y of the class, where P 2 i =1 i = 1 parameters dening the mixture of t w o Gaussians ( 1 ; 2 ; 1 ; 2 ) f i m ulitv ariate Gaussian densit y whic h is c haracterized b y parameters i i.e. i mean and i v ariance The m ultiv ariate Gaussian densit y is as follo ws: f i ( x j i ) = 1 (2 ) d 2 j i j 1 2 e 1 2 ( x i ) T 1 i ( x i ) The algorithm w orks for random initialization of p ( i j x j ; t ). But for the results sho wn, the p osteriors ha v e b een initialized as follo ws: p ( f or eg r ound j x j ; initial ) = R + G + B 255 3 p ( back g r ound j x j ; initial ) = 1 p (1 j x j ; initial ) suc h that, 20

PAGE 31

p ( f or eg r ound j x j ; initial ) + p ( back g r ound j x j ; initial ) = 1 R,G & B are pixel v alues of red, green and blue c hannel resp ectiv ely The up date equations are giv en b elo w: t +1 i = 1 N N X j =1 p ( i j x j ; t ) t +1 i = P N j =1 x j p ( i j x j ; t ) P N j =1 p ( i j x j ; t ) t +1 i = P N j =1 p ( i j x j ; t )( x j t +1 i )( x j t +1 i ) T P N j =1 p ( i j x j ; t ) p ( i j x j ; ) = i f i ( x j j i ) P 2 k =1 k f k ( x j j k ) i n um b er of Gaussians for the GMMs, in this case i=2 N total n um b er of observ ations, in this case it is total n um b er of pixels considered p ( i j x j ; ) probabilit y of j th pixel b eing in the i th class giv en that it ha v e v alue x j and parameters of the observ ation v ector are ( ; ) EM algorithm is iterated un til the follo wing conditions are fullled: abs ( t a t +1 a ) < 0 : 0001 abs ( t b t +1 b ) < 0 : 0001 The ab o v e conditions mean that, the EM algorithm is iterated un til the shift in means of a as w ell as b parameter of L ab space is less than 0.0001 for b oth Gaussians (foreground and bac kground). The pixels that ha v e p ( f or eg r ound j x j ; ) > p ( back g r ound j x j ; ) are k ept. It should b e noted that the EM algorithm is applied t wice, rst to segmen t out the pixels of skin and dress of the signer from the bac kground and then later to segmen t out 21

PAGE 32

the skin pixels from skin-dress pixels. Figure 4.1 sho ws the application of EM algorithm. Figures 4.1(d),(e) & (f ) sho w the images obtained after EM is applied on (a), (b) & (c). The n um b er of iterations in this case when EM is used to separate dress and skin pixels from bac kground v ary from 55 to 65. Figures 4.1(g),(h) & (i) sho w the images obtained after EM is applied on (d), (e) & (f ). The n um b er of iterations in this case when EM is used for the second time to separate the skin pixels from the skin-dress pixels v ary from 35 to 45. Also the Gaussians obtained after applying EM for rst and second time can b e seen in Figure 4.2 and Figure 4.3 resp ectiv ely In Figure 4.2 the Gaussian with a = 21 : 60 and b = 72 : 53 represen t the skin and dress pixels while the other Gaussian represen ts the bac kground pixels. In Figure 4.3 the Gaussian with a = 44 : 35 and b = 113 : 673 represen ts the skin pixels while the other Gaussian represen ts the dress pixels. 4.3 Connected Blob Analysis Connected Comp onen t Analysis is used as a p ost pro cessing step on the image obtained after segmen ting skin pixels using EM. Blobs of size greater than 200 pixels are k ept. This helps to remo v e some pixels whic h are closer to skin pixels but do not form a skin blob big enough to b e a part of hand or face th us reducing errors in segmen tation. The mask obtained is then used to nd edge pixels whic h are part of skin blobs. The mask obtained after connected blob analysis and the images obtained at v arious steps of EM algorithm are sho wn in Figures 4.1(j), (k) & (l). 22

PAGE 33

(a) (b) (c) (d) (e) (f ) (g) (h) (i) (j) (k) (l) Figure 4.1. Segmen tation of Skin Pixels Using EM. (a),(b) & (c) sho ws three images from sen tence `LIPREAD CAN I', (a) is from sign `LIPREAD', (b) is from sign `CAN' & (c) is from sign `I'. (d),(e) & (f ) sho ws the skin pixels obtained after rst application of EM. It should b e noted that in (d) only skin pixels remain while in (e) & (f ) some pixels from dress of signer are also presen t. (g),(h) & (i) sho ws the skin pixels obtained after second application of EM (i.e. applying EM to images (d),(e) & (f )). As can b e seen the skin pixels are segmen ted to a far b etter exten t. Connected blob analysis is done on (g),(h) & (i) to obtain (j),(k) & (l). 23

PAGE 34

(a) (b) (c) (d) Figure 4.2. Fitted Gaussians After One Run of EM Pro cess. F eatures are a & b comp onen ts of L ab color space. (a),(b),(c) & (d) sho ws the dieren t views of the same Gaussians. The Gaussian distributions are sho wn for image in Figure 4.1(b). The output obtained from this rst step can b e seen in Figure 4.1(e). 24

PAGE 35

(a) (b) (c) (d) Figure 4.3. Final Fitted Gaussians. F eatures are a & b comp onen ts of L ab color space. (a),(b),(c) & (d) sho ws the dieren t views of the same Gaussians. The Gaussian distributions are sho wn for image in Figure 4.1(e). The output obtained from this rst step can b e seen in Figure 4.1(h). 25

PAGE 36

CHAPTER 5 MOTION MODELING USING RELA TIONAL DISTRIBUTIONS AND SP A CE OF PR OBABILITY FUNCTION F rom an ASL p oin t of view, w e w ould lik e our feature represen tation to b e in v arian t to translation, to tak e care of mo v emen t of the whole b o dy during signing, but it need not b e rotationally in v arian t. Also, since these features w ould b e based on lo w-lev el primitiv es extracted using bac kground subtraction and color-based region gro wing, whic h are kno wn to b e noisy pro cesses, the represen tation should degrade gracefully with lo w-lev el missed detection and false alarms. Philosophically w e b eliev e that the or ganization or structur e or r elationships among lo w-lev el primitiv es are more imp ortan t than the primitiv es themselv es. Th us, the attributes of the individual primitiv es migh t v ary quite a bit, but as long as the spatial relationships among them are preserv ed, recognition is still p ossible. F or instance, in an image of a face ev en if one c hanges the ey es to star shap es and the nose to an in v erted triangle, the resulting shap e w ould still b e recognized as a face. Our need is to device a mec hanism to capture this structure so that w e can use its c hange with time to mo del high-lev el motion patterns. Graphs ha v e b een the most commonly used mec hanism for capturing these relationships among primitiv es [9 66 48 30 ]. Ho w ev er, the study of v ariation of a graph o v er time requires solving the corresp ondence problem b et w een image primitiv es, whic h is a computationally dicult problem. W e a v oid this need for primitiv e-lev el corresp ondence b y fo cusing on the statistical distribution of the relational attributes observ ed in the image. The use of feature attribute histograms is not new. Distribution of lo cal feature lter outputs ha v e b een used for recognition [46 ]. Lo cal orien tation histograms ha v e b een used for pattern recognition [37 ] and gesture recognition [24 ]. Ho w ev er, the only uses of relational histograms that w e are a w are of are b y Huet and Hanco c k [27 ], who used it to mo del line distributions in the con text of image database indexing, and b y Belongie and Malik, who 26

PAGE 37

used it to mo del shap e con texts of features [5 ], again, for image databases. The no v elt y of the presen t con tribution lies in that w e oer a strategy for incorp orating dynamic asp ects. W e use this represen tation whic h has b een succesfully used for iden tication of a p erson from gait [41 42 40 ]. In this c hapter w e describ e the statistical mo del for motion analysis as dened in [41 ]. W e start with the denition of the concept of Relational Distributions, follo w ed b y the theoretical description of Space of Probabilit y F unctions (SoPF). F or an indepth analysis of this motion mo deling theory w e refer the reader to [40 ]. 5.1 Relational Distributions Let F = f f 1 ; ; f N g represen t the set of N primitiv es in an image, suc h as edge pixels, in terest p oin ts, or region patc hes, F k represen t a random k tuple of primitiv es, and the relationship among these k -tuple primitiv es, suc h as distance, orien tation, or some spatial distribution measure, b e denoted b y R k Let the relationships R k b e c haracterized b y a set of M attributes A k = f A k 1 ; ; A k M g Then the shap e of the ob ject can b e represen ted b y join t probabilit y functions: P ( A k = a k ), also denoted b y P ( a k 1 ; ; a k M ) or P ( a k ), where a k i is the (discretized, in practice) v alue tak en b y the relational attribute A k i W e term these probabilities as the R elational Distributions One p ossible in terpretation of these distributions is: giv en an image, if y ou randomly pic k k -tuples of primitiv es, what is the probabilit y that it will exhibit the relational attributes a k ? What is P ( A k = a k )? The represen tation of these relational distributions can b e in parametric forms or in nonparametric, histogram or bin-based forms. The adv an tage of parametric forms, suc h as mixture of Gaussian, is the lo w represen tational o v erhead. Ho w ev er, w e ha v e noted that these relational distributions exhibit complicated shap es that do not readily aord mo deling using a com bination of simple shap ed distributions. So, non-parametric histogram based form is b etter. T o reduce the size that is asso ciated with a histogram based represen tation, w e prop ose the Space of Probabilit y F unctions (SoPF). W e illustrate the concept of Relational Distributions using the edge pixels of skin blobs as the features. W e apply the Cann y edge detector o v er eac h image frame and select only 27

PAGE 38

those edge pixels that b elong to the skin blobs. Figure 5.1 sho ws the edges selected using skin blobs that are created b y segmen tation of skin pixels as seen in the previous c hapter. T o capture the structure b et w een edge pixels, w e use the distance of the t w o edge pixels in the v ertical and horizon tal directions (dx,dy) as the attributes. W e normalize the distance b et w een the pixels b y a distance (D), whic h is related to the size of the ob ject in the image, to mak e it somewhat scale in v arian t. W e ha v e tak en the scaling constan t to b e the heigh t of the image. Note that the c hoice of the attributes is suc h that the probabilit y represen tation is in v arian t with translation and somewhat in v arian t with resp ect to scale. Figure 5.2(a) depicts the attributes that are computed b et w een the t w o pixels. Figure 5.2(c) sho ws the relational distribution for the edge image sho wn in Figure 5.2(b), where brigh ter pixels denote high probabilities. Figure 5.2(d) sho ws a 3D bar of the probabilit y v alues. Note the concen tration of high v alues in the certain regions of the probabilit y ev en t space. As the hands of the signer mo v e, the relational distributions will c hange. Motion of hands will in tro duce non-stationarit y in the relational distributions. Figure 5.3 sho ws some more examples of the 2-ary relational distributions for the sign `CAN'. Notice the c hange in the distributions as the hands come do wn. The c hange in the v ertical direction in relational distributions can b e seen clearly as the hands are coming do wn, while there is comparativ ely less c hange in the relational distributions in other direction. Th us w e ha v e used relational distribution to mo del the image frames with resp ect to man ual asp ect in ASL recognition. Eac h image is represen ted b y a relational distribution. 5.2 Space of Probabilit y F unctions As discussed b efore there is a need to reduce the size that is asso ciated with a histogram based represen tation of relational distribution. This is done using Space of Probabilit y F unction. Let P ( a k ; t ) represen t the relational distribution at time t Let p P ( a k ; t ) = n X i =1 c i ( t ) i ( a k ) + ( a k ) + ( a k ) (5.1) 28

PAGE 39

(a) (b) (c) Figure 5.1. Finding Edge Image of Skin Blobs. Sample frame from sen tence `PEOPLE LONG LINE-W AIT ANGR Y!' is sho wn in (a), skin color detected blob using EM sho wn in (b) and its corresp onding edges in (c). 29

PAGE 40

(a) (b) 0 10 20 30 0 10 20 30 0 0.005 0.01 0.015 0.02 0.025 0.03 dx dy Probability (c) (d) Figure 5.2. Edge Pixel Based 2-ary Relational Distribution. (a) The t w o attributes c haracterizing relationship b et w een t w o edge pixels. (b) Edge pixels in an image. (c) The relational distribution P ( dx; dy ). P (0 ; 0) the top left corner of the image. Brigh ter pixels denote higher probabilites. (d) The relational distribution sho wn as a 3D bar plot. 30

PAGE 41

(a) (f ) (k) (b) (g) (l) (c) (h) (m) (d) (i) (n) (e) (j) (o) Figure 5.3. V ariations in Relational Distributions. (a),(b),(c),(d) & (e) sho ws the image frames in sign `CAN', (f ),(g),(h),(i) & (j) are the edge images corresp onding to them, while (k),(l),(m),(n) & (o) are the relational distributions sho wn as 3D plots for the same frames. As the hands go do wn for the sign `CAN' the v ariation in the distributions in the v ertical direction can b e seen clearly while since b oth the hands are in same p osition horizon tally there is nearly no c hange in distributions in horizon tal direction. 31

PAGE 42

describ e the squar e r o ot of eac h relational distribution as a linear com bination of orthogonal basis functions, where i ( a k )'s are orthonormal functions, the function ( a k ) is a mean function dened o v er the attribute space, and ( a k ) is a function capturing small random noise v ariations with zero mean and small v ariance. W e refer to this space as the Space of Probabilit y F unctions (SoPF). W e use the square ro ot function so that w e arriv e at a space where the distances are not arbitrary ones but are related to the Bhattac hary a distance b et w een the relational distributions, whic h is an appropriate distance measure for probabilit y distributions. The pro of for this is in [45 ]. Giv en a set of relational distributions, f P ( a k ; t i ) j i = 1 ; ; T g the SoPF can b e arriv ed at b y using the Karh unen-Lo ev e (KL) transform or, for the discrete case, b y principal comp onen t analysis (PCA) or a com bination of PCA and Linear Discriminan t Analysis (LD A), if w e w an ted to mo del in the presence of signer and bac kground v ariations. In practice, w e can consider the subspace spanned b y a few ( N << n ) dominan t v ectors asso ciated with the large eigen v alues. Th us, a relational distribution can b e represen ted using these N co ordinates ( c i ( t )s), whic h is more compact represen tation than a normalized histogram based represen tation. Note that this use of the PCA is dieren t from other uses of this tec hnique in motion trac king. F or example, Blac k and Jepson [7] also used PCA but in the con text of trac king and matc hing mo ving ob jects. The represen tation also is dieren t. They use PCA o v er the image pixel space whereas w e use it o v er relational probabilit y functions. Sclaro and P en tland [47 ] use PCA to obtain canonical shap e descriptions of deformable ob jects for recognition or indexing. They use nite-elemen t mo del to deform one shap e in to another, guided b y primitiv e corresp ondences. Shap e similarit y is quan tied b y the nature of deformation, quan tied b y the co ecien ts asso ciated with few mo dal shap es, that w ere inferred using PCA. Our shap e represen tation do es not require prior mo del, nor do es it assume p erfect segmen tation of ob ject from bac kground. The other attractiv e asp ects are that it do es not require primitiv e trac king or corresp ondence, it is amenable to learning, and there is no assumption ab out single pixel mo v emen t b et w een frames. The eigen v ectors of the SoPF asso ciated with the fteen largest eigen v alues are sho wn in Figure 5.4. The space w as trained for the 39 signs. The size of eac h relational distri32

PAGE 43

Figure 5.4. Eigen v ectors of SoPF. Dominan t dimensions of the learned SoPF o v er the 39 signs. Figure 5.5. F all of Eigen v alues for Relational Distributions. bution is 30 30. The v ertical axes of the images plot the distance attribute dy and the distance attribute dx is along the horizon tal axes. The rst eigen v ector sho ws 3 mo des in it. The brigh t sp ot in the second eigen v ector emphasizes the dierences in the attribute dx b et w een the t w o features. The third eigen v ector is radially symmetric, emphasizing the dierences in b oth the attributes. Figure 5.5 sho ws the sorted eigen v alues for the relational distributions. Notice that most of the energy of the v ariation is captured b y the few large eigen v alues. It can b e seen that eigen v ectors asso ciated with 15 largest eigen v alues suce. This n um b er is only a small fraction of the 900 en tries in the 30 30 sized histogram represen tation of relational distribution. The sen tence in ASL forms a trace in this Space of Probabilit y F unctions and signs are detected in the sen tence b y correlating their traces with the trace of the sen tence. W e will see this in Chapter 7. 33

PAGE 44

CHAPTER 6 F A CE DETECTION AND TRA CKING T o capture non-man ual expressions in ASL, it is imp ortan t to accurately detect the face so as to disregard the motion of face to extract the facial expression in v olv ed in that sen tence. A t the same time it is also imp ortan t to nd the motion tra jectories of the face to nd whether there is use of `Negation' non-man ual in the sen tence. In this c hapter w e discuss ho w the face detection is p erformed and the w a y in whic h w e nd the motion tra jectories of the face for a sen tence. Figure 6.1 depicts the algorithm to detect the face and to nd the motion tra jectories, whic h w e expand up on next. Note that all the face information pro cessing is p erformed on the image sequence obtained from the camera just fo cused on the face. 6.1 Detecting F ace Using Ey e T emplates There are v arious sophisticated approac hes to detect faces [43 ], [14 ], [56 ]. Here w e adopt a v ery simple approac h to nd ey es within the giv en image and th us detect the face. W e use template matc hing for ey e detection. Dieren t ey e templates are used and mean ey e template is constructed using these templates. The ey e templates used and the mean ey e template is sho wn in Figure 6.2. It should b e noted that the ey e templates used to form the mean ey e template are dieren t than that of the signer. Ey e templates w ere obtained from 4 dieren t p ersons using same camera and same imaging conditions as used for the data collection of ASL sen tences. W e use the mean ey e template and correlate it with the input image. An example of input image is sho wn in Figure 6.3(a). After correlation of mean ey e template, a distribution is obtained whic h indicates the probabilit y of the presence of cen ter of rectangular b o x b ounding the ey e at that particular p osition. The distribution for the image in Fig34

PAGE 45

Figure 6.1. Flo w c hart for The F ace Detection and T rac king. The ab o v e ro w c hart sho ws in detail, the ey e detection within the face image and nding the motion tra jectories of the ey e-cen ter for the whole sen tence. Figure 6.2. Ey e T emplates Used and Mean Image of the Ey e T emplates. The rst 12 images sho w the ey e templates of 4 dieren t p ersons while the last image is the mean of these 12 ey e templates. 35

PAGE 46

(a) (b) (c) (d) Figure 6.3. Ey e Detection and Masking of F ace With the Elliptical Structure. (a) sho ws rst image of sen tence `LIPREAD CAN I' sync hronized with the b o dy image. (b) sho ws the distribution of co orelation of mean ey e template with facial image in (a). Brigh ter sp ots con v ey that there are higher c hances of nding the ey e-cen ter at that lo cation. (c) sho ws b ounding a rectangle around the ey e. (d) sho ws tting an ellipse to face after detecting the ey e and then translating the detected ellipse to a xed p osition ( since (a) is the rst image of the sen tence). The subsequen t images in the sen tence are translated w.r.t. to this rst image for nding the motion tra jectories of the face. ure 6.3(a) is sho wn in (b). Brigh ter sp ots indicate that the probabilit y of nding cen ter of rectangular b o x b ounding the ey e, at that p osition is more. It is clearly seen in this image that there are brigh ter sp ots near the cen ter of the t w o ey eballs. The brigh test sp ot in the distribution image is tak en as the cen ter of the rectangular b o x b ounding the ey e. The rectangular b o x is of the same size as that of ey e templates. F ace image in whic h the ey es are b ounded b y the rectangular b o x can seen in Figure 6.3(c). The correlation is done on the whole image, only for the rst image frame of the sen tence. F or the subsequen t images, the cen ter of the rectangular b o x b ounding the ey e is found b y correlating around the neigh b orho o d of cen ter found in previous image. A windo w of 10 pixels in width and heigh t is considered for the neigh b orho o d searc h. 36

PAGE 47

6.2 Masking the F ace With an Elliptical Structure After the detection of ey es, w e mask the face with an elliptical structure. W e use the golden ratio for face [23 ], to mask the face with the elliptical structure. The golden ratio is giv en b y: heig ht ( h ) w idth ( w ) = 1 + p 5 2 where w and h indicates the width and heigh t of the face (Figure 6.4). The width of face is obtained from the ey e template, whic h is 70 pixels in this case. Heigh t is calculated from ab o v e form ula and comes out to b e 111 pixels in this case. W e use t w o ellipses for constructing the elliptical structure for masking. The t w o ellipses and the elliptical structure used are sho wn in the Figure 6.4, while image of the face after ey e detection and after masking the face with the elliptical structure is sho wn in Figure 6.5. As can b e seen in Figure 6.4, four parameters (the ma jor and minor axes of the t w o ellipses) are needed to mak e the elliptical structure using the t w o ellipses, Ellipse I and Ellipse I I. F or the Ellipse I, the t w o diameters are w and dh the v alues for them are 70 and 75 pixels resp ectiv ely The lo w er p ortion of Ellipse I is considered. As can b e seen, for the Ellipse I I, the t w o diameters are w and dw the v alues for them are 70 and 36 pixels resp ectiv ely The upp er p ortion of Ellipse I I is considered. It should b e noted that h = dw + dh and its v alue is 111 pixels as calculated from the form ula. The demarcated facial p ortion obtained b y masking the elliptical structure on face is used to extract facial information. 6.3 Eigen Represen tation for Condensing the F acial Expression Information T o reduce the computational complexit y it is necessary to represen t the face image obtained after masking the elliptical structure, in a lo w er dimensional space. Principal Comp onen t Analysis of these images reduces the dimensionalit y of the image data. Th us eac h face image can b e represen ted b y the co ordinates of orthogonal eigen v ectors. T o form 37

PAGE 48

Figure 6.4. An throp ometric F acial Prop ortions and Masking of F ace With the Elliptical Structure. The left side sho ws face after ey e detection and righ t side sho ws the demarcated facial p ortion obtained after masking. (a) (b) Figure 6.5. Image Sho wing the F ace After Ey e Detection and Masking With the Elliptical Structure. (a) sho ws the ey e detection within the facial image and (b) sho ws the image of face after masking of the elliptical structure. The elliptical structure is tted on face after detecting the ey es. The face image in (b) sho wn is used as an input for understanding the facial expressions in v olv ed in ASL sen tences. 38

PAGE 49

Figure 6.6. F all of Eigen v alues for F acial Expressions. the PCA space for `face', w e use face images of 39 signs for training. Figure 6.6 sho ws the sorted eigen v alues for the face images. Notice that most of the energy of the v ariation is captured b y the few large eigen v alues. It can b e seen that the eigen v ectors asso ciated with 20 largest eigen v alues suce. This n um b er is v ery small compared to 72 100 = 7200, whic h is the size of the image i.e. the original dimensionalit y of the face image data. The face images (images obtained after masking of the elliptical structure) of the sen tence form a trace in the PCA space of `face', and the similarit y of signs in the sen tence based on facial expressions is obtained b y correlating the traces of the signs with the trace of the sen tence. W e will see this in Chapter 7. 6.4 Motion T ra jectory of F ace It is imp ortan t to nd the motion tra jectory of face to see whether the ASL sen tence has an y `Negation'. F or example sen tence `I don 't understand' is signed exactly same as `I understand' in the man ual marking, except that there is distinct `head shak e' indicating `Negation' in the sen tence `I don 't understand' F ace motion tra jectory is obtained b y detecting the ey es in the facial image. W e rst detect the ey es in the rst frame of the sen tence b y correlation, then w e select the ey e-cen ter for this rst image. The ey e-cen ter is the cen ter pixel of the rectangular b o x b ounding the ey e (Figure 6.4) and is used to represen t the face motion tra jectory W e also nd the 39

PAGE 50

ey e-cen ter for the rest of the images in the sen tence, considering only a neigh b orho o d of 10 pixels from ey e-cen ter obtained in previous image. The face lo cation of the rst frame is subtracted from all subsequen t frames to arriv e at a somewhat translation in v arian t represen tation. Figure 6.7 sho ws the motion of the face in the sen tence `LIPREAD CANNOT I'. The face image for all the image frames in the sen tence are arranged with resp ect to the rst image (Figure 6.7(a)). Figure 6.7(l) sho ws the motion tra jectory for the same sen tence. The horizon tal motion of face indicating `head shak e' can b e clearly seen in the motion tra jectory Also, Figure 6.8 sho ws motion tra jectories of face for v arious sen tences. (a), (b) and (c) clearly sho ws the presence of `Negation' in the sen tences, (d) sho ws the v ertical motion of face, indicating a `head no d' while (e) and (f ) sho ws the motion tra jectories for the sen tences in whic h there is no p ositiv e or negativ e meaning con v ey ed through them. 40

PAGE 51

(a) (b) (c) (d) (e) (f ) (g) (h) (i) (j) (k) (l) Figure 6.7. F ace Mo v emen t and its Motion T ra jectory for Sen tence `LIPREAD CANNOT I'. (a) is the rst frame in the sen tence. (b),(c),(d),(e),(f ),(g),(h),( i),( j) and (k) represen t frame 9,13,16,19,20,31, 43, 48 ,53 and 55 resp ectiv ely These frames sho w the mo v emen t of head w.r.t. to rst frame. (l) sho ws the motion tra jectory of the face for complete sen tence. The motion tra jectory is plotted for the pixel at the cen ter of ey e. The motion tra jectory is nearly horizon tal due to `head shak e' whic h con v eys a negativ e asp ect in the sen tence. 41

PAGE 52

(a) `DONT-KNO W I' (b) `I NOT HA VE KEY' (c) `NO' (d) `YES' (e) `YOU UNDERST AND ME' (f ) `SUITCASE I P A CK FINISH' Figure 6.8. Motion T ra jectories of F ace for V arious Sen tences. (a),(b) and (c) sho ws the motion tra jectories of face for sen tence ha ving negation in them. It can b e seen that they all sho w head motion in horizon tal direction, whic h indicates `head shak e'. While (d) has head motion in v ertical direction whic h indicates a p ositiv e thing b eing con v ey ed through head no d, infact it is the motion tra jectory for sign `YES'. (e) and (f ) sho ws motion tra jectories for sen tences whic h do not con v ey negativ e meaning. 42

PAGE 53

CHAPTER 7 EV ALUA TION FRAMEW ORK AND RESUL TS In this c hapter w e discuss the strategy used to com bine non-man ual information with the man ual information. W e ha v e used the facial expression information 1 to reduce the deletion and insertion errors while face motion information is used through the motion tra jectories of face to nd whether the sen tence con tains `Negation' or not. Also the c hapter sho ws the v e dieren t cases w e ha v e considered for training and testing and w e rep ort the results on same. It w as noticed that com bining the non-man ual information directly with man ual information is v ery dicult due to follo wing reasons: The non-man uals are not sync hronized with man uals. F or example, if in one sen tence the ey ebro ws are raised at a frame x then it ma y happ en that the same expression ma y o ccur at some other neigh b orho o d frame x+dx in another instance of the sen tence ( x is the frame with same man ual p osition of hands in b oth instances of sen tences). This causes problems with regards to nding the same facial expression for a sign in other instance of the same sen tence. It should b e noted that a sign is mark ed 2 for training with resp ect to its man ual part, and the sync hronous face images are considered to b e in non-man ual part. These sync hronous face images do not ha v e same facial expressions in b oth instances of sen tences. Another problem in nding the facial expression related with the sign o ccurs when there is presence of a strong non-man ual indicating `Assertion' or `Negation' in the sen tence. In suc h cases the facial expressions are totally dominated b y the face mo v emen ts whic h is indicated b y `head shak es' or `head no ds'. 1 Non-man ual information and facial expression information are b oth used to indicate the facial expression only non-man ual is the term whic h is mostly used in ASL phonology to indicate information presen t in facial expression, face motion and torso mo v emen t 2 Marking of signs means selecting starting and ending frame to get sequence of frames in the sign, based on man ual information 43

PAGE 54

Figure 7.1. Bottom-Up Approac h for In termediate Lev el Recognition. Figure sho ws in termediate lev el recognition of ASL signs in the sen tence with the aid of non-man ual features lik e facial expressions and facial motion. This is a b ottom-up approac h for in termediate lev el recognition and do es not use an y con text or grammatical information from ASL. As a result w e adopted a b ottom-up approac h in whic h w e rst detect the top n signs that ha v e most probable c hances of o ccuring in the sen tence based on man ual information. Then w e use non-man ual information on top of it to get rid of signs 1 that ha v e v ery less probabilit y of o ccuring in the sen tence considering the facial expressions. A t the same time, w e use facial motion considering the motion tra jectories of face to nd whether the sen tence has `Negation' in it. The selection of n and will b e discussed later. It should b e noted that w e use man ual information, facial expressions and facial motion (motion tra jectories of face) as indep enden t c hannels. This b ottom-up approac h yields the b est use of non-man ual asp ects in ASL. The blo c k diagram sho wing the approac h is giv en in Figure 7.1. W e start the c hapter with the metho d used to recognize a sign using the SoPF trace of a sen tence of man ual part and the PCA trace of the non-man ual part. 1 n and indicates n um b ers 44

PAGE 55

7.1 Similarit y Measures and P osition of Sign in the Sen tence T raining is done on relational distributions of the man ual part for v arious signs to construct an SoPF space. In the same w a y training on the face image mask ed with elliptical structure as sho wn in Figure 6.5(b) for v arious signs giv es the face space or PCA space for facial expressions. Articulated motion sw eeps a path or trace through SoPF space. In the same w a y c hange in expressions sw eeps a path through face space. Eigen v ectors of the face space is sho wn in Figure 7.2 while the eigen v ectors of relational distributions can b e seen in Figure 7.3. The v arious \expression" features of the face lik e lips, ey es and ey ebro ws, whic h are used most while signing, can b e signican tly seen in the eigen v ectors of face space. The face and relational distribution w ere trained for 39 signs giv en in T able 3.2 for four instances of eac h sign. Signs w ere tak en for training from eac h sen tence in whic h they o ccured to reduce the coarticulation eect. Also Figure 7.4(a)&(b) sho ws the fall of eigen v alues for the facial expressions and relational distribution. F or face space, the n um b er of eigen v ectors considered w ere 20 while that for SoPF w ere 15. Essen tial grammmatical information is con v ey ed b y a v ariet y of facial expressions. (a):the ey e bro ws ma y b e raised, lo w ered, narro w ed, etc.; (b):c heeks ma y pu or b e conca v e; (c):the lips ma y raise, purse, etc.; (d):the nose ma y con tort, wrinkle, etc. & (e):the ey es ma y blink, close, or op en widely as w ell as gaze in sp ecic directions [2 ]. (a) (b) (c) (d) & (e) are some of the facial expressions that in v olv e no facial motion. These expressions can b e seen clearly in eigen v ectors of face space in Figure 7.2. Eigen v ectors (Figure 7.2(4),(9)) describ e (a) (5),(10) describ e (b) (3),(4),(8),(13),(1 4) ,(1 5), (19 ) describ e (c) (3),(12),(15) describ e (d) while eigen v ectors (17) & (19) highligh t the part (e) It is observ ed that due to the lip mo v emen ts of ASL gestures b eing signed the (c) part is emphasized the most among all the facial expressions b y the eigen v ectors. Since there is only one signer, the rst eigen v ector (1) captures the facial features lik e ey es, nose and lips of the signer. Distances in SoPF space quan tify motion in v olv ed in the man ual part, while that in the face space quan tify c hanges in facial expression. In this w ork, w e adopt a simple distance measure b et w een t w o traces to nd a sign in the sen tence using man ual and nonman ual part. F or the man ual part, w e use correlation of the trace of the trained sign 45

PAGE 56

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) Figure 7.2. Eigen v ectors of F ace Space: Dominan t dimensions of the learned facial expressions o v er the 39 signs. 46

PAGE 57

Figure 7.3. Eigen v ectors of SoPF: Dominan t dimensions of the learned SoPF o v er the 39 signs. o v er the trace of sen tence to get the smallest distance of sign within the sen tence. F or non-man ual part, w e correlate the trace of trained sign near the time neigh b orho o d where the smallest distance for man ual part has b een found, to get the smallest distance of sign within the sen tence. The form ula to nd the minim um distance of sign within a sen tence, b y correlating the trace of sign with sen tence for man ual part, is giv en b elo w. Square of the Eucledian distance is used as the distance metric. Corr( s m ; S m )( i ) = 1 N s m N s m + i X t = i E X k =1 c s m k ( t i ) c S m k ( t ) 2 N S m N s m i =1 where, Corr ( s m ; S m ) distribution of sign s m in the sen tence S m after correlating the trace of sign with the sen tence Corr ( s m ; S m )( i ) distance of sign s m in the sen tence S m at i th p osition s m the image sequence for the man ual part of the sign, with sign length b eing N s m S m the image sequence for the man ual part of the whole sen tence, with sen tence length b eing N S m E total n um b er of eigen v ectors considered for the SoPF space c k co ordinate of k th eigen v ector Minim um distance of the sign in the sen tence and its p osition for man ual part is giv en b y the form ula b elo w. 47

PAGE 58

(a) (b) Figure 7.4. F all of Eigen v alues for F acial Space & SoPF.(a) & (b) sho ws the fall of eigenv alues for face space and SoPF resp ectiv ely 48

PAGE 59

Similarit y ( s m ; S m ) = Min Corr( s m ; S m )( i ) N S m N s m i =1 ^ i = Min i Corr( s m ; S m )( i ) N S m N s m i =1 P osition ( s m ; S m ) = ^ i + N s m 2 where, Similarit y ( s m ; S m ) minim um distance of sign within the sen tence, ^ i the v alue of i for whic h the Corr ( s m ; S m )( i ) is minim um, and P osition ( s m ; S m ) the p osition where the sign s m o ccurs in the sen tence S m Th us w e get the p osition of the sign within the sen tence using man ual part. W e use this p osition to nd the smallest distance of the trained sign in the sen tence for the non-man ual part. W e correlate the trained facial expression of the sign in the time neigh b orho o d of this p osition and nd the smallest distance. The form ula for this is giv en b elo w: Corr ( s nm ; S nm )( i ) = 1 N s nm P ( i + N s nm 2 ) t =( i N s nm 2 ) P E k =1 c s nm k t ( i N s nm 2 ) c S nm k ( t ) 2 P osition ( s m ;S m )+ i = P osition ( s m ;S m ) where, Corr ( s nm ; S nm ) distribution of sign s nm in the sen tence S nm after correlating the trace of sign with the sen tence within the neigh b orho o d of from P osition ( s m ; S m ) Corr ( s nm ; S nm )( i ) distance of sign s nm in the sen tence S nm at i th p osition s nm the image sequence for the non-man ual part of the sign, with sign length b eing N s nm S nm the image sequence for the non-man ual part of the whole sen tence, with sen tence length b eing N S nm E total n um b er of eigen v ectors considered for the PCA space of non-man ual part (facial expressions) c k co ordinate of k th eigen v ector It should b e noticed that N s nm = N s m & N S nm = N S m 49

PAGE 60

Using the ab o v e w e get the minim um distance of the sign in sen tence for the facial expressions whic h can b e calculated as follo ws: Similarit y ( s nm ; S nm ) = Min Corr ( s nm ; S nm )( i ) P osition ( s m ;S m )+ i = P osition ( s m ;S m ) Lets lo ok at the distribution of signs in the sen tence with an example. Figure 7.5(a) sho ws the distribution i.e. Corr ( LIPREAD ) m ; (LIPREAD CAN I) m for sign `LIPREAD' in the sen tence `LIPREAD CAN I' using man ual part only As can b e seen the sign is detected when the distance is minim um at the frame 12. Th us Similarit y ( LIPREAD ) m ; (LIPREAD CAN I) m = 8 : 336 P osition ( LIPREAD ) m ; (LIPREAD CAN I) m = 12 In the same w a y Figure 7.5(b) & (c) sho ws the distribution of sign `CAN' & `I' in the same sen tence. It can b e seen that `LIPREAD',`CAN' & `I' o ccur as the rst three signs with smallest distances. Hence the sen tence can b e said to ha v e p erfect recognition with resp ect to use of man ual part. The o ccurrence of these three signs within the sen tence can b e clearly seen in Figure 7.6 with the resp ectiv e p osition of signs and their length. The ep enthesis movement indicated b y E, can also b e clearly seen in b et w een the signs. This relates to the movement-hold mo del of ASL [34 ]. T able 7.1 sho ws the p osition and distance of eigh t signs with least distance when all the signs are sorted in ascending order of distances. The p osition of these eigh t signs and their names can b e seen in Figure 7.7. This is all with resp ect to using man ual information only Also it is observ ed that signs `LIPREAD', `PHONE', `JUST' are detected close to eac h other and all of them ha v e motion in fron t of face whic h results in nearly same traces of these signs in SoPF. The same eigh t signs are correlated with the sen tence for nding the smallest distance 50

PAGE 61

1 7 10 18 20 30 40 50 60 70 79 0 50 100 150 200 250 300 350 Distribution of sign `LIPREAD' in sentence `LIPREAD CAN I'Frames of sentence `LIPREAD CAN I'Distance LIPREAD (a) 1 10 20 25 30 35 40 50 60 70 79 0 100 200 300 400 500 600 Distribution of sign `CAN' in sentence `LIPREAD CAN I'Frames of sentence `LIPREAD CAN I'Distance CAN (b) 1 10 20 30 40 45 49 50 60 70 79 0 50 100 150 200 250 300 350 400 450 Distribution of sign `I' in sentence `LIPREAD CAN I'Frames of sentence `LIPREAD CAN I'Distance I (c) Figure 7.5. Distribution of Signs Within the Sen tence. (a),(b),(c) sho ws the distribution of signs `LIPREAD',`CAN', & `I' when correlated with the sen tence `LIPREAD CAN I'. y axis sho ws the distance of the sign at that p osition in the sen tence. Smaller the distance, more are the c hances of nding the sign at that frame of sen tence. It can b e seen that within the sen tence, `LIPREAD' o ccurs rst then is `CAN' and `I' has minim um distance in last p osition among the three signs. The second minima in the case of sign `CAN' o ccurs when the signer brings his hand do wn after signing `I'. 51

PAGE 62

1 7 18 25 35 45 49 79 Position of signs in the sentence `LIPREAD CAN I' E is epenthesis movements Frames in sentence `LIPREAD CAN I' LIPREAD CAN I E E Figure 7.6. The P osition of the Signs in the Sen tence. The gure sho ws the p osition of signs `LIPREAD', `CAN' & `I' in the sen tence `LIPREAD CAN I'. E indicates the ep enthesis movements presen t b et w een t w o signs. It can b e observ ed that `LIPREAD' o ccurs rst, follo w ed b y `CAN' and then `I'. There is large v oid after frame 49 where sign `I' ends. This is b ecause signer held her hand for signing `I' infron t of c hest till frame 60. This relates to the movement-hold mo del of ASL [34 ].After frame 60 till frame 72 the signer brings the hand do wn, while after frame 72 only face is presen t and hence there is nearly rat curv e as can b e seen in previous gure of distribution of signs within sen tence. in the non-man ual part. The distance and p osition of the same eigh t signs when correlated in the neigh b orho o d of their p osition in man ual part can b e seen in Figure 7.8 and also in T able 7.2. F or example for sign `LIPREAD' Similarit y ( LIPREAD ) nm ; (LIPREAD CAN I) nm = 730150 : 531 P osition ( LIPREAD ) nm ; (LIPREAD CAN I) nm = 13 It can b e clearly seen from the T able 7.2 that man uals for the same sign are not sync hronized with non-man uals. Also the signs `I',`LIPREAD' & `CAN' are the signs with smallest distance or similarit y in the sen tence. Hence the sen tence `LIPREAD CAN I' exhibits p erfect recognition with resp ect to non-man uals to o. It can b e observ ed from T able 7.1 and T able 7.2 that the signs do not o ccur at the same p osition for man ual and non-man ual part. 52

PAGE 63

Lipread Phone Just Have Can I Key Buy 0 5 10 15 20 25 30 35 40 Sentence : `LIPREAD CAN I' Signs in SentenceDistance Figure 7.7. Eigh t Signs With Minim um Distance When Correlated With the Sen tence. The gure sho ws eigh t signs in the sen tence `LIPREAD CAN I' with minim um distances and the p osition where they o ccur in the sen tence. It can b e seen that the signs `LIPREAD', `CAN' & `I' are the three signs with minim um distance and hence the sen tence can b e accurately detected. Also can b e seen the signs whic h are similar to eac h other. Sp ecically the three signs `LIPREAD',`PHONE' & `JUST' are detected v ery close to eac h and all of them three ha v e motion in fron t of face whic h results in nearly same traces of these signs in SoPF. The T able 7.1 sho ws the frame at whic h these signs are detected and the distance of the signs at that p osition. Just Lipread Phone Can Have I Key Buy 0 1 2 3 4 5 6 x 10 6 Sentence: `LIPREAD CAN I', Non Manual Part Signs in SentenceDistance Figure 7.8. Eigh t Signs With Minim um Distance in Sen tence Using the Non-Man ual Information. The gure sho ws the same eigh t signs in the sen tence `LIPREAD CAN I' as giv en in Figure 7.7 with their p osition in the sen tence. But here the distances are found using non-man ual part. The distances for these signs are found b y correlating the signs in the neigh b orho o d of the p osition where they o ccured in sen tence using man ual part. T able 7.2 sho ws the exact p osition and distance for the signs. 53

PAGE 64

T able 7.1. Eigh t Signs With Smallest Distances and their P osition. Eigh t signs with smallest distances in the sen tence `LIPREAD CAN I' and the frame n um b er where they o ccur. As can b e seen from the table the signs `LIPREAD', `PHONE' & `JUST' o ccur v ery close to eac h other. The total n um b er of frames in the sen tence is 79. Sign Distance F rame Num b er I 1.427 47 LIPREAD 8.336 12 CAN 15.812 30 BUY 22.266 63 PHONE 22.826 13 HA VE 27.049 27 JUST 35.281 14 KEY 37.989 54 T able 7.2. Eigh t Signs With Smallest Distances and their P osition Using Non-Man ual Information. Figure sho ws the eigh t signs with smallest distances in the sen tence `LIPREAD CAN I' and the frame n um b er where they o ccur, using the non-man ual information. As can b e seen from the table the signs `LIPREAD', `PHONE' & `JUST' o ccur v ery close to eac h other. The total n um b er of frames in the sen tence is 79. Sign Distance F rame Num b er I 53444.96 47 LIPREAD 730150.53 13 CAN 833214.84 25 JUST 1386294.03 12 BUY 2344207.14 70 KEY 27.049 56 PHONE 4660461.13 17 HA VE 5378854.12 29 54

PAGE 65

7.2 Deletion and Insertion Errors In this section, w e dene the t yp e of errors that can o ccur. W e sort the signs on basis of minim um distance of the sign in the sen tence. Then w e pic k up n signs with the least distances. These n signs ha v e the highest probabilit y of o ccurence in the sen tence. If a sign is part of a sen tence, but not presen t in these n signs then a deletion err or has o ccured. This is a sign whic h is a part of sen tence but has b een deleted from it. The n um b er of deletion errors dep end on n ; as n increases, errors go do wn but the cost of high lev el pro cessing increases since it has to consider more p ossibilities. Similarly w e lo ok at these n signs to nd the n um b er of insertion errors. The sign whic h is not a part of sen tence but has distance less than the last correctly o ccuring sign of sen tence is wrongly inserted in the sen tence. Suc h a sign is said to ha v e caused an insertion err or It should b e noted that insertion error is caused b y a sign whic h is similar to a sign presen t in the sen tence. Also smaller signs tend to sho w smaller distances when correlated with the sen tence and hence cause insertion. Insertion errors can b e reduced using the con text kno wledge or b y grouping the signs that are v ery similar. Here w e ha v e reduced the insertion errors using facial expression information in the sen tence. Deletion error is caused b ecause of the follo wing reasons: Segmen tation errors ma y cause missing of some of the skin blobs resulting in deletion error. The same sign ma y b e signed v ery dieren tly eac h time, causing the trained sign to b e v ery dieren t than the sign presen ting in test sen tence. Signer ma y ha v e omitted some signs altogether in the sen tence. F or example, in one of the instance of the sen tence `I p acke d my suitc ase' the ASL sign `I' w as not presen t. It should b e noted that, in a w a y deletion error indicate the detection errors while insertion errors indicate false alarms, although in the traditional sense, they are not exactly detection errors and false alarms resp ectiv ely 55

PAGE 66

T able 7.3. T raining and T esting Sets for 5 F old Cross V alidation. F or example in Case I I I the signs are trained on the instances from sen tences 1, 2, 4 & 5 while tested on the instances of sen tence 3. Case T rain T est I 1,2,3,4 5 I I 1,2,3,5 4 I I I 1,2,4,5 3 IV 1,3,4,5 2 V 2,3,4,5 1 7.3 Exp erimen ts and Results W e use 5-fold cross v alidation. W e train the SoPF space for relational distribution and the PCA space for the facial expression using the signs from four instances of eac h sen tence while w e test on the fth instance. It should b e noted that some of the signs o ccur in more than one sen tence as seen in the T able 3.1. All the instances of these signs are used for training to reduce the coarticulation eect. The T able 7.3 sho ws the training and testing for v arious scenarios. The n um b er of deletion errors are rep orted for dieren t v alues of n Recall that n is the total n um b er of signs considered p er sen tence whic h ha v e least distances. The v ariation in the deletion error with c hange in n for the v e cases can b e seen in Figure 7.9(a), while Figure 7.9(b) sho ws the total deletion errors of all the v e cases. In Figure 7.9(a) the deletion errors are from a total of 65 signs presen t in the data set of one instance of 25 sen tences while in Figure 7.9(b) the deletion errors are from a total of 325 signs presen t in the data set considering ev ery case. The same information is sho wn in T able 7.4 As can b e seen, there is a drastic increase in deletion error when the v alue of n decreases from 8 to 6. The accuracy w.r.t. to deletion error is giv en b y: Accuracy D n = T otal correct signs recognized in top n signs T otal signs presen t Th us at n = 6 accuracy is 88% (325 39) 325 = 88 while accuracy at n = 8 is 94.46%. The insertion errors at n = 6 are listed in T able 7.5. 56

PAGE 67

T able 7.4. Deletion Error for the Fiv e T est Cases. The table sho ws the n um b er of deletion errors for v arious cases at dieren t v alues of n As can b e seen from the table nearly all the v e cases sho w similar n um b er of deletion errors at same v alue of n n Case I Case I I Case I I I Case IV Case V T otal 16 2 1 1 1 3 8 15 3 1 1 1 3 9 14 3 1 1 1 4 10 13 3 1 1 2 4 11 12 4 1 1 2 4 12 11 4 1 1 2 4 12 10 4 1 2 2 5 14 9 4 1 2 3 6 16 8 4 2 4 3 6 18 7 5 5 5 4 8 27 6 8 7 8 6 10 39 T able 7.5. Insertion Errors for n =6, Without Using Non-Man ual Information. Case Insertion Errors Without Using Non-Man ual Information Case I 18 Case I I 18 Case I I I 21 Case IV 28 Case V 26 T otal 111 7.3.1 In tegrating Non-Man ual Information The strategy w e use to impro v e the accuracy at n = 6 using the facial or non-man ual information is outlined b elo w. 1. Find the n signs with least distances in the sen tence using man ual information. 2. Find the distances for same n signs found in Step 1, using non-man ual information. 3. Sort these signs in ascending order of distances obtained from non-man ual information. 4. Discard signs ha ving maxim um distances from sorted list obtained from Step 3. 5. Keep the remaining n signs from Step 1. This strategy leads to decrease in insertion as w ell as deletion errors. W e c ho ose n = 8 and = 2 b ecause there is signican t increase in deletion error when v alue of n is c hanged 57

PAGE 68

(a) (b) Figure 7.9. V ariation in Deletion Error With Num b er of Signs Considered P er Sen tence ( n ). (a) sho ws it for the v e dieren t cases while (b) sho ws the total deletion error for all the v e cases. 58

PAGE 69

T able 7.6. Insertion Errors for n =6, With and Without Using Non-Man ual Information. It should b e noted that for ev ery case, there is decrease in insertion errors. Case Without Using Non-Man ual Information Using Non-Man ual Information Case I 18 12 Case I I 18 11 Case I I I 21 15 Case IV 28 23 Case V 26 19 T otal 111 80 from 8 to 6. Also since the a v erage n um b er of signs p er sen tence is 2.7, w e lo ok for 6 most probable signs in the sen tence p er sen tence, whic h is nearly t wice the a v erage n um b er of signs p er sen tence. This giv es prett y conserv ativ e results to b e fed to upp er la y ers (Figure 7.1) of the con tin uous sign language recognition approac h (Figure 1.1) whic h can further use the con text information of these signs based on the grammar of ASL and recognize the sen tence. An example that sho ws the reduction in insertion and deletion error for the sen tence `PEOPLE LONG LINE-W AIT ANGR Y' is sho wn in Figure 7.10. It can seen from the Figure 7.10 that the non-man ual information correctly recognizes the sen tence, but since there is more information in man uals, w e k eep the distances of man uals only Th us, w e ha v e used non-man uals only to eliminate = 2 signs to nd most probable 6 ( n = 8 2 = 6) signs o ccuring in the sen tence. In this case the signs, `HA VE', `I' & `BUY' ha v e b een inserted while sign `LONG' whic h o ccurs in 7 th p osition has b een deleted when w e consider only the top 6 signs. But the use of non-man ual information prev en ts the deletion of sign `LONG' while remo v es the inserted sign `HA VE'. Th us, in this case, the decrease in deletion as w ell as insertion error is 1. T able 7.6 sho ws the decrease in insertion errors for the v e cases in the 5-fold cross v alidation when top six signs are considered. T able 7.7 sho ws the decrease in deletion errors. Deletion errors are critical b ecause they dictate the detection rate at that particular rank ( n ). It is observ ed that there is enough decrease in deletion error b ecause of the use of non-man ual information. The accuracy obtained b efore and after using non-man ual information is sho wn b elo w, also sho wn is the total n um b er of sen tences that are p erfectly 59

PAGE 70

Figure 7.10. Flo w c hart Explaning the Strategy to Com bine the Man ual and Non-Man ual Information. The example is sho wn for the sen tence `PEOPLE LONG LINE-W AIT ANGR Y'. The decrease in insertion and deletion error is one for this example. T able 7.7. Deletion Errors for n =6, With and Without Using Non-Man ual Information. It should b e noted that for ev ery case, there is decrease in deletion errors. Case Without Using Non-Man ual Information Using Non-Man ual Information Case I 8 6 Case I I 7 3 Case I I I 8 7 Case IV 6 3 Case V 10 7 T otal 39 26 60

PAGE 71

Figure 7.11. Asp ect Ratio (W/H) for the Sen tence With `Negation' in it. recognized. By p erfectly recognized w e mean, the sen tence whic h has no insertion as w ell as no deletion error. The n um b er of sen tences p erfectly recognized is the same with and without using non-man ual information. Deletion errors without using non-man ual information for n = 6 is 39. Accuracy D n = (325 39) 325 = 88% Deletion errors after using non-man ual information for n = 6 is 26. Accuracy D nm = (325 26) 325 = 92% Deletion errors without using non-man ual information for n = 8 is 18. Accuracy D n = (325 18) 325 = 94 : 46% Insertion errors without using non-man ual information for n = 6 is 111. Insertion errors after using non-man ual information for n = 6 is 80. T otal n um b er of p erfectly recognized sen tences is 58 out of 125. 7.4 Using Motion T ra jectories of F ace to Find `Negation' `Negation' in a sen tence is indicated b y the `head shak e'. W e use asp ect ratio, width and heigh t of the tra jectory as the features to correctly recognize the presence of `Negation' in the sen tence. The asp ect ratio is calculated using the width (W) and heigh t (H) of the motion tra jectory of the face for the sen tence (Figure 7.11). The asp ect ratio is giv en b y: 61

PAGE 72

0 20 40 60 80 100 0 20 40 60 80 100 0 1 2 3 4 5 6 Width HeightAspect Ratio (Width/Height)Scatter Plot of Width, Height and Aspect Ratio of Motion Trajectories of Face 0 10 20 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 Scatter Plot of Width, Height and Aspect Ratio of Motion Trajectories of Face Aspect Ratio (Width/Height) Height (a) (b) 0 10 20 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 Scatter Plot of Width, Height and Aspect Ratio of Motion Trajectories of Face Aspect Ratio (Width/Height) Width 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Scatter Plot of Width, Height and Aspect Ratio of Motion Trajectories of FaceWidthHeight (c) (d) Figure 7.12. Scatter Plot of Width, Heigh t and Asp ect Ratio of Motion T ra jectories of F ace. Red circles indicate the sen tence whic h ha v e `Negation' in them while blue asterisks indicate normal sen tences. (a) sho ws 3D view while (b),(c) and (d) sho ws the 2D views of the same plot. 62

PAGE 73

Aspect R atio = W H The scatter plot of width, heigh t and asp ect ratio is giv en in Figure 7.12. It can b e seen from the Figure 7.12 that the sen tences that ha v e `Negation' in them are indicated b y red circles, and cannot b e easily separated from the normal sen tences, whic h are indicated b y asterisk in the scattered plot. F or a `head shak e', there should b e some amoun t of horizon tal mo v emen t of the face, whic h indicates the minim um width of motion tra jectory of the face, while the v ertical mo v emen t should b e limited b y some amoun t whic h indicates the maxim um allo w able heigh t of the tra jectory Also b y its v ery basic nature, for a `head shak e', the width of motion tra jectory of face should b e greater than heigh t, whic h means that asp ect ratio should b e denately greater than 1. W e use the ab o v e reasoning and consider sen tences whose motion tra jectories ha v e asp ect ratio greater than 1.25, width greater than 40 and heigh t less than 50, to ha v e `Negation' in them. Out of 30 sen tences in the database whic h ha v e `Negation' in them 27 w ere correctly recognized while there w ere 18 false alarms out of remaining 95 sen tences. The false alarms w ere mainly b ecause of sen tences lik e `GA TE WHERE', whic h also ha v e motion of face in horizon tal direction. Also the missed detection are in sen tence that sp ecically ha v e sign `ME' in them (example: `Y ou don 't understand me' ). This is b ecause in sign `ME', there is a natural do wn w ard mo v emen t of face, and hence the heigh t of motion tra jectory increases to a go o d exten t, causing asp ect ratio to decrease. But if the tra jectory is lo ok ed temp orarily there is a amoun t of time in whic h there is motion of head purely in horizon tal direction indicating `Negation', but if the motion tra jectory of the whole sen tence is used, it causes missed detection. It should b e noted that, to classify the motion tra jectories of the face no training w as done. 63

PAGE 74

CHAPTER 8 CONCLUSION AND DISCUSSION W e presen ted the framew ork for con tin uous sign language recognition. W e also lo ok ed at the strategy for com bining the non-man ual information with man ual information to impro v e the accuracy b y decreasing the deletion and insertion errors. The b ottom-up approac h for con tin uous ASL recognition explained in this w ork considers the con text dep enden t information from ASL phonology to the least exten t. Also it do es not b ypass the gure ground segmen tation problem. In addition to this it considers the use of facial expressions and face mo v emen t to b o ost up the accuracy obtained from the man ual information. This b ottom-up approac h used is dieren t from mostly used HMM based top-do wn approac hes, whic h tend to b ypass the segmen tation problems and com bine the con text dep enden t information at the same lev el of pro cessing. The motion mo del based on relational distributions and Space of Probabilit y F unction w orks w ell for recognizing signs in ASL. This can b e seen from amoun t of deletion errors at dieren t v alues of n F or n = 6 accuracy based on n um b er of signs deleted is 88% while at n = 8 accuracy is 94.46%. This SoPF represen tation is in v arian t to translation and as it is based on relationship among lo w lev el features it degrades gracefully with lo w-lev el missed detection and false alarms. The use of non-man ual information increases the accuracy from 88% to 92%, whic h is a signican t dierence. It is dicult to directly use the facial information b ecause of the follo wing reasons: Man ual images are not sync hronized with non-man ual images. F or example the same facial expressions is not presen t at same man ual p osition in t w o instances of the same sen tence. 64

PAGE 75

One another problem in nding the facial expression related with the sign, o ccurs when there is presence of a strong non-man ual indicating `Assertion' or `Negation' in the sen tence. In suc h cases the facial expressions are totally dominated b y the face mo v emen ts whic h are indicated b y `head shak es' or `head no ds'. Hence w e ha v e used a strategy in whic h w e rst detect n signs based on man ual information, and try to remo v e signs from them using non-man ual information. This strategy w orks for t w o reasons: The distances are found within the sen tence only for n signs using the non-man ual information. This helps, as the use of non-man ual information on the whole v o cabulary tends to sho w lots of false alarms. Tw o signs whic h ma y b e similar in man ual part, b ecause of nearly similar hand motion, can b e v ery dieren t when non-man ual information is considered. Similar signs can b e found using the distribution of signs in the sen tence. The signs whic h are similar to eac h other are detected close to eac h other. This information can b e furth ur used at higher lev el with con text and grammatical information from ASL to reduce the false alarms. The n um b er of sen tences, that ha v e `Negation' in them and are correctly recognized with the help of motion tra jectories of the face are 27 out of 30. 8.1 F uture W ork In this w ork w e ha v e used simple face detection algorithm. In future a far robust algorithm for face detection can b e used. Also a formalism whic h can nd raised ey ebro w and suc h other features of facial expression can b e dev elop ed. It w ould also b e in teresting to lo ok at c hanges in accuracy with c hange in viewp oin t for man ual part. Also depth information can b e utilised using stereo images. Finally motion tra jectories of the face can b e trained to nd `Negation' in the sen tence. 65

PAGE 76

REFERENCES [1] M. Assan and K. Grob el. Video-based sign language recognition using Hidden Mark o v Mo dels. In I. W ac hsm uth and M. F r ohlic h, editors, International Gestur e Workshop: Gestur e and Sign L anguage in Human-Computer Inter action v olume 1371 of L e ctur e Notes in Computer Scienc e Springer, 1998. [2] Neidle C. Bahan, B. Non-man ual realization of agreemen t in american sign language. Master's thesis, Boston Univ ersit y 1996. [3] B. Bauer and H. Hienz. Relev an t features for video-based con tin uous sign language recognition. In International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition pages 440{445, 2000. [4] B. Bauer, H. Hienz, and K.F. Kraiss. Video-based con tin uous sign language recognition using statistical metho ds. In International Confer enc e on Pattern R e c o gnition v olume I I, pages 463{466, 2000. [5] S. Belongie and J. Malik. Matc hing with shap e con texts. In Workshop on ContentBase d A c c ess of Image and Vide o Libr aries pages 20{26, 2000. [6] M. J. Blac k and A. D. Jepson. A probabilistic framew ork for matc hing temp oral tra jectories: Condensation-based recognition of gestures and expressions. In H. Burkhardt and B. Neumann, editors, Eur op e an Confer enc e on Computer Vision v olume 1406 of LNCS-Series pages 909{924, F reiburg, German y 1998. Springer-V erlag. [7] M.J. Blac k and A.D. Jepson. EigenT rac king:robust matc hing and trac king of articulated ob jects using view-based represen tation. In Eur op e an Confer enc e on Computer Vision pages 329{342, 1996. [8] A. Bobic k and A. Wilson. A state-based approac h to the represen tation and recognition of gesture. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 19:1325{1337, Decem b er 1997. [9] K.L. Bo y er and A.C. Kak. Structural stereo for 3-D vision. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 10(2):144{166, 1988. [10] D. Bren tari. A Pr oso dic Mo del of Sign L anguage Phonolo gy MIT Press, 2000. [11] Serge Belongie J. Hellerstein Jitendra Malik C. Carson, M. Thomas. Blob w orld: A system for region based image retreiv al and indexing. In Thir d International Confer enc e on Visual Information systems Springer, 1999. [12] M. L. Cascia, S. Sclaro, and V. A thitsos. F ast, reliable head trac king under v arying illumination. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 21(6), June 1999. 66

PAGE 77

[13] C. Charan y apan and A. Marble. Image pro cessing system for in terpreting motion in the American Sign Language. Journal of Biome dic al Engine ering 14:419{425, 1992. [14] A. Colemnarez and T. Huang. F ace detection with information based maxim um discrimination, 1997. [15] Y. Cui, D. Sw ets, and J. W eng. Learning-based hand sign recognition using SHOSLIFM. In International Confer enc e on Computer Vision pages 631{636, 1995. [16] Y. Cui and J. W eng. View-based hand segmen tation and hand-sequence recognition with complex bac kgrounds. In International Confer enc e on Pattern R e c o gnition pages 617{621, 1996. [17] Y. Cui and J. W eng. App earance-based hand sign recognition from in tensit y image sequences. Computer Vision and Image Understanding 78(2):157{176, Ma y 2000. [18] Y. Cui and J.J. W eng. Hand segmen tation using learning-based prediction and v erication for hand-sign recognition. In Computer Vision and Pattern R e c o gnition pages 88{93, 1996. [19] Y. Cui and J.J. W eng. A learning-based prediction-and-v erication segmen tation sc heme for hand sign image sequence. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 21(8):798{804, Aug. 1999. [20] Rubin D. B. Dempster A. P ., Laird N. M. Maxim um lik eliho o d from incomplete data via em algorithm. Journal of R oyal Statistic al So ciety 39:1{38, 1977. [21] U. M. Erdem and S. Sclaro. Automatic detection of relev an t head gestures in American Sign Language comm unication. In International Confer enc e on Pattern R e c o gnition pages 460{463, 2002. [22] S. S. F els and G. E. Hin ton. Glo v e-T alkI I|a neural-net w ork in terface whic h maps gestures to parallel forman t sp eec h syn thesizer con trols. IEEE T r ansactions on Neur al Networks 8(5):977{984, Sept. 1997. [23] L. G. F rak as and I. R. Munro. A nthr op ometric F acial Pr op ortions in Me dicine Charles C. Thomas, Springeld, IL, 1987. [24] W.T. F reeman and M. Roth. Orien tation histograms for hand and gesture recognition. In International Workshop on F ac e and Gestur e R e c o gnition pages 296{301, 1995. [25] L. Gupta and S. Ma. Gesture-based in teraction and comm unication: Automated classication of hand gesture con tours. IEEE T r ansactions on Systems, Man, and Cyb ernetics: Part C 31(1):114{120, F eb. 2001. [26] J. Hamilton and E. Mic heli-Tzanak ou. Alop ex neural net w ork for man ual alphab et recognition. In IEEE Confer enc e on Engine ering in Me dicine and Biolo gy: Engine ering A dvanc es: New Opp ortunities for Biome dic al Engine ers pages 1109{1110, 1994. [27] A.B. Huet and E.R. Hanco c k. Line pattern retriev al using relational histograms. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 12(13):1363{1370 1999. 67

PAGE 78

[28] M. Kadous. Mac hine recognition of Auslan signs using p o w erglo v es: T o w ards largelexicon recognition of sign language. In Pr o c e e dings of the Workshop on the Inte gr ation of Gestur e in L anguage and Sp e e ch pages 165{174, 1996. [29] A. Kap o or and R. W. Picard. A real-time head no d and shak e detector. In Workshop on Persp e ctive User Interfac es No v. 2001. [30] Y. Keselman and S. Dic kinson. Generic mo del abstraction from examples. In Computer Vision and Pattern R e c o gnition pages I:856{863, 2001. [31] J.S. Kim, W. Jang, and Z.N. Bien. A dynamic gesture recognition system for the Korean sign language (ksl). IEEE T r ansactions on Systems, Man, and Cyb ernetics: Part B 26(2):354{359, April 1996. [32] H. Lee and J Kim. An HMM-based threshold mo del approac h for gesture recognition. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 21:961{973, Oct. 1999. [33] R.H. Liang and M. Ouh y oung. A real-time con tin uous gesture recognition system for sign language. In International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition pages 558{567, 1998. [34] S. K. Liddell and R. E. Johnson. American Sign Language: The phonological base. Sign L anguage Studies 64:195{277, 1989. [35] J. Ma, W. Gao, C. W ang, and J. W u. A con tin uous Chinese sign language recognition system. In International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition pages 428{433, 2000. [36] A. M. Martinez, R. B. Wilbur, R. Sha y and A.C. Kak. Purdue R VL-SLLL ASL database for automatic recognition of American Sign Language. In IEEE International Confer enc e Multimo dal Interfac es 2002. [37] R.K. McConnell. Metho d of an apparatus for pattern recognition. U.S. Patent No. 4,567,610 Jan uary 1986. [38] C. Neidle, J. Kegl, D. MacLaughlin, B. Bahan, and R. Lee. The Syntax of A meric an Sign L anguage: F unctional Cate gories and Hier ar chic al Structur e MIT Press, Cam bridge, MA, 2000. [39] C. Neidle, S. Sclaro, and V. A thisos. A to ol for linguistic and v omputer vision researc h on visual-gestural language data. Behavior R ese ar ch Metho ds, Instruments, and Computers 33(3):311{320, No v. 2001. [40] I. Robledo. Motion Mo del Base d on Statistics of F e atur e R elations: Human Identic ation fr om Gait PhD thesis, USF, 2002. [41] I. Robledo and S. Sark ar. Exp erimen ts on gait analysis b y exploiting nonstationarit y in the distribution of feature relations. In International Confer enc e on Pattern R e c o gnition v olume I, pages 385{388, 2002. [42] I Robledo and S. Sark ar. Represen tation of the ev olution of feature relationship statistics: Human gait-based recognition. P AMI F eb 2003. 68

PAGE 79

[43] Henry A. Ro wley Sh umeet Baluja, and T ak eo Kanade. Neural net w ork-based face detection, 1998. [44] H. Saga w a and M. T ak euc hi. A metho d for recognizing a sequence of sign language w ords represen ted in a Japanese sign language sen tence. In International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition pages 434{439, 2000. [45] S. Sark ar and I. Robledo. Discrimination of motion based on traces in the space of probabilit y functions o v er feature relations. In Computer Vision and Pattern R e c o gnition v olume I, pages 976{983, 2001. [46] B. Sc hiele and J.L. Cro wley Recognition without corresp ondence using m ultidimensional receptiv e eld histograms. International Journal of Computer Vision 36(1):31{ 50, 2000. [47] S. Sclaro and A.P P en tland. Mo dal matc hing for corresp ondence and recognition. P AMI 17(6):545{561, June 1995. [48] K. Siddiqi, A. Shok oufandeh, S.J. Dic kinson, and S.W. Zuc k er. Sho c k graphs and shap e matc hing. International Journal of Computer Vision 35(1):13{32, 1999. [49] T. Starner and A. P en tland. Visual recognition of American Sign Language using hidden Mark o v mo dels. Master's thesis, MIT, Media Lab., 1995. Also Media Lab VISMOD T ec h. Rep. 316. [50] T. Starner and A.P P en tland. Real-time American Sign Language recognition from video using hidden Mark o v mo dels. In Symp osium on Computer Vision pages 265{ 270, 1995. [51] T. Starner and A.P P en tland. Visual recognition of American Sign Language using hidden Mark o v mo dels. In International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition 1995. [52] T. Starner and A.P P en tland. Real-time American Sign Language from video using hidden Mark o v mo dels. In M. Shah and R. Jain, editors, Motion b ase d r e c o gnition c hapter 10. Klu w er Academic Publishers, 1997. [53] T. Starner, J. W ea v er, and A.P P en tland. Real-time American Sign Language recognition using desk and w earable computer based video. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 20(12):1371{137 5, Decem b er 1998. [54] W. C. Stok o e. Sign L anguage Structur e: A n Outline of the Visual Communic ation System of the A meric an De af Linstok Press, 1978. [55] W.C. Stok o e. Sign L anguage Structur e Univ ersit y of Bualo Press, 1960. [56] Ka y Ka y Sung and T omasso P agio. Example-based learning for view-based h uman face detection, 1998. [57] G. J. Sw eeney and A. C. Do wn ton. Sign language recognition using a c heremic arc hitecture. In IEE Confer enc e Public ation No. 433 pages 483{486, 1997. 69

PAGE 80

[58] J. T riesc h and C. v on der Malsburg. Robust classication of hand p ostures against complex bac kgrounds. In International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition pages 170{175, 1996. [59] C. V ogler and D. Metaxas. ASL recognition based on a coupling b et w een HMMs and 3d motion analysis. In International Confer enc e on Computer Vision pages 363{369, 1998. [60] C. V ogler and D. Metaxas. P arallel hidden Mark o v mo dels for American Sign Language recognition. In International Confer enc e on Computer Vision pages 116{122, 1999. [61] C. V ogler and D. Metaxas. A framew ork of recognizing the sim ultaneous asp ects of American Sign Language. Computer Vision and Image Understanding 81:358{384, 2001. [62] M. B. W aldron and S. Kim. Isolated ASL sign recognition system for deaf p ersons. IEEE T r ansactions on R ehabilitation Engine ering 3(3):261{271, 1995. [63] C. W ang, W. Gao, and S. Shan. An approac h based on phonemes to large v o cabulary Chinese sign language recognition. In International Confer enc e on A utomatic F ac e and Gestur e R e c o gnition pages 393{398, 2002. [64] A. Wilson and A. Bobic k. P arametric hidden Mark o v mo del for gesture recognition. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 21:884{900, Sept. 1999. [65] E.J. Wilson and G. Anspac h. Applying neural net w ork dev elopmen ts to sign language translation. In IEEE-SP Workshop on Neur al Networks for Pr o c essing pages 301{310, 1993. [66] R.C. Wilson and E.R. Hanco c k. Graph matc hing b y congurational relaxation. In International Confer enc e on Pattern R e c o gnition pages B:563{566, 1994. [67] Y. W u and T. S. Huang. Vision-based gesture recognition: A review. L e ctur e Notes in Computer Scienc e 1739, 1999. [68] G. Wyszec ki and W. Stiles. Color Scienc e: Conc epts and Metho ds, Quantitative Data and F ormulae Wiley second edition, 1982. [69] M. H. Y ang, N. Ah uja, and M. T abb. Extraction of 2d motion tra jectories and its application to hand gesture recognition. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 24:168{185, Aug. 2002. [70] M. Zhao and F. K. H. Quek. RIEVL: recursiv e induction learning in hand gesture recognition. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 20(11):1174{1185 1998. 70