USF Libraries
USF Digital Collections

Diarization, localization and indexing of meeting archives

MISSING IMAGE

Material Information

Title:
Diarization, localization and indexing of meeting archives
Physical Description:
Book
Language:
English
Creator:
Vajaria, Himanshu
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Speaker localization
Speaker diarization
Audio-visual association
Meeting indexing
Multimedia analysis
Dissertations, Academic -- Computer Science & Engineering -- Doctoral -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Summary:
ABSTRACT: This dissertation documents the research performed on the topics of localization, diarization and indexing in meeting archives. It surveys existing work in these areas, identifies opportunities for improvements and proposes novel solutions for each of these problems. The framework resulting from this dissertation enables various kinds of queries such as identifying the participants of a meeting, finding all meetings for a particular participant, locating a particular individual in the video and finding all instances of speech from a particular individual. Also, since the proposed solutions are computationally efficient, require no training and use little domain knowledge, they can be easily ported to other domains of multimedia analysis. Speaker diarization involves determining the number of distinct speakers and identifying the durations when they spoke in an audio recording.We propose novel solutions for the segmentation and clustering sub-tasks, based on graph spectral clustering. The resulting system yields a diarization error rate of around 20%, a relative improvement of 16% over the current popular diarization technique which is based on hierarchical clustering. The most significant contribution of this work lies in performing speaker localization using only a single camera and a single microphone by exploiting long term audio-visual co-occurence. Our novel computational model allows identifying regions in the image belonging to the speaker even when the speaker's face is non-frontal and even when the speaker is only partially visible. This approach results in a hit ratio of 73.8% compared to an MI based approach which results in a hit ratio of 52.6%, which illustrates its suitability in the meeting domain.The third problem addresses indexing meeting archives to enable retrieving all segments from the archive during which a particular individual speaks, in a query by example framework. By performing audio-visual association and clustering, a target cluster is generated per individual that contains multiple multimodal samples for that individual to which a query sample is matched. The use of multiple samples results in a retrieval precision of 92.6% at 90% recall compared to a precision of 71% at the same recall, achieved by a unimodal unisample system.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2008.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Himanshu Vajaria.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 105 pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 002001109
oclc - 319595333
usfldc doi - E14-SFE0002581
usfldc handle - e14.2581
System ID:
SFS0026898:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 002001109
003 fts
005 20090429094648.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 090429s2008 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0002581
035
(OCoLC)319595333
040
FHM
c FHM
049
FHMM
090
TK7885 (Online)
1 100
Vajaria, Himanshu.
0 245
Diarization, localization and indexing of meeting archives
h [electronic resource] /
by Himanshu Vajaria.
260
[Tampa, Fla] :
b University of South Florida,
2008.
500
Title from PDF of title page.
Document formatted into pages; contains 105 pages.
Includes vita.
502
Dissertation (Ph.D.)--University of South Florida, 2008.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
520
ABSTRACT: This dissertation documents the research performed on the topics of localization, diarization and indexing in meeting archives. It surveys existing work in these areas, identifies opportunities for improvements and proposes novel solutions for each of these problems. The framework resulting from this dissertation enables various kinds of queries such as identifying the participants of a meeting, finding all meetings for a particular participant, locating a particular individual in the video and finding all instances of speech from a particular individual. Also, since the proposed solutions are computationally efficient, require no training and use little domain knowledge, they can be easily ported to other domains of multimedia analysis. Speaker diarization involves determining the number of distinct speakers and identifying the durations when they spoke in an audio recording.We propose novel solutions for the segmentation and clustering sub-tasks, based on graph spectral clustering. The resulting system yields a diarization error rate of around 20%, a relative improvement of 16% over the current popular diarization technique which is based on hierarchical clustering. The most significant contribution of this work lies in performing speaker localization using only a single camera and a single microphone by exploiting long term audio-visual co-occurence. Our novel computational model allows identifying regions in the image belonging to the speaker even when the speaker's face is non-frontal and even when the speaker is only partially visible. This approach results in a hit ratio of 73.8% compared to an MI based approach which results in a hit ratio of 52.6%, which illustrates its suitability in the meeting domain.The third problem addresses indexing meeting archives to enable retrieving all segments from the archive during which a particular individual speaks, in a query by example framework. By performing audio-visual association and clustering, a target cluster is generated per individual that contains multiple multimodal samples for that individual to which a query sample is matched. The use of multiple samples results in a retrieval precision of 92.6% at 90% recall compared to a precision of 71% at the same recall, achieved by a unimodal unisample system.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
590
Co-advisor: Rangachar Kasturi, Ph.D.
Co-advisor: Sudeep Sarkar, Ph.D.
653
Speaker localization
Speaker diarization
Audio-visual association
Meeting indexing
Multimedia analysis
690
Dissertations, Academic
z USF
x Computer Science & Engineering
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.2581



PAGE 1

Diarization,LocalizationandIndexingofMeetingArchives by HimanshuVajaria Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida Co-MajorProfessor:RangacharKasturi,Ph.D. Co-MajorProfessor:SudeepSarkar,Ph.D. DmitryGoldgof,Ph.D. RaviSankar,Ph.D. ThomasSanocki,Ph.D. DateofApproval: February21,2008 Keywords:speakerlocalization,speakerdiarization,audio-visualas sociation,meetingindexing, multimediaanalysis Copyright2008,HimanshuVajaria

PAGE 2

DEDICATION Tomyparents.

PAGE 3

ACKNOWLEDGEMENTS IwouldliketotakethisopportunitytothankProf.RangaKasturiforprovidin gmetheopportunity toworkwithhim.Iammostappreciativeofhispatience,guidanceandemotiona lsupportduring difculttimes.Hehasnotonlyinspiredmetobecomeabetterresearcherbut alsotobecomeabetter person.Thisworkwouldnothavebeenpossiblewithouttheguidanceof Prof.Sarkar.Hisfervorfor researchandtendencytothinkoutsidetheboxhasalwaysinspiredme.Iw ouldalsoliketothankmy committeemembers-Prof.DmitryGoldgof,Prof.RaviSankarandProf.Tho masSanockifortheir time,valuablesuggestionsandfreshperspectivesonthesubject.Withoutmy peersandfriendsinthe VisionLab-VasantManohar,PadmanabhanSoundararajan,Pranab MohantyandTanmoyIslam,the yearsspentpursuingthisdegreewouldhavefeltmuchlonger.Onaper sonalfront,Iwouldliketo thankmymotherwhoremindsmeeveryday-thatnothingisimpossibleand-toeatmy vegetables. Myfatherforteachingmeearlyonthatthereisnoshortcuttosuccess.My bestfriendSachinRathod foralwaysbeingthere.Finally,Ithankmywife,Kadambari,whohasalwa ysstoodbymeandborne thebruntofmytravails.

PAGE 4

NOTETOREADER Theoriginalofthisdocumentcontainscolorthatisnecessaryforunder standingthedata.The originaldissertationisonlewiththeUSFlibraryinTampa,Florida.

PAGE 5

TABLEOFCONTENTS LISTOFTABLES iii LISTOFFIGURES iv LISTOFALGORITHMS vi ABSTRACT vii CHAPTER1INTRODUCTION 1 1.1MotivationandOverview 1 1.2ScopeandContributions 3 1.3OutlineofDissertation 5 CHAPTER2SPEAKERDIARIZATIONUSINGONLYAUDIO6 2.1RelatedWork 7 2.2ProposedApproach 10 2.3Segmentation 11 2.4Clustering 17 2.5Results 21 2.5.1Segmentation 23 2.5.2Clustering 25 2.6SummaryandConclusions 26 CHAPTER3AUDIO-VISUALDIARIZATIONANDLOCALIZATION29 3.1RelatedWork 32 3.2AtomicTemporalPrimitives 36 3.3ClusteringandLocalization 38 3.3.1IntermediateClusters 39 3.3.2AudioandVideoModels 40 3.3.3IterativeClustering 44 3.4Results 46 3.4.1DatasetandPerformanceMeasures463.4.2Localization 50 3.4.3ComparisonWithMutualInformationBasedLocalization523.4.4Diarization 53 3.5SummaryandConclusion 55 CHAPTER4INDEXINGMEETINGARCHIVES 57 4.1EvaluationofBiometricAlgorithms 60 4.1.1DescriptionofaTypicalBiometricSystem60 i

PAGE 6

4.1.2Database 68 4.1.3Methods 70 4.1.4ResultsandDiscussion 74 4.2IdenticationinMeetingVideos 81 4.2.1MeetingSegmentation 84 4.2.2Audio-VisualAssociation 85 4.2.3ModelingIndividualScoreDistributions864.2.4Results 87 4.3SummaryandConclusions 91 CHAPTER5SUMMARYANDCONCLUSIONS 93 REFERENCES 96 LISTOFPUBLICATIONS 105 ABOUTTHEAUTHOR EndPage ii

PAGE 7

LISTOFTABLES Table2.1ABriefReviewofPreviousWorkDealingWithAudioBasedSpeake r Diarization,LocalizationandTracking. 8 Table2.2MeetingDatasetandAssociatedMeta-DatafortheSixteenMeetings UsedintheExperiments. 23 Table3.1ABriefReviewofPreviousWorkDealingWithJointAudio-Visual Diarization,LocalizationandTracking. 33 Table4.1SummaryofExistingLiteratureDealingWithBiometricRecognition UsingFaceandVoice. 68 iii

PAGE 8

LISTOFFIGURES Figure1.1AspectsofMeetingAnalysis. 2 Figure2.1SystemFlowchartforAudioBasedSpeakerDiarization.12Figure2.2ChangepointDetectionUsingtheBayesianInformationCriterion. 16 Figure2.3IllustrationofSegmentation. 17 Figure2.4GraphSpectralClusteringFramework. 19 Figure2.5SegmentationPerformance. 24 Figure2.6ClusteringPerformanceUsingtheClusterPurityMetric.25Figure2.7ClusteringPerformanceUsingtheDERMetric.27Figure3.1SystemFlowchartforJointLocalizationandDiarization.31Figure3.2GraphSpectralClusteringFramework. 39 Figure3.3IllustrationoftheEigen-BlobMethodforSpeakerLocalization.4 2 Figure3.4IllustrationoftheMutualInformationMethodforSpeakerLocaliz ation.44 Figure3.5TheNISTMeetingRoomSetup. 46 Figure3.6ASampleImageforEachoftheMeetingsRecordedbyCamera1.4 8 Figure3.7ASampleImageforEachoftheMeetingsRecordedbyCamera2.4 9 Figure3.8LocalizationofSpeakersintheFirstCameraView.51Figure3.9LocalizationofSpeakersintheSecondCameraView.51Figure3.10LocalizationPerformanceUsingtheEigen-BlobandMutualInf ormation(MI)BasedMethods. 52 Figure3.11ComparisonofDiarizationPerformanceUsingAudio-OnlyandA udioVideoInformation. 54 Figure4.1ComparisonBetweenQuerybyKeywordandQuerybyExample. 58 Figure4.2ComponentsofaTypicalBiometricSystem.61Figure4.3CameraSetupforIndoorCollectionofFaceImages.69 iv

PAGE 9

Figure4.4SampleIndoorandOutdoorFaceImages. 70 Figure4.5PerformanceofIndividualModalitiesinDifferentEnvironmen ts.75 Figure4.6ResultsforIntramodalFusionforFaceintheIndoor-Outdoor Scenario.77 Figure4.7EffectofScoreNormalizationonFaceProbabilityDensities.78Figure4.8ResultsforIntramodalFusionforVoiceintheIndoor-Outdoor Scenario.79 Figure4.9EffectofScoreNormalizationonVoiceProbabilityDensities.80Figure4.10ResultsofMultimodalFusionintheIndoor-OutdoorScenario.8 0 Figure4.11SystemSchemaforMeetingIndexing. 83 Figure4.12RetrievalPerformanceintheSample-SampleMatchingFramework .89 Figure4.13RetrievalPerformanceintheSample-ClusterMatchingFramewor k.90 v

PAGE 10

LISTOFALGORITHMS Algorithm2.1ThreeSubwindowSegmentationAlgorithm15 vi

PAGE 11

DIARIZATION,LOCALIZATIONANDINDEXINGOFMEETINGARCHIVES HimanshuVajaria ABSTRACT Thisdissertationdocumentsresearchperformedintheareasoflocalization ,diarizationandindexinginmeetingarchives.Itsurveysexistingworkintheseareas,identies opportunitiesforimprovementsandproposesnovelsolutionsforeachoftheseproblems.Thefra meworkresultingfromthis dissertationenablesvariouskindsofqueriessuchasidentifyingthepartic ipantsofameeting,nding allmeetingsforaparticularparticipant,locatingaparticularindividualinthev ideoandndingall instancesofspeechfromaparticularindividual.Also,sincethepropos edsolutionsarecomputationallyefcient,requirenotraininganduselittledomainknowledge,theycanbee asilyportedtoother domainsofmultimediaanalysis. Speakerdiarizationinvolvesdeterminingthenumberofdistinctspeakersan didentifyingthedurationswhentheyspokeinanaudiorecording.Weproposenovelsolutio nsforthesegmentationand clusteringsub-tasks,basedongraphspectralclustering.Theresultin gsystemyieldsadiarizationerrorrateofaround20%,arelativeimprovementof16%overthecurrent populardiarizationtechnique whichisbasedonhierarchicalclustering. Themostsignicantcontributionofthisworkliesinperformingspeakerloca lizationusingonlya singlecameraandasinglemicrophonebyexploitinglongtermaudio-visualco -occurence.Ournovel computationalmodelallowsidentifyingregionsintheimagebelongingtothespeak erevenwhenthe speaker'sfaceisnon-frontalandevenwhenthespeakerisonlypar tiallyvisible.Thisapproachresults inahitratioof73.8%comparedtoanMIbasedapproachwhichresultsinahit ratioof52.6%,which illustratesitssuitabilityinthemeetingdomain. Thethirdproblemaddressesindexingmeetingarchivestoenableretrievin gallsegmentsfromthe archiveduringwhichaparticularindividualspeaks,inaquerybyexa mpleframework.Byperforming audio-visualassociationandclustering,atargetclusterisgeneratedpe rindividualthatcontainsmulvii

PAGE 12

tiplemultimodalsamplesforthatindividualtowhichaquerysampleismatched.The useofmultiple samplesresultsinaretrievalprecisionof92.6%at90%recallcomparedtoa precisionof71%atthe samerecall,achievedbyaunimodalunisamplesystem. viii

PAGE 13

CHAPTER1 INTRODUCTION 1.1MotivationandOverview Meetingsareanimportantandfrequentlyoccurringeventinourdailylive s,whereinformation isdisseminated,ideasarediscussedanddecisionsaretaken.Theyarea crucibleforcomplexhumaninteractions,whereavarietyofindividualbehaviorsandgroupinter actionsmanifestthemselves. Thusresearchersfromvariouseldssuchasbehavioralpsycholo gy,humancomputerinterfacedesign, computervisionandsignalprocessinghavefocusedeffortsonanalyz ingmeetings.Inrecentyears, withtheadventofsmartmeetingspaces,vastamountsofmeetingdataarerec orded.Theoverwhelmingamountofinformationavailablefromsuchrecordingsunderscoresth eneedforsystemsthatcan automaticallyanalyzemeetings. AsshowninFigure1.1,thethoroughanalysisofameetinginvolvesanswerin gquestionssuch as“who?”,“when?”,“where?”,“which?”,“what?”and“why?”.T hisdissertationfocusesonthe “who?”,“when?”and“where?”questions.Determiningwhoattendedthe meeting,involvesascertainingtheidentityofeachpersoninthemeetingbasedontheirfaceandspeec hsamples,ataskof biometricrecognition.Thetaskofndingdifferentinstanceswhenapartic ipantspokeinthemeeting byanalyzingtheaudio/visualsignaliscalledasspeakerdiarizationandn dingwherethepersonis locatedinthevideoframeisataskofspeakerlocalization. Solvingthesethreeelementaryproblemsenablereturningresultsforabroa drangeofqueriesand arealsoaprerequisiteforvariousothertasks.Asanexample,speaker diarizationisrequiredfor generatingintelligibletranscripts.Also,valuableinsightsregardingthemeeting suchasmoodofthe meetingandthetypeofthemeetingcanbegleanedbyanalyzingthespeechdur ationsandspeaker turnpatterns.Localizingthecurrentspeakerisrequiredforpanningth ecameratothecurrentspeaker andalsotohighlightonlythatregionofthevideowherethepersonisseated .Similarly,identifying 1

PAGE 14

Figure1.1AspectsofMeetingAnalysis.Thegureshowsquestionsperta iningtovariousaspectsof meetinganalysisalongwithsamplequeries.Thequestionsaddressedbythisd issertationaremarked byaredrectangle(dotted).Thespecicchapter(s)dealingwithaquestio narealsolisted. themeetingparticipantsisrequiredforgeneratingthemeetingrosterandalsoto returnqueriesfor meetingswhichaparticularindividualattended. Thediarizationproblemhasbeenviewedmostlyasanaudio-onlyprocessin gtask,wherethe numberofspeakersandthedurationswhentheyspeakarefoundbyan alyzingtheaudio.Similarly, localization,whichinvolvesndingtheimageregionwherethepersonisseate disusuallyviewedas avideoprocessingtask.However,sincehumansperceivetheworldthr oughmultiplesensesinan integratedmanner,itisonlynaturalthatautomatedsystemsthatdesiretoachie veasimilarcognition ofhumaninteractionsadoptamultimodalapproach.Forexample,itishypothe sizedthatspeechand movementsareproducedinconjunctiontoexpressthesamethought.Thusap ersonexhibitsmore movementwhenspeakingthanwhenlistening.Suchinformationcanbeusedto aidboththespeaker diarizationandlocalizationtaskswhichistheapproachtakeninthiswork. Muchresearchhasbeencarriedoutonmeetingdatathathasbeenreco rdedinaspecialroomrigged withmultiplemicrophonesandmultiplecameras.Thisispossiblewhenthemeetingisrec ordedina speciallyriggedsmartroom.Incontrast,thisworkfocusesonmeetingsrec ordedwithaconstrained 2

PAGE 15

setupconsistingofasinglecameraandasinglemicrophone.Themotivationbe hindusingsucha simplesetupisthatitisextremelyportable-inessenceanydevicewithacameraa ndamicrophone, suitablyplacedcanbeusedtorecordameeting.Thisenablesrecordingad -hocmeetingsindifferent roomsusingalaptopandthenuploadingthemtoacentralarchive.Theuse ofaconstrainedsetupraises newproblemsthatrequireradicallydifferentsolutionsthanthosealready proposedintheliterature, makingresearchinthisareaexciting. Somerecentworkshaveaddressedtheproblemofaudio-visualasso ciationinsinglecamera,single microphonerecordingsallofwhichfocusonndinginstantaneousrelatio nsbetweentheaudioand videosignals.Theseapproacheshavebeendemonstratedonshortvid eoclipsinwhichthereareafew (usuallytwo)speakersfacingthecamera.However,asshowninthisdiss ertation,theseapproaches donotworkwellinthemeetingroomscenariobecausethefacesarenota lwaysfrontalandhence aninstantaneousrelationbetweenaudioandvideodoesnotalwaysexist. Theapproachtakeninthis workinsteadfocusesonndinglongtermaudio-visualco-occurrenc eswhichisbettersuitedtothe meetingroomdomain. Insummary,thisworkdevelopsaframeworktoindexmeetingsbyjointlyanalyzin gtheaudiovisualsignalstosolvecoreunderlyingtaskssuchasdeterminingwhospok ewhen,andtheidentityand locationofthespeaker.Theindexingenablesdifferentkindsofquerie s,suchasquerybydiscussion topic,foranindividualscommentsorforactivities(suchasnote-taking,dr awingontheboard,etc.) andforsearchingallmeetingsattendedbyaparticularindividual.Thecor econtributionsandthescope ofthisworkisdescribedinSection1.2andtheoutlineofthisdissertationispres entedinSection1.3. 1.2ScopeandContributions Inthisdissertation,thefollowingassumptionsaremaderegardingthemeetings: Themeetinginvolvesmultipleparticipants,butthenumberofparticipantsarenot knowna priori. Themeetingisrecordedusingonlyasinglestationarycameraandasinglemicro phoneand thesestreamsaretimesynchronized. 3

PAGE 16

Noassumptionsaremaderegardingtheplacementofthemicrophoneandthec amera.Inparticular,becausetheparticipantsfaceeachother,allfacesmaynotbevisib letothecamera. Theparticipantsstayseatedintheirchairsthroughouteachmeeting. Thenatureofthemeetingsareunknown-i.e.theymaybegroupdiscussions ,debates,brainstormingsessionsororanysuchkindofmeetings. Thesystemdoesnotrequiretrainingtolearnparameters-i.e.allparameter sareautomatically learnedfromthetestdataontheyanddifferentmeetingswillbeprocess edwithdifferent parameters. Thecoreaspectsofthisworkare: JointAudio-VisualDiarizationandLocalization:Themajorcontributionofthis workisthe synergeticuseofaudioandvideotoperformtasksthathavebeentraditio nallyapproachedviaa singlemodality.Inparticular,atthetimethisworkwasperformed,mostapproa chestospeaker diarizationreliedsolelyonaudioanalysis.Similarly,typicalapproachestope rsonlocalization werebasedonfeaturesextractedonlyfromthevideosignal.Inthiswork ,thediarizationand localizationisperformedjointly,inaniterativemanner,bycombininginformationf romboth, theaudioandthevideostreams. IndexingMeetingArchives:Thesecondmajorcontributionofthisworkisa novelframework forindexingmeetingarchivesusingaquerybyexampleframework.Thef rameworkallowsa singlefaceorvoicesampletobesubmittedasthequery.Byperformingaudiovisualassociation,samplesfromthemissingmodalityarefoundandthisexpandedsamplesetis usedasthe query.Meetingsinthearchiveareprocessedtogroupextractedface andspeechsamplesand generateclustersofface-speechsamplesperuser.Theavailabilityofmu ltiplesamplesperuser allowsgeneratinguser-specicsubspacesthatresultsinbetterretrieva l.Atthesametime,the querysamplecanbematchedtoallsamplesintheclustertomakeadecisionregar dingtheclusterratherthanmatchingthequerytoindividualsamplesinthedatabase.Thisle adstoamore robustdecision.Theamalgamationofthesethreeideasleadstoabetterretrie valperformance, inessencebycastingtheidenticationproblemasaclusteringproblem. 4

PAGE 17

1.3OutlineofDissertation Thisdissertationissplitintothreemainchapters.Ithasbeenwrittenwithaviewto keepeachof thesechaptersmodularsothattheinterestedreadermayskipdirectlytoacha pterofinterest.Abrief descriptionofwhatistobefoundineachchapterfollows. Chapter2takesalookatthetaskofspeakerdiarizationusingonlytheaudio stream.Itsurveys andbrieycomparesthetwomajorapproachesforspeakerdiarization.T hesegmentation-clustering approachunderwhichourproposedalgorithmfalls,isdiscussedindeta il.Thisisfollowedbya descriptionofthealgorithmandit'sperformanceontheNISTmeetingroompilot corpus. Chapter3carriesthemajorcontributionofthiswork-aproposedalgorithmf orjointaudio-visual speakerdiarizationandlocalization.Thechapterbeginswithasurveyofv ariousMutualInformation basedtechniquesthathavebeenappliedtothetaskofndingaudio-visua lassociations.Ithighlights howthemeetingroomdomainisdifferentthandomainsinwhichthesemethodshav ebeenapplied anddiscusseswhythesealgorithmsfailinthemeetingdomain.Theproposeda lgorithmforaudioguidedvideolocalizationandvideo-enhancedspeakerdiarizationisthenp resented.Aquantitative comparisonbetweentheMutualInformationapproachandtheproposeda lgorithmisperformedby testingtheseapproachesontheNISTmeetingroompilotcorpus. Chapter4describestheapproachusedtoindexmeetingarchives.Anov elframeworkispresented thatcaststheidenticationproblemasaclusteringandvericationproblem.S incesamplevariations affectthevericationperformanceofbiometricalgorithms,astudyquantify ingthiseffectispresented rst.Thisisfollowedbyadescriptionofthethreeaspectsoftheframework .Therstinvolves expandingthesearchquerybyautomaticallyassociatingsamplesfromamissin gmodality.Thesecond istheclusteringofmultimodalsamplesintotargetclusterstoenableasampletocluste rmatch.The thirdisthegenerationofuser-specicsubspacestogeneratesimilaritysco reswhichobviatestheneed fornormalizationandlearninguser-specicweightsforfusion.Thepro posedindexingschemeis evaluatedforthetaskofextractingspeakersegmentsfromtheNISTmeeting archive. Chapter5carriestheconclusionofthisdissertationandhighlightsthecontrib utionsoftheresearch. Italsoproposesimprovementstotheproposedapproachesandhighlights promisingareasforfuture workinthemeetinganalysisdomain. 5

PAGE 18

CHAPTER2 SPEAKERDIARIZATIONUSINGONLYAUDIO Speakerdiarizationcanbelooselydenedasdetermining“whospokewhe n?”fromanaudio signalcontainingspeechfrommultiplepeople.Thistopichasbeenofmuchinte resttothesignaland speechprocessingcommunities,withmostoftheemphasislaidondiarizationofb roadcastnewsand telephonerecordings[16,41,43,64,83,92].Thischapterdealswith diarizationinmeetingroomswhich isincomparison,afairlynewdomain.However,ithasreceivedmuchattention becausediarization isakeytaskintheanalysisofmeetings.Forinstance,itisaprerequisitefore nablingqueriessuch as“whosaidwhat?”aswellasforhigherleveltaskssuchasgenerating intelligibletranscripts.In addition,theanalysisofspeakerturnpatternsisanintegralpartofmeeting analysis,asitprovides informationregardingthemoodandthetypeofameeting. Mostoftheproposedsystems[10,83,94,118],involvesomesortofh ierarchicalclusteringofthe dataintoclusters,wheretheoptimumnumberofspeakersoftheiridentitiesare unknownapriori.A verycommonlyusedmethodiscalledbottom-upclusteringorhierarchicalagglo merativeclustering, wheremultipleinitialclustersareiterativelymergeduntiltheoptimumnumberofclus tersisreached, accordingtosomestoppingcriterion.Thisobviouslyrequiresdeterminingthe initialclusters,which initselfisadifcultproblem.Thetypicalapproachtodeterminingtheinitialcluste rsistopartitionthe recordingintocontiguoussegmentssuchthateachsegmentcontainsspee chfromonlyasingleperson. Thistaskiscommonlyreferredtoasspeakersegmentationorchangepointd etection. Theproposedapproachalsofollowsthesegmentation-clusteringparadig m.First,asilencedetectorisusedtodiscarddurationsofsilencefromtherecording.Thensegme ntationisperformedusing amodicationofthepopularBayesianInformationCriterion(BIC)basedch angepointdetector.Segmentsfromthesamespeakerarethenclusteredtogether,butthehierarch icalclusteringalgorithmis notusedbecauseithassomedrawbacks.First,duetothespontaneous natureofspeechinmeeting roomdata,thesegmentsizesareusuallysmallanddonotyieldrobustmode ls.Second,theapproach 6

PAGE 19

islocallygreedy,asmergingisperformedbasedonlyonthedistancebetwee ntwosegments.Third, thealgorithmhassignicantcomputationalcomplexityasitrequirescreatinga newmixturemodel andndingthedistancebetweenthenewmodelandallexistingmodelsateachite ration.Inmeeting conversations,wherethenumberofinitialsegmentsaretypicallyhigh,thisapp roachisslow. Instead,weproposeatwolevelclusteringalgorithm,inwhichthemodelcomple xityisincreased basedontheclustersize.Therstlevelusesagraphspectralcluster ingschemetoclustertogether theaudiosegmentsthataremodeledusingunimodalGaussianswithafullcova riancematrix.This ismorerobustthanagglomerativeclusteringasgraphspectralclustering takesintoaccountdistances betweenallsegments.Theoutputofthisstageareintermediateclustersthath avelowcoverage(do notcontainallsegmentsfromanindividual)butarepure(allsegmentswith intheclusterbelongtothe sameindividual).Inthesecondstage,sincetheintermediateclusters,conta inenoughdata,theycan bereliablymodeledusingGaussianMixtureModels.Also,sincethenumberof intermediateclusters ismuchsmallerthantheoriginalnumberofsegments(typicallybyafactorof10) ,theagglomerative clusteringschemecanbeusedwithoutdrasticspeedreduction.Theprop osedapproachcompares favorablytothehierarchicalclusteringalgorithmasisquantitativelyillustrate dinSection2.5. TherestofthisChapterisstructuredasfollows.Section2.1surveysthes tateoftheartinspeaker diarization.Thesegmentationandclusteringmodulesarediscussedinsection s2.3and2.4respectively.Section2.5presentsthesegmentationandclusteringresults,comparin gthissystemtoonethat usesagglomerativeclustering.Inaddition,thissectionalsopresentsastud ytouncoverfactorsthat explainthevariationinperformanceacrossdifferentmeetings.2.1RelatedWork Thissectionreviewspreviousworksthatperformspeakerdiarizationu singeitherasinglemicrophoneormultiplemicrophones.Table2.1categorizestheseapproachesba sedonwhethertheyare designedforasingleormultiplemicrophonesetup,theproblemstheyaddress ,theassumptionsthey make,thecomputationalmodelandthedatasetsonwhichtheyillustratetheirperf ormance.Although thereviewgivesgreaterattentiontothesinglemicrophonescenarioasitcomp aresdirectlytothis work,themultiplemicrophonescenarioisbrieyreviewedforcompleteness. 7

PAGE 20

Table2.1ABriefReviewofPreviousWorkDealingWithAudioBasedSpeake rDiarization,LocalizationandTracking. Works SensorSetup Problemsaddressed Assumptions Method TypicalDatasets [6,43,64,68,92] singleMicro-phone( sM ) Speakerdi-arization Non-overlappingspeech Spectralclus-tering NISTmeetingcorpus[33] [29,60,112, 122] multipleMicro-phones( mM ) Speakerlo-calization,tracking Timedelayofarrival MeetingRoomdatafromNIST[33],CLEAR[100] [109,110](This work) singleMicro-phones( mM ) Speakerdi-arization Non-overlappingspeech Two-stagespectralclus-tering NISTMeetingRoomdata[33] Speakerdiarizationinameetingroomscenariohasreceivedmuchattentionw arrantinglargescale evaluationssuchastheNISTspeakerrecognitionevaluations[33].The taskdenitioninvolvesidentifyingthenumberofdistinctspeakersinthespeechrecordingandthetimesd uringwhichtheyspoke. TheNISTmeetingsetupactuallyinvolvesmultiplemicrophonesatdifferentloca tionsbutoneofthe optionalevaluationscenariosisperformingspeakerdiarizationwithasingle distantmicrophone. Evenforthemultiplemicrophonecondition,mostmethodsuseadelayandsumbe amforming toobtainasingleenhancedchannelwithimprovedsignaltonoise(SNR)ratio andthetime-delay informationbetweenindividualmicrophonesisnotusedpastthisstage(e.g .forlocalization).Taking thisintoconsideration,althoughmultiplemicrophonesareused,forthepurpo seofcomparison,we considerthistobethecaseofasinglemicrophone sM asnolocalizationinformationisused.Since, thedatasetavailableforthisworkcontainedonlysuchasingleenhancedc hannel,theaudioaspectof ourworkfallsinthiscategory. Agoodoverviewofaudiobasedapproachestospeakerdiarizationispr ovidedin[105].Ingeneral, twodistinctapproachesexistforspeakerdiarization.Therstisaseq uentialapproachwherespeaker changepointsaredetectedintheaudiostreamandthenspeechsegmentsbe longingtothesamespeaker areclusteredtogether.Inthesecondapproach,thechangepointsand clustersarejointlyestimatedin anHMMframework.Inmostaudioworks,thespeechsignalisrepresen tedbyLPCorMFCCfeatures andGaussianmixturesareusedtomodelthesignal. 8

PAGE 21

Thesequentialapproach,introducedby[3]isthemostwidelyusedappr oach[6,43,92]andhas beenshowntoperformwellattheNISTevaluations.Atypicalsequential approachhasbeendescribed in[68],whichconsistsofthreesteps.Therststepistodeterminethesegme ntboundaries,whichis usuallydonebycomputingtheBICcriterionateachtimeinstantandndingpea ksintheresultantBIC curve.Thepeaksindicatetimeinstantswhereaspeakerchangeismostlikelyto haveoccurred.Once segmentboundariesareobtained,thethenextstepistomodelthesegmentsw hichistypicallydone usingGaussianMixtureModels(GMM).Forrobustness,theentirespeec hsegmentisrstmodeled byaGMMwhichisthenadaptedforeachsegmentusingthemaximumaposterior i(MAP)criterion. Thenalstepinvolvesclusteringthesegmentsbyanagglomerativeclusterin gscheme.Thedistances betweeneachpairofsegmentsiscomputed,usuallyusingageneralizedlikelih oodratioandtheclosest twosegmentsaremerged.ThemergedsegmentsareremodeledbyanewGMMa ndthedistances betweenthenewsegmentandothersegmentsarecomputed.Thisiterativeclu steringcontinuesuntila stoppingcriterionismet.Thestoppingcriteriontypicallyusesalikelihoodratiomea sure,suchasthe BIC. Inthesecondapproach,changepointsandclustersarejointlyestimatedina nHMMframework. Thejointapproachtospeakersegmentationinvolvesinitiallymodelingtheentires peechrecordingby asingleGMMassumingonlyonespeakerandtheniterativelyaddingspeake rmodelstillastopping criterionismet.ThisapproachknownasEvolutiveHMM[63]involvesbuildin ganinitialGMM fromtheentirespeechsegment.Thenashortspeechsegment(about3 seconds)ischosensuchthat itmaximizesthelikelihoodratiobetweenthecurrentspeakermodelandanalter nativebackground modelobtainedfromatrainingset.Thissegmentisusedforadaptinganews peakermodel.AViterbi decodingiscarriedouttoresegmenttheinitialspeechsegmentfollowedag ainbyanadaptationstage. Thisresegmentation-adaptationiscarriedoutiterativelyuntilthesegmentation doesnotchange.Then anotherthreesecondsofspeecharesoughtthatmaximizethelikelihoodra tiobetweenthecurrent speechandexistingmodelsandthissegmentisusedtoadaptthemodelforan ewspeaker.The processstopsiftheoveralllikelihooddoesnotincreasebytheadditionof anewspeaker,orifthereis noother3secondintervalavailablefortraininganothermodel. Mostapproachesthatusemultiplemicrophones mM tolocalizethesource,canbeclassiedas thosebasedonestimatingthetimedelayofarrivalorthosebasedondirectme thods[117].Approaches ofthersttyperelyonthe timedelayofarrival (TDOA)principle,i.e.thefactthatsoundproduced 9

PAGE 22

atoneinstantwillbeheardatdifferentinstantsatdifferentmicrophon es.Whenmultiplepairsof microphonesareavailable,thesourcelocationcanbeinferredusingthev ariouspairwisedelaysifthe sensorgeometryisknownprecisely.Inthesemethods,therststepistoes timatethedelaybetween eachpairofmicrophones,usuallyusingacrosscorrelationofthespectr ogram.Apopularmethod todothisisthegeneralizedcrosscorrelationphasetransformGCC-PHAT [57]asitisresilientto reverberations.Knowingthegeometryoftheindividualmicrophonesand thedelaybetweeneachpair, aLinearIntersectionAlgorithm(LIA)[13]isusedtoestimatethelocationofthe source.Theworks in[29,112]arebasedonthismethod. Directmethodsperformsourcelocalizationbasedonbeamforming[58]or steeredresponsepower (SRP)wherethemicrophonearrayissteeredthroughvariousanglesto ndtheanglethatmaximizes thesummedresponse[60,122].Although,worksinthiscategoryfocuso ntheir3Dlocalization andtrackingperformance,theextensiontospeakerdiarizationissimple.Fo rsituationsinwhichthe participantsstayseatedinthesamelocation,theinstantaneouslocationestimatesc anbeclusteredin thespatialdomain.Speechframesoriginatingfromdistinctspatialclustersc anbeassumedtobelong todifferentspeakerseffectivelyperformingspeakerdiarization. Trackingthespeakerovertimeinvolveslteringtheinstantaneouslocationestima tesprovidedby thesemethodsusingaKalmanorParticleFilterformulation.Thetrackingisperfo rmedusingeitherthe KalmanlteroritsextensionssuchastheextendedKalmanlter(EKF),unsce ntedKalmanlterand iterativeextendedKalmanlter(IEKF).Anotherapproachhasbeentheu seofParticleFilters(PF)for tracking.Thenextsectiondescribestheproposeddiarizationmethodwhic hfollowsthesegmentationclusteringparadigm.2.2ProposedApproach Therststepoftheproposeddiarizationmethodinvolvesdetectinganddisc ardingdurationsof silencefromtherecording.Thespeechsignalisthensegmentedintohomog eneoussegments-i.e. segmentscontainingspeechfromonlyasingleindividual.Next,individual segmentsareclustered sothateachclustercontainsallthespeechfromonlyasinglespeakeran dthenumberofclustersis equaltothenumberofspeakers.Theclusteringisdoneintwostages.Firs t,thesegmentsaremodeled usingunimodalGaussiansandgraphspectralclusteringisusedtoderiv eintermediateclusters.The 10

PAGE 23

intermediateclustersarethenmodeledusingGaussianMixtureModelsanditera tivelymergedusinga hierarchicalagglomerativeclusteringschemetoobtainthenalclusters. TheowchartinFigure2.1illustratesthemethod.TheBayesianInformationC riterion(BIC)is usedtosegmenttheaudiostreamintohomogeneoussegmentsi.e.durationsof theclipwhereaudio andvideoinformationissimilar.Graphspectralclusteringisthenusedtoobtain intermediateclusters byfusingsegmentswithsimilaraudiocharacteristics.Sincetheintermediatecluste rscontainlarge enoughamountsofdata,theycanbeusedtorobustlyestimateaudiomodels.In thesecondstep ofmerging,anhierarchicalagglomerativeclusteringschemeisusedtodete rminedthenalclusters. Here,theclosestpairofclustersismergedandanewaudiomodelisderiv edfortheresultingcluster. TheagglomerativeclusteringprocesscontinuesiterativelyuntilaBICbas edstoppingcriterionismet. Ideallythereshouldbeoneclusterperspeakerthatcontainsallspeech utteredbythespeaker. Section2.3describesthespeech/silencedetectionandsegmentationproced ures.Thetwostep clusteringprocedureisdescribedinsection2.42.3Segmentation Therststepofprocessinginvolvesdetectingspeakerchangepoints, i.e.timeinstantswhena personstopsorstartsspeaking.Theaudiosignalwassampledat16kHz andtheMel-frequency cepstralcoefcients(MFCCs)wereextractedusing32ltersandaban dwidthrangingfrom166Hzto 4000Hz.Thesesettingswerechosenbasedonanempiricalstudy[35]w hichfoundthemtoyieldgood resultsonthespeakervericationtask.TheMFCCfeatureswereextrac tedatarateof30Hzinstead ofthecommonlyused100Hztomaintainconsistencywiththevideoframerateforth eaudio-visual experimentsthataredescribedinChapter3.Thedimensionalityofaudiofea tureswasreducedby projectingtheminaPCAsubspace.Let A ( t )=[ a 1 ( t ) ; a 2 ( t ) ;:::; a m ( t )] T ,where a 1 ( t ) ; a 2 t ;:::; a m ( t ) aretheMFCCsprojectedinaPCAspace. Thesegmentationmoduleideallypartitionsthetimelineintonon-overlappingcontiguo ussegmentssuchthateachsegmentcontainstimeinstanceswhereonlyasinglepers onspeaks.Although asubstantialspeakeroverlapmayexistinsomemeetings,theformulationassu mesnon-overlapping speech.Theproblemisformalizedasfollows:Considertheset f A g ,eachmemberofwhichrepresents theMFCCfeaturevector A ( t ) ,i.e. A = S A ( t ) where t arethediscretetimeinstantsrepresentedbythe 11

PAGE 24

Figure2.1SystemFlowchartforAudioBasedSpeakerDiarization.TheB ayesianInformationCriterion(BIC)isusedtosegmenttheaudiostreamintosegmentswithsimilaraudiocha racteristics.The rststageofclusteringisperformedviagraphspectralclusteringtond durationsofspeechfroma singlespeaker.GaussianMixtureModelsarebuiltfromtheintermediateclus tersandthenalclusters areobtainedviaagglomerativeclustering.Ideally,eachofthenalcluste rsshouldcontainallthe speechfromeachindividualspeaker.set f t g =[ 1 ; 2 ;:::; T ] .Let f I g representthesetofNspeakersinthemeeting, f I g =[ 1 ; 2 ;:::; N ] and G ( t ) 2 I representthetrueidentityofthespeakerattime t .Segmentationinvolvespartitioningtheset f t g intocontiguous,non-overlappingsubsets(segments) f s k g ,suchthat f t g = S f s k g whereeachsub12

PAGE 25

setconsistsofcontiguoustimeinstants, f s k g =[ t k 1 ;:::; t k ) where k =[ 1 ; 2 ;:::; nS ] for nS segments. Theset f S g = S s k ofsegmentspartitionstheoriginaltimeline f t g tomaximizetheobjectivefunction inEquation2.1 SP = s k f S g max i 2 I t 2f s k g d ( G ( t ) i ) T (2.1) where d ( i )= 1if i = 0and0otherwise.Theinnertermcountsthenumberofframesbelonging tothedominantspeakerinthecurrentsegmentandtheoutersummationsumsthis overallsegments. Thetotalsumisnormalizedbythenumberofframesintherecording T .Ifthesegmentationisideal, eachsegmentwillcontainframesfromonlyonespeakerandthesegmentatio npurity( SP )willbe1. Although,Equation2.1measurestheoverallpurityofthesegments( SP ),itdoesnottakeinto accountthenumberofsegmentsgeneratedandsoapartitioningwhereeac hframeisasegmentwill haveperfectperformance.However,thisdoesnothappeninpractic eandsoalthoughawed,the metricisausefulindicatorofthesegmentationperformance.Also,overseg mentationisnotacritical issuebecauseoftheclusteringstagethatfollows. Thetaskofsegmentationisnowposedasoneofndingthetimeinstants t k thatmarkthebeginning (andending)ofsegments.Todeterminewhetheratimeinstant t correspondstoachange-point,asetof frames A W 1 = x t T w x t T w + 1 ::: x t 1 preceding t iscomparedtoasetofframes A W 2 = x t + 1 ; x t + 2 ::: x t + T w followingit(where T W isthewindowsize).Theframesaremodelledparametricallyandifthetwo setsaredeemedtobegeneratedbydifferentmodels,thetimeinstant t representsachange-point.If weassumeaunimodalGaussianmodelforeachspeaker,i.e. A i N( i ; S i ) ,thenthenullhypothesis canbeexpressedas: H 0 : ( x t T w ::: x t ::: x t + T w ) N( c ; S c ) Thealternatehypothesisisthattwodifferentmodelsareneededtoexplainth edataineachwindow H 1 : ( x t T w ::: x t ) N( 1 ; S 1 ) and ( x t + 1 ::: x t + T w ) N( 2 ; S 2 ) Assumingthattheframes x ( t ) areindependent,andidenticallydistributedwithaunimodaldistribution,thehypothesisthatgivesthemaximumlikelihoodischosen.Choose H 0 if p ( x t T w ::: x t ::: x t + T w ) j c ; S c ) > p ( x t T w ::: x t j 1 ; S 1 ) p ( x t ::: x t + T w ) j 2 ; S 2 ) 13

PAGE 26

Howeverbecause H 1 usesmoremodelparameters,itismuchmorelikelytotthedatabetter. Foranappropriatecomparison,thenumberofmodelparametersmustbetak enintoaccount.The BayesianInformationCriterion(BIC)doesexactlythisbypenalizingtheloglik elihoodofeachmodel byapenaltytermproportionaltothenumberofmodelparameters.TheBIC[ 90]isamodelselection criterion;givenasetoffeatures,A=( A 1 A 2 ... A N ),itselectsthemodelthatmaximizesthelikelihood ofthedata.Sincethelikelihoodincreaseswiththenumberofmodelparameters, apenaltyterm proportionaltothenumberofparameters d isintroducedtofavorsimplermodels.TheBICfora modelMwithparameters q isdenedas BIC ( M )= log p ( A j q ) 1 2 l d log N (2.2) where l =1and N isthenumberoffeaturevectors. Thetaskofchange-pointdetectioniscastintothemodelselectionframewor k.Wewanttodeterminewhethertheentirefeatureset A isbetterrepresentedbyasinglemodel M C orwhetherthere existtwomodels M 1 and M 2 thatcanbetterrepresentfeaturesets A W 1 =( A 1 ; A 2 ;::: A i )and A W 2 = ( A i + 1 ; A i + 2 ;::: A N )respectively. Dening D BICtobeBIC( M 1 ; M 2 )-BIC( M C )andundertheassumptionthat A followsaNormal distribution,wehave D BIC ( i )= i 2 log j S A W 1 j + N i 2 log j S A W 2 j N 2 log j S A j i 2 l ( d + d ( d + 1 ) 2 ) log N (2.3) ApositivevaluefortheBICjustiesthealternativehypothesisandsugges tsthatthetimeinstant i isa change-point. TheuseoftheBayesianInformationCriterion(BIC)forspeakerdiariza tionwasrstsuggested in[19].Itwasshownin[21]thatthismethodoutperformssegmentationmethod sbasedonthesymmetricKullback-Leiblerdistance(KL2)andgeneralizedlikelihoodratio.Var iousmodicationshave sincebeensuggestedtoimprovethespeedandrobustnessoftheoriginal algorithm[21,106].Inthis workweproposeadifferentmodicationtotheoriginaltechnique[19]tha tensurescontinuityofthe BICcurveinadditiontoachievingbetterspeedandreliabilityindetectingshort segments.Themodi14

PAGE 27

Algorithm2.1 ThreeSubwindowSegmentationAlgorithm 1: procedure S EGMENTATION ( X T N ) Featurestream,Streamlength,Windowsize 2: shift 0 3: while shift + N = 3 6 > T do Dountilendofthefeaturestreamisreached 4: SW L X [ 0 + shift ;:::; N = 3 1 + shift ] Initializingthewindow,subwindows 5: SW C X [ N = 3 + shift ;:::; 2 N = 3 1 + shift ] 6: SW R X [ 2 N = 3 + shift ;:::; N 1 + shift ] 7: SW X [ 0 + shift ;:::; N 1 + shift ] 8: for i 2 SW C do 9: X W 1 SW [ 0 ;:::; i 1 ] 10: X W 2 SW [ i ;:::; N ] 11: X SW 12: Compute D BIC ( i ) usingEquation2.3 ComputeBICformiddlesubwindow 13: endfor 14: shift shift + N = 3 Shiftallwindowsbyonethirdofwindowlength 15: endwhile 16: return BIC 17: endprocedure edsegmentationtechnique,basedonaslidingwindowapproachisdescrib ebyalgorithmAlgorithm 2.1andillustratedinFigure2.2. TheslidingwindowissplitintothreeequalsubwindowsandtheBICvaluesar ecomputedonlyfor framesinthemiddlesubwindow.Allframesinthewindowtotheleftofthecurren tframeformtheset X W 1 andframestotherightformtheset X W 2 .Theunionoftheseframesformstheset X .Equation2.3 isthenusedtocomputetheBICvaluefortheframe.AfterBICvaluesforallf ramesinthemiddle subwindowhavebeencomputed,thewindowsareshiftedbyathirdofthewin dowlengthandthe entireprocedureisrepeated. AfteraBICcurvefortheentiresignalhasbeencomputed,thenextstepis todetectpeaks.In theoriginalalgorithm,allpeaksthatcorrespondtoapositivevalueareco nsideredaschangepoints. FromEquation2.3,itisapparentthatthisimplicitlyassumesthresholdingthelikelih oodbyafactor dependenton l .Although,intheoriginalalgorithm l wassetto1,subsequentworkssuggesttuning ittoadifferentvaluebasedonthedataset.Alternatively,thepenaltytermc anbediscardedandan explicitthresholdingapproachcanbeusedasin[21],wherepeakvalu esgreaterthan as werechosen aschangepoints,where a isanempiricallychosenparameterand s isthestandarddeviationofthe BICcurve. 15

PAGE 28

Figure2.2ChangepointDetectionUsingtheBayesianInformationCriterion. Therstimage(topleft) showsanaudiosamplewhichisthenprocessedtoextracttwelveMel-frequ encycepstralcoefcients at30Hz(topright).TheBIC(Bayesianinformationcriterion)curveisthen computedbyasliding windowapproach.ThewindowconsistsofthreepartsandtheBICvalues areonlycomputedfor framesinthecentralwindow.Thewindowisthenslidbyathirdofitslengthandth ecomputationsare repeatedtoobtainaBICvaluefortheentiresignal(bottomright).TheBICis thenlowpassltered anddifferentiatedtoextractlocalmaximawhichcorrespondtoaudiochang epoints(shownusingred diamondsinthebottomleftimage). Inourmethod,theBICcurveisrstnormalizedtoliebetween0and1,smoothe dandaderivativeistakentondthepeaks.Peaksinthesignalthatlieabovethemedianare thenchosenasthe changepoints.Thisobviatestheneedforempiricallytuninganyparameters. Inadditiontoreduced computationscomparedtotheoriginalmethod,thisapproachovercomesthetwois sues.Byconsideringonlysmallwindowsoneachside,themethodcanreliablydetectshorts egments.Atthesame time,becauseforeachframetheratioofthelargesttosmallestsubwindowdo esnotexceed2:1,the computationisrobust.TheBICcurvedoesnotsufferfromsuddendisc ontinuitiesbecauseevenwhen thewindowisslid,thereisa66%overlapbetweentheframesusedforcomputa tion. 16

PAGE 29

Figure2.3IllustrationofSegmentation.Grayregionsindicatedurationswhen apersonisspeaking. ThesmoothedBICcurveshowninblueisoverlaidontheimage.Correctdete ctionsareshownby triangles,falsealarmsareshownbytriangleswithinacircleandmisseddetects aremarkedbyarrows. Figure2.3illustratestheworkingofthistechniqueonaclipwherethreepeople areconversing. Thegrayareasindicatethetimeinstantswhenanindividualisspeaking.The smoothedBICcurveis showninblueandthedetectedpeaksaremarkedbyredtriangles.Itcanb eseenthattherearetwo deletionsandquiteafewinsertions.Theinsertionsarenotsomuchofapro blembecausesimilar segmentswillbemergedintheclusteringstage.2.4Clustering Onceindividualsegmentsareobtained,thenexttaskistogrouptogethers egmentsfromthesame speaker.Formallyspeaking,theclusteringtaskinvolvescreatingasetof clusters f C g = f c 1 ; c 2 ;:::; c M g whereeachcluster f c j g containsasetofsegments s S containingspeechfromthesamedominant speakersoastomaximizetheobjectivefunctioninEquation2.4,whereMisthen umberofclusters foundbythesystem. CP = c j f C g max i 2 I t 2f c j g d ( G ( t ) i ) T (2.4) 17

PAGE 30

Themeasureforclusterpurity ( CP ) issimilartothemeasureforsegmentpurity.Theinnerterm countsthenumberofframesfromthedominantspeakerineachcluster,wh ichissummedoverall clustersandnormalizedbythetotallengthofthestream T .Itshouldbenotedthattheoptimalperformancerequiresthat M = N ,i.e.,thenumberofclustersfoundbythesystemisequaltothenumberof speakers.Intheidealcase,thereisaone-onemappingbetweencluster sandspeakersandapenaltyis incurredwhensegmentshavingdifferentdominantspeakersarecluste redtogether.Thepurityofthe clustersisalsoaffectedbysegmentationerrorsandthe SP isanupperboundonthe CP Mostworksthatperformspeakerdiarization,usingasinglemicrophone [46,118],followahierarchicalagglomerativeclusteringschemeformergingsegments.Inthesew orks,theaudiofeatures fromeachsegment(e.g.MFCCs,LPCs)aremodeledusingaGaussianmixtu remodelandpairwise distancesbetweenallsegmentsarefound.Thepairofsegmentshavingthe leastdistancearemerged togetherandanewmixturemodeliscreatedforthemergedsegment.Thismergin gprocedurecontinuesiterativelyuntilastoppingcriterionismet. However,thisapproachhassomedrawbacks-rst,duetothespontane ousnatureofspeechin meetingroomdata,thesegmentsizesareusuallysmallasevidentfromFigur e2.3.TheExpectation Maximization(EM)algorithmdoesnotproduceconsistentmodelswhenonlys mallamountsofdata areavailableforlearning.Additionally,thesegmentationprocedureprodu cesmanyfalsechangepoints leadingtoshortersegments,exacerbatingthisproblem. Second,theapproachislocallygreedyasmergingisperformedbasedon lyonthedistancebetween twosegments.Also,thealgorithmhassignicantcomputationalcomplexityasitr equirescreatinga newmixturemodelandndingthedistancebetweenthenewmodelandallexistin gmodelsateach iteration.Whenthenumberofinitialsegmentsarehigh,thismethodisconsiderab lyslow. Toovercomedrawbacksoftheagglomerativescheme,weproposeatwole velclusteringalgorithm,inwhichthemodelcomplexityisincreasedbasedontheclustersize.The rstlevelusesagraph spectralclusteringschemetoclustertogethertheaudiosegmentswhicharemo deledusingunimodal Gaussianswithafullcovariancematrix.Theoutputofthisstageisasmallernu mberoflargerintermediateclusters.Inthesecondstage,sincetheintermediateclusters,contain enoughdata,theycan bereliablymodeledusingGaussianMixtureModels.Also,sincethenumberof intermediateclusters ismuchsmallerthantheoriginalnumberofsegments(typicallybyafactorof10) ,theagglomerative 18

PAGE 31

clusteringschemecanbeusedwithoutdrasticspeedreduction.Theseco ndlevelisnecessarytoallow incorporationofastoppingcriterion,whichisnotpossibleinthegraphspe ctralclusteringframework. Fortherstlevel,weformulatetheclusteringproblemasagraphpartitioning problemasshown inFigure2.4.Inthisframework,eachsegmentisrepresentedasanodea ndthesimilaritybetween nodesserveastheedgeweights w ij .Variousdistancemeasureshavebeenusedtocomputedistances betweensegmentmodels,suchasthesymmetricKullback-Leibler(KL2)dive rgence[92],Gaussian likelihoodratio[64]andtheBhattacharyadistance[46].Inthiswork,similar to[118],weusedthe BICdistanceastheedgeweights. Figure2.4GraphsSpectralClusteringFramework.Theclusteringframew orkinvolvesdenotingeach segmentasanodeandthenegativeofBICdistancebetweentwosegmentsa sedgeweightstoobtaina fullyconnectedgraph.Thisleadstoamatrixrepresentation(bottomright).G raphspectralclustering ofthenodes(segments)resultsinclustersbelongingtoindividualspeake rs. Theclusteringproblemnowreducestoagraphpartitioningproblemwhere theobjectiveisto minimizetheterm ij w ij ( L i L j ) 2 (2.5) where L i isthelabelaccordedtoeachsegment.Theideabeingthatsimilarsegments(ha vinglarge weights)shouldbeassignedthesamelabel.Toavoidtrivialsolutionsi.e.L=1 orL=0,thefollowing constraintsareimposed. i L i = 0and i L i 2 = 1(2.6) Itisshownin[89],thatthesolutiontothisproblemisthesecondeigenvector oftheLaplacianmatrix obtainedfromtheedgeweightsas 19

PAGE 32

L W ( i ; j )= 8><>: j ; j 6 = i w ij if i = j w ij if i 6 = j 9>=>; (2.7) Giventhesecondeigenvector,thenodes(segments)arepartitionedintotw oclusters,basedonthe signoftheeigenvector'sentries.Eachclusteristhenrecursivelybipar titionedtillthedesirednumberof clustersisfound.Inthisimplementation,graphspectralclusteringwasused toobtainsixteenclusters fromtheinitialsegments.Thechoiceofsixteenclusterswasmotivatedbytheo bservationthatmeeting roomdatausuallycontainslessthansixteenspeakersandsothedataisnot under-clustered.Alsoat sixteenclusters,wendforourdatasetthateachclustercontainssuf cientlylargeamountofdatato robustlyestimatemorecomplexmodels. AlthoughunimodalGaussiansmaybeappropriatetomodelinitialaudiosegments becauseonlya limitedamountofdataisavailable,theintermediateclusterscontainenoughdatatow arranttheuseof morecomplexmodels.WemodeltheintermediateclustersusingthepopularUBM-G MMtechnique describedin[84].Inthistechnique,aUniversalBackgroundModel (UBM)isbuiltusingtheentire speechinthemeeting.TheUBMisessentiallyaGaussianMixtureModel,whichin thisimplementationconsistsofsixteenmixtureswithdiagonalcovariancematricesandislearn tusingtheExpectation Maximization(EM)algorithm.Thisisfollowedbyameanonly maximumaposteriori (MAP)adaptation[10]toproduceGMMsforeachcluster.ThedistancebetweentheGM Msiscomputedusing usingtheEuclideandistancebetweenthemeansasshowninEquation2.8.Itw asshownin[10],that thiscorrelateswithaMonteCarloestimationoftheKL2distance. d ( a i ; a j )= vuut K k = 1 D d = 1 w k : ( m i ; k ; d m j ; k ; d ) 2 s 2k ; d (2.8) Here d ( a i ; a j ) isthedistancebetweenclusters i and j computedforaGMMwith K componentsand datawith D dimensions. m i ; k ; d representsthemeanofthe d th dimensionofthe k th componentfor cluster i .Anagglomerativeclusteringschemeiterativelymergestogethertheintermedia teclusterstill aBICbasedstoppingcriterionismet. Thisapproachforaudiobasedclusteringdiffersfrompreviouswork sintwoaspects-rstatwo levelclusteringschemeisusedwithdifferentmodelcomplexitiesforeachlev elbasedonthethe 20

PAGE 33

amountofdata.Second,agraphspectralclusteringschemeisusedinthe rststagethataffordsboth speedandperformanceimprovementcomparedtousingonlyhierarchical agglomerativeclusteringas showninSection2.5.2.2.5Results Theproposedspeakerdiarizationapproachwastestedonasubsetof theNISTpilotroommeeting corpus[38].Theentirecorpuscontains19meetingsrecordedinaroom riggedwithvecamerasand fourtablemountedmicrophones.Ofthefourtablemicrophones,threeareo mni-directionalmicrophonesandoneisa4-channeldirectionalmicrophone.Eachparticipan tintheroomhastwopersonal microphones-aheadmicrophoneandadirectionallapelmicrophone. Therearetwoaudiochannelsavailablefromeachmeeting.Therstaudio channelisagainnormalizedmixofallthehead-boommicrophonerecordingsandthesecond channelisagain-normalized mixofallthedistantmicrophones.Theaudiodatahasasamplingfrequencyo f44kHz,andaresolutionof16bitspersample.Ofthe19meetingsinthecorpus,twomeetingsdidnoth aveassociated speechtranscriptsandtheexcerptfromonemeetingwasdominatedbyasin glespeaker.Theremaining16meetingswereusedforthestudyandtheirspeechtranscriptswereu sedtogeneratethespeaker diarizationground-truth. Inallofthemeetings,participantsareseatedaroundacentraltableandinte ractcasually.Dependingonthetypeofthemeeting,theparticipantsdiscussagiventopic,playgames ,makepresentations orusetheblackboard.Fromtimetotime,participantstakenotes,stretchandsip waterorcoffeefrom theirmug.Theaudiosignalsfromthesemeetingsarequitecomplexbecausethe meetingsareunscriptedandoflongdurations.Thesignalconsistsofshortutterances ,frequentoverlapsinspeechand non-speechsoundssuchaswheezing,laughing,coughingetc. Fromvariousevaluations[33,100],itisapparentthatspeakerdiariz ationandlocalizationalgorithmshaveawiderangeofperformancevariationandthattheperformanc eseemstobeextremely datadependent.Toinvestigatefactorsthataffectthealgorithmperforma nce,weusethefollowing variablestoquantitativelycharacterizethemeeting. Numberofparticipantsintheroom(No.Speakers):Weassumethismetricwillc orrelatewith thevideoperformance-thegreaterthenumberofspeakers,thecloserto gethertheywillbe 21

PAGE 34

seated,leadingtoocclusions.Secondly,theoverallamountofdistracting motionmayalsobe proportionaltothenumberofspeakers.Also,althoughthenumberofspe akersmaynotdirectly correlatewithaudioperformance,largernumberofpeopleresultsinaincr easesbackground noiseduetothevarioussoundsthatlistenersmakewhentheyswivelintheir chairs,tapthetable orippages. SpeakerEntropy(Entropy):Thisisameasureofspeakerdominationiname eting.Alow entropyindicatesthatonlyafewspeakersspeakformostofthetimewhere asahighentropy indicatesthattheparticipantsinvolvedspokemoreorlessforaboutthesame duration.The entropyiscomputedas H ( meeting )= N i P ( S i ) log ( P ( S i )) (2.9) P ( S i )= dur ( S i ) Ni dur ( S i ) (2.10) whereNisthenumberspeakersinvolvedinthemeeting, dur ( S i ) isthetotaltimedurationfor whichperson S i speaksand P ( S i ) isthepercentageoftime(i.e.probability)thatperson S i speaks duringthemeeting. ShortUtteranceRatio:Detectionofshortspeechutterancesstillremainsa challengeforthe speakersegmentationtask.Also,robustmodelingofshortutterancesisa problemleadingto lowerclusteringperformance.Theshortutterancedurationisthesumof allutterancedurations thatarelessthanthreeseconds.Theshortutteranceratioisderivedb ynormalizingtheshort utterancedurationbythetotalspeechdurationandiscomputedas SUR = i ; j dur ( U ij ) k i ; j dur ( U ij ) k = 8><>: 0if dur ( U ) > 3seconds dur ( U ) if dur ( U ) 3seconds 9>=>; (2.11) where U i ; j isthe j th utterancebythe i th speaker. ThesixteenmeetingsusedarelistedinTable2.2,alongwiththefactorsjustdes cribed.Inthe followingsections,theimpactofthesefactorsonthediarizationtaskswillbes tudied. 22

PAGE 35

Table2.2MeetingDatasetandAssociatedMeta-DatafortheSixteenMeetings UsedintheExperiments. No. Name Type Speakers Entropy ShortSeg.Ratio 1 20011115-1050 FocusGroup 4 1.25 0.1049 2 20011211-1054 Planning 3 1.28 0.0785 3 20020111-1012 Planning 6 1.37 0.1211 4 20020213-1012 StaffMeeting 6 1.87 0.1796 5 20020214-1148 Interactionwithexpert 6 1.85 0.1822 6 20020304-1352 Gameplaying 4 1.59 0.1704 7 20020305-1007 Planning 7 1.79 0.1834 8 20020627-1010 Staffmeeting 6 2.26 0.2905 9 20020731-1409 Gameplaying 4 2.03 0.3164 10 20020815-1316 Problemsolving 4 1.61 0.1501 11 20020904-1322 Interactionwithexpert 4 1.93 0.1563 12 20020911-1033 Interactionwithexpert 7 1.39 0.0625 13 20030702-1419 Focusgroupdiscussion 4 1.88 0.1922 14 20030729-1519 Informationgathering 4 1.97 0.1845 15 20031204-1125 Informationgathering 4 2.26 0.2515 16 20031215-1412 Focusgroupdiscussion 5 1.89 0.1875 Firsttheperformanceofthesegmentationmoduleisanalyzedinsubsection2.5 .1.Insubsection2.5.2,thediarizationperformanceusingtheHierarchicalAgglomerativec lusteringiscompared toourtwo-levelapproach.2.5.1Segmentation Inthesegmentandclusterspeakerdiarizationframework,thersttaskis thedetectionofchangepointsthatpartitiontheclipintosegmentswhichcontainspeechfromonlyasingle individual.As discussedinsection2.3,thesegmentationperformanceismeasuredbythese gmentpurity( SP )measuregivenbyEquation2.1.Figure2.5showsthesegmentationperformanc eforeachofthesixteen clips,usingaudioinformation.Foreachclip,segmentationwasperformedus ingbothaudiochannels(individually)andtheresultsareshownbythebarslabeledasAudio1 andAudio2. Theaveragesegmentpurityfromaudiochannel1and2is92.5%and91.1 %,respectively;the differenceinsegmentationperformancebetweenthetwochannelsisnotre markable.However,the performancevariationacrossmeetingsissignicantwiththebestandwors tperformancesat96.2% and87.6%respectively.Wethenanalyzedifanyofthefactorsoutlinedea rliercanexplainthevariationinperformanceacrossmeetings.AnANOVAanalysisrevealsthe R 2 valueas0.78and0.74for 23

PAGE 36

Figure2.5SegmentationPerformance.ResultsobtainedbyusingtheBICbas edmethodforsegmentation.Audio1andAudio2denotethepurityofthesegmentsfoundusingaudio channels1and2 respectively.Theaveragesegmentpurityfromaudiochannel1and2 is92.5%and91.1%,whichis notassignicantasthedifferenceinsegmentpurityforeachchannel acrossthemeetings. channelsoneandtworespectivelywithentropy ( H ) andshortutteranceratio SUR factorssignicant atthe95%condencelevel. Itisintuitivethatthesetwometricsshouldcorrelatewithsegmentationperforman ce.Thenumber ofmeetingroomattendeesisnotbyitselfanimportantpredictorofperforman cebecauseoftensome attendeesdonotparticipatemuchintheconversationandhencedonotimpac tthedynamicsofthe conversation.Theentropyisameasureofuncertaintyintheidentityofthecu rrentspeaker.Ahigh entropymeansthatthespeakerswhoparticipatedspokeformoreorlesse qualdurations.Thisimplies thaterrorsinsegmentationandtheensuingclusteringwilldrasticallyimpactthe performancemetrics. The SUR isprobablythemostintuitivemetricforsegmentation.Whenthe SUR ishigh,itimplies thattherearemanyshortsegmentsintheconversationandhencemanycha ngepoints.Thisaffects segmentationintwoways-rstly,morechangepointsneedtobedeterminedand secondly,lessdata isavailabletothe BIC modelsfromshortsegmentstomakedecisions.TheresultsandtheirANOVA analysishighlightstheeffectof SUR andentropyonthesegmentpurity.Wenowaddsegmentpurity ( SP )asanotherfactorduringclusteringperformanceanalysis. 24

PAGE 37

2.5.2Clustering Atwostageclusteringisperformed-therststepinvolvesmergingaudioseg mentsbasedongraph spectralclustering.Inthesecondsteptheseintermediateclustersareuse dtobuildGaussianMixture Modelswhicharethenmergedviaahierarchicalagglomerativeclusteringto obtainthedesirednumber ofclusters.Theclusteringperformancemetricissimilartothesegmentpuritymetr icandcomputed usingEquation2.4. Figure2.6,comparestheclusterpurityobtainedusingthetwostageclustering scheme(GSC+ HAC)withthehierarchicalagglomerativeclustering(HAC)approach.Th eclusterpurityusingthe traditionalHACapproachis76.68%and72.3%,forchannel1and2resp ectively.Thisisasharpdrop fromthesegmentationperformance.However,thetwolevelclusteringsch emeperformsbetterwith a CP of80.84%and77.45%forchannel1and2,respectively.Inadditiontoth eimprovedaccuracy, the(GSC+HAC)methodisalsoconsiderablyfaster,takinganaverageof45 secondscomparedtothe averageof213secondstakenbytheHACmethod. Figure2.6ClusteringPerformanceUsingtheClusterPurityMetric.Compariso noftwoapproaches foraudio-onlyclustering.TheGSC+HACmethodperformsbetterthanthetra ditionalHACmethod. Tounderstandreasonsfortheconsiderabledropinpurity,relationswe resoughtbetweenthedrop inperformance( SP CP )andthevariablesdescribedbefore.AnANOVAanalysisindicatedat90 % condencethattheinitialsegmentpurityandentropywerethefactorscor relatedwiththedropinper25

PAGE 38

formance.The R 2 valueforthemodelwas0.71.Asimilareffectofentropyonclusteringperf ormance wasalsoobservedin[51].Thedependenceofclusteringperformanc eoninitialsegmentpurityis intuitiveassegmentscontainingspeechfrommorethanonespeakerarelike lytobeincorrectlyclustered.Also,entropyisacoarsemeasureofthespeakertransitiondynamic s.Ahighentropyvalueof entropysigniesthatmultiplepeoplearespeakingintherecordingwhichmake stheclusteringtask harder.Forinstance,considerthecasewhereasinglepersonspeak sformostofthemeetingasina presentation.Inthiscasethesystemisrequiredtondonedominantcluster whichwillconsistofcontiguoussegments.Thisiscomparativelyeasierthandeterminingthenumberofs peakersandcorrectly groupingdisjointsegmentsforeachspeaker.Asaresultalowerentrop yresultsinbetterperformance. Theeffectof SP on CP isalsoobviousbecausesegmentscontainingspeechfrommultiplespeakers resultinnoisymodelswhicharepronetoincorrectclustering. ThenalresultsofdiarizationwerepresentedinFigure2.6usingtheCluste rPuritymetricdened byEquation2.4.However,apopularmeasureforguagingtheperformanc eofaspeakerdiarization systemisthediarizationerrormetric,denedbyNISTin[38].Forthesake ofcompletenessandto enablecomparisonofthissystemwithothers,wealsopresentresultsusing thismetricinFigure2.7. Theaverage DERs ofaudio-onlydiarizationare19.15%and22.54%forchannels1and2res pectively. AsofthemostrecentNISTevaluations,RT07,thebestperformingsyste mwasfromICSI.Thesystem hadaDERofabout22%whichindicatesthattheproposedsystemisquiteco mpetitivewithotherstate oftheartsystemsalthoughaproperevaluationrequirescomparingthetwoa lgorithmsonthesame dataset.2.6SummaryandConclusions Theproblemofspeakerdiarizationrequiresdetermining“whospokewhe n?”fromanaudio recording.Asidefrombeinganimportantaspectofmeetinganalysis,itisa lsorequiredtosolvehigher levelproblemssuchascategorizingmeetingsandgeneratingintelligibletransc ripts.Asaresult,this problemisnowreceivingsignicantattentionfromresearchersandlar gescaleevaluations,suchasthe onesbyNIST[33],havebeenconductedtodeterminethestateoftheart.M ostapproachesforspeaker diarizationfollowasegmentation-clusteringparadigmwheretheaudiosignal isrstpartitionedinto homogeneoussegmentsandthensegmentsfromthesamespeakerarecluste redtogether.Thediariza26

PAGE 39

Figure2.7ClusteringPerformanceUsingtheDERMetric.Comparisonoftwoa pproachesforaudioonlyclustering.TheGSC+HACmethodhasanaverageDERof19.15%and2 2.54%forchannels1 and2respectively,comparedtoDERsof33.32%and27.7%fortheHACmeth od. tionsystemdescribedinthischapterfollowsthesameapproach,butachiev esbetterperformancethan comparablemethods,byimprovingboth,thesegmentationandclusteringalgorith ms. Forsegmentation,wepresentamodicationtothesequentialBICbasedmetho dthatusesathreepartslidingwindowandcomputestheBICvaluesforonlythemiddlesubwindow. Theproposed segmentationapproachisbetteratdetectingshortspeechsegmentsthatisty picalofmeetingconversations.Inaddition,itisalsocomputationallyefcient,operatinginO(n)andd oesnotrequiretuning anyparameters.Themajorcontributionhighlightedinthischapteristhenoveltw ostageclustering algorithmusedtogrouptogethersegmentsfromthesameindividual.Thisfra meworkallowstoincreasemodelcomplexityinresponsetotheamountofdata.Graphspectral clusteringisusedatthe rststage,toobtainalargernumberofintermediateclustersbygroupingtog ethershortsegments thataremodeledbyunimodalGaussians.Unlikehierarchicalclustering,w hichisagreedyapproach, graphspectralclusteringtakesintoaccountdistancesbetweenallsegme ntswhenmakingclustering decisions.Secondly,itdoesnothavetoiterativelycomputethedistancebe tweenallsegmentswhich resultsinfastercomputations.Thediarizationresultsindicateanaveragere lativeimprovementof16% comparedtothehierarchicalclusteringbasedapproachwithaspeedupb yafactorof5. 27

PAGE 40

Althoughitisobviousthattheperformanceofanysystemdependsonthed ataonwhichitoperates, anunderstandingoftheprecisefactorsthataffecttheperformanceis desired.Thestudyconducted examinedtheeffectsofnumberofspeakers,entropy,shortsegmentso ntheperformanceofthesegmentationmodule.Itwasfoundthatentropyandshortutteranceswerethe majorfactorsthataffect theperformanceandthataninverserelationexistsbetweenthesefactors andthepurityofsegments. Thesendingsresonatewellwiththeintuitionheldbymostresearchersinthe eld.Secondly,we alsofoundthatforoursystem,theclusteringperformanceisdirectlyrelate dtothesegmentpurity. Thisisexpectedasimpuresegmentstendtocombineerroneouslyduringtheclu steringstage.These studiesalsohighlightthattheperformanceoftheabovesystemcanbeimpro vedbyincorporatinga resegmentationstagepriortoclusteringashasbeendoneinotherworks. Tosummarize,thesystemhasadiarizationerrorrateofaround20%andpe rformsinlessthan0.1 real-time.Thesemetricsplaceitwellwhencomparedtostateoftheartsystems. Thesignicant gainsinspeedandaccuracyaregainedthroughtheincorporationofgr aphspectralclusteringanda modiedsegmentationsystem.Thenextchaptershowshowthediarizationacc uracycanbeimproved byincorporatingvideoinformation. 28

PAGE 41

CHAPTER3 AUDIO-VISUALDIARIZATIONANDLOCALIZATION Theoverwhelmingamountofinformationavailablefrommeetingvideosunders corestheneed forcomprehensiveevent-detectiontechniquestoenabledifferentkind sofqueriessuchasqueryby discussiontopicorforanindividual'scommentsorforspecicactivities(s uchasnote-taking,makingapresentation,illustratingusingthewhiteboard,etc.).Determiningwhospo kewhen(speaker diarization)andlocatingthecurrentspeaker(speakerlocalization)are prerequisitetasksforenabling suchqueries.Speakerdiarizationandlocalizationisalsorequiredforoth erhigherleveltaskssuch asgeneratingintelligibletranscriptsorautomaticallypanningthecameratothecur rentspeaker.This chapterdescribesanovelframeworktojointlyperformdiarizationandloc lizationusingasynergetic fusionofaudioandvideo. Sincesemanticanalysisofmeetingsisconsideredanimportantareaofresea rch,manyevaluations onspeakerdiarizationandspeakerlocalizationhavebeenconductedsu chastheonesbyNIST[33] andCLEAR[101],wheremeetingsarerecordedinspecialroomsrigged withmultiplemicrophones andcameras.However,thisworkfocusesonmeetingsrecordedusinga constrainedrecordingsetup consistingofasinglecameraandasinglemicrophone.Thereasontofocus onthisconstrainedscenario stemsfromitsbroaderapplicability-asaudioandvideosensorsareintegra tedintosmallportable devicessuchaslaptopsandcellphones;theytoocanbeusedtorecord meetings,convertingany locationintoavirtualmeetingroom. Speakerlocalizationanddiarizationinthesinglecamera,singlemicrophonesc enariohasbeen investigatedinworkssuchas[27,34,44,55,71],wheretheproblemis posedasoneofdetecting synchronousaudioandvideoevents.Intheseworks,mutualinformation (MI)basedapproacheshave beensuccessfullydemonstratedonshortdurationclipswhichinvolveafe w(usuallytwo)speakers facingthecamera.Insituationssuchasbroadcastnewsandpeopleinter actingwithakiosk,the personlooksdirectlyatthecamerawhenspeakingandtheface(s)cove rsalargeportionoftheimage. 29

PAGE 42

Sincethefacesarefrontalandthespeaker'sfaceandlipsareclearly visiblewhenspeechisheard,an instantaneouscorrelationbetweentheaudioandvideochannelsexists,an dMIbasedapproacheswork quitewell. However,meetingroomvideosareverydifferentasmultiplepersonsare seatedfacingeachother, andsoallfacesobservedbythecameraarenotfrontal.Also,sincethe cameraisplacedmuchfarther fromtheparticipants,faceshavealowresolution.Additionally,participants oftenexhibitahighdegree ofmovementforshortintervalsevenwhentheydonotspeaksuchaswh entakingnotes,sippingcoffee, orswivelinginachairandsuchmovementsarefalselycorrelatedwiththespe ech.Forthesereasons, wehavefoundthatMIbasedapproachesdonotperformwellonmeetin gdatasets[109]. Weproposeadifferentframeworkforaudio-visualintegrationmotivate dbythefollowingobservations.Itisobviousthatastrongcorrelationexistsbetweenthelipmovements ofaspeakerandthe resultantspeechwhichhasbeenexploitedinMIbasedworks.Thereals oexistsaloosecorrelation betweenaperson'sspeechandhead/handgestureswhichhasbeend emonstratedinworkssuchas [54,82,88,111].Inadditiontothecorrelationofspeechwithlipsandges tures;wealsoobservethat, ingeneral,apersonexhibitsmoremovementduringspeech.Tomaintaineyeco ntact,theheadturns fromonelistenertotheotherandusuallybobsupanddownduringspeech becauseofthejawmovement.Also,thespeaker'shandsandshouldersmoveinvoluntarilywhenan ideaisexpressed.Such movementsarenotcorrelatedwiththespeech,i.e.theredoesnotexistanin stantaneousmappingbetweenaudioandvideofeatures.Rather,alongtermco-occurenceofs peechandmovementexists,i.e. overlongertimedurations,peopletendtoexhibitmorebodymovementswhensp eakingthanwhen listening.Inthiswork,weexploitthisphenomenonofco-occurrenceofs peechandbodymovements toperformthespeakerdiarizationandlocalizationtaskswiththeunderlyinga ssumptionthat,onan average,aspeakerexhibitsmoremovementthanalistener. TheowchartinFigure3.1illustratesourapproach.Theaudioandvideo featuresarerstconcatenatedtoformajointfeaturestream.TheBayesianinformationcriterion( BIC)isusedtond discontinuitiesinthejointfeaturestream,effectivelypartitioningitintoatomictempo ralprimitives (ATPs)thathavehomogeneousaudio-visualcharacteristics.Ideally, eachATPcontainsspeechfroma singleindividual.However,becauseanATPisofshortduration,thevis ualinformationinitmaynotbe reliable.Fore.g.,aspeakermayexhibithand-movementsinoneATP,butno tinanother,oralistener mayexhibitspuriousmovementinanATP.Henceaone-stepaudio-visualc lusteringoftheATPsis 30

PAGE 43

Figure3.1SystemFlowchartforJointLocalizationandDiarization.Audioa ndVideofeaturesare fusedtoformajointfeaturestream.TheBayesianInformationCriterion(B IC)approachisusedto segmentthejointstreamintosegmentsthataresimilarinaudioandvideocharac teristics.Sinceglobal videofeaturesarecomputed,durationsofspeechfromthesamespeake rmaybesegmentedbecause ofdifferentmotioncharacteristicsinthevideosuchasmovementfromother speakersordifferent movementsfromthesamespeaker.Therststageofclusteringisperforme dusingonlytheaudio signaltonddurationsofspeechfromasinglespeaker.Usingtheinterme diateclusters,Gaussian mixturemodelsarebuiltfortheaudioandeigen-blobmodelsarebuiltforthevid eo.Thesecond stageofclusteringusesdistancesbetweenboththesemodelstoarriveatth enalclusteringsolution. Ideally,eachofthenalclustersshouldcontainallthespeechfromeac hindividualspeaker.The secondeigenvectorcorrespondstothedominantmotionmodeintheimageandlo calizesthespeaker. errorprone.Instead,ATPsarerstclusteredusingonlytheaudioby graphspectralclustering(GSC) toobtainintermediateclustersoflongerduration.Thenumberofintermediateclu stersischosentobe morethantheactualnumberofclusters,toensurethattheyhavehighpurity -i.e.,eachintermediate clustercontainsspeechfromasingleperson.Thisalsomeansthattheclus tershavelowercoverage, i.e.,therearemultipleintermediateclustersthatcontainspeechfromthesameper son,whichneedto bemergedtoobtainthenalresults. Sincetheintermediateclustersareoflongerdurations,theeffectofspur iouslistenermovements init'svideoframesaremitigated.Also,thelongertemporalintervalallowstoagg regatemotion evidencefromdifferentbodyregionsofthespeaker.Robustvideo modelscanthusbebuiltfromthese intermediateclusterswhichcanthenbeusedalongwiththeaudiomodelsinanitera tiveclustering frameworktomergeclustersfromthesameperson. Thedistancebetweeneachpairofclustersisaweightedcombinationofthedis tancebetweentheir audioandvideomodels.Ateachstepoftheiterativeclusteringprocedure ,theclosestpairofclusters aremerged,reningthediarizationresults.Sincethemergedclusterisoflon gerduration,eigen-blob modelsbuiltfromthevideowillleadtobetterlocalizationofthespeakerinthecor respondingvideo 31

PAGE 44

frames.Inturn,bettervideomodelswillpositivelyinuencetheclusteringpr ocedureinthenext iteration.Thisclustering-modelingcyclecontinuestillthenumberofnalcluste rsequalsthenumber ofspeakerswhichisdeterminedbyastopcriterionthatusesbothaudioand video.Ideally,eachof thenalclustersshouldcontainallthespeechfromonlyasinglepersona ndit'svideomodelshould preciselylocalizethespeakerintheimage. Theoutlinefortherestofthischapterisasfollows.Thecreationofajointf eaturestreamandpartitioningofthemeetingintoATPsisdescribedinsection3.2.Section3.3dealswithclu steringATPs intointermediateclustersandtheiterativeaudio-visualclusteringframework .Section3.4presents theimprovementsindiarizationbyincorporatingvideoandcompareseigen-blo bandMIbasedlocalizationresults.Section3.5carriestheconclusionofthispaperandoutline sdirectionsforfuture work.Thenextsectioncomparesthisworktopreviousapproachesthat haveusedaudio,video,or theircombinationtoperformspeakerdiarizationandlocalization.3.1RelatedWork Persondetection/trackingandspeakerdiarizationaretwotasksthathave beenheavilystudiedby thecomputervisionandsignalprocessingcommunitiesrespectivelyanduntil recently,therehasbeen relativelylittleefforttouseonemodalitytocomplementtheother.Asaresult,spe akerdiarization hasbeenviewedmostlyasapurelyaudioprocessingtaskwhereasspeak erlocalizationisviewedas asubsetofthepersondetection/trackingtask.However,inthepastfew yearstherehasbeenmuch emphasisonintegratingaudioandvideoinformationtoperformthesetasks.E valuationeffortssuch asthe RichTranscriptionevaluation(RT) [33]conductedbyNIST,andthe CLassicationofEvents ActivitiesandRelationships(CLEAR) [100]conductedjointlybyNISTandCHIL,aimtofurthersuch researchbyprovidingacommondatabaseandevaluationframeworktoass essthestateoftheart. Inthissection,previousworksthatuseaudio/videoortheircombinationtope rformspeakerdiarizationandlocalization(andtracking)arereviewed.Table3.1categoriz esthembasedonthenumber andtypesofaudio/videosensorsused,thespecicproblemstheyaddre ss,thecharacteristicsofthe audio-videodatathattheirmethodsexploitandtypicaldatasetsonwhichtheir methodsareillustrated. Sincethefocusofthisworkisonusingasinglecameraandasinglemicrophon etolocalizethecurrent 32

PAGE 45

speaker,thisreviewgivesgreaterattentiontothesinglecamera,singlemicro phone(sCsM)scenario. However,intheinterestofcompleteness,otherscenariosarebrieyre viewed. Table3.1ABriefReviewofPreviousWorkDealingWithJointAudio-VisualD iarization,Localization andTracking. Works SensorSetup Problemsad-dressed Method TypicalDatasets [80,104,114,119] Singlecam-era( sC ) Persondetec-tion,tracking Motion,appear-ancemodels TRECVID[98] [23,34,44,55,67, 71,73,96,97] singleCam-erasingleMicrophone( sCsM ) Speakerlocal-ization(2D),tracking,di-arization Audio-Visualsynchrony Broadcastnews[98],CUAVE[72],genericdatafromKioskscenarios [7,9,15,18,20,25, 52,79,95,113,123] multipleCamerasmultipleMicrophones( mCmM ) Speakerdi-arization,localizationandtracking KalmanorPar-ticleFilterbasedintegrationofunimodalcues MeetingRooms,Lec-tures[33,100] [109,110](This work) ( sCsM) Speakerdi-arization,localization Longtermco-occurencebetweenspeechandmovement NISTmeetingroompilotcorpus[38] Thecomputervisioncommunityhaslaidgreatemphasisonsolvingtheperson(a ndface)detection andtrackingproblemsincethelastfewdecades,especiallyundertheco nstraintofasinglecamera view.Worksinthesinglecamera( sC )scenariodifferfromeachotherbasedontheappearanceand motionmodelsusedfordetection,theframeworkusedfortrackingandthecla ssierthatisemployed. Thevisualappearanceofanobjectcanbefactoredintoitsshape,colo randtexture.Typically,theshape informationisbasedonthecontouroftheforegroundblob[40,103].Co lorinformationiscapturedvia histogramsinthenormalizedRGB[30],YUV[119]orothercolorspacesan dtheforegroundtexture isusuallyquantiedviaanedgehistogram[26].Sincehumangaithasatypic alperiodicmotionand theshorttermtrajectoryisusuallysmooth,motioncuescanservetodistinguish peoplefromother movingobjectsandclutter.Thus,motioncharacteristicsforpersondetection andtrackinghavebeen usedin[24,42,91].Asalways,acombinationofdifferentappearan ceandmotionmodelscanbeused tomakethesystemrobust[115,120,121]. 33

PAGE 46

VariousclassierssuchasSVMs[26,91],NeuralNetworks[103], MAP[119]havebeenusedfor thepersondetectionandtrackingtask.Recently,interesthasdeveloped beyondthecoarsetrackingof humansandthefocusisontrackingbodypartstoestimatepose,gazeandge stureswhichareessential featuresforhigherlevelanalysisinmeetingrooms.Agoodsurveyofwor ksthatapproachthisproblem canbefoundin[1,39,74,80,116]. Sincenoaudioinformationisavailableinthe sC scenario,thetasksarelimitedtodetectingand trackingpeople(and/orfaces)inthe2Dplane.Speechactivitycannotb etrulyinferredjustfromvideo analysisbutitservesasanessentialmodulewhenusedinconjunctionwitha naudiobasedtracking schemeaswillbeseenwhenreviewing mCmM works. Themultimodalapproach(forasinglecameraandasinglemicrophone)hasu suallybeendemonstratedinscenarioswherethespeakersarefacingthecameras.Asare sultthefacesarefrontaland aninstanteoussynchronybetweentheaudioandvideosignalsexists.Inth e sCsM scenario,theproblemistypicallyformulatedasndingprojectionsthatmaximizethemutualinformationb etweenthe projectedaudioandvideosignals. ThegerminalworkinthisareaisduetoHersheyandMovellan[44].Intheir work,imagepixels areassumedtobeconditionallyindependentgiventheaudiosignalandthe jointdistributionofpixel intensityandaudioenergyisassumedtobeGaussian.Theythenevaluatethe mutualinformation betweentheaudioenergyandeachpixel'sintensitywhichisequivalenttoc omputingthecorrelation undertheGaussianassumption.In[55,96],theapproachwasmodiedb yconsideringmultidimensionalaudioandvideovectors.Canonicalcorrelationwasusedtolearn projectionsthatmaximizethe correlationbetweentheprojectedvariables. In[27,34],projectionsarelearntsothattheMutualInformationbetwee ntheprojectedaudioand videosignalsismaximized.Theircontributionwastheuseofnon-parametricde nsitiestorepresentthe jointsignalspace.In[23],atimedelayedNeuralNetworkwasusedtolea rnaudio-visualcorrelations andaspatio-temporalsearchperformedintestsequencestodetectcor relatedpixels.In[67],asparse representationoftheVideosignalenablesrepresentingcontoursofmov ingobjectsinanintuitive way.Theimageisdecomposedandrepresentedasaweightedsumoftrans formedbasisfunctions.A correlationbetweeneachtransformedbasisfunctionandtheaudioenerg yisusedtondtheimage regionsthatcorrelatewiththeaudio.Thisapproachallowstoperforminsta ntaneousaudio-video correlationevenwhenthesoundemittingobjectismovingacrossthescene. 34

PAGE 47

Inalloftheaboveaudio-visualworks,thefocusisonmodelingrelationsh ipsbetweentheaudio andvideosignals.Theformulationfocusesonndingpixelsthatarehighly relatedtotheaudio.Thus theassumptionisthattheunderlyingcauseduetowhichtheaudioandvideosig nalsarise,always expressesitselfinbothmodalitiesandthatthisrelationshipisinstantaneous.Th isassumptionholds forcaseswheretwopeoplearefacingthecameraandtakingturnsatspe akingorwhentheobject generatingthesoundisvisibleinthevideo.However,inthemeetingroomscen ariowhereonlya singlecameraisused,itisnotpossibletocapturethefrontalfacesoreve ntheproleviewsofall participants.Thusourproblemformulationfocusesonndingtheaverag ecorrelationbetweenthe audioandvideosignal,ratherthanfocusingontheinstantaneouscorrela tion. Therecentavailabilityofmeetingroomdata[5]capturedfromawidenumber andvarietyof audioandvideosensorshasmadeitpossibleforresearcherstodevelo panddemonstratetechniques forthe mCmM scenario.Theapproachesdevelopedaresimilartothosementionedforthe multiple microphonescenariointhepreviouschapter. In[15,25,52,79,95,123],audioandvideocuesareusedtoindepe ndentlylocateandtrackthe speakerandthenallocationestimatesareobtainedbyintegratingtheindivid ualestimatesatahigher level.In[25],audioandvideoestimatesareusedforlocalization,butonly visualtrackingisperformed.However,fortrackverication,bothaudioandvideocuesare used.In[52],skindetection isusedtondpeopleintheimageandthesteeredresponsepower(SRP)ise valuatedonlyoverthose regionstodeterminethespeaker.In[79],motion,headshapeandskinco lorareusedtodetectthe angularlocationofapersonfromthevideoandatimedelayofarrivalmetho disusedtondahyperbolaofpossiblesourcelocationsfromtheaudio.Thenallocationes timateisobtainedfromthe intersectionofthevideoandaudioestimates.In[15],3Dlocalizationisperfo rmedindependently usingaudioandvideo,speechmodelsarebuiltforeachindividualandth eentireinformationisused todeterminethelocationandspeakingactivity.Asimilarapproachisusedin[95 ],butinsteadof identifyingspeakersbasedonthespectralcontent,audio-visualsync hronyisusedtodeterminethe currentspeaker. In[7,9,18,20,113],aparticlelterformulationisusedtojointlyestimatetheloc ationandspeaking activityofindividualsusingaudioandvideocues.Theworksdifferinter msofwhetheramulti-person orsingle-personstatespacesareconsidered,thesamplingparadigman dtheobservationanddynamic models. 35

PAGE 48

3.2AtomicTemporalPrimitives Therststepofourapproachinvolvessplittingthemeetingintocontiguousdu rationsthatweterm asAtomicTemporalPrimitives(ATPs).EachATPshouldbehomogeneous-i.e. itshouldcontain speechandmovementfromonlyasinglespeaker.Thisimpliesthatit'sjointau dio-visualfeatures shouldbewellmodeledusingasingleGaussian.Bydenition,neighboringA TPswillbeexplained usingsignicantlydifferentmodelsandsothetaskofsplittingthemeetingintoATP sinvolvesnding timeinstanceswheretheunderlyingmodelchanges.Theprimaryfocusofth isstepistoderivestrictly homogeneousATPs,evenatthecostofover-segmentation,sincecontig uousATPscontainingspeech fromthesamepersoncanbeclusteredtogetherlater. TheATPboundariesarefoundusingthejointaudio-visualstream.Thea udiosignalissampled at16kHz,andtheMel-frequencycepstralcoefcients(MFCCs)are extractedusing32lterswith thebandwidthrangingfrom166Hzto4000Hz.Thesesettingsarechosen basedonanempirical study[35]whichfoundthemtoyieldgoodresultsonthespeakervericatio ntask.Thevideofeatures, whichintendtocapturemotion,areobtainedusingimagedifferences(threef ramesapart).Thedifferenceimagesarethresholdedtosuppressjitternoise,dilatedbya3x3cir cularmask,downsampled fromtheoriginalsizeof480x720to48x72,andnallyvectorized.Thea udio/videofeaturesarethen projectedontoPCAsubspacestoreducetheirdimensionality,andajointaud io-visualsubspaceisobtainedbyconcatenatingthecoefcientsusingEquation3.1.Since,thevideo framerateis30Hz,to enableaconcatenationofthefeatures,MFCCfeaturesfromtheaudiowe realsoextractedatthesame rate. X ( t )= 264 a : A ( t ) V ( t ) 375 (3.1) Here A ( t )=[ a 1 ( t ) ; a 2 ( t ) ;:::; a m ( t )] T ,where a 1 ( t ) ; a 2 t ;:::; a m ( t ) aretheMFCCsprojectedinaPCA space.Similarly, V ( t )=[ v 1 ( t ) ; v 2 ( t ) ;:::; v n ( t )] T ,where v 1 ( t ) ; v 2 t ;:::; v n ( t ) areobtainedbyprojecting thevideofeaturesontoaPCAspace.Thescalingfactor a = j S V j = j S A j isusedtomakethevarianceof theaudiofeaturesequaltothatofthevideofeatures. TheBayesianInformationCriterion(BIC)isthenusedtosplitthejointfeatur estreamintoprimitivescontainingconsistentaudioandvideofeatures.Thetaskisposeda soneofndingthetime 36

PAGE 49

instants t k thatcorrespondtoATPboundaries.Todetermineif t k correspondstoaboundary,subsets ofthesignal X ,precedingandfollowing t k ,areexaminedtoinferwhethertheyareproducedbythe samepersonorbytwodifferentpersons.Let T w representthelengthofeachsubset, X W 1 = f X ( t k T w ) X ( t k T w + 1 ) ;:::; X ( t k 1 ) g representthesubsetpreceding t k X W 2 = f X ( t k + 1 ) ; X ( t k + 2 ) ::: X ( t k + T w ) g ,representthesubsetfollowingitand X C representtheunionofthetwo.TheBICisusedto determineif X C canbeadequatelyrepresentedbyasinglemodel M C oriftwomodels M 1 and M 2 are requiredtorepresent X W 1 and X W 2 ,respectively. TheBICforamodelwithparameters q isdenedas BIC ( M )= log p ( X j q ) 1 2 l d log N (3.2) where l =1, d isthenumberofmodelparameters,and N isthenumberoffeaturevectors.Thesecond term,proportionalto d ,penalizescomplexmodels.Dening D BICtobeBIC( M 1 ; M 2 )-BIC( M C )and modeling X W 1 X W 2 X C byunimodalGaussiandistributionswithfullcovariancematrices,wehave D BIC ( t k )= N 1 2 log j S X W 1 j N 2 2 log j S X W 2 j + N 2 log j S X j 1 2 l ( d + d ( d + 1 ) 2 ) log N (3.3) where N 1 and N 2 arethenumberoffeaturesin X W 1 and X W 2 ,respectivelyand N = N 1 + N 2 .Apositive valuefortheBICjustiestheuseoftwomodels,indicatingthat t k isachange-point. PriortodetectingATPboundaries,aspeech-silencedetectorisruntoelimina tesilenceframes. Ak-meansclusteringalgorithmisinitializedwiththemeanandtheminimumoftheaudioen ergy. MFCCsintheresultingspeechandsilenceclustersaremodelbyunimodalGa ussiansandthemaximumlikelihoodcriterionisusedtoidentifysilenceframes.Contiguoussilencefr ames,longerthan halfasecondindurationaretermedassilencesegmentsandeliminatedpriortof urtherprocessing. OurimplementationoftheBICframeworkisdescribedindetailin2.3.Inessenc e,itusesa slidingwindowapproachforndingATPboundaries,similarto[21]withthema jordifferencebeing theincorporationofvideoinformation.Themotivationforusingbothaudioan dvideofeaturesto obtaintheprimitivesstemsfromthefollowingobservation:Achangeinthespea kerisindicatedby achangeinthemodelproducingtheaudiofeatureswhichisthepremiseofthe BICapproach.Often 37

PAGE 50

times,achangeinspeakerisalsoreectedbyachangeinthevideodynamics .Afterapersonstops speaking,theyoftenchangeposture-byleaningbackfurtherintotheirc hairindicatingthrougha non-verbalmechanismthattheoorisopen.Similarly,justpriortospeakin g,apersonattemptsto gaintheiraudience'sattentionbyleaningforwardorextendingtheirarminto thecommonspaceto indicateadesiretoholdtheoor.Thus,achangeinspeakerisalsoreec tedbyachangeintheimage regionswheremotionoccurs. Sincethedifferenceimagesareprojectedasalowdimensionalvectorandmo deledasbyunimodal multivariateGaussian(acrosstime),achangeintheimageregionwillbemodeled byadifferent Gaussianmodel.ThejointGaussianismoresensitivetospeakerchangesth anmodelsbuiltforeither modalityalone.Thishowevercomesatthecostofincreasedfalsedetection sduetothevideo-suchas whenapersonreachesouttograbacupofcoffeewhensomeoneelse isspeaking.However,sincethe primitiveswillbemergedtogetherintheclusteringstage,falsedetectsarenot asexpensiveasmissed detects.3.3ClusteringandLocalization OncethefeaturestreamhasbeensplitintoATPs,thenextgoalistomergea llATPscontaining speechfromthesameindividual.Thelocalizationtaskinvolvesdeterminingthe imageregioninthe videoframesofthoseATPswherethespeakerisseated.Thesetwotasks canbeperformedsequentially -speakerdiarizationcanbeperformedrstusingonlytheaudioandthen videoframesfromthenal clusterscanbeanalyzedtolocatethespeaker.Alternatively,sincethevid eocontainsinformation aboutthecurrentspeaker,bothaudioandvideofeaturescanbeuse dfromtheATPstojointlyperform diarizationandlocalization. However,sinceindividualATPstendtobeofshortdurations,thevisua linformationinthem isnotveryconsistent.Forexample,whereapersonuttersjustafewsen tences,weobservethat thereislittleaccompanyingmotionandthatthissituationexacerbateswhentheper sonisfacingaway fromthecamera.SimilarlyanATPcancontainspeechfromonepersonbut motionfrommorethan oneindividual-asoccurswhensomeoneistakingnotes.Thehypothesis onwhichthisworkis based,isthatonanaverage,aspeakerexhibitsmoremovementthanalisten erandthisholdswhen consideringlongertimedurations.Thus,insteadofobtainingvideomodelsfr omtheATPs,theATPs 38

PAGE 51

arerstclusteredusingonlytheaudiotoobtainfewerlargeclusters.Vide omodelscanbereliably estimatedfromtheselargerintermediateclusters,andthenbeusedtoinuenc ediarizationinthe iterativediarization-localizationprocess. Therestofthissectionisstructuredasfollows.Insubsection3.3.1,wede scribetheprocedureof obtainingintermediateclustersfromtheATPs.Insubsection3.3.2,wedescr ibehowaudioandvideo modelsarebuiltfromtheseintermediateclustersandin3.3.3wedescribetheiter ativediarizationlocalizationprocedure.3.3.1IntermediateClusters Figure3.2GraphSpectralClusteringFramework.Theclusteringframewo rkinvolvesdenotingeach ATPasanodeandthe D BICdistancebetweentwoATPsasedgeweightstoobtainafullyconnected graph.Thisleadstoamatrixrepresentation(bottomright).Graphspectral clusteringofthenodes producesintermediateclusters. TheinitialclusteringofATPsisperformedusingonlyaudiobymodelingitsMFCCs byaunimodal Gaussianwithafullcovariancematrix.Theclusteringproblemisformulateda sagraphpartitioning problemasshowninFigure3.2.EachATPisrepresentedasanodeandth esimilaritybetweenATPs servesastheedgeweight( w ij ).Variousdistancemeasureshavebeenusedtocomputedistances betweenGaussianmodelssuchasthesymmetricKullback-Leibler(KL2)dive rgence[92],Gaussian likelihoodratio[64],andtheBhattacharyadistance[46].Inthiswork,simila rto[118],weused the D BICdistanceastheedgeweights.Theclusteringproblemnowreducestoa graphpartitioning problemwheretheobjectiveistominimizetheterm 39

PAGE 52

ij w ij ( L i L j ) 2 (3.4) where L i isthelabelaccordedtoeachsegment.Tominimizetheerrorterm,similarsegments( having largeweights)shouldbeassignedthesamelabel.Toavoidtrivialsolutionsi.e .L=1orL=0,the followingconstraintsareimposed. i L i = 0and i L i 2 = 1(3.5) Itisshownin[89],thatthesolutiontothisproblemisthesecondeigenvector oftheLaplacianmatrix obtainedfromtheedgeweightsas L W ( i ; j )= 8><>: j ; j 6 = i w ij if i = j w ij if i 6 = j 9>=>; (3.6) Basedonthesecondeigenvector,thenodesarepartitionedintotwocluster s,basedonthesignof theeigenvector'sentries.Eachclusteristhenrecursivelybipartitionedtill thedesirednumberofclustersisfound.Thenumberofclusterscanbechosenbasedondomaininfo rmation.Inourexperiments, graphspectralclusteringwasusedtoobtainsixteenclustersfromtheinitial segments.Thechoiceof sixteenclusterswasmotivatedbytheobservationthatmeetingroomdatausua llycontainslessthan sixteenspeakersandsothedataisnotunder-clustered.Alsoatsixteen clusters,wendforourdataset thateachclustercontainssufcientlylargeamountofdatatorobustlyestimate GMMsfortheaudio andeigen-blobmodelsforthevideo.3.3.2AudioandVideoModels AudiofeaturesofanintermediateclusteraremodeledusingtheUBM-GMMtech niquedescribed in[84].Inthistechnique,aUniversalBackgroundModel(UBM)isbu iltusingtheentirespeechinthe meeting.TheUBMisessentiallyaGaussianMixtureModel,whichinthisimplementation consistsof eightGaussianswithdiagonalcovariancematriceslearnedusingtheExpe ctationMaximization(EM) algorithm.Thisisfollowedbyameanonly maximumaposteriori (MAP)adaptation[10]toproduce 40

PAGE 53

GMMsforeachcluster.ThedistancebetweentheGMMsiscomputedusingus ingtheEuclidean distancebetweenthemeansasshowninEquation3.7.Itwasshownin[10]th atthiscorrelateswith aMonteCarloestimationoftheKL2distance.Here d ( a i ; a j ) isthedistancebetweenclusters i and j computedforaGMMwith K componentsanddatawith D dimensions. m i ; k ; d representsthemeanof the d th dimensionofthe k th componentforcluster i d ( a i ; a j )= vuut K k = 1 D d = 1 w k : ( m i ; k ; d m j ; k ; d ) 2 s 2k ; d (3.7) Theintermediateclustersalsoserveasthestartingpointforspeakerlocaliz ation.Avideomodel isbuiltforeachintermediateclusterbyanalyzingtheeigenvectorsofitsvideo features(difference images).Letthematrix Z representthesetofvectorizeddifferenceimagesofanintermediateclus ter andlet C Z representthecovarianceof Z .SolvingEquation3.8yieldstheeigenvectorsof C Z asthe columnentriesof V ,where L isthecorrespondingeigenvaluematrix. C Z V = V L (3.8) Sinceeigenvectorsareprojectionsthatreducethecovarianceofthepr ojectedvariables,theyeffectivelygrouppixelsthatmovetogether.Ifthedominantspeakermovesth emostinthesetofframes, theprimaryeigenvectorpartitionstheimageintotworegions-onebelongingtothe speakerandthe othertospuriousbackgroundmovements.However,itcannotbedetermin edwhichofthetworegions correspondstothespeakerfromonlytheprimaryeigenvector.Sincethe secondeigenvectorisorthogonaltotherst,itsplitsthedominantcomponentofthersteigenvector-wh ichistheregionthat representsthespeaker'slocation. Mathematically,if v 1 isthelargesteigenvectorand v 2 isthesecondlargesteigenvectorofframes fromthecluster c i ,thenthepart v i whichrepresentstheselectedregionisgivenbyEquation3.9, where v +1 and v 1 arethepositiveandnegativepartsoftheprimaryeigenvector,and T isthetranspose operator.Thedominantregion v i isthennormalizedsothatit'ssumequalsoneandservesasthe eigen-blobmodelforthecluster c i .Thiseigen-blobmodel( v i )isbasicallyaprobabilitydensity functionrepresentingthelikelihoodofapixelbelongingtothespeakersloc ation. 41

PAGE 54

(a)(e) (b)(f) (c)(g) (d)(h) Figure3.3IllustrationoftheEigen-BlobMethodforSpeakerLocalization.F igures(a)and(e)show sampleimagesfromtwointermediateclustersinwhichthespeakerislocatedatthe top-leftand bottom-leftrespectively.(b)and(f)showtherespectiveprincipaleig envectors.Thesecondeigenvectorsshownin(c)and(g),splitthedominantregion(positiveornegative )oftheprimaryeigenvectors. Thedominantregionrepresentsthespeaker'slocationandisshownin(d )and(h). v i = 8><>: j v +1 j if j v T2 v +1 j < j v T2 v 1 j j v 1 j if j v T2 v 1 j < j v T2 v +1 j 9>=>; (3.9) 42

PAGE 55

Figure3.3illustratestheeigen-bloblocalizationfortwointermediateclustersfro mameetingof fourpeople.Thersteigenvectorhasnon-zerocomponentscorres pondingtothemovingpartsof theimage;inaddition,thesignoftheeigenvectorfurtherdividesthemovingpo rtionsintotwoparts (shownbytwodifferentcolors).Thesecondeigenvector,whichcap turesthenextdominantmodeof motioncorrelationandisorthogonaltothersteigenvector,isusedtoidentif ytheportionfromthe speaker. Intermediateclustersbelongingtothesamespeakershouldhavesimilarvideoc haracteristics. Specically,theeigen-blobmodelsshouldoverlap,andthedegreeofove rlapcanbeconsideredas ameasureofsimilarity.Sincetheeigen-blobmodelsarenon-parametricdensitie ssignifyingthe speaker'slocationwithintheimage,thedistancebetweentwomodelsiscomputedus ingthesymmetricKullback-Leibler(KL2)measureasshowninEquation3.10. d ( v i ; v j )= 1 2 a v i ( a ) log v i ( a ) v j ( a ) + a v j ( a ) log v j ( a ) v i ( a ) (3.10) Asacomparisontotheeigen-bloblocalizationapproach,wealsoimplementedthe mutualinformationbasedlocalizationtechnique,usinganimplementationsimilartothatin[71] .Themutual informationbetweentwomultivariaterandomvariables A and V isgivenbyEquation3.11,where p ( a ; v ) isthejointprobabilitydistributionand p ( a ) and p ( v ) arethemarginaldistributions. I ( A ; V )= a 2 A v 2 V log ( p ( a ; v ) p ( a ) p ( v ) ) (3.11) Theaudiosignala(t)isrepresentedbytheMFCCsandthevideosignalv( t)consistsofthepixel-wise differenceofimagesthreeframesapart.TheMIiscomputedbetweeneach videopixel v xy ( t ) and theaudiosignal a ( t ) .TheMIiscomputedeveryframeusingatwosecondwindowtoestimatethe probabilitydistributions( p ( a ) p ( v ) and p ( a ; v ) whichareassumedGaussian. Figure3.4illustratessampleresultsoflocalizationusingtheMIapproach.The MutualInformation iscomputedbetweentheaudioandeachpixelintheimage.Pixelsthatarehighly correlatedwiththe audiohaveahigherMutualInformationvalue.TheMutualInformationimage isthresholdedtodiscard lowvaluepixelsandtheresultinglteredimageisdisplayedin(b)and(d).Sinc epixelsbelongingto asinglepersonareconnected,thelocalizationoutputisconsideredasthe connectedcomponentwith thelargestaverageMI. 43

PAGE 56

(a)(c) (b)(d) Figure3.4IllustrationoftheMutualInformationMethodforSpeakerLocaliz ation.Figure(a)shows asampleimagewherethepersoninthetopleftisspeakingand(c)showsasa mpleimagewherethe persononthebottomleftisspeaking.(b)and(d)showthemutualinformatio nimagesfor(a)and(c) respectively. TherepresentativeMIlocalizationimagesinFigure3.4showthattheMIperf ormsbetterwhen thespeakerisfacingthecameraasseenin(b)thanwhenthepersonisfac ingawayfromthecamera asin(d).Interestingly,(d)showsthatevenwhenthefaceisnotvisible,th ereareregionsaroundthe speaker'sbodythatarecorrelatedwiththespeech.ComparedtoFigure 3.4(b)and(d),thelocalization resultsarebetterinFigure3.3(d)and(h).Webelievethatthisisbecauseth eMIapproachseeks instantaneouscorrelationsbetweenthepixelsandthespeech-arelationwh ichisnon-robustinthe meetingdomainwhereastheeigen-blobapproachseekscorrelatedpixelsin framesthataredetermined tobelongtothesamespeakerusingtheaudiochannel.Sincetheeigen-bloba pproachconsiderslonger durations,spuriousmovementsbynon-speakersareaveragedoutlea dingtobetterlocalizationresults. Aquantitativecomparisonofthetwolocalizationapproachesispresentedins ection3.4.2. 3.3.3IterativeClustering Onceaudioandvideomodelshavebeenbuilt,theintermediateclustersaremerg edusinganagglomerativeclusteringframeworkinwhichtheaudioandvideomodelsareren edateachstep.The procedureinvolvesmergingtheclosestpairofclustersintoanewclustera ndobtaininganewaudio 44

PAGE 57

andanewvideomodelforthatcluster.Sincethemergedclustercontainsmor edatathaneitherofthe individualclusters,modelsderivedfromitwouldbemorerobustandre presentativeofthespeaker's audio-visualcharacteristics. Thedistancebetweeneachpairofclustersiscomputedbycombiningdistances betweentheaudio andvideomodelsasshowninEquation3.12,where d ( a i ; a j ) and d ( v i ; v j ) arecomputedusingEquation2.8andEquation3.10respectivelyand b ij isaweightingtermthatdeterminestheinuenceof videoontheoveralldistanceandiscalculatedusingEquation3.13. d ( c i ; c j )=( 2 b ij ) d ( a i ; a j )+ b ij d ( v i ; v j ) (3.12) b ij = min ( b i ; b j ) where b i = v a ; i (3.13) Equation3.13requiressomeexplanation.Theeigen-blobmodel v i foraclusterissometimesfragmentedovermultiplepersons.Thishappensbecauseofconsistentco-oc curringmotionsuchashand movementsofapersonwhotakesnoteswhensomeoneelseisspeaking.Suc hfragmentationincorrectlyreducesthedistancebetweenvideomodelsoftheinvolvedpersons, andnegativelyinuences theclusteringprocedure.Let v a ; i betheconnectedcomponentof v i ,thatrepresentsthemaximum fractionof v i ,i.e.itistheblobthatcapturesthemaximumfractionofthepdf.Then, b i ,whichis thesumover v a ; i canbeconsideredafragmentationmeasureoftheeigen-blobmodel v i ; b i willbe oneif( v i )isnotfragmented,andlowif v i isseverelyfragmented.Theweightingterm b ij represents thecondenceinthecomputedvideodistance.Ifeitherthetwoeigen-blobmod elsisfragmented, b ij willhavealowvalue,reectinglessercondenceinthelocalizationandred ucingthecontributionof d ( v i ; v j ) totheinter-clusterdistance. Equation3.12isusedtocomputethepairwisedistancebetweenalloftheintermed iateclusters andthepairwiththelowestdistanceismerged.GMMsforspeechandeigenblobmodelsforvideo arenowbuiltfromthemergedclusterandtheiterativeprocedurecontinue stillastoppingcriterion ismet.Thestoppingcriterionisafunctionofthedistancebetweenaudioandvid eomodels.Ifthe D BICofthetwoaudioclustersbeingmergedisnegativeorifthereisnooverla pbetweentheeigenblobmodels,thealgorithmterminates.Theincorporationofeigen-blobmodelsin tothestopcriterion preventsmergingclustersbelongingtodifferentspeakersthathavesimila rvocalcharacteristics. 45

PAGE 58

Figure3.5TheNISTMeetingRoomSetup[38].Meetingsarerecordedusin gfourxedcameras, oneoneachwallandaoatingcameraontheeastwall.Theaudioisrecord edusingfourtable microphonesandthreewallmountedmicrophonearraysinadditiontoalapela ndamicrophonefor eachparticipant.3.4Results3.4.1DatasetandPerformanceMeasures Theproposedaudioandaudio-visualspeakerdiarizationandlocalizatio napproachesaretestedon asubsetoftheNISTpilotroommeetingcorpus[38].Thecorpuscontain s19meetingsrecordedina roomriggedwithvecamerasandfourtablemountedmicrophones.Ofthev ecameras,fourarexed camerasmountedoneachwalloftheroomfacingthecentraltable,andthe fthisaoatingcamera thatzoomsontosalientactivitiesinthescenesuchasthespeaker,thewhiteb oard,ortheprojection screen.Ofthefourtablemicrophones,threeareomni-directionalmicroph ones,andoneisa4-channel directionalmicrophone.Eachparticipantintheroomalsowearsaheadmicr ophoneadirectionallapel microphone.ThemeetingroomsetupisshowninFigure3.5. Thedatabaseavailabletouscontainsthevecamerafeedsforeachmeeting andthevideoshavea spatialresolutionof720x480sampledat29.97Hz.Therearetwoaudioch annelspackagedwitheach 46

PAGE 59

video;oneisagain-normalizedmixofalltheheadmicrophones,andtheseco ndisagain-normalized mixofallthedistantmicrophones.Theaudiodataissampledat44kHzandhas aresolutionof16 bitspersample.Ofthe19meetings,threemeetingswereexcludedfromtheexp erimentsbecausetwo ofthemdidnothaveassociatedgroundtruthandthethirdconsistedentirely ofapresentationbyone person.Eightaudio-visualpairingsareconsideredforeachmeetingb ypairingeachofthefourxed cameraswitheachofthetwoaudiochannels,resultingin128(16x8)meetingc lips.Fromeachclip, therst30secondsarediscarded,andthenext10minutesarechosen resultinginapproximately21 hoursofdata. Inallofthemeetings,participantsareseatedaroundacentraltableandinte ractcasually.Dependingonthetypeofthemeeting,theparticipantsdiscussagiventopic,planeven ts,playgamesorattend presentations.Fromtimetotime,participantsmaytakenotes,stretch,andsipdrin ks.Insomeofthe meetings(5and9-12),participantsleavetheirchairstousethewhiteboardor distributematerials.The audioandvideosignalsfromthesemeetingsarequitecomplexbecausethemee tingsareunscripted andoflongdurations.Sinceonlyasinglecameraviewisconsideredatatime, mostfacesarenonfrontalandsometimesparticipantsareonlypartiallyvisible.Insomemeetings,ap articipantmaynot bevisibleatallinaparticularcameraview.Evenwhenthefacesarefronta l,theyareoftenoccludedby theperson'sownhand.Similarly,theaudiosignaliscomplex,consistingofs hortutterances,frequent overlapsinspeech,andnon-speechsoundssuchaswheezing,laug hing,coughingetc.Sampleimages ofallclipsfromcamera1and2areshowninFigures3.6and3.7respectiv ely. Toquantitativelyassessthelocalizationperformance,acoarsegroundtruthisdenedbystatic boxesaroundeachperson.Theoutputoftheeigen-bloblocalizationmeth odisthedominantblob v a obtainedfromthenalcluster.Thisisastaticregionintheimagewhichlocalizes thepersoninall meetingframeswherethepersonspoke.TheoutputofMIlocalizationisthec onnectedcomponent oftheMIimagewiththehighestaverageMIandsothisregionvariesfromfr ametoframe.Fora frame i ,ahitoccursiftheregionoutputbyalocalizationmethodforthatframe S ( i ) ,intersectswith theground-truthboxaroundthespeaker B ( i ) asshowninEquation3.14. h ( i )= 8><>: 1if S ( i ) \ B ( i ) 6 = / 0 0otherwise 9>=>; (3.14) 47

PAGE 60

(1)(2)(3) (4)(5)(6) (7)(8)(9) (10)(11)(12) (13)(14)(15) (16) Figure3.6ASampleImageforEachoftheMeetingsRecordedbyCamera1.E achgureshowsone imagefromtheentiremeeting.Someoftheparticipantsfacethecamerawhiletheoth ersfaceaway fromit.Camera3ismountedontheoppositewallandprovidesasimilarview. 48

PAGE 61

(1)(2)(3) (4)(5)(6) (7)(8)(9) (10)(11)(12) (13)(14)(15) (16) Figure3.7ASampleImageforEachoftheMeetingsRecordedbyCamera2.T heguresshowone imageforeachofthemeetingsrecordedbycamera2.Mostoftheparticipan tsarevisibleinprole viewwithvaryingdegreesofocclusion.Camera4ismountedontheoppositew allandprovidesa similarview. 49

PAGE 62

Onlynon-overlappingspeechframeswherethespeakerisvisible(complete lyorpartially)ina cameraviewareevaluated.Representingthesubsetofframesoverwhic hlocalizationisscoredas T e thehitratioiscomputedastheratioofhitstothenumberofframesin T e asshowninEquation3.15. HitRatio = i 2 T e h ( i ) j T e j (3.15) Thediarizationperformanceismeasuredusingthediarizationerrorrate( DER )denedin[38].To computethe DER ,aone-onemappingisperformedbetweenthe M systemclusters(nalclusters)and the N referenceclusters(oneforeachspeaker),suchthatthemappingmax imizesthetotalnumberof framesinagreement.The DER isthencomputedonnon-overlappingspeechframesastheratioofthe numberofincorrectlymappedframestothetotalnumberofframesintherecor ding. 3.4.2Localization Fortheeigen-bloblocalizationmethod,the v a blobobtainedfromanalclusterservestolocalize thespeakerinthevideoframesofthatcluster.Figures3.8and3.9illustrateth eeigen-blobmodelsfor someofthenalclustersinmeetings-3and6.Thetwomeetingshaveverydiff erentcharacteristics -meeting3isaplanningmeetingwithfrequentnotetakingactivity,whilemeeting6isa cardgame scenariowithparticipantsfrequentlyreachingouttothecenterofthetableto pickanddropcards. Figures3.8(a)-(d)showtheeigen-blobmodelsforthenalclustersofme eting3.Themodel correctlylocalizesonlythespeakerforthetwodifcultcasesin(a)and( c),wherethespeakerisfacing awayfromthecameraandwherethespeakerispartiallyvisiblerespective ly.Thefrontalpersonin (b)isalsocorrectlylocalized,butin(d),ablobofthemodelliesonanon-s peaker.Similarly,(e)-(h) showsthemodelsforfourofthesixnalclustersofmeeting6.Althoughthemod elsmostlylocalize thecurrentspeaker,fragmentedblobslyingonotherspeakersarese enin(f)and(h).Particularly(h) revealsthatthereisnote-takingactivityassociatedwiththeperson'sspee ch. Figure3.9showstheeigen-blobmodelsforthesamedatainthesecondcamera viewwhereonly theprolesoftheparticipantsarevisible.Themodelscorrectlylocalizethesp eakerinmostclusters, butblobslieonnon-speakersin(c),(d)and(f).Agreenboundingbo xshowsthetruelocationofthe speakerandtheredboxindicatesnon-speakerscoveredbythemodel. Inmostcases,thedominantblob ( v a )oftheeigen-blobmodelcorrectlylocalizesthespeaker.However,whe nanon-speakerexhibits 50

PAGE 63

consistentmotionthatexceedsthespeaker'smotion,( v a )incorrectlylocalizesthenon-speaker,which happensfortheclustershowninFigures3.8(d). (a)(b)(c)(d) (e)(f)(g)(h) Figure3.8LocalizationofSpeakersIntheFirstCameraView.Figures(a) -(d)showlocalizationofthe fourspeakersfrommeeting3and(e)-(h)showlocalizationoffourofth esixspeakersfrommeeting 6.In(d),(f),and(h),fragmentsoftheeigen-blobmodellieonthenonspeakers,eitherbecause ofimperfectclusteringorco-occuringmotionasin(h)wherenote-takinga ctivityusuallyco-occurs withthespeechofthisparticularspeaker.Thetruespeakerlocationismark edbyagreenboxanda non-speakercoveredbythemodelismarkedbyaredbox. (a)(b)(c)(d) (e)(f)(g)(h) Figure3.9LocalizationofSpeakersIntheSecondCameraView.Figures( a)-(d)showlocalizationof fourpersonsfrommeeting3andFigures(e)-(h)showlocalizationoffo urofthesixparticipantsof meeting6.InFigures(c),(d),and(h),fragmentsoftheeigen-blobmode llieonthenon-speakers. Truespeakerlocationsaremarkedbygreenboxesandnon-speaker scoveredbythemodelaremarked byredboxes. 51

PAGE 64

Figure3.10LocalizationPerformanceUsingtheEigen-BlobandMutualInf ormation(MI)Based Methods.Fortheeigen-blobmethod,the v a blobisconsideredasthesystem'soutputandforthe MImethod,theregionwiththehighestaveragemutualinformationisconsidere dasthesystem'soutput.Ahitoccursifthesystemoutputoverlapswiththespeaker'strueloca tion.Onlynon-overlapping speechframesinwhichthespeakerisvisiblearescored.Since,eachau diochannelineachmeeting canbeindividuallycoupledwitheitherofthefourcameraviews,fourperf ormancesareobtainedfor eachcamera-channelcombinationpermeeting.Asolidbarindicatesthemeano fthesevaluesand rangebarsindicatethemaximumandminimumoftheperformances.3.4.3ComparisonWithMutualInformationBasedLocalization Figure3.10comparesthelocalizationperformancesoftheMIandeigen-blo bmethodsusingthe HitRatio metricdenedinEquation3.15.Eachaudiochannelispairedwiththefourca merasand thelocalizationresultispresentedasthemeanofthefour HitRatios witherrorbarsindicatingthe maximumandminimumofthefourvalues.Thelocalizationresultsforthetwoaudioc hannelsare shownseparatelyforthetwolocalizationmethodsresultinginfourbarsperme eting.Foreachmeeting, thesubsetofframes T e overwhichthe HitRatio iscomputedmaydifferifallspeakersarenotvisible inallcameraviews. Atrstglanceweobservethat,theeigen-bloblocalizationprocedurewo rkswellinmostmeetings butperformspoorlyonasubsetofmeetings(5,9-12).Asmentionedear lier,thesemeetingsviolatethe assumptionthattheparticipantsstayseated,whichleadstopoorlocalizationre sults.Thisisbecause theeigen-blobmodelsaresplitbetweenthetruespeakerandamovingparticip ant.Thelargemotion magnitudegeneratedbyamovingperson,causesthe v a blobtolocalizethemovingpersoninstead 52

PAGE 65

ofthespeaker.Incidentally,thesearetheonlymeetingswhereMIbasedlo calizationperformsbetter. ThisisbecausetheMIiscomputedovershorttimewindowsandhenceadapts toachangeinspeaker location. Theaverage HitRatio acrossthedatasetfortheeigen-bloblocalizationmethodis65.24%and 62.04%forchannel1and2respectivelywhichissubstantiallyhigherthanth e51.65%and49.93% obtainedusingMI.Ifthevemeetings(5,9-12)aredropped,thediffer enceisevenmorepronounced, witheigen-bloblocalizationyielding73.8%and69.97%forchannels1and2re spectivelycompared toMI's52.67%and51.44%fortherespectivechannels. Comparingacrosschannels,weobservethatthelocalizationmethodstendtop erformsbetterwith channel1asit'sspeechqualityisbetterthanthatofchannel2.Wealsose ethatthevariationof performanceacrosscamerasismuchlowerfortheeigen-blobmethodthanthe MImethod.TheMI methodperformsbetterwhenthedominantspeakerfacesthecamerawhere astheeigen-blobmethod ismuchmoreinvarianttochangeincameraviews. Errorsineigen-bloblocalizationstemfromtwosources:one,anintermedia teclustermaycontainspeechfromotherspeakersandthoseframeswillbemarkedwiththeloc ationofthedominant speaker.Two,non-speakersthatexhibitcontinuousmotionoveralong duration(swivelingonachair throughoutthemeeting),willcausefragmentsoftheeigen-blobmodeltolieonth eirlocation.If thenon-speakermotionisconsistentandoflargermagnitudethanthespeak er'smotion,the v a blob incorrectlylocalizesthenon-speaker. SincenoaudioclusteringisperformedintheMIbasedmethods,themethodisno taffectedby diarizationerrors.Errorsoccurwhenanon-speaker'smovementssh owstrongercorrelationwiththe audiosignal,whichoccurswhenalistenerexhibitssignicantmotionforsho rtdurations.TheMI techniqueincorrectlylocalizessuchmovementsincurringadropinperforma nce.Thesituationworsenswhenthespeakerisfacingawayfromthecamera-asvisiblemotionfromth espeakerislessto beginwithandhenceeasilyoverwhelmedbyspuriousbackgroundmotion.3.4.4Diarization Inthissubsectionwepresenttheresultsofspeakerdiarization-thetasko fdetermining“whospoke when?”.Wecomparetheperformanceofaudio-onlydiarizationtothatofa udio-visualdiarization 53

PAGE 66

usingthe DER metric.Theaudio-onlyspeakerdiarizationframeworkissimilartotheaudio-vis ual diarizationframeworkdescribedearlier;exceptthatnovideofeatures wereused.ATPswerefound usingonlytheaudiosignal,thedistancebetweenintermediateclusterswasbas edonlyonthedistance betweenGMMsandthestoppingcriterionwasbasedonlyonthe D BIC. Foreachmeeting,twoaudio-onlydiarizationresultsareobtained-onefore achchannel.Similar toLocalization,fourdiarizationresultsareobtainedforeachaudiochann el,bycombiningeachof thefourcamerasindividuallywiththeaudiochannel.Theresultsareprese ntedasthemeanofthe fourresults,witherrorbarsindicatingthemaximumandminimumofthefourresu lts.The DER is computedoverallnon-overlappingspeechframesirrespectiveofwhe therthespeakerisvisibleornot inacameraview. Figure3.11ComparisonofDiarizationPerformanceUsingAudio-OnlyandA udio-VideoInformation. Formostmeetings,theincorporationofvideoinformationresultsinlower DER .Meetings5,9-12do notgainmuchimprovementbyincorporatingvideobecauseofpoorlocaliza tioninthesemeetings. Sinceeachaudiochannelispairedindividuallywithfourcameraviews,fo urresultsareobtainedwith thesolidbarindicatingthemeanandrangebarsindicatingthemaximumandminimumof thefour DERs. Figure3.11comparestheperformanceofthediarizationschemewhenusing onlyaudiotothat whenusingbothaudioandvideo.Theaverage DERs ofaudio-onlydiarizationare19.15%and22.54% forchannels1and2respectively.Theincorporationoflocalizationres ultsinaverage DERs of16.27% and18.42%whichcorrespondstorelativeimprovementsof15.0%and18.02 %respectively.However, 54

PAGE 67

itcanbeseenthatthecontributionofvideoisnotconsistentinallmeetings.S pecicallymeetings5 and9-12donotgainmuchbyincorporatingvideodistances.Thereasonis thatinthesemeetings,some participantsleavetheirseatsforshortdurationsleadingtofragmentedvide omodelsandresultingin alowvalueof b whencomputinginter-clusterdistancesusingEquation3.12.Ifthesemeetings are eliminated,theaverage DERs forchannel1and2are19.62%and24.01%foraudio-onlydiarization and15.47%and17.8%foraudio-videodiarization.Thisrepresentsrelati veimprovementsof21.16% and25.87%respectively. Figure3.11revealsthattheaverage DER forchannel2ishigherthanthatforchannel1.Thisis expected,sincechannel1isobtainedfromheadmicrophonesandthush asbetterqualitythanchannel 2whichisrecordedfromdistantmicrophones.Asimilarpatternisfoundfor audio-visualdiarization, sincethelocalizationprocessisessentiallyguidedbytheaudio. Errorsintheaudio-onlydiarizationstemfrommultiplesourcessuchasthesp eech/silencedetector, missedATPboundariesandincorrectclustering.Inadditiontotheaboves ources,poorlocalization contributestoerrorsintheaudio-visualdiarization.Poorlocalization,maypla cefragmentsof v i ona non-speaker,incorrectlyreducingthevideodistancebetweenmodelsbe longingtodifferentspeakers. Intheextremecase,wherethespeakerisnotvisibleinaparticularcamerathissituationisinevitable. However,insuchcases,theeigen-blobmodelissplitalmostevenlyamongs ttheotherparticipants. Thisresultsinalowvalueof b for v i ,reducingthecontributionofthevideototheinter-clusterdistance andpreventingthepoorlocalizationfromadverselyaffectingthediarizatio nprocess. 3.5SummaryandConclusion Speakerdiarizationandlocalizationareproblemsthatarereceivingincre asingattentionfromvariousresearchcommunities.Mostoftheworkisonmeetingsrecordedinrooms whichareriggedwith multipleaudio-visualsensors,whichlimitsthemeetingvenuetosuchsmartrooms. Thisworkfocuses onmeetingsrecordedusingasimplesetupconsistingofonecameraandonemic rophonewhichis easilyportable,eliminatingthislimitation.Previousapproachesthatperformjoin taudio-visualanalysisinthesinglecamera,singlemicrophonescenarioseekcorrelationsbetwe enthetwospaces.These solutionsassumethattheaudioandvideosignalsareinstantaneouslycorre lated,butasdemonstrated inthiswork,theassumptiondoesnotholdinthemeetingdomain. 55

PAGE 68

Intheproposedapproach,insteadofformulatingtheproblemasndingco rrelationsacrossspaces, clusteringisperformedinindividualspaces.Theassociationacrosssp acesisbasedontheassumption thatonanaverage,aspeakerexhibitsmorebodymovementsthanalistener. Clusteringintheaudiospaceisperformedrsttondlongerdurationsinwhichasinglepersonis speaking.Assumingthat thedominantmotioninthecorrespondingvideoframesisduetothespeaker's movement,speaker localizationinvolvesndingpixelswithhightemporalcorrelationinthemovements paceusinga noveleigen-analysisapproach.Clustersarethenmergediterativelyba sedontheirdistancesinboth theaudioandvisualspaces.Bymergingclustersfromthesamespeaker, betteraudioandvideo modelsareprogressivelybuiltuntilthenumberofclustersisequaltothenu mberofspeakers.Here, localizationisessentiallyguidedbytheaudioanddiarizationbenetsfromthev isualinformation. Theapproachisevaluatedonasubstantiallylargedataset(21hours)of unscriptedrealmeetings. Thedatasetisobtainedbypairingtwoaudiochannelsofdifferentsound qualitieswithfourdifferent cameraviews.Localizationresultsonthischallengingdatasetndthattheeig en-blobbasedmethod outperformstheMIbasedmethodbyabout40%relative.Inaddition,theeig en-blobbasedlocalization islesssensitivetochangesincameraview.Thenoveltyofthediarizationpro cessisinitsuseofmotion informationfromtheentirebody,ratherthanjusttheface. Resultsobtainedbyincorporatingvideoinformationintotheclusteringproces sleadstoarelative improvementofabout16%overthatofusingaudioalone.Theseareenc ouragingresultsgiventhe natureofthemeetingsandthequalityofthevideosignal.Ofcourse,thevideo resultswouldbemuch moreaccurateinpredictingthespeakerifeachparticipantsfacewasvisible ,andthelipsweretracked. However,theseresultsshowthattherearecuesinthevideobeyondthef acethattellofspeechactivity. Thisworkexploitssuchvideoinformationonaglobalscalewithoutrelyingon explicitface/head/hand detectionandwithoutassumingfrontalfaces.Also,sincetheapproachd oesnotrequiretrainingor a priori information,itisreadilyadaptabletootherdomains. 56

PAGE 69

CHAPTER4 INDEXINGMEETINGARCHIVES Meetingsareanimportantaspectofanyorganization'sdaytodayworkin g-whereinformation isdisseminated,ideasarediscussedanddecisionsaretaken.Recognizin gtheirsignicance,many organizationshavebegunarchivingtheirmeetings,andascanbeexpec ted,thesizeofsucharchivesis growingconstantly.Forthearchivestobeofpracticaluse,anefcien tquerymechanismisneededthat canaccuratelyidentifyportionsofameeting,thatmatchauser'squery.Wh ilethepreviouschapter dealtwithdetermining“whospokewhen?”andlocalizingthespeakerinasing lemeeting,thischapter describestheretrievalofmeetingsegmentsfromtheentirearchivethatmatc hauser'squery. Avarietyofindividualbehaviorsandgroupinteractionsmanifestthemselv esinmeetings,andthus meetingrecordingsarecomplexmultimediadocumentsthatformthebasisforaric hsetofqueries. Someofthewaysinwhichausermightwanttoqueryameetingarchiveare Bytheparticipantswhoattendameeting Bydurationsofthemeetingwhenaparticipantspoke Byactivitiesthatcharacterizeameeting(note-taking,voteofagreement) Bythemeetingcategory(presentation,groupdiscussion,etc.) Thisworkspecicallyfocusesonindexingarchivesbasedon“whosp okewhen?”.Thisallows variousqueriessuchasndingallsegmentsofthearchivewhereapar ticularpersonspoke.Inaddition,suchindexingisaprerequisitetootherhigherlevelqueriessuchasme etingcategorizationand activity/eventrecognition. Thewayinwhichausermayquerythearchivecanvary,suchasquery bykeywordorqueryby example.Figure4.1showsaschematicoutliningthetwoframeworks.Tosuppo rtqueriesbykeywords,thevarioustypesoftargetshavetobedenedinadvancedanda nnotatedbythekeywords.For example,tosearchthearchiveforallofJohn'sspeechsamples,atlea stsomeofhisspeechsamples 57

PAGE 70

Figure4.1ComparisonBetweenQuerybyKeywordandQuerybyExample. Thegureshowsthe requisitestepsinthetwoframeworks.Querybykeywordrequiresthatthe userpossessesthedomain knowledgerequiredtoexpresshisquerybyanappropriatekeyword. Inaddition,targetsthedatabase havetobeannotatedwithcorrespondingkeywords.Inthequerybyexa mpleframework,theusermust supplysampledataasthequery.Theonusisonthesystemtoextracttargets fromthedatabaseand matchthemwiththesampletoreturnappropriateresults.mustbetaggedbyhisnameandmodeled.Queryingthearchiveisthenaclass icationproblem,where featuresfromuntaggedspeechsamplesarematchedagainstJohn'ssp eechmodel.Oneshortcoming 58

PAGE 71

isthattheuserneedsdomainknowledge,forexampleauserbrowsingamee tingcannotndaparticipant'ssamplesunlessheknowshis/hername.Asecondandmoreseverlimitatio nisthatqueriescan onlyberunfortargetswhosemodelsalreadyexist. Thequerybyexampleframeworkismuchmoreexible,theusermayselectas peechsamplefrom themeetingasaquery.Theuserdoesnotneedtoknowtheidentityofthespe aker.Theuserpresents thisquerytothesystemandthesystemextractsallspeechsegmentsfromthe meetingandmatchesit withthequerysampletogeneratetheresults.Inconcept,thequerybyexamp leframeworkismuchless restrictiveasitcanbedirectlydenedfromthedatawithouttryingtodescr ibetheconceptinwords. Thisovercomestwoissues,thereisnolossintranslationwhichoccurswhen acomplexmultimodal eventisdescribedbyafewkeywords.Secondly,itispossibletosearc hforeventsthathavenotbeen specicallycategorized.Becauseofthesereasons,aquerybyexamp leframeworkispursuedinthis work.Also,thequerybyexampleframeworkcanbeusedrsttogenera teclusterscontainingsimilar targets,whichcanthenbeannotatedwithappropriatekeywords. Intheproposedretrievalframework,therecognitionofindividualsusin gspeechand/orfaceplays amajorrole.Thisfacetoftheindexingfallsundertherealmofbiometrics.Ty picallyinbiometrics applications,biometricsamplesareobtainedthroughexplicituserco-opera tionandthesesegmented samplesarecomparedtosimilarsamplesinthedatabase.Inpracticalapplication s,careistakento attenuatevariationsinthesystemandtheenvironmentsothatsamplesareacq uiredundersimilarconditions.Unfortunately,inthemeetingroomscenario,suchvariationsareine vitable.Theparticipants maybeseatedindifferentpositionsindifferentmeetingsormeetingsmaybehe ldinentirelydifferent rooms,affectingthesampleacquisitionprocess. Thustheevaluationofaretrievalsystemconsistsofevaluatingthealgorithms thatperformbiometricrecognitionandthenevaluatingtheretrievalperformanceoftheentire system.Section4.1of thischapterrstevaluatestheperformanceofbiometricalgorithmsonstaticsa mplesusingaspecial databasethatenablesstudyingtheperformanceofalgorithmswhensamples arecollectedundervaryingconditions.Thisprovidesaninsightintotherobustnessofthealgorithmsto samplevariationsand benchmarkstheirperformance.Inaddition,studiesarealsoconductedto ascertainthebestmethodto combineinformationfromfaceandvoiceforthepurposeofbiometricidentic ation.Thendingsof thisstudyinuencedthedesignofthemeetingindexingsystemdescribedinse ction4.2. 59

PAGE 72

4.1EvaluationofBiometricAlgorithms Asmentionedearlier,theperformanceoftheretrievalsystem,dependsinla rgeontheperformance ofthesystemsthatperformrecognitionusingfaceand/orvoice.Theper formanceofthesebiometric recognitionsystemsinturnisaffectedbysamplevariationsthatariseduetova riationsinthesystem/environment.Althoughtheeffectofsamplevariationsonaccuracyisawe llknownphenomenon, theeffectmustbequantiedtounderstandthecontributionofsamplevariatio nstoretrievalaccuracy.Thus,priortoevaluatingtheperformanceoftheretrievalsystem,w erstconductedstudiesto evaluatetheperformanceofthefaceandvoicerecognitionsystemunder controlledandmismatched conditions.Studieswerealsoconductedtoevaluatepopularnormalizationan dfusiontechniques,to ndtheoptimalmethodtocombineinformationfromthetwomodalitiesunderdifferen tconditions. Forthesakeofcompleteness,athoroughintroductiontobiometricsystemsispr ovidedinsection4.1.1.Thesectionhighlightstheissuesthataffecttheperformanceofs uchsystemsanddescribes thepopularapproachesusedtoovercomethem.Section4.1.2describesthe faceandvoicedatabase thatwascreatedspecicallytostudytheeffectofsamplevariationsonthes ystemaccuracy.Section4.1.3describesthenormalizationandfusionproceduresusedtocombine informationfrommultiplesamplesandmultiplemodalities.Theresultsofthisstudy,andtheirimplicationsonth emeeting indexingsystemarepresentedinSection4.1.4.4.1.1DescriptionofaTypicalBiometricSystem Thissectionprovidesabriefdescriptionofbiometricsystems.Werstpres entaschemaofa typicalbiometricsystem,followedbydenitionsofrelevanttermsandthenhig hlightthevarious factorsthataffectsystemperformanceandtheapproachestakentoo vercomethem.Theoverview providedhereisbynomeanscomprehensive;foramoredetailedintroduc tion,werefertheinterested readertothepublicationbyJainetal.[50]. Figure4.2showsthevariouscomponentsofatypicalbiometricsystem.Ther ststepofanybiometricsystemisdatacollection,alsocalledsampleacquisition.Here,theuserpr esentshis/herbiometrictothesystemwhichiscapturedusinganappropriatesensor.Forexamp le,faceimagesarecaptured usingaregularRGBcamerawhereasvoicesamplesarecapturedusingac ardioidmicrophone.Various presentationeffectsaffectthebiometricsample;faceimagescanvarydue toexpressionsandvoice 60

PAGE 73

Figure4.2ComponentsofaTypicalBiometricSystem.Atypicalbiometricsystemc onsistsoffour modulesthatperformdatacollection,signalprocessing,storageandde cisionmaking.Therststep involvesacquiringabiometricsampleusingasuitablesensor.Featuresextra ctedfromthesampleare storedinthedatabase.Duringmatching,featuresfromthedatabasearema tchedwithfeaturesfrom thecurrentsampletogenerateadecision.samplescanbeaffectedbytheuser'smood.Oftentimes,theacquiredsample undergoespreliminary checkstoensurethatitisofsufcientquality. Oncethesamplehasbeenacquiredanddeemedtobeofgoodquality,itpass esthroughthesignal processingstage.Therststepofthisstageisusuallysegmentingtheregio nofinterestfromthe biometricsample.Inthecaseofbiometricsamplesthatareimages,suchasface andngerprint,this meansforeground-backgroundsegmentationanddiscardingthebackg round.Inthecaseofvoice,this meansremovingdurationsofsilencefromthespeechsample.Thisisfollowed byafeatureextraction stage,whosegoalistoextractrepresentativeanddiscriminativefeatur esfromthesample.Thefeatures maybestoredinadatabaseasareferenceforfuturematching.Duringmatc hing,thefeaturesarethen representedinamannerrequiredbytheclassicationalgorithm,e.g.,voicef eaturesmaybeusedto 61

PAGE 74

generateaGaussianmixturemodeland,ngerprintminutiaemaybereprese ntedinpolarco-ordinates. Duringthematchingstage,theclassiercomparesfeaturesfromthecurre ntsampletofeaturesfrom oneormoresamplesinthedatabaseandoutputsascore.Thedecisionmodule comparesthescore withapre-setthresholdtodetermineiftheuser'sclaimshouldbeacceptedo rrejected. Theeldofbiometricsisfairlynewandborrowsheavilyfromvariouselds suchassignalprocessingtheory,probabilitytheory,databaseindexing,andpatternreco gnitiontonameafew.Consequently,varioustermshavebeenborrowedfromtheliteratureandmoreofte nthannottheirmeanings havebeenmodiedtobettersuitthisapplicationdomain.Inaddition,newtermsha vebeenintroduced todenoteentitiesforwhichtherealreadyexistdenitionsinotherelds.Her e,wepresentashortlist oftermsanddenitions,frequentlyencounteredinthebiometricsliterature,a ndusedinthischapter. Foranexhaustivelist,thereaderisreferredto[62]. Modality:Thisreferstothephysiologicalorbehavioraltraitexploitedbyth esystemtoperform recognition.Amodalitymaybeface,voice,iris,ngerprints,gait,DNA,etc. Enrollment:Thisistheprocessbywhichnewusersareaddedtothesystem.T heprocessinvolvesaddingone(ormore)samplesfromon(ormore)modalitiesoftheindiv idual.Additional meta-informationiscollectedwitheachtemplateofaparticularuser. Verication:Thisisoneofthetwomodesofbiometricrecognition.Theusercla imstobea particularidentityandprovidesasampletothesystem.Thesystemthenmatchesth eenrolled sampleoftheclaimedidentitywiththeuser'scurrentsampletodetermineiftheidentity claim isvalid. Identication:Thisisthesecondmodeofbiometricrecognition.Theuseronly presentsthe systemwithasampleandmakesnoclaimabouthis/heridentity.Thesystemmatche stheuser's samplewithalloftheenrolledsamplestodeterminethetrueidentityoftheperson. Gallery:Thegalleryconsistsofallsamplesthatareenrolledbythesystem. Probe:Thetermprobeisusedtodenotethesamplepresentedtothesystemb ytheuser.This probeisthenmatchedtooneormoresamplesinthegallery 62

PAGE 75

Score:Abiometricsystemcomputesthesimilarity(ordistance)betweentheprob eandagallery sample.Thissimilarity(ordistance)isreportedasascalarvaluealsocalledas score. MatchScore(orGenuineScore):Ifitisknownthattheprobeandgaller ysamplescamefrom thesameperson,thescoreistermedasamatchscoreoragenuinescore. Non-matchScore(orImpostorScore):Ifitisknownthattheprobeandg allerysamplescame fromdifferentpersons,thescoreistermedasanon-matchorimpostorsc ore. DecisionThreshold:Abiometricsystemcomputesthescorebycomputingthesimila rity(or distance)betweentwosamples.Thesystembinarizesthescore,i.e.outputs adecisionindicatingmatchornon-matchbycomparingthescorewithapresetvaluealsocalleda sadecision threshold. FalseAcceptanceRate(FAR):AFalseAcceptanceoccurswhenthesys temincorrectlydeterminesthatsamplesfromtwodifferentpersonsbelongtothesameperson.T heFalseAcceptance Rateisthepercentageoftimethatanimpostorwillbeacceptedbythesystem. FalseRejectionRate(FRR):AFalseRejectionoccurswhenthesystemincor rectlydetermines thatsamplesfromthesamepersoncomefromtwodifferentpersons.The FalseRejectionRate isthepercentageoftimethatagenuinepersonwillberejectedbythesystem. TrueAcceptanceRate(TAR):TheTrueAcceptanceRateisthepercenta geoftimethatagenuine personwillbeaccepted.ItisrelatedtotheFRR:TAR=100%-FRR. ReceiverOperatingCharacteristicCurve(ROC):Forabiometricsystem,th eTARandFAR changesbasedonthedecisionthresholdandthereexistsamonotonicrelation betweenthetwo, i.e.,ifthesystemthresholdisvariedtorejectmoreoftheimpostorclaims,certaing enuine claimswillalsoberejectedandviceversa.TheROCcurveisgeneratedbyv aryingthedecision thresholdbetweentwoextremesandplottingtheresultingTARagainsttheFAR. TheROC providesacomprehensivemeasureofsystemperformance. MultimodalBiometricSystem:Amultimodalbiometricsystemcombinesinformationfrommorethanonemodality,inmostcasesimprovingtheoverallsystemaccuracy. 63

PAGE 76

Fusion:Thetermfusionasusedinbiometrics,looselymeanstocombineinforma tionfrom multiplesources. IntramodalFusion:Inthiswork,intramodalfusionisusedtodenotefusion wherethesources ofinformationaremultiplesamplesofthesamepersonfromthesamemodality. MultimodalFusion:Here,thesourcesofinformationarefromsamplesofdif ferentmodalities. Scorelevelfusion:Oneofthemostpopularwaysinwhichtocombineinforma tionfromtwo systemsistoperformaweightedsummationoftheirindividualscorestoobtaina nalscore whichisthencomparedagainstthedecisionthreshold.Thisisreferredtoa sscorelevelfusion. Decisionlevelfusion:Whenonlybinarydecisions(accept/reject)areav ailablefromtheindividualsystems,rulessuchasAND/ORcanbeappliedtoobtainanaldecisionf romthedecisions ofindividualsystems.Thisisknownasdecisionlevelfusion. Featurelevelfusion:Oftenwhencombininginformationfrommultiplesystemsop eratingon thesamemodality,featurelevelfusionisusedtorstobtainabettersamplefro mtheindividualsamplesandperformrecognitionbasedonthisenhancedsample.E .g.,combiningtwo simultaneouslyrecordedspeechsignalstoobtainasignalwithhighersignal tonoiseratioor obtainingafaceimagewithincreasedresolutionusingasuper-resolutionalg orithmonmultiple faceimages. Abiometricsystemisessentiallyapatternclassicationsystemthatinvolvesfea tureextraction, representationandmatching.Likeanypatternclassier,it'sperformance dependsontheinputdata. Thekeysourcesoferrorinbiometricrecognition,stemfromthefollowinge ffects. Channeleffects:Theseareerrorsintroducedbychangesinsystem parametersassumingthat thesampleprovidedbytheuserisalwaysthesame.Asshownin4.2,thesyste mconsistsof theacquisitionsensorthatcapturestheusersample,thesignalprocessin gstagethatextracts featuresfromthesampleandthethematchingmodulethatcomputesthedistanceb etweentwo samples.Imperfectionsineitherofthesethreemodulescontributetothesystem error.However, thesensorparametersarethemostsusceptibletoenvironmentalandtempora lvariationsand changesinthesensoralsoaffectthedownstreammodules.Forexample, thegainofacamera 64

PAGE 77

mayvarywithambientbrightness,achangeinhumidityaffectsngerprintse nsors,ambient noiseaffectsparametersofamicrophone.Thisleadstodifferentimageso fthesamesample whichaffectsrecognitionaccuracy. Presentationeffects:Whenauserpresentsasampletothesystem,he/shema ynotbeconsistent overalltrials.Forexample,apersonmightsmilewhenposingforonefaceima geandnotdo soforanother.Fingerplacementonthengerprintsensorymayhavev ariationsinorientation andtranslation.Voicesamplesmaybeutteredatdifferentspeeds.These arereferredtoas presentationeffectsinthebiometricsliterature.Thevariationsinauser'spa rticularsample maymakeitmoresimilartoasamplefromanotheruserratherthantheuser'senr olledsample affectingsystemperformance. Therearemanyissuesthatcanbestudiedwhencarryingoutathorough evaluationofabiometric system.However,inthisdissertationweconductedonlythefollowingstudiesth atfocusonquantifyingtheeffectofsamplevariationsandidentifyingtheoptimalmethodofcombinatio nofinformation whenmorethanonesampleandmorethanonemodalityareavailable.Thetwomoda litiesinvestigatedinthisstudyarefaceandvoice,becausetheyarerelevanttothemee tingscenariowherefaceand voicesamplescanbeobtainedinanunobtrusivemannerbyanalyzingthevid eoandaudiostreams. Samplevariations:Tointroduceanaturalvariationinsamples,wecollectsamp lesintwostarkly differentenvironments.Therstisinatypicalofceenvironment.Inthis indoorenvironment, illuminationandbackgroundnoisevariationsareataminimum.Thesecondisouts ideabuilding,wherethereisnocontroloverilluminationornoise.Samplevariationsdue tovarying environmentalconditionsareknowntodegradetheperformanceofbiometric systemsthatare knowntoperformwellundercontrolledconditions.Oneaimofthisstudywa stoquantitatively evaluatesuchperformancedegradation. Scorenormalization:Theadagethat“twoheadsarebetterthanone”,hold swellinpattern recognitionwheremultiplealgorithmsareregularlycombinedtoyieldasystemwithb etter performancethanit'sindividualconstituents.However,therearecerta inissuesthatmustbe addressedpriortosuchcombination.Thetwoalgorithmsbeingcombinedmayha vedifferent outputranges.Inourcase,thefacesystemoutputsascorebetween0 and10whereasthevoice 65

PAGE 78

systemoutputsascorebetween0and1.Therangesmustnormalizedprior tocombination.The studyinvestigatestwowellschemes,min-maxandzscalingforthepurposeof normalization. Scorefusion:Theroleoffusionistondtheoptimalwayinwhichinformationf romtwoor moresystemscanbecombinedtoobtainasinglescore.Essentially,itistondthe mappingof twoscores x 2 and y 2 to s = f ( x ; y ) : s 2 thatleadstotheoptimalperformance. Combininginformationfrommultiplesamples:Typically,theperformanceofthesy stemimprovesasthenumberofsamplesonwhichthedecisionisbasedincreases. Thisisveryrelevant tothemeetingscenarioasmultiplesamplescanbeextractedperindividualfrom ameeting recording.Consequently,studiesareconductedthatassesstheperf ormanceimprovementwhen usingmultiplesamples. Combininginformationfrommultiplemodalities:Usinginformationfrommorethanonemo dality,increasesbothsystemperformanceandrobustness.Sincemeetingsa reaudio-visualrecordings,speechandfaceimagesaretwomodalitiesthatarereadilyavailable.Stu dieswereconductedtoquantitativelyassesstheadvantageofusingbothmodalitiesoverasin glemodality. Inthisstudy,commercialbiometricsystemswereusedtocollectfaceandvoice samplesfromthe samesubjectsinvaryingenvironmentconditions.Thisrealisticevaluationona datasetof116subjects showsthatthesystemperformancedegradeswhensamplesareaffecte dbyuncontrolledvariations, butbymultimodalscorefusiontheperformanceisenhancedby20%. Tostudytheeffectofenvironmentalvariations,adatabasecontainingsa mplesacquiredindifferent environmentsisrequired.However,mostmultimodaldatabasesforfacean dvoicearecollectedinan indoorenvironment.TheM2VTS[78]andXM2VTSDB[65]databasesco ntainvoice,2Dand3Dface images,collectedinanindoorenvironment.Similarly,theCMU-PIEdatabase(s im03cmu)captures pose,illuminationandexpressionvariationsoffaceimagesonlyinanindoore nvironment.Certain databasesthatwerecreatedtostudytheeffectofvariations,containsa mplescollectedbothindoorsand outdoors,butunfortunatelyarenotmultimodal,e.g.,theFERETdatabase[7 6]consistsoffaceimages collectedbothindoorsandoutdoors,butdoesnotcontainvoice.Whenv oiceisrecordedoutdoors, usuallycellularphonesareusedasthecapturingdevice,[81]whichwou ldbeunsuitableformeeting 66

PAGE 79

rooms.Studiesonvoicedatabaseswherethesamesubjectisrecordedbo thindoorsandoutdoorshave notbeenpublishedeither. Becauseofthedearthofsuitabledatabases,mostface-speechfusion studieshavebeenconducted onindoordatabaseswhichhavelimitedvariations.Variousapproachesha vebeenexploredforscorelevelanddecision-levelfusion.BrunelliandFalavigna[14]fusedfa ceandvoicesamplescollected inanindoorenvironmentusingtheweightedproductapproach.Themultimod alfusionresultedina recognitionrateof98%,wheretherecognitionratesofthefaceandvoice systemswere91%and88%, respectively.Bigunetal.[12]usedBayesianstatisticstofusefaceand voicedatausingtheM2VTS database(whichonlycontainsindoorsamples).Jainetal.[47]performed fusiononface,ngerprint andvoicedatabasedontheNeyman-Pearsonrule.Thedatabaseusedf orthisexperimentconsists ofsamplesfrom50subjectsinanindoorenvironment.SandersonandPaliw al[87]comparedand fusedfaceandvoicedataincleanandnoisyconditions.Althoughthedataw ascollectedindoors,they simulateanoisyenvironmentbycorruptingthedatawithadditivewhiteGaussian noiseatdifferent signaltonoiseratiolevels. Thusfromtheliteraturesurveyed,weseethatmostfusionexperimentsinv olvingfaceandvoice useanindoordatasetortrytoinducevariationsbyaddingnoise.Table4.1 summarizesdifferent studiesandcontraststhemwiththisworkwhichfocusesonexperimentswhich usefaceandvoice samplescollectedinindoorandoutdoorenvironments. Fromabiometricsperspective,thisstudyisuniqueinthreeaspects.Firstly, ourdatasetcontains faceandvoicesamplesthatarecollectedinboth,indoorandoutdoorenvir onments.Secondly,inmany fusionstudies,amultimodalsampleiscreatedbyrandomlyclubbingtogetherinde pendentunimodal samples.Researchers[77]expresstheneedforacorpusthatree ctsrealisticconditions.Inresponse, forthisstudy,wehavecreatedatrulymultimodaldatabasewherethetwomodalitie sarecollected fromthesamesubjects.Thirdly,mostfusionstudiesreportanimprovement inperformancedueto multimodalfusion.Howeverithasbeensuggested[77]thatthisimprovement couldpartlybecredited tothelargernumberofsamplesused.Toensureafaircomparison,where thenumberofsamplesused isthesame,themultimodalfusionperformanceiscomparedtothatofintramodalf usion. 67

PAGE 80

Table4.1SummaryofExistingLiteratureDealingWithBiometricRecognitionUsingFa ceandVoice. Study Algorithm DB Covariates TopIndividual Fusion used size ofInterest Performance Performance UK-BWG[62] Face,voice:Commercial 200 Time:1-2monthsepara-tion(indoor) TAR a at1% FAR b Face:96.5%Voice:96% Brunelli[14] Face:Hierarchi-calcorrelationVoice:MFCC 87 Time:3ses-sions,timeun-known(indoor) Face:TAR=92%at4.5%FARVoice:TAR=63%at15%FAR TAR=98.5%at0.5%FAR Jain[47] Face:EigenfaceVoice:Cep-strumCoeff.Based 50 Time:Twoweeks(indoor) TARat1%FARFace:43%Voice:96.5%Fingerprint:96% Face+Voice+Fingerprint=98.5% Sanderson(Sander-sonandPaliwal,2002) Face:PCAVoice:MFCC 43 Time:3ses-sions(indoor)NoiseadditiontovoiceEqualErrorRate Face:10%Voice:12.41% EqualErrorRate2.86% ThisStudy Face,voice:Commercial 116 Location:In-doorandOut-door(sameday) TARsat1%FARIndoor-OutdoorFace:80%Voice:67.5% TAR=98%at1%FAR TAR a -TrueAcceptanceRate FAR b -FalseAcceptanceRate 4.1.2Database Forarealistictestingofmultimodalauthenticationsystems,thedatasetshouldmimicthe operationalenvironment.Manyofthereportedtestsforbiometricfusionarecond uctedonamultimodal databasethatiscomposedofsinglebiometricdatabasescollectedfordiffer entindividuals.Some databaseslikeXM2VTS[65],DAVID[22],MyIdea[28],Biomet[36]th atcontainfaceandvoicesamplesfromthesameuserarepopulatedfromanindoorcontrolledlabenviro nment.TheBANCA[8] databasecontainsoutdoorfaceandvoicesamples,buttheywerecollecte dinfourdifferentgeographicallocationsandfourdifferentlanguagesmakingitunsuitableforthisstu dy.Arecentoverviewof multibiometricdatabasesisgivenin[31]. Forevaluatingtherobustnessofalgorithmstosamplevariations,adatabaseth atcapturesvariations thatoccurnaturallyinoperationalscenariosisneeded.Sincenoneofth eexistingpublicdatabases weresuitableforthisstudy,anewdatabasehadtobecreated,detailsofwh icharealsoprovided in[108]. 68

PAGE 81

Figure4.3CameraSetupforIndoorCollectionofFaceImages.Threelight sourcesL1,L2,L3were usedinconjunctionwithnormalofcelights. Thedatabasecontainsfaceandspeechsamplesof116subjectscollected atindoorandoutdoor locationsonthesameday.Whilecreatingthedatabase,carewastakensotha titwouldinvolve extremevariationsbutundernaturallyoccurringconditions.Theoutdoor environmentwasnearthe entrancetoabuilding.Theindoorenvironmentwassimilartoaquietofceen vironment.Aseparate sectionoftheofcewasriggedtocollectthefaceimages.Figure4.3shows theindoorsetuptocapture thefacephotos. Thesubjectisseatedonachairwithadjustableheight.Twoimagesofthesubje ctweretaken.For therstone,thelights-L1,L2andL3wereturnedoffandonlyregularo fcelightsremainedon. Thesecondoneistakenwithallthreelightsandofcelightsturnedon.Thisc apturesillumination variationsimilartothatin[66]. Fortheoutdoorenvironment,thetwophotoswereagainstdifferentback grounds.Figure4.4shows theindoorandoutdoorfaceimagesforoneofthesubjects.Figure4.4(a) wastakenwithallthreelights onandFigure4.4(b)wastakenwithonlythenormalofcelightson.Figures 4.4(c)and(d)showthe outdoorfaceimageagainsttwodifferentbackgrounds.Thecamerasu sedhadaneffectiveresolution of8mega-pixelsandeachimagewas3264x2448pixelsinsize. Forspeechsamples,theuseruttersaxedphrase,“UniversityofSou thFlorida33620”.Theindoorsampleswerecapturedsimultaneouslyusingasteerablearraymicroph oneandaregularcardioid microphone.Theoutdoorsampleswerecapturedusingonlythecardioidmicro phone.Thesampling 69

PAGE 82

(a) (b) (c) (d) Figure4.4SampleIndoorandOutdoorFaceImages.Figure(a)wastaken withalllightson.Figure(b) wastakenwithonlynormalofcelightson.Figures(c)and(d)showoutdo orbackgroundvariations. frequencywas11kHz.Forthisstudyonlysamplesusingthecardiodmicrop honewereusedtoavoid effectsofequipmentbiasintheindoor-outdoorscenario.4.1.3Methods Westudytheperformanceoftheindividualunimodalsystemsintheindoor-in door,indoor-outdoor andoutdoor-outdoorscenarios.Next,weevaluatetheperformanceimpr ovementobtainedthrough intramodalfusiononfaceandintramodalfusiononvoiceintheindoor-outdo orscenario.Theperformanceimprovementobtainedbymultimodalfusionisthencomparedtotheintramodal performances. Informationfusioninbiometricsispossibleatthesensor,feature,scoreo rdecisionlevels[4].In sensorlevelfusion,datafrommultiplesensorsofthesamemodalityarecombin ed(e.g.viasumming microphoneoutputsormosaicingimagesfromdifferentcameras).Thecomb inedsignalmayoffer abetterSNRormoreinformationandleadtolowerclassicationerrors.Fusio natthefeaturelevel 70

PAGE 83

usuallyinvolvesconcatenatingfeaturevectorsfromindividualmodalities. Thefeaturesmaybefrom thesamemodalitysuchasinfraredandcolorimagesofthefaceorfromdiffe rentmodalitiessuchas faceandspeechfeatures. Scorelevelfusionapproachesinvolveusingthescoresfromindividu almatcherstoarriveata classicationdecision,whereasindecisionlevelschemes,accept/reject decisionsaremadebythe individualsystemsandthenaldecisionisobtainedviaavoting(e.g.majority, AND,OR)between thedecisionsofindividualsystems. Fromthebiometricliteratureitisapparentthatfusionatthescorelevelisthe mostpreferred approach,becausescorelevelfusionallowsfusingtwomodalitieswithout requiringknowledgeofthe individualsystemsoraccesstotheextractedfeatures[48].Also,havin ggreaterinformationthanthat availableatthedecisionlevel,allowsformoreexiblefusionapproaches. Scorelevelapproachescan befurtherclassiedintoclassicationandcombinationapproaches.Inthe classicationapproach,a decisionboundaryislearnedinthescorespacethatseparatesthegenu ineandimpostorclasseswhereas inthecombinationapproach,outputsofindividualsystemsarecombinedintoa singlescorebasedon whichtheclassicationismade. Amongstthemorepopularclassicationapproachesaresupportvector machines[37],neural networks[53],decisiontrees[11],andrandomforests[61].In[5 9],theseandotherclassication approachesarecomparedforthetaskofvericationusinghandshape andtexture.Inthecombination approach,therawscoreshavetobepreprocessedbeforetheycan becombinedintoasinglescore.This istoaccountforthedifferentscorerangesofindividualsystemsand thenatureofthescores(similarity vs.difference).Thescorescanthenbecombinedintoasinglescorean dthenalaccept/rejectdecision istakenbasedonthisscore. Inthisworkweillustratetheeffectivenessofsimplescorecombinationscheme sinimproving performancefortheintramodalandmultimodalperformanceinindoorandoutd oorscenarios.Similartotheinvestigationin[99],commercialsystemsareusedinthisstudytoensur ethatthesystem performancedoesnotdependonnetuningalgorithmparameterswhichin creasesthegeneralityand reproducibilityoftheresults.Also,theresultsobtainedwillbebetterpredicto rsofperformanceinan operationalscenariowherecommercialsystemsareusuallydeployed. WechoseoneofthetopthreeperformingsystemsattheFRVT2002[75]fo rface.Theface recognitionsystemusesLocalFeatureAnalysis(LFA)torepresentfa ceimagesintermsoflocal 71

PAGE 84

statisticallyderivedbuildingblocks.Inthistechnique,faceimagescanbere presentedbyanirreducible setofbuildingelements.Thesebuildingelementsareuniversalfeatureslik eeyeco-ordinates,nose positionetc.Theuniquenessofasampleisbaseduponthesecharacteristic saswellastheirrelative geometricalpositiononafaceimage.Theenrollmenttimewas15secondsandv ericationtimewas lessthan2seconds. Thespeakerrecognitionsystemistextdependentandusesatimewarping algorithmtoalignthe timeaxesoftheinpututteranceandreferencetemplatemodelofaspeaker. Thesystemwillaccept avoicesampleforupto10secondsandenrolltheuserinlessthan0.2seco nds.Thematchingproceduretakesapproximatelyonetenthofasecond.Therangeofscores forthefacesystemwas0-10 andforthevoicesystemwas0-1.Thusscorenormalizationwasrequired priortofusion.Various approacheshavebeenproposedfornormalizationandfusionandmany worksempiricallyevaluate variouscombinationstondthebestperformingcombinationfortheirexperime nts.However,arecentstudy[85]comparingsevendifferentnormalizationtechniquesond ifferentmultimodalbiometric systemsconcludesthatnosinglenormalizationmethodperformsbestforalls ystems.Similarly,Ulery etal.[107]evaluatedifferentcombinationsofnormalizationandfusionon avarietyoffusiontasks andconcludethatthebestcombinationdependsonavarietyoffactors.S incemostmultibiometricexperiments(likethisstudy)areconductedonasinglemultibiometricdatabase,diff erentstudiesleadto differentconclusionsregardingthebestnormalizationandfusionsche meonthatdatabase.Inlightof suchobservations,itwasdecidedtoexperimentwithtwosimplenormalization(z -scoreandmin-max) schemesandtwobasicfusion(sumandmax)rulesthathavebeenusedas abaselineinmoststudies. Thesenormalizationandfusionschemesevaluatedinthisstudyarediscusse dbelow. Normalizationinvolvestransformingtherawscoresofdifferentmodalitiestoa commondomain usingamappingfunction.Whenthefunctionparametersaredeterminedfrom axedtrainingset,it isknownasxedscorenormalization[14].Alternatively,theseparameter scanbederivedforeach probe-galleryinteractionyieldingdifferentparametersfordifferentp robes.Thisisknownasadaptive normalizationwhereuserandmodalityspecicinformationcanbeincorporate dinthenormalization framework[99].Inthiswork,adaptivescorenormalizationisusedtotran sformascoreandthe functionparametersaredeterminedfromthesetofscoresobtainedbymatc hingthecurrentprobeto theentiregallery. 72

PAGE 85

Formallyspeaking,let G representthegallerytemplates, P representtheprobesamples,and S pg representthematchscoreofaparticularprobe' p ', p P withgallerytemplate' g ', g G .Then S pG representsthevectorofscoresobtainedwhenaprobe' p 'ismatchedagainsttheentiregallery G .In min-maxnormalization,theminimumandthemaximumofthisscorevectorareusedtoob tainthe normalizedscoreaccordingtoEquation4.1.Thenormalizedscoreslieinther ange0to1. S 0 pg = S pg min ( S pG ) max ( S pG ) min ( S pG ) (4.1) Inz-scorenormalization,themeanandstandarddeviationofthescorevec tor S pG areusedto normalizethescoresasshowninEquation4.2.Thenormalizedscorevector willhaveameanof0and astandarddeviationof1. S 0 pg = S pg meann ( S pG ) std ( S pG ) (4.2) Oncenormalized,thescoresfromdifferentmodalitiescanbecombinedusin gsimpleoperations suchasmax,min,sumorproduct.Thesumandproductrulesallowforwe ightedcombinationsof scores.Theweightingcanbematcherspecic,user-specic[99]orb asedonsamplequality[70].In matcherspecicweighting,weightsarechosentobeproportionaltotheacc uracyofthematcher(e.g. inverselyproportiontotheEqualErrorRateforthematcher).Inusersp ecicweighting,weightsare assignedbasedonhowwellthematcherisknowntoperformforaparticula rindividual.Similarly,in qualitybasedweighting,theweightsareassignedbasedonqualityofthesamp le(features)presented tothematcher. Althoughweightingtechniquesdoofferperformanceimprovements,theper formanceofsimple sumandmaxfusiondoesnotlagfarbehind[99].Inadditionthesefusion methodsdonotrequireany training.Henceinthiswork,weevaluatedtwobasicfusiontechniques-sum ruleandmaxrule,the detailsofwhicharediscussedin[56].Forthesimplesumtechnique,wetake theaverageofthescores beingfused.Let N bethenumberofmodalitieswewishtofuse.Let S i representthescorematrixof modality' i ',where i = 1 ; 2 ::: N .ThesimplesumtechniqueisthengivenbyEquation4.3. S 00 = S i N (4.3) 73

PAGE 86

S 00 = max ( S i ) (4.4) Themaxscoreissimplythelargestscoreofthescoresobtainedfromeach modalityasseenfrom Equation4.4.4.1.4ResultsandDiscussion Thissectioncarriesresultsofthefollowingstudies Theperformanceofindividualbiometricsystemsintheindoor-indoor,indo or-outdoorand outdoor-outdoorenvironments. Improvementofperformanceintheindoor-outdoorscenariobytheuseo fmultiplesamplesof thesamemodality(intramodalfusion). Improvementofperformanceintheindoor-outdoorscenariobytheuseo fsamplesfromdifferent modalities(multimodalfusion). Firstwecomparethesystemperformancewhenthesamplesareacquiredu ndervaryingconditionstotheperformancewhensamplesareacquiredunderthesamecondition s.Theperformance undervaryingconditionsisstudiedbyusingindoorsamplesasthegalleryan doutdoorsamplesasthe probe.Wealsorefertothisastheindoor-outdoorscenario.Thisperfo rmanceisthencomparedto performancesintwootherscenarios.Inone,bothgalleryandprobesa mplescomefromanindoorenvironment(indoor-indoorscenario)andintheother,bothsamplescomefr omanoutdoorenvironment (outdoor-outdoor)scenario. Fortheindoor-indoorscenario,116genuinescoresweregenerated bytreatingonesetofindoor samplesasthegalleryandtheotherastheprobes.Thenumberofimpostorsc oreswas13,340(116x 115).Anothersetofgenuineandimpostorscoresweregeneratedbyr eversingthegalleryandprobe samples.Thiscouldbedonebecausethealgorithmisnotsymmetric(swapping thegallerysample withtheprobesampleyieldsadifferentscore).Thusatotalof232genuin eand26,680impostor scoreswereusedtogeneratetheindoor-indoorROC(ReceiverOper atingCharacteristic)curve.The sameprocedurewasfollowedfortheoutdoor-outdoorscenario.Forth eindoor-outdoorscenario,each outdoorsamplewasmatchedtoboththeindoorsamplesandsofourgenuinesc oresand460impostor 74

PAGE 87

(a) (b) Figure4.5PerformanceofIndividualModalitiesinDifferentEnvironmen ts.ROCcurvesdescribing theperformanceof(a)faceand(b)voicesystemsindifferentscenar ios. scoresweregeneratedforeachsubject.Thus,theROCcurvewasge neratedfromatotalof464(116 x4)genuineand53,360(116x460)impostorscores. Figure4.5comparestheperformanceofbothmodalitiesinallthreescenarios .FromFigure4.5 (a)weobservethatthefacesystemperformsverywellintheindoor-ind oorscenarioachieving98% TARat1%FAR.Whenboth,probeandgalleryareoutdoorsamples,thepe rformancedropsby10%. Theworstperformanceisintheindoor-outdoorscenariowheretheper formancedropsbyabout20% comparedtotheindoor-indoorscenario. Figure4.5(b)showsthatthevoicesystemhas88%TARat1%FARintheindo or-indoorscenario. Thesystemperformanceisalmostthesameevenintheoutdoor-outdoorsce nario.Althoughalittle surprising,thiscanbeaccountedbythefactthatoutdoorsampleswerec orruptedbyambientnoise, whereasindoorsamplesweresubjecttonoisefrompeoplespeakinginthe background.Thesignal tonoiseratiointheoutdoor-outdoorscenarioislowerthanthatintheindoorindoorscenario,butthe noiseenergyisdistributedoverawiderangeoffrequencies.Ontheothe rhand,noiseintheindoorindoorscenarioisfromotherpeopletalkinginthebackground.Thisnoise haslowerenergybutoccurs inthesamefrequencyrangeasthespeaker'ssignal.Wehypothesizetha ttheseconictingfactorseven outandresultinsimilarperformanceforvoiceintheindoor-indoorandou tdoor-outdoorscenario.The worstperformanceofthevoicesystemoccursintheindoor-outdoorsc enario-wheretheperformance dropsto67%TARat1%FAR. 75

PAGE 88

Weobservethatboththefaceandvoicesystemssufferadrasticreductio ninperformance(about 20%)intheindoor-outdoorscenariocomparedtotheindoor-indoorscen ario.Thefacesystemsuffers becauseamodelbuiltsolelyfromindoortemplatescannotcapturethelightv ariationsofoutdoor images.Similarly,thevoicesystemsuffersbecauseofmismatchednoisecond itions. Nowwestudyhowthepoorperformanceofunimodalsystemsintheindoor-o utdoorscenario canbeimprovedbyexploitinginformationfrommultiplesamplesofthesamemodality.F irst,we studytheroleofmin-maxandz-scorenormalizationinimprovingthesystemperf ormance.Next,we investigatetheeffectsofsumandmaxfusiononsystemperformance.Fina lly,westudythecombined effectsofnormalizationandfusionontheindividualfaceandvoicesyste mperformances. Foreachsubject,thetwooutdoorprobeswerematchedtotwoindoorsamples generatingfour genuineand460impostorscores.ThusfortherawandnormalizedROCs, atotalof464genuineand 53,360impostorscoreswereused.Becausetherearesixwaysofcombin ingfourgenuinescores( 4 C 2 ), atotalof696genuineand80,040impostorscoreswereusedforthefusio nROCs. Figure4.6showsseveralROCscharacterizingtheperformanceofthef acesystemintheindooroutdoorscenario.Figure4.6(a)showstheeffectofscorenormalizatio n.Therawperformanceof thefacesystemis80%TARat1%FAR.Both,z-scoreandmin-maxnormalizatio nareeffective, improvingtheperformanceby13%and15%,respectively. Figure4.6(b),(c),and(d)showtheresultofsumandmaxrulefusiono ftheraw,min-max normalizedandz-scorenormalizedscores,respectively.Theperfor manceimprovementineachcase isabout3%-5%.Thesecurvesalsoshowthatthemaxruleperformsbetter thanthesumruleandthat min-maxnormalizationslightlyoutperformsz-scorenormalization. Min-maxnormalizationandmaxfusionresultsin96.4%TARat1%FAR.,yieldinga1 6.4% improvementovertheoriginalsystem.Themajorcontributiontothisimprovementis fromnormalization. Thedrasticimprovementinperformanceafternormalizationcanbeexplained withtheaidof Figure4.7.Thegenuinescoresgeneratedbythesystemhaveahighvar ianceascanbeseenfrom Figure4.7(a).However,theseparationbetweenanindividual'sgenuin eandimpostorscoresislow. Thusaprobethatgenerateshighgenuinescoreswillalsogeneratecon siderablyhighimpostorscores. Sincethegenuinedistributionhashighvariance,somegenuinescoresare lowerthansuchaprobe's impostorscores,leadingtoclassicationerrors.Scorenormalizationresu ltsintranslatingandstretch76

PAGE 89

(a) (b) (c) (d) Figure4.6ResultsforIntramodalFusionforFaceintheIndoor-Outdoor Scenario.Figure(a)shows theeffectofdifferentscorenormalizationtechniques.Figure(b)sho wstheeffectofdifferentfusion techniques.Figures(c)and(d)showtheeffectoffusiononthemin-max andz-scorenormalized scoresrespectively.ingthescoressothattheoverlappingareabetweenthegenuineandimposto rscoresisreducedasseen inFigure4.7(b)resultinginbetterperformance. Figure4.8showsvariousROCscharacterizingthevoicesystemintheindoo r-outdoorscenario. Figure4.8(a)showsthatmin-maxnormalizationperformsbetterthanz-score normalizationandimprovestheoriginalperformanceby12%.Figure4.8(b)showsthatfor therawscores,fusionusingthe maxruleisbetterthanthesumrule,improvingtheoriginalperformanceby7%. Figures4.8(c)and (d)showtheeffectofdifferentfusionschemesonthemin-maxandz-sc orenormalizedscores,respectively.Theadditionalimprovementfromfusionis4-7%.Overall,usingmin-ma xnormalizationand maxrulefusion,theoriginalsystemperformanceimprovesby17%. 77

PAGE 90

(a) (b) Figure4.7EffectofScoreNormalizationonFaceProbabilityDensities.Figu re(a)showsthegenuineandimpostorscoredistributionoftherawvoicesystemsscoresandFigu re(b)showsthescore distributionsforthez-scorenormalizedscores. Interestingly,fromFigure4.8(a),weobservethatz-scorenormalizatio nworsenstheperformance atFARsbelow1%.Tofurtherinvestigatethis,westudiedthehistogramsofthe rawandz-score normalizedscores.FromFigure4.9weobservethattherawgenuinesco redistributionisskewedso thatthemodeisgreaterthanthemean.Itisalsoevidentthatthedistributionisno tGaussianandhence Z-scorenormalizationwillnotyieldoptimalresults[65,78,93].Z-scoren ormalizationpulledthemode closertothemean,leadingtogreateroverlapwiththetailoftheimpostordistributio n,worseningthe systemperformanceatlowFARs. FromFigures4.6and4.8weobservethatmin-maxnormalizationhasaslightedg eoverz-score normalizationasreportedinvariousindoor-indoorstudies.Also,wendth atmaxrulefusionoutperformssumrulefusionwhichdiffersfromobservationsinsomeothers tudies[45,48,86].Thisis becausethescoredistributionsinourindoor-outdoorstudyvaryfromth osefoundinindoorstudies. Weconjecturethattheoptimalityofthefusionruledependsonthedistributionof thegenuinescoresifthegenuinescoresareGaussiandistributed,thenthesumruleusuallype rformsbetter.However,if thematchscoresdistributionsareskewed,asseeninFigure4.7(a)andFig ure4.9(a),thenmaxrule seemstobebetter. Havingstudiedtheperformanceoftheindividualalgorithmsindifferentsc enariosandthepossibleimprovementsbyusinginformationfrommultiplesamplesofthesamemodality,wen owturn ourattentiontofusinginformationfrommultiplemodalities(multimodalfusion).Formultimo dalfu78

PAGE 91

(a) (b) (c) (d) Figure4.8ResultsforIntramodalFusionforVoiceintheIndoor-Outdoor Scenario.Figure(a)shows theeffectofdifferentscorenormalizationtechniques.Figure(b)sho wstheeffectofdifferentfusion techniques.Figures(c)and(d)showtheeffectoffusiononthemin-max andz-scorenormalized scoresrespectively.sionintheindoor-outdoorscenario,twooutdoorprobesampleswerematch edtotwoindoorgallery samplesgeneratingfourgenuinescores.Fourgenuinescoresforea chmodalitygiverisetosixteen possiblemultimodalfusioncombinations(4 C 1 x 4 C 1).Thusforeachnormalizationandfusionscheme, theROCcurvewasgeneratedusing1856genuineand213,440impostorsc ores. Scoresfromthefaceandvoicesystemwerenormalizedusingeithermin-max orz-scorenormalizationandfusedusingthesumormaxrule.Figure4.10(a)showsthesyste mperformanceobtained usingthesedifferentnormalizationandfusionschemes.Weseethatsche mesusingthemaxrulefusion outperformthoseusingthesumrule.Aswithintramodalfusion,weseethat min-maxnormalization 79

PAGE 92

(a) (b) Figure4.9EffectofScoreNormalizationonVoiceProbabilityDensities.Figu re(a)showsthegenuineandimpostorscoredistributionoftherawvoicesystemsscoresandFigu re(b)showsthescore distributionsforthez-scorenormalizedscores. (a) (b) Figure4.10ResultsofMultimodalFusionintheIndoor-OutdoorScenario.F igure(a)showstheperformanceofdifferentnormalizationandfusionschemesforcombiningfac eandvoicescores.Figure (b)comparestheperformanceofmultimodalfusiontointramodalfusionofthe individualmodalities. followedbymaxrulefusionhasthebestperformance.Comparedtoperfo rmanceoforiginalface system(showninFigure4.5,thisschemeyieldsan18.7%improvementinperfo rmance. Figure4.10(b)comparestheperformanceofmultimodalfusiontotheintramoda lfusionperformanceoffaceandvoice.Afterintramodalfusion,thefacesystemperfo rmsalmostaswellasthe multimodalsystem.Themultimodalsystemperformanceis2.4%greaterthanthefac esystem. 80

PAGE 93

4.2IdenticationinMeetingVideos Meetingrecordingsarecomplexaudio-visualsignalsthatcaptureindivid ualandgroupbehaviors. Thestudyofsuchinteractionsisofsignicantinterestinvariouseldssuc hasbehavioralpsychology andhumancomputerinterfacedesign.Sincemeetingrecordingsserveasa naccuraterecordingof interactionsbetweenindividualsofanorganization,theautomaticprocessin gofmeetingsisanimportantapplicationinitsownright.Forinstance,organizationsmaywanttoautomatic allyextractminutes ofthemeetingorhavealistofalltopicsdiscussedinaparticularmeeting.Thus manyorganizations havebegunarchivingtheirmeetings,leadingtohugearchives.Forthea rchivestobeofpracticaluse, anefcientquerymechanismisneededtoextractinformationfromsuchme etings. Determiningtheidentityofindividualsparticipatinginameetingisprobablyoneofth emost importanttasksfromaquerypointofview.Oftenausermightwanttonda llmeetingsattendedby apersonandsolvingtheidenticationtaskequatestoansweringthisquery. Otherhigherleveltasks suchasndingwhoproposedaparticularideaalsorequiresolvingtheide nticationproblem.Unlike thetraditionalbiometricidenticationscenariowhereonlystaticsamplesofface sandspeechareto bematchedacrossagallery,recognitioninmeetingroomsisessentiallycontinuo us.Multiplesamples ofaperson'sspeechandfaceimagescanbeextractedfromameetingan dusedformatching.An additionallevelofdifcultyisalsointroducedinthisdomainbecausethesample sareinherentlynonoptimalastheyareobtainedwithoutrequiringuserco-operation.Forexa mple,thespeechsamples obtainedmaycontainspeechfromotherpersonsorbackgroundnoisea ndfaceimagesmaybenonfrontal.Thedifcultyiscompoundedbythefactthatnoexplicitinformationr egardingsamplequality isavailable.Theproblemisessentiallyopen-setidentication,itisnotknow napriorithattheperson existsinanyothermeetingofthearchive. Thissectiondetailsaquerysystemtondallsegmentscontainingspeechfro maparticularspeaker fromameetingarchive.Findingallspeechsegmentsfromaparticularuse risanimportanttaskby itselfaswellasaprerequisiteforotherhigherlevelqueries.Fore.g.,inq ueriessuchas“Findwhere Johnspeaksaboutthebudget”,therststepwouldbetoquerythearch iveforallofJohn'sspeechand thenanalyzethecorrespondingtranscriptfortheword“budget”. Thequerytothesystemconsistsofasinglespeechsampleorasinglefaceima geoftheparticipant.Therststepistopartitionthemeetingintohomogeneoussegments-contigu ousdurationsof 81

PAGE 94

themeetingwhichcontainspeechfromonlyasingleperson.Inthenextstep ,weperformcross-modal associationandclusteringwithinthemeetingtogroupspeechsegmentsbelongin gtothesameperson andndtheassociatedfaceimages.Theresultingclusters,containingmultiples peechandfacesamplesfromthesameindividual,serveastargetstowhichthequeryismatched. Sinceeachindividual's clustercontainsmultipleface-voicesamples,matchandnon-matchdistributionsc anbelearnedfor eachindividual.Thequerytothesystemisasinglesamplefromasinglemodality ,soaudio-visual associationisrstperformedtondtheassociatedsecondmodalityfromthe querymeeting.The expandedquery,consistingofafaceandavoicesampleismatchedtoallfa ce-voicesamplesinthe targetcluster.Sincethematchandnon-matchdistributionsforthecluster'ss egmentsareknown,a likelihoodratiobaseddecisionframeworkisemployed.Theaverageofthelik elihoodratiooverall samplesoftheclusterisusedtodetermineiftheclustermatchesthequery. Theproposedsystemisbasedonthreenovelideas.First,weobserve thattypicallyinameeting, thereareafewparticipantsandmultiplespeechsamplesfromthem.Ratherthan matchaqueryspeech sampleindividuallywitheachspeechsegmentfromatargetmeeting,speechs egmentswithinthe targetmeetingcanbeclusteredrsttoobtain targetclusters ,oneforeachspeaker.Sincetherecording conditionswithinthemeetingarethesame,samplesfromthesamemeetingcanbecomp aredmore accurately.Thequerysamplecanthenbematchedtomultiplesamplesofatarget cluster,allowingfor amorerobustdecision. Second,sincemeetingsconsistofaudio-visualrecordings,faceimage softhespeakercanbeassociatedwiththespeechsamples.Thecomparisonoftwomeetingsegmentscannow usethedistances betweentheirrespectivespeechandfacesamples,insteadofrelyingon eitherfaceorspeechalone. Theadditionofasecondmodalityimprovestheretrievalperformance. Third,differentmodalitiesprovidebetterdiscriminationamongstdifferentu sers.Forcertainsubjects,speechcanbemorediscriminativethanfaceandforothers,theopp ositemayholdtrue.Thus adecisionframeworkinwhichspeechandfacedistancesarecombinedus inguser-specicweights outperformaframeworkthatusesstaticweighting[49]. ThesystemschemashowninFigure4.11,illustrateshowtheproposedappr oachcombinesthese threeideas.Meetingsinthearchivearerstsegmentedintoindividualse gmentscontainingspeech fromonlyasingleuser.Usingaudio-visualclustering,speechsegments fromthesamespeakerare clusteredtogetherandthespeaker'sfaceimageisassociatedwitheachsp eechsegment.Sinceeach 82

PAGE 95

Figure4.11SystemSchemaforMeetingIndexing.Eachmeetingofthearchi veispreprocessedas follows:Audio-visualsegmentationiscarriedouttopartitionthemeetingintohomo geneoussegments whichcontainspeechfromonlyasinglespeaker.Usinganovelaudio-v isualassociationsystem,each person'sanassociationisfoundbetweenaperson'sspeechandfac eimages,andsamplesfromthe samepersonaregroupedtogethertoformtargetclusters.Fromthesec lusters,user-specicmatch andnon-matchscoredistributionscanbegenerated.Thequerytothesys temcanbeeitherasingle faceimageoraspeechsample.Byperformingaudio-visualassociation,a samplefromthemissing modalitycanbefoundtoproduceabimodalquery.Thebimodalquerysample isthenmatchedwith eachbimodalsamplefromthetargetclusterandalikelihoodratiodecisionfra meworkdeterminesif thetargetclusterbelongstothequery.ofthesetargetclusterscontainmultipleface-voicesamples,matchandnon-ma tchscoredistributions canbelearnedforeachindividual. Givenaspeechorfacesamplefromthequerymeeting,audio-visualass ociationisperformedrst tondsamplesfromthesecondmodality.Thisconvertstheunimodalquerysa mpleintoanexpanded bimodalquery.Thisbimodalqueryismatchedtoeachface-voicesamplefro matargetclusters, generatingmultiplebi-variatescores.Thedecisionwhetherthetargetcluste rbelongstothequery sampleisbasedontheaveragelikelihoodratioofthescores. Theoutlineoftherestofthispaperisasfollows.Thepartitioningofthemeeting intosegmentsis describedinsection4.2.1.Section4.2.2describestheaudio-visualcluster ingframeworkthatmerges 83

PAGE 96

segmentsfromthesamespeakerandassociatesafaceimagewitheachspee chsegment.Section4.2.3 dealswithmodelingthematchandnon-matchscoredistributionsperindividual.S ection4.2.4presents theimprovementsinsegmentretrievalandsection4.3carriestheconclusiono fthischapter. 4.2.1MeetingSegmentation Therstprocessingstepinvolvessplittingthemeetingintocontiguousduration sthatcontain speechfromonlyasinglespeaker.Adetaileddescriptionofthisprocess isdescribedinSection3.2 andbriefoverviewispresentedhereforcompleteness. Segmentboundariesarefoundusingthejointaudio-visualstream.Themo tivationforusingboth audioandvideofeaturestodeterminespeakerchange-pointsstemsfromth eobservationthatspeech andmovementarecoexpressive[111].Typically,sinceaspeakerex hibitsmoremovementthana listener,achangeinspeakerisalsoreectedbyachangeintheimageregio nwheremovementoccurs. Theaudiosignalissampledat16kHz,andtheMel-frequencycepstralc oefcients(MFCCs)are extractedusing32lterswiththebandwidthrangingfrom166Hzto4000Hz. Thesesettingsare chosenbasedonanempiricalstudy[35]whichfoundthemtoyieldgoodre sultsonthespeakervericationtask.Thevideofeatures,whichintendtocapturemotion,areobtained usingimagedifferences (threeframesapart).Thedifferenceimagesarethresholdedtosuppre ssjitternoise,dilatedbya3x3 circularmask,downsampledfromtheoriginalsizeof480x720to48x72,a ndnallyvectorized.The audio/videofeaturesarethenprojectedontoPCAsubspacestoreduceth eirdimensionality,andajoint audio-visualsubspaceisobtainedbyconcatenatingthecoefcientsusin gEquation4.5. X ( t )= 264 a : A ( t ) V ( t ) 375 (4.5) Here A ( t )=[ a 1 ( t ) ; a 2 ( t ) ;:::; a m ( t )] T ,where a 1 ( t ) ; a 2 t ;:::; a m ( t ) aretheMFCCsprojectedinaPCA space.Similarly, V ( t )=[ v 1 ( t ) ; v 2 ( t ) ;:::; v n ( t )] T ,where v 1 ( t ) ; v 2 t ;:::; v n ( t ) areobtainedbyprojecting thevideofeaturesontoaPCAspace.Thescalingfactor a = j S V j = j S A j isusedtomakethevarianceof theaudiofeaturesequaltothatofthevideofeatures. Inthiswork,theBayesianInformationCriterion(BIC)[19]isusedforse gmentingthemeeting intohomogeneousaudio-visualsegments.Thetaskisposedasoneofnd ingthetimeinstants t k that 84

PAGE 97

correspondtosegmentboundaries.Todetermineif t k correspondstoaboundary,subsetsofthesignal X ,precedingandfollowing t k ,areexaminedtoinferwhethertheyareproducedbythesameperson orbytwodifferentpersons.Let T w representthelengthofeachsubset, X W 1 = f X ( t k T w ) X ( t k T w + 1 ) ;:::; X ( t k 1 ) g representthesubsetpreceding t k X W 2 = f X ( t k + 1 ) ; X ( t k + 2 ) ::: X ( t k + T w ) g representthesubsetfollowingitand X C representtheunionofthetwo.TheBICisusedtodetermine if X C canbeadequatelyrepresentedbyasinglemodel M C oriftwomodels M 1 and M 2 arerequiredto represent X W 1 and X W 2 ,respectively. Dening D BICtobeBIC( M 1 ; M 2 )-BIC( M C )andmodeling X W 1 X W 2 X C byunimodalGaussian distributionswithfullcovariancematrices,wehave D BIC ( t k )= N 1 2 log j S X W 1 j N 2 2 log j S X W 2 j + N 2 log j S X j 1 2 l ( d + d ( d + 1 ) 2 ) log N (4.6) where N 1 and N 2 arethenumberoffeaturesin X W 1 and X W 2 ,respectivelyand N = N 1 + N 2 l isatuning parametersetto1and d isthenumberofmodelparameters.ApositivevaluefortheBICjustiesthe useoftwomodels,indicatingthat t k isachange-pointandthusasegmentboundary. 4.2.2Audio-VisualAssociation Thenextgoalistogenerate targetclusters ,oneforeachspeakerthatcontainallspeechsegments forthatspeaker.Theassociationtaskistondthespeaker'sfaceimage foreachsegment.The clusteringofsegmentsisbasedontheiracousticandvisualcontent.MFCCs extractedat30Hzareused asaudiofeaturesandabsolutedifferenceimages(threeframesapart)c apturethemotion.Assuming thatthespeaker'sdonotleavetheirpositions,theassociationtaskisthree -fold.Firstwendtime durationsduringwhichanindividualspeaksbyperformingtemporalclus teringoftheaudiosignal. Thisgroupstogethermultiplesegmentsinwhichthesamepersonisspeaking.Th en,weperform spatialclusteringinthecorrespondingvideoframesusinganeigen-analy sismethodtondthemodes ofmotion,whicharedifferentimageregionsbelongingtothedifferentpar ticipants.Sinceaspeaker exhibitsmoremovementthanthelisteners,thedominantmotionmodewillbetheoneass ociatedwith thespeaker.Thusanassociationbetweentheperson'sspeechandloc ation(imageregion)isfound. Fordetailsontheaudio-visualsegmentation-localizationalgorithm,thereader isreferredto[109,110] 85

PAGE 98

andtoSection3.3ofthisdissertation.Oncethespeaker'sregionintheimageis found,aHaarface detector[115]isrunontheregionofinteresttondthespeaker'sface Theoverallquerysystemisnotrestrictedtousingtheproposedmethodfo raudio-visualassociation.Techniquesthatrelyoninstantaneousaudio-visualsynchrony[ 55,67,96],canalsobeusedto determinethespeaker'sfaceregion.Theproposedaudio-visualass ociationmethodoutperformssuch MutualInformation(MI)basedtechniqueswhenonlyasinglecameraisuse dtorecordthemeetingand thespeaker'sfacemaynotbefrontal.However,ifameetingisrecorded frommultiplecameraviews, thespeaker'sfacemaybefrontalinatleastoneoftheseviews,satisfyin gtheinstantaneoussynchrony assumptiononwhichMImethodsarebased.4.2.3ModelingIndividualScoreDistributions Inthepastfewyears,signicantefforthasbeendevotedtolearning user-specicmodalityweights andthresholds[2,102].Mostofthefocushasbeenonlearninguser -specicdiscriminativemodels andusingSVMsforclassication.Recently,agenerativeframework,th atusesalikelihoodratiobased decisionwasproposedbyNandakumaretal.[69].Theirfusionscheme hasseveraladvantages,the mostsalientofwhichisitsabilitytohandlearbitraryscalesanddistributionsofma tchscoresand correlationbetweenthescoresofmultiplematchers.Inaddition,theframewor kiselegantbecauseit performsimplicitweightingofmodalities.However,theproposedframeworkmo delsallmatch(nonmatch)scoresasamixtureofGaussiansandhenceisnotadaptedtolearnin guser-specicmodels.In thiswork,weextendtheproposedlikelihoodratiobasedfusionframework tolearninguser-specic models,toprovidebetterdiscriminationandhencebetterretrievalperforman ce. Inourapproach,thespeaker'sfaceimagesarefoundbyrunningaH aarfacedetectoronthe speaker'simageregion.Theindividualsegmentsfromeachtargetclus terareusedasspeechsamples. Onefaceimageisassociatedwitheachspeechsample.Thusatargetcluster nowcontainsmultiple speech-facepairedsamplesforauser.Forcomputingdistancesbetwee nspeechsegments,aKL2GMMspeakerrecognitionalgorithmisused[84].Forfacerecognitionac ommercialalgorithmis used. Fromtheclusteringphaseoftheaudio-visualassociation,thesegment-ind ividualmemberships areknown.Thus,foreachindividual,thescorescanbecategorized asmatch-scores(scorebetween 86

PAGE 99

twosegmentsbelongingtothesameindividual)andnon-matchscores(score betweenonesegment belongingtotheindividualandtheothersegmentbelongingtoadifferentin dividual).Similarly,face imagesextractedfromaspeechcluster'sassociatedimageregionarematch edagainstthemselvesand withfacesextractedfromotherregionstogeneratematchandnon-matchfa cescores,respectively.A mixtureofGaussiansisthenusedtomodelthematchandnon-matchdistributionsf oreachindividual. EstimationoftheGaussianmixturemodelfromthescoredistributionisdoneusing thealgorithm proposedin[32]. Eachqueryface-voicesampleisthenmatchedtoallface-voicepairingsinth etargetcluster.Let S qi =[ F qi V qi ]representthegeneratedbivariatescoreobtainedbymatchingthequer ywiththe i th sample fromatargetcluster.Foreachtargetcluster, C ,thelikelihoodoftheclusterbelongingtothequery L qC iscomputedusing4.7. L qC = 1 j C j 8 i 2 C p ( S qi j q M ) p ( S qi j q NM ) (4.7) Here q M and q NM representthedensitiesofthematchandnon-matchscoredistributions,rep ectively.4.2.4Results Theproposedapproachimprovesthesegmentretrievalperformanceov erusingonlyasinglesampleofonemodalityasthequery.Theimprovementsstemfromthreemajorfactor s.First,byclustering segmentsinthearchive,multiplesamplesofamodalityareobtainedwhichallowsas ample-cluster matchthatoutperformsdecisionsbasedonsample-samplematching.Second, theuseofaudio-visual associationtondthecomplementarymodalityallowingamultimodalcomparisonrather thanaunimodalone.Third,themodelingofmatchandnon-matchdistributionsperindivid ual,whichequates tolearninguser-specicweightsforcombininginformationfromthetwomodalitie s,improvesresults overusingthesamemodalityweightsforcombination.Theresultspresentedhe reintendtohighlight theindividualcontributionsofthesethreefactors. Theproposedsegmentretrievalmethodistestedonasubsetofmeetingsfr omtheNISTpilotmeetingroomarchive[38].Thedatasetusedconsistsof16meetings,recor dedfromfourdifferentcamera views.Theaudiochannelconsistsofagainnormalizedmixfrommultipledistantmic rophones.The 87

PAGE 100

videoframerateis29.97Hzwithaspatialresolutionof720x480andtheaudio dataissampledat 44kHzandhasaresolutionof16bitspersample.Thenumberofparticipants ineachmeetingvaries fromthreetonineandthetotalnumberofuniqueparticipantsinthedatasetis5 0.Thenumberof differentmeetingsthatapersonparticipatesinvariesfromonetosix. AsillustratedinFigure4.11,eachmeetingofthearchiveundergoesaudiovisualsegmentation, whichpartitionsthemeetingintocontiguoussegmentscontainingspeechfromon lyasingleuser. Thenbyperformingaudio-visualassociation,theimageregionlocatingeac hspeakerinthemeetingis foundineachofthefourcameraviews.TheHaarfacedetector[115]is runontheseregionofinterest tondthespeaker'sfaceinallfourcamerviews.Sinceitcandetectonly frontalfaces,typicallyfaces arefoundinoneortwoofthecameraviews.Also,sinceasignicantlylongd urationofspeechis considered,multiplefacesofthespeakerarefound.Tosimplifythedesign oftheexperimentsandto aidcomparison,onlyasinglefaceimageisrandomlychosenfromthefoundf acesandassociatedwith thespeechsample. TheretrievalperformanceusingtheunimodalsamplesisshowninFigure4.1 2.Theprecisionof thefaceandvoicesystemsat90%recallis64%and71%,respectively.S inceaudio-visualassociation allowsustoassociatetheperson'sfacewiththespeechsample,thedistance betweentwosegments canbecomputedasafunctionofthedistancesbetweenthesegments'facea ndvoicesamples.Since thesystemsprovidescoresindifferentranges,znormalizationwasused totransformthescoresinto comparablerangesandthescoresweresummedtocomputethenaldistance. Thisequatestoassigningequalweightstobothmodalitiesinthescorecombinationframework.Asc anbeexpected, thisresultsinanimprovementofabout10%,overthebestperformingindiv idualsystem,yieldinga precisionof80%at90%recall. Theevaluationofimprovementduetomultiplevoicesamplesproceedsasfollows .Foreach segmentusedasthequery,theremainingsegmentsofthemeetingundergoau dio-visualclustering toformtargetclusters.Bymatchingallthesampleswithinaclustertoeachother, thedistributionof matchscorescanbegeneratedpercluster.Similarlybymatchingthesamplesof thecurrentcluster withsamplesfromotherclusters,thenon-matchdistributioncanbegenerated .Sincewehavetwo modalities,thesescoreshaveabivariatedistribution.Thebimodalqueryisth enmatchedtoeachofthe targetcluster'sface-speechsamplesandtheaveragelikelihoodratioisc omputedforthequery-target matchusingthematchandnon-matchdistributionsforthatcluster.Itshouldbe notedthatmodelingthe 88

PAGE 101

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision-Recall RecallPrecision face voice face + voice Figure4.12RetrievalPerformanceintheSample-SampleMatchingFramework .Eachsegment'sface orvoicesampleismatchedwithallcorrespondingsamplesfromthearchiveto generatetheprecisionrecallcurvesshowninredforfaceandblueforvoice.Thefusionpe rformance,obtainedbyznormalizationandsumrulefusionisshowningreen.At90%recall,theprecisionif 64%forfaceand71%for voice.Byperformingaudio-visualassociationtoassociatefaceandvoic esamples,theperformance afterfusionis81%precisionat90%recall,animprovementofabout10%o verthebestindividual modality.scoredistributionsperindividual,facilitatesbothnormalizationanduser-sp ecicmodalityweighting. Thisperformance,obtainedbyfusingfaceandvoiceusingindividualsp ecicmodelsisshownin Figure4.13byablackcurve.Theprecisionis92.6%at90%recall. Ifwedonotuseuser-specicmodels,andinsteadgrouptogethermatcha ndnon-matchscoresarisingfromdifferentclusters,weforgotheadvantageofhavinguser-s pecicweightingofthemodalities andinsteadhavethesamemodalityweightsforallusers.Thisgeneratesthes amedecisionboundary forallusersandisakintohavingthexedmodalityweights.Theperformanc eunderthisconditionis shownusingamagentacurveinFigure4.13.Theprecisionis88%at90%re call. 89

PAGE 102

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision-Recall RecallPrecision face voice face + voice: GM face + voice: ISM Figure4.13RetrievalPerformanceintheSample-ClusterMatchingFramewor k.Themeetingsare partitionedintoclusters,suchthateachclustercontainsallspeechandfa cesamplesfromasingleindividual.Thesesamplesareusedtolearnuser-specicscoredistributions. Theretrievalperformanceis evaluatedusingaleave-one-outframework,wheresegmentsamplesac tingasaqueryaresegregated fromtheclusterandthenmatchedwitheachsegmentinthecluster.Theblackc urveshowstheperformanceofthefusedsystemwhenindividualspecicmodelsareused-at9 0%recall,theprecisionis 92.6%.Ifmatchandnon-matchscoresfromdifferentclustersaregrou pedtogethertogenerateglobal matchandnon-matchmodels,theperformancedropsto88%precisionat90% recall.Theperformancefortheindividualfaceandvoicesystemsisaprecisionof78%and7 7%,respectivelyat90% recall. Next,weevaluatetheperformanceoftheindividualsystemswhenmultiplesamp lesareused, whichissimilartotheevaluationofglobalmodels,exceptthatthemodelsarebuilt usingonlyone modality.TheperformanceofthetwosystemsisshowninFigure4.13,usingar edcurveforfaceand abluecurveforvoice.Theperformanceofthefacesystemis78%prec isionat90%recallandthe voicesystemhasasimilarperformanceof77%precisionatthesamerecallr ate. 90

PAGE 103

4.3SummaryandConclusions Inconclusion,thischapterdescribesindetailallthenecessarycompone ntsforcreatingameeting indexingsystemthatndsallsegmentsofameetingarchivewhereaparticu larindividualspoke.Solvingthisproblemenablessolvingvariousotherqueriessuchasndingallme etingswhichaparticular userattendedorndingmeetingsattendedbyaspecicsetofusers.Ina ddition,otherhigherlevel taskssuchasmeetingcategorizationrequiresolvingthissubproblem. Thischapterproposedafullyautomaticmeetingmethodforqueryingmeetingarc hivestond speechsegmentsfromthesameperson.Themethodusesaudio-visualpr ocessingtorstpartitionthe meetingintohomogeneoussegmentsthatcontainspeechfromonlyasingleuse r.Thenaudio-visual clusteringandassociationisperformedthatassociatesfaceandvoicesa mplesfromthesameuserand groupsthesesamplesintotargetclusters.Thespeechqueryisexpande dbyaudio-visualprocessing tondtheassociatedfacesampleandtheexpandedbimodalqueryisusedto retrievetargetclusters. Anovelmatchingframeworkwasintroducedthatexploitsthenatureofmeeting videos.Multiple samplesfromanindividualareusedtolearnuser-specic,jointfacevoicedistributions.Bydirectly usingtheuser-specicmodelstogeneratelikelihoodratios,thesystemperf ormsimplicitnormalization anduser-specicmodalityweighting. Inthisrstsectionofthischapter,wedescribedstudiesconductedonan ovel,trulymultimodal, databasethatcontainedsamplevariationsduetodifferencesinenvironmen talconditions.Thisisone oftherststudieswhichdealwithbothfaceandvoice,andtheirintramodal andmultimodalfusionsin ascenariowheresamplesaresubjecttonaturalenvironmentalvariations .Thestudymimicsarealistic scenarioasfaceandvoicesampleswerecollectedfromthesamepersonu nderconditionstypicalto thatofanoperationalscenario. Fromthisstudy,weuncovercertaininterestingobservations.Firstly,weo bservethatforbothface andvoice,theindoor-outdoorperformanceistheworst.Thefactthatth eoutdoor-outdoorperformance isbetterthantheindoor-outdoorperformancesuggeststhattheperforma nceofthesystemsisbetterif theenrollmentandvericationconditionsaresimilar,eveniftheconditionsare noisy.Thisobservation canbeusedtoguidethedesignofthemeetingindexingsystembynotingthatre cognitionperformance inthesamemeetingwillbebetterthanrecognitionacrossmeetings. 91

PAGE 104

Secondly,wendthatthemaxruleperformsbetterthanthesumruleinourda tasetwhichdiffers frompreviouslyreportedresults.Thisisprobablyduetotheindoor-outd oornatureofourdatabase whichresultsinhighlyskewedgenuinescoredistributions.Inmeetings,thisme ansthattheoptimal fusionrulewillbedatadependentandthatunderadverseconditions,w henoneofthemodalityis pronetonoise,themaxrulemaybeabetterchoice. Thirdly,wendthatscorenormalizationsignicantlyimprovestheperforman ce(12-15%at1% FAR)forbothsystems.Thisimprovementismuchgreaterthanthatobserved inotherstudiesandcan beattributedtotheindoor-outdoornatureofthedataset.Sinceweexpects imilarvariationsinmeeting rooms,normalizationofthescoreswillbeanimportantprocessingstep. Finally,wendthatintheindoor-outdoorscenario,theperformanceofin tramodalfusionofthe strongermodalityapproachesthatofmultimodalfusion.Similarobservationsa remadein[17],where afusionof2Dand3Dfaceimagesisstudied.Thusinordertoimprovethemeeting indexingperformance,exploitinginformationfrommultiplesamplesisakeyidea,especiallybe causemultiple samplesarereadilyavailablefromthecontinuousaudio-visualrecording Thesendingsindicatethattoimprovetheaccuracyofaretrievalsystem,o nemustexploitall possiblemodalities.Secondly,onemustusemultiplesamplesifavailabletoimproveth edecision framework.Thirdly,sincethemodalitiesmaynotperformthesameforalluse rs,amechanismisrequiredtogenerateuser-specicmodalityweights.Thesendingsinuenc edthedesignofthemeeting indexingsystemdescribedinSection4.2. Theproposedretrievalsystemwastestedonanarchiveconsistingof16 meetingexcerptsof20 minuteseach.Thearchivecontains50uniquesubjects,mostofwhichapp earinmultiplemeetings.Retrievalresultsindicatethattheproposedsystemperformsmuchbe tterthanasystembased onsample-samplematching.Thesystemisfastandrobustbecauseituses lowdimensionalsamples andhighlevelfeatureswhichcanbeextractedinreal-timefromtheaudio-v isualsignal. 92

PAGE 105

CHAPTER5 SUMMARYANDCONCLUSIONS Thisdissertationproposedsolutionstothreedistinctbutrelatedproblemstha tarefundamental totheanalysisofmeetings.Therstproblemaddressedwasthatofaudio -onlyspeakerdiarization, whichisthetaskofautomaticallydeterminingthenumberofspeakersandthedur ationswhenthey spokefromtheaudioportionofameetingrecording.Theproposedsolutio ninvolvesasplitandmerge paradigm,whereamodiedthree-partslidingwindowtechnique,basedon theBayesianInformation Criterion,wasusedtosplitthemeetingintosegmentscontainingspeechfromon lyasinglespeaker. ThismodicationtotheoriginalBICsegmentation,yieldedimprovementsinboth,sp eedandaccuracy. Thesecondcontributiontotheaudio-onlydiarizationproblemwastheuseof atwostageclustering solution,wherethemodelcomplexitywasincreasedinresponsetotheamount ofdata.Initialsegments whichcontainlessdataweremodeledwithunimodalGaussiansandgraphspe ctralclusteringwasused tomergethemintolargerintermediateclusters.Theseintermediateclusters,con tainingmoredata weremodeledbyamixtureofGaussiansandclusteredusingahierarchical agglomerativeclustering schemetillthestoppingcriterionwasmet.Theproposeddiarizationsolutionres ultedinadiarization errorrateof20%comparedto25%whichwastheperformanceofatypical segmentation-clustering solution. Wealsointroducedanovelparadigmtojointlyperformdiarizationandlocaliz ationusingboth audioandvideoinaniterativeframework.Thenoveltyoftheproposedso lutionliesinitsuseof motioncuesfromtheentirebodyratherthanjustthefacetopredictthespea kerinameeting.First acoarsespeakerdiarizationisperformedtonddurationsofspeechfr omthesamespeaker,then byperforminganeigen-analysisonthecorrespondingvideoframes,re gionswithhighlycorrelated movementarefound.Theseregionsareassumedtolieonthespeaker'sima ge,resultinginacoarse localization.Thediarizationandlocalizationresultsaretheniterativelyrene dbyincorporatingvideo informationintothediarizationprocessandusingthereneddiarizationresu ltstoguidelocalization. 93

PAGE 106

Itwasshownthatincorporatingvideoinformationimprovedthediarizationpe rformancebyreducing theDERfrom20%to16%.Thelocalizationresultsobtainedusingthisapproac houtperformedthe localizationresultsobtainedusingtheMutualInformationapproachbyabou t40%. Finally,wealsoperformedmeetingindexingusingaquerybyexampleframewo rk.Theproposed solutioninvolvesusingaudio-visualassociationtoexpandaunimodalquer ytoabimodalquery.Meetingsofthearchiveareprocessedtogeneratetargetclusters,onepe rindividualagainusingaudio-visual association.Eachclustercontainsmultiplefaceandspeechsamplesofause r.Theoccurrenceofmultiplesamplesallowsmakingasingledecisionpertargetclusterbasedonmultiplematc hesratherthan makingmultipledecisionsusingsample-samplematchingwhichistypicallydoneinidenti cation problems.Asasecondadvantage,multiplemultimodalsamplesallowgeneratingus er-specicsubspacesandusingaclassicationframeworkratherthanascorecombinatio nframework.Thisallows learninguser-specicdecisionboundarieswithouttheneedtoexplicitlylea rnuserspecicmodality weightsforfusion.Acomparisonofthetwoframeworksshowsthatthepro posedsolutionyieldsa precisionof92.6%at90%recallversusasystemthatusesonlyasingles ample,unimodalquerythat hasaprecisionof71%atthesamerecall. Insummary,thecontributionsofthisworkcanbecategorizedasthosethatin troducenewparadigms forsolvingtheproblemsandasthosethatdemonstratesuperiorsolutionstoa problemusingalgorithms noveltotheeld.Contributionsthatfallintherstcategoryincludetheuse ofmotioncuesfromthe entirebodytopredictthecurrentspeaker,andtheideaofposingidenti cationasaclusteringproblem formeetingindexing.Theuseofatwo-stageclusteringforaudio-visuala ssociation,andtheuseof user-specicsubspacestoperformindexinginmeetingroomsfallinthese condcategory. Oneofthecorecontributionsofthisworkwasthejointaudio-visualassoc iationtodeterminethe speakerandhis/herlocationintheimage.Thesolutionproposestheuseoflo ngtermco-occurrences betweenspeechandvideotoperformthesetasksratherthanuseinstanta neousassociationwhichmay notalwaysbepossibleinthemeetingdomain.However,someofthespeaker facesarefrontalina cameraviewandforsuchparticipants,audio-visualassociationisaplaus iblealternative.Itisdesired tohaveasolutionthattakesintoaccountboth,longtermco-occurrences aswellasinstantaneous associationtosolvethediarizationandlocalizationtasks. Theproposedsolutionsforspeakerdiarizationlikeotherapproachesth atuseonlyasinglemicrophonecannothandleoverlappingspeech.Insomekindsofmeetings suchasgroupdiscussions, 94

PAGE 107

asignicantportionofthemeetingmaycontainoverlappingspeech,makingthis aproblemworth addressing.AlthoughIndependentComponentAnalysishasbeenuse dtosolvetheblindsourceseparationproblem,itrequiresthesignaltoberecordedwithasmanymicrophon esastherearesources. Lately,methodshavebeenproposedthatallowseparationwhenthenumber ofmicrophonesareless thanthenumberofsources.However,theproblemofseparatingmultiplesimu ltaneousspeechsources fromasinglemicrophoneislargelyunaddressedandwarrantsresearc h. Adirectionforfutureimprovementoftheproposeddiarization-localizationf rameworkcouldbe todealwithparticipantswalkingaroundintheroom.Onesolutiontothisproblemc anbetoperformeigen-analysisovershorterdurationsandusingappearanceba sedmodelstocomputesimilarity. Anotherpossibilityistouseregiontrackingtoaccountforsuchmovementpr iortoperformingthe eigen-analysis. Thisworkhasshownthattherearecuesbeyondthefacethattellofspe echactivity.Itwasobserved thatingeneral,overlongenoughtimedurations,thespeakertendstoexhib itmoremovementthanthe listener.Inasense,onlythemagnitudeofmotionwasexploitedinthiswork.How ever,eachperson hasauniquebodylanguage,whichcanalsobeusedtorecognizetheindi vidual.Itwouldbeinteresting todetermineifthe“motion-mannerism”ofanindividualcanbeusedasasoftb iometric.Finally,an extensionoftheproposedworkwouldbetousetheframeworktodetectev entsandactivitiesinthe videoandperformmeetingcategorizationbasedonthespeakerturnpatter nsandmovements. 95

PAGE 108

REFERENCES [1]J.AggarwalandQ.Cai.Humanmotionanalysis:areview. ComputerVisionandImage Understanding ,73(3):428–440,1999. [2]J.Aguilar,D.Romero,J.Garcia,andJ.Rodriguez.Adapteduser -dependentmultimodalbiometricauthenticationexploitinggeneralinformation. PatternRecognitionLetters ,26:2628– 2639,2005. [3]J.AjmeraandC.Wooters.Arobustspeakerclusteringalgorithm. ProceedingsoftheIEEE workshoponAutomaticSpeechRecognitionandUnderstanding(ASRU'0 3) ,pages411–416, 2003. [4]P.AleksicandA.Katsaggelos.Audio-visualbiometrics. ProceedingsoftheIEEE 94(11):2025–2044,Nov.2006. [5]AMI.Augmentedmultipartyinteraction. availableathttp://corpus.amiproject.org ,lastaccessedonMay2007. [6]X.Anguera,C.Wooters,andJ.M.Pardo.Robustspeakerdiar izationformeetings:ICSIRT06S meetingevaluationsystem. ThirdInternationalWorkshopMachineLearningforMultimodal Interaction ,4299:346–358,2006. [7]H.Asoh,F.Asano,T.Yoshimura,Y.Motomura,N.Ichimura,I.Har a,J.Ogata,andK.Yamamoto.AnapplicationofaparticleltertoBayesianmultiplesoundsourcetrackin gwith audioandvideoinformationfusion. ProceedingsoftheInternationalConferenceonInformationFusion ,pages805–812,2004. [8]E.Bailly-Bailliere,S.Bengio,F.Bimbot,M.Hamouz,J.Kittler,J.Marietho z,J.Matas, K.Messer,V.Popovici,F.Poree,B.Ruiz,andJ.Thiran.TheBANCA databaseandevaluationprotocol. Proceedingsofthe4thInternationalConferenceonAudioandVideoBas ed BiometricPersonAuthentication(AVBPA) ,pages625–638,2003. [9]M.Beal,N.Jojic,andH.Attias.Agraphicalmodelforaudio-visualo bjecttracking. IEEE TransactionsonPatternAnalysisandMachineIntelligence ,25(7):828–836,2003. [10]M.Ben,M.Betser,F.Bimbot,andG.Gravier.Speakerdiarizationu singbottom-upclustering basedonaparameter-deriveddistancebetweenadaptedGMMs. ProceedingsoftheInternationalConferenceonSpokenLanguageProcessing(ICSLP) ,2004. [11]S.Ben-Yacoub,Y.Abdeljaoued,andE.Mayoraz.Fusionoffa ceandspeechdataforperson identityverication. IEEETransactionsonNeuralNetworks ,10(5):1065–1074,1999. [12]E.Bigun,J.Bigun,B.Duc,andS.Fischer.Expertconciliationfo rmultimodalpersonauthenticationsystemsusingBayesianstatistics. ProceedingsofFirstInternationalConferenceon AudioVisualBasedPersonAuthentication(AVBPA) ,pages291–300,1997. 96

PAGE 109

[13]M.Brandstein.Aframeworkforspeechsourcelocalizationusings ensorarrays. PhDThesis, BrownUniversity. ,1995. [14]R.BrunelliandD.Falavigna.Personidenticationusingmultiplecues. IEEETransactionson PatternAnalysisandMachineIntelligence ,10:955–966,1995. [15]C.Busso,S.Hernanz,C.W.Chu,S.I.Kwon,S.Lee,P.Geor giou,I.Cohen,,andS.Narayanan. Smartroom:Participantandspeakerlocalizationandidentication. ProceedingsoftheIEEE InternationalConferenceonAcoustics,SpeechandSignalProcessin g(ICASSP) ,2:1117–1120, 2005. [16]L.Canseco-Rodriguez,L.Lamel,andJ.-L.Gauvain.Towardsu singsttforbroadcastnews speakerdiarization. ProceedingsofDARPART04 ,2004. [17]K.Chang,K.Bowyer,andP.Flynn.Anevaluationofmultimodal2D+3 Dfacebiometrics. IEEETransactionsonPatternAnalysisandMachineIntelligence ,27(4):619–624,2005. [18]N.Checka,K.Wilson,M.Siracusa,andT.Darrell.Multipleperson andspeakeractivitytrackingwithaparticlelter. ProceedingsoftheIEEEInternationalConferenceonAcoustics, SpeechandSignalProcessing(ICASSP) ,5:17–21,2004. [19]S.ChenandP.Gopalakrishnan.Speaker,environmentandcha nnelchangedetectionandclusteringviatheBayesianinformationcriterion.In DARPAspeechrecognitionworkshop ,1998. [20]Y.ChenandR.Riu.Real-timespeakertrackingusingparticleltersen sorfusion. Proceedings oftheIEEE ,92(3):485–494,2004. [21]S.S.ChengandH.M.Wang.Asequentialmetric-basedaudiosegme ntationmethodviathe bayesianinformationcriterion.In ProceedingsofEurospeech ,2003. [22]C.Chibelushi,S.Gandon,J.Mason,F.Deravi,andR.Johnston .Designissuesforadigital audio-visualintegrateddatabase. IEEColloquiumonIntegratedAudio-VisualProcessingfor Recognition,SynthesisandCommunication ,28(7):1–7,1996. [23]R.CutlerandL.Davis.Lookwho'stalking:Speakerdetectionusing videoandaudiocorrleation. ProceedingsoftheIEEEInternationalConferenceonMultimedia ,3:1589–1592,2000. [24]R.CutlerandL.Davis.Robustreal-timeperiodicmotiondetection,analy sis,andapplications. IEEETransactionsonPatternAnalysisandMachineIntelligence ,22(8):781–796,2000. [25]R.Cutler,Y.Rui,A.Gupta,J.Cadiz,I.Tashev,L.He,A.Colbur n,Z.Zhang,Z.Liu,and S.Silverberg.Distributedmeetings:Ameetingcaptureandbroadcastingsys tem. Proceedings ofACMMultimedia ,2002. [26]N.DalalandB.Triggs.Histogramsoforientedgradientsforhumand etection. IEEEComputer SocietyConferenceonComputerVisionandPatternRecognition ,1:886–893,2005. [27]T.Darrell,J.W.Fisher,P.Viola,andW.Freeman.Audio-visualS egmentationand“TheCocktailPartyEffect”. InternationalConferenceonMultimodalInterfaces ,pages32–40,2000. [28]B.Dumas,C.Pugin,J.Hennebert,D.Petrovska-Delacrtaz,A.H umm,F.Evquoz,R.Ingold,and D.Von-Rotz.Myidea-multimodalbiometricsdatabase,descriptionofacquisitio nprotocols. ProceedingsofThirdCOST275Workshop(COST275) ,pages59–62,2005. 97

PAGE 110

[29]T.DvorkindandS.Gannot.Speakerlocalizationusingtheunscente dkalmanlter. ProceedingsoftheJointWorkshoponHands-FreeSpeechCommunicationandM icrophoneArrays (HSCMA'05) ,C:3–4,2005. [30]H.Eng,J.Wang,A.Kam,andW.Yau.ABayesianframeworkforr obusthumandetectionand occlusionhandlingusingahumanshapemodel. InternationalConferenceonPatternRecognition ,2:257–260,2004. [31]M.Faundez-Zanuy,J.Fierrez-Aguilar,J.Ortega-Garcia,an dJ.Gonzalez-Rodriguez.Multimodalbiometricdatabases:anoverview. IEEEAerospaceandElectronicSystemsMagazine 21(8):29–37,2006. [32]M.FigueiredoandA.K.Jain.Unsupervisedlearningofnitemixture models. IEEETransactionsonPatternAnalysisandMachineIntelligence ,24(3):381–396,2002. [33]J.Fiscus,N.Radde,J.Garofolo,A.Le,J.Ajot,andC.Laprun .Richtranscription2005spring meetingrecognitionevaluation.In MachineLearningforMultimodalInteraction ,2005. [34]J.FisherandT.Darrell.Probabalisticmodelsandinformativesubsp acesforaudiovisualcorrespondence. ProceedingsoftheEuropeanConferenceonComputerVision ,3:592–603,2002. [35]T.Ganchev,N.Fakotakis,andG.Kokkinakis.Comparativeevalu ationofvariousMFCCimplementationsonthespeakervericationtask. Proceedingsofthe10thInternationalConference onSpeechandComputer,SPECOM2005 ,1:191–194,2005. [36]S.Garcia-Salicetti,C.Beumier,G.Chollet,B.Dorizzi,J.L.lesJardin s,J.Lunter,Y.Ni,and D.Petrovska-Delacrtaz.Biomet:amultimodalpersonauthenticationdatabase includingface, voice,ngerprint,handandsignaturemodalities. ProceedingsofFourthInternationalConferenceonAudioandVideoBasedBiometricPersonAuthentication(AVBPA ) ,pages845–853, 2003. [37]S.Garcia-Salicetti,M.Mellakh,L.Allano,andB.Dorizzi.Multimodalbio metricscorefusion: themeanrulevs.supportvectorclassiers. ProceedingsofthethirteenthEuropeanSignal ProcessingConference(EUSIPCO) ,2005. [38]J.S.Garofolo,C.D.Laprun,M.Michel,V.M.Stanford,andE. Tabassi.TheNISTmeeting roompilotcorpus. ProceedingsoftheLanguageResourcesandEvaluationConference ,2004. [39]D.M.Gavrila.Thevisualanalysisofhumanmovement:asurvey. ComputerVisionandImage Understanding ,73(1):82–98,1999. [40]D.M.GavrilaandJ.Giebel.Shape-basedpedestriandetectionand tracking. IEEEIntelligent VehicleSymposium ,1:8–14,2002. [41]V.Gupta,P.Kenny,P.Ouellet,G.Boulianne,andP.Dumouchel.Co mbiningGaussianized/NonGaussianizedfeaturestoimprovespeakerdiarizationoftelephoneconve rsations. IEEESignal ProcessingLetters ,14:1040–1043,2007. [42]T.Haga,K.Sumi,andY.Yagi.Humandetectioninoutdoorsceneusing spatio-temporalmotion analysis. ProceedingsoftheInternationalConferenceonPatternRecognition(IC PR) ,4:331– 334,2004. 98

PAGE 111

[43]T.Hain,S.E.Johnson,A.Tuerk,P.C.Woodland,andS.J.Yo ung.Segmentgenerationand clusteringintheHTKbroadcastnewstranscriptionsystem. Proceedingsofthe1998DARPA BroadcastNewsTranscriptionandUnderstandingWorkshop ,pages133–137,1998. [44]J.HersheyandJ.R.Movellan.Audiovision:Usingaudiovisualsy nchronytolocatesounds. AdvancesinNeuralInformationProcessingSystems ,12:813–819,2000. [45]M.Indovina,U.Uludag,R.Snelick,A.Mink,andA.K.Jain.Multimod albiometricauthenticationmethods:aCOTSapproach. ProceedingsoftheWorkshoponMultimodalUserAuthentication ,pages99–106,2003. [46]A.Iyer,U.Ofoegbu,R.Yantorno,andB.Solenski.Genericmode lingappliedtospeakercount. IEEEInternationalSymposiumonIntelligentSignalProcessingandCom municationSystems pages327–330,2006. [47]A.K.Jain,L.Hong,andY.Kulkarni.Amultimodalbiometricsystemusing ngerprints, faceandspeech. 2ndInternationalConferenceonAudioVisualBasedPersonAuthentica tion (AVBPA) ,pages182–187,1999. [48]A.K.Jain,K.Nandakumar,andN.Ross.Scorenormalizationinmultimod albiometricsystems. PatternRecognition ,38:2270–2285,2005. [49]A.K.JainandA.Ross.Learninguser-specicparametersinamultib iometricsystem. ProceedingsoftheInternationalConferenceonImageProcessing(ICIP ) ,pages57–60,2002. [50]A.K.Jain,A.Ross,andS.Prabhakar.Anintroductiontobiometricr ecognition. IEEETransactionsonCircuitsandSystemsforVideoTechnology(TCSVT),SpecialIss ueonImageandVideo BasedBiometrics ,14(1):4–20,2004. [51]Q.Jin,K.Laskowski,T.Schultz,andA.Waibel.Speakersegmenta tionandclusteringinmeetings. ProceedingsoftheNISTRT-04SpringMeetingRecognitionEvaluationWor kshopatthe 29thInternationalConferenceonAcoustics,Speech,andSignalProce ssing(ICASSP) ,2004. [52]B.Kapralos,M.Jenkin,andE.Milios.Audio-visuallocalizationofmu ltiplespeakersina videoteleconferencingsetting. InternationalJournalofImagingSystemsandTechnologies 13:95–105,2003. [53]T.Kar-AnnandY.Wei-Yun.Combinationofhyperbolicfunctionsfo rmultimodalbiometrics datafusion. IEEETransactionsonSystem,ManandCybernetics ,34(2):1996–1209,2004. [54]S.Kettebekov,M.Yeasin,andR.Sharma.Prosodybasedco-an alysisforcontinuousrecognition ofco-verbalgestures. ProceedingsofInternationalConferenceforMachineIntelligence ,pages 161–166,2002. [55]E.KidronandY.Schechner.Pixelsthatsound. ProceedingsoftheIEEEconferenceonComputerVisionandPatternRecognition(CVPR) ,pages88–95,2005. [56]J.Kittler,M.Hatef,R.Duin,andJ.Matas.Oncombiningclassiers. IEEETransactionson PatternAnalysisandMachineIntelligence ,20(3):226–239,1998. [57]C.KnappandG.Carter.Thegeneralizedcorrelationmethodfores timationoftimedelay. IEEE TransactionsonAcoustics,SpeechandSignalProcessing ,24:320–337,1976. 99

PAGE 112

[58]H.KrimandM.Viberg.Twodecadesofarraysignalprocessing research:Theparametric approach. IEEESignalProcessingMagazine ,13:67–94,1996. [59]A.KumarandA.Zhang.Personalrecognitionusinghandshapea ndtexture. IEEETransactions onImageProcessing ,15(8):2454–2461,2006. [60]G.LathoudandI.A.McCowan.Asector-basedfrequency-do mainapproachtodetectionand localizationofmultiplespeakers. ProceedingsoftheIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP) ,pages18–23,2005. [61]Y.Ma,B.Cukic,andH.Singh.Aclassicationapproachtomulti-biometric scorefusion. Audio andVideoBasedBiometricPersonAuthentication(AVBPA) ,3546:484–493,2005. [62]T.Manseld,G.Kelly,D.Chandler,andJ.Kane.Biometricproduc ttestingnalreport1.0. U.K.NationalPhysicalLab ,2001. [63]S.Meignier,J.-F.Bonsatre,andS.Igounet.E-HMMapproachf orlearningandadaptingsound modelsforspeakerindexing. ProceedingsoftheOdysseySpeakerandLanguageRecognition Workshop ,pages175–180,2001. [64]S.Meignier,D.Moraru,C.Fredouille,J.Bonastre,andL.Besa cier.Step-by-stepandintegratedapproachesinbroadcastnewsspeakerdiarization. ComputerSpeechandLanguage 20(2-3):303–330,2005. [65]K.Messer,J.Matas,J.Kittler,J.Luettin,andG.Maitre.XM2VTSDB :TheextendedM2VTS database. Proceedingsofthe2ndInternationalConferenceonAudioVisualBase dPersonAuthentication(AVBPA) ,pages72–77,1999. [66]J.Min,P.Flynn,andK.Bowyer.Usingmultiplegalleryandprobeimage sperpersontoimprove performanceoffacerecognition. NotreDameComputerScienceandEngineeringTechnical ReportTR-03-7 ,2003. [67]G.Monaci, OscarDivorraEscoda,andP.Vandergheynst.Analysisofmultimodals equences usinggeometricvideorepresentations. SignalProcessing ,86(12):3534–3548,2006. [68]D.Moraru,L.Besacier,S.Meignier,C.Fredouille,andJ.Bona stre.Speakerdiarizationinthe ELISAconsortiumoverthelast4years. RichTranscription2004Workshop ,2004. [69]K.Nandakumar,Y.Chen,S.C.Dass,andA.K.Jain.Likelihoodr atiobasedbiometricscore fusion. IEEETransactionsonPatternAnalysisandMachineIntelligence ,30(2):342–347,2008. [70]K.Nandakumar,Y.Chen,A.K.Jain,andS.Dass.Quality-based scorelevelfusioninmultibiometricsystems. InternationalConferenceonPatternRecognition ,4:473–476,2006. [71]H.J.Nock,G.Iyengar,andC.Neti.Speakerlocalizationusingau dio-visualsynchrony:An empiricalstudy. ProceedingsoftheConferenceonImageandVideoRetrieval ,pages488–499, 2003. [72]E.Patterson,S.Gurbuz,Z.Tufekci,andJ.Gowdy.Movingtalke r,speaker-independentfeature studyandbaselineresultsusingtheCUAVEMultimodalSpeechCorpus. EURASIPJournalon AppliedSignalProcessing ,11:1189–1201,2002. 100

PAGE 113

[73]V.Pavlovic,J.Rehg,A.Garg,andT.Huang.Multimodalspeaker detectionusingerrorfeedback dynamicBayesiannetworks. IEEEConferenceonComputerVisionandPatternRecognition pages34–43,2000. [74]A.Pentland.Lookingatpeople:Sensingforubiquitousandweara blecomputing. IEEETransactionsonPatternAnalysisandMachineIntelligence ,22(1):107–119,2000. [75]J.Phillips,P.Grother,R.Micheals,D.Blackburn,E.Tabassi,an dM.Bone.Facerecognitionvendortest2002. IEEEInternationalWorkshoponAnalysisandModelingofFacesand Gestures ,2003. [76]J.Phillips,H.Moon,J.Rauss,andS.Rizvi.TheFERETevaluationme thodologyforfacerecognitionalgorithms. IEEETransactionsonPatternAnalysisandMachineIntelligence ,22:1090– 1104,2000. [77]P.Phillips,K.Bowyer,D.Reynolds,andP.Grifn.Multimodalvers usunimodalperformanceinbiometrics. PanelDiscussionattheBiometricConsortiumConference,available athttp://www.biometrics.org ,September20,2004. [78]S.PigeonandL.Vandendorpe.TheM2VTSMultimodalFaceDataba se. Proceedingsofthe FirstInternationalConferenceonAudioandVideoBasedBiometricPers onAuthentication (AVBPA) ,pages403–409,1997. [79]G.Pingali,G.Tunali,andI.Carlbom.Audio-visualtrackingfornatu ralinteractivity. ProceedingsofACMInternataionalConferenceonMultimedia(MM) ,pages415–420,1999. [80]F.Porikli.Achievingreal-timeobjectdetectionandtrackingunderextr emeconditions. Journal ofReal-TimeImageProcessing ,2006. [81]M.PrzybockiandA.Martin.NIST'sAssessmentofTextIndep endentSpeakerRecognition Performance. COST275Workshop ,2002. [82]F.Quek,D.McNeill,R.Ansari,X.-F.Ma,R.Bryll,S.Duncan,and K.E.McCullough.Gesture cuesforconversationalinteractioninmonocularvideo.In Recognition,Analysis,andTracking ofFacesandGesturesinReal-TimeSystems ,pages119–126,1999. [83]D.ReynoldsandP.Torres-Carrasquillo.TheMITLincolnLabor atoriesRT-04Fdiarization systems:Applicationstobroadcastaudioandtelephoneconversations. Fall2004RichTranscriptionWorkshop(RT04) ,2004. [84]D.A.Reynolds,T.F.Quatieri,andR.B.Dunn.Speakerverica tionusingadaptedGaussian MixtureModels. DigitalSignalProcessing ,10:19–41,2000. [85]S.RibaricandI.Fratric.Experimentalevaluationofmatching-scoren ormalizationtechniques ondifferentmultimodalbiometricsystems. IEEEMediterraneanElectrotechnicalConference pages498–501,2006. [86]A.RossandA.K.Jain.Informationfusioninbiometrics. PatternRecognitionLetters ,24:2115– 2125,2003. [87]C.SandersonandK.Paliwal.Informationfusionandpersonveri cationusingspeechandface information. IDIAP-RR02-33 ,pages33–42,2002. 101

PAGE 114

[88]M.Sargin,O.Aran,A.Karpov,F.Oi,Y.Yasinnik,S.Wilson,E. Erzin,Y.Yemez,and A.Tekalp.Gesture-speechcorrelationanalysisandspeechdriveng esturesynthesis. IEEE InternationalConferenceonMultimedia ,pages893–896,2006. [89]S.SarkarandP.Soundararajan.Supervisedlearningoflarge perceptualorganization:Graph spectralpartitioningandlearningautomata. IEEETransactionsonPatternAnalysisandMachineIntelligence ,22(5):504–525,2000. [90]G.Schwarz.Estimatingthedimensionofamodel. TheAnnalsofStatistics ,6(2):461–464, 1978. [91]H.Sidenbladh.Detectinghumanmotionwithsupportvectormachines. Proceedingsofthe17th InternationalConferenceonPatternRecognition ,2:188–191,2004. [92]M.Siegler,U.Jain,B.Raj,andR.Stern.Automaticsegmentation,class ication,andclustering ofbroadcastnewsaudio. Proc.ofDARPASpeechRecognitionWorkshop ,pages97–99,1997. [93]T.Sim,S.Baker,andM.Bsat.TheCMUPose,Illumination,andExpre ssionDatabase. IEEE TransactionsonPatternAnalysisandMachineIntelligence ,25:1615–1618,2003. [94]R.Sinha,S.E.Tranter,F.Gales,andP.C.Woodland.TheCambr idgeUniversitymarch2005 speakerdiarisationsystem. EuropeanConferenceonSpeechCommunicationandTechnology (Interspeech) ,pages2437–2440,2005. [95]M.Siracusa,L.-P.Morency,K.Wilson,J.Fisher,andT.Darre ll.Amultimodalapproachfordeterminingspeakerlocationandfocus. ProceedingsofInternationalConferenceonMultimodal Interfaces(ICMI) ,2003. [96]M.SlaneyandM.Covell.Facesync:Alinearoperatorformeasurin gsynchronizationofvideo facialimagesandaudiotracks. AdvancesinNeuralInformationProcessingSystems2000 ,14, 2001. [97]P.SmaragdisandM.Casey.Audio/visualindependentcomponents. 4thInternationalSymposiumonIndpendentComponentAnalysisandBlindSignalSeparation(I CA2003) ,pages709– 714,2003. [98]A.F.Smeaton,P.Over,andW.Kraaij.EvaluationcampaignsandTRE CVID. Proceedings ofthe8thACMInternationalWorkshoponMultimediaInformationRetrieval ,pages321–330, 2006. [99]R.Snelick,U.Uludag,A.Mink,M.Indovina,andA.K.Jain.Larg escaleevaluationofmultimodalbiometricauthenticationusingstate-of-the-artsystems. IEEETransactionsonPattern AnalysisandMachineIntelligence ,27:450–455,2005. [100]R.Stiefelhagen,K.Bernardin,R.Bowers,J.Garofolo,D.Mo stefa,andP.Soundararajan.The CLEAR2006Evaluation. ProceedingsoftherstInternationalCLEARevaluationworkshop 4122:1–45,2006. [101]R.StiefelhagenandR.Bowers.Clearevaluations. Availableathttp://www.clearevaluation.org/ ,lastaccessedonApril2007. [102]K.Toh,X.Jian,andW.Yau.Exploitingglobalandlocaldecisions formultimodalbiometrics verication. IEEETransactionsonSignalProcessing ,52(10):3059–3072,2004. 102

PAGE 115

[103]D.TothandT.Aach.Detectionandrecognitionofmovingobjectsusin gstatisticalmotion detectionandFourierdescriptors. InternationalConferenceonImageAnalysisandProcessing pages430–435,2003. [104]K.Toyama,J.Krumm,B.Brumitt,andB.Meyers.Wallower:Principle sandpracticeof backgroundmaintenance. InternationalconferenceonComputerVisionICCV ,pages255–261, 1999. [105]S.TranterandD.Reynolds.Anoverviewofautomaticspeakerdia rizationsystems. IEEE Transactionsonaudiospeechandlanguageprocessing ,14(5):1557–1565,2006. [106]A.TritschlerandR.Gopinath.Improvedspeakersegmentationand segmentsclusteringusing thebayesianinformationcriterion. ProceedingsofEurospeech ,pages679–682,1999. [107]B.Ulery,A.Hicklin,C.Watson,W.Fellner,andP.Hallinan.Evalua tionofselectedbiometric fusiontechniques.studiesofbiometricfusion. NIST,Tech.Rep.IR7346 ,2006. [108]H.Vajaria,T.Islam,P.Mohanty,S.Sarkar,R.Sankar,andR. Kasturi.Anoutdoorbiometric system:evaluationofnormalizationfusionschemesforfaceandvoice. ProceedingsofSPIE 6202,2006. [109]H.Vajaria,T.Islam,S.Sarkar,R.Sankar,andR.Kasturi.Aud iosegmentationandspeaker localizationinmeetingvideos. InternationalConferenceonPatternRecognition ,2:1150–1153, 2006. [110]H.Vajaria,S.Sarkar,andR.Kasturi.Exploringlongtermco-oc curencebetweenspeechand bodymovements. submittedtoIEEETransactionsonCircuitsandSystemsforVideoTechnology ,2008. [111]L.Valbonesi,R.Ansari,D.McNeill,F.Quek,S.Duncan,K.E.M cCullough,andR.Bryll. Multimodalsignalanalysisofprosodyandhandmotion:Temporalcorrelatio nofspeechand gestures. ProceedingsofEuropeanSignalProcessingConference ,pages75–78,2002. [112]J.VermaakandA.Blake.Nonlinearlteringforspeakertracking innoisyandreverberant environments. ProceedingsofIEEEInternationalConferenceonAcoustics,Speech andSignal Processing(ICASSP) ,5:3021–3024,2001. [113]J.Vermaak,M.Gagnet,A.Blake,andP.Perez.Sequentialmonte carlofusionofsoundand visionforspeakertracking. ProceedingsoftheIEEEInternationalConferenceonComputer Vision(ICCV) ,1:741–746,2001. [114]V.Vezhnevets,V.Sazonov,andA.Andreeva.Asurveyonp ixel-basedskincolordetection techniques. ProcedingsofGraphicon-2003 ,pages85–92,2003. [115]P.Viola,M.J.Jones,andD.Snow.Detectingpedestriansusingp atternsofmotionandappearance. IEEEInternationalConferenceonComputerVision ,2:734–741,2003. [116]L.Wang,W.Hu,andT.Tan.Recentdevelopmentsinhumanmotionan alysis. PatternRecognition ,36(3):585–601,2003. [117]D.Ward,E.Lehmann,andR.Williamson.Particlelteringalgorithmsfor trackinganacoustic sourceinareverberantenvironment. IEEETransactionsonSpeechandAudioProcessing 11(6):826–836,2003. 103

PAGE 116

[118]C.Wooters,J.Fung,B.Peskin,andX.Anguera.Towardsro bustspeakersegmentation:The ICSI-SRIfall2004diarizationsystem. Fall2004RichTranscriptionWorkshop(RT-04) ,2004. [119]C.R.Wren,A.Azarbayejani,T.Darrell,andA.Pentland.Pnd er:Real-timetrackingofthe humanbody. IEEETransactionsonPatternAnalysisandMachineIntelligence ,19(7):780–785, 1997. [120]S.M.YoonandH.Kim.Real-timemultiplepeopledetectionusingskincolor, motionand appearanceinformation. InternationalWorkshoponRobotandHumanInteractiveCommunication ,pages331–334,2004. [121]J.ZhouandJ.Hoang.Realtimerobusthumandetectionandtrackin gsystem. IEEEComputer SocietyConferenceonComputerVisionandPatternRecognition ,3:149–156,2005. [122]D.ZotkinandR.Duraiswami.Acceleratedspeechsourcelocalizatio nviaahierarchicalsearch ofsteeredresponsepower. IEEETransactionsonSpeechandAudioProcessing ,12(5):499–508, 2004. [123]D.Zotkin,R.Duraiswami,andL.Davis.Multimodal3Dtrackingandev entdetectionviathe particlelter. IEEEInt.Conf.onComputerVision,WorkshoponDetectionandRecogn itionof EventsinVideo(ICCV-EVENT) ,pages20–27,2001. 104

PAGE 117

LISTOFPUBLICATIONS *H.Vajaria,S.Sarkar,R.Kasturi,“ClipRetrievalusingMulti-modalBiometr icsinMeeting Archives”,toappearinInternationalConferenceonPatternRecogn ition,2008.(Chapters3,4) H.Vajaria,S.Sarkar,R.Kasturi,“ExploringLongTermSpeech-Bod yMovementCo-occurrenceforSpeakerDiarizationandLocalizationinMeetings”,submittedtoIEE ETransactions onCircuitsandSystemsforVideoTechnology.(Chapters2,3) H.Vajaria,T.Islam,S.Sarkar,R.Sankar,R.Kasturi,“AudioSegmenta tionandSpeakerLocalizationinMeetingVideos”,inInternationalConferenceonPatternReco gnition,2006.[won beststudentpaperaward](Chapters2,3) H.Vajaria,T.Islam,P.Mohanty,S.Sarkar,R.Sankar,R.Kasturi,“A noutdoorbiometricsystem:evaluationofnormalizationandfusionschemesforfaceandvoice”,inS PIEDefenseand SecuritySymposiumApril2006.(Chapter4) H.Vajaria,T.Islam,P.Mohanty,S.Sarkar,R.Sankar,R.Kasturi,“I ndoor-OutdoorFaceand VoiceFusion”,PatternRecognitionLetters,Volume28,Issue12,Septembe r2007.(Chapter4) 105

PAGE 118

ABOUTTHEAUTHOR HimanshuVajariaearnedhisM.S.inElectricalEngineeringfromthePennsy lvaniaStateUniveristyandaPh.D.inComputerScienceandEngineeringfromtheUnivers ityofSouthFlorida.His researchinterestslieintheeldsofimageprocessing,computervisionandp atternrecognition,with afocusonmultimediaanalysis,biometrics,objectrecognitionandtracking,an dremotesensing.