USF Libraries
USF Digital Collections

Detecting group turns of speaker groups in meeting room conversations using audio-video change scale-space

MISSING IMAGE

Material Information

Title:
Detecting group turns of speaker groups in meeting room conversations using audio-video change scale-space
Physical Description:
Book
Language:
English
Creator:
Krishnan, RaviKiran
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Conversation change
Temporal scales
Turn pattern
Multimedia analysis
Taxonomy
Dissertations, Academic -- Computer Science & Engineering -- Masters -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Automatic analysis of conversations is important for extracting high-level descriptions of meetings. In this work, as an alternative to linguistic approaches, we develop a novel, purely bottom-up representation, constructed from both audio and video signals that help us characterize and build a rich description of the content at multiple temporal scales. Nonverbal communication plays an important role in describing information about the communication and the nature of the conversation. We consider simple audio and video features to extract these changes in conversation. In order to detect these changes, we consider the evolution of the detected change, using the Bayesian Information Criterion (BIC) at multiple temporal scales to build an audio-visual change scale-space. Peaks detected in this representation yields group turn based conversational changes at different temporal scales. We use the NIST Meeting Room corpus to test our approach. Four clips of eight minutes are extracted from this corpus at random, and the other ten are extracted after 90 seconds of the start of the entire video in the corpus. A single microphone and a single camera are used from the dataset. The group turns detected in this test gave an overall detection result, when compared with different thresholds with fixed group turn scale range, of 82%, and a best result of 91% for a single video. Conversation overlaps, changes and their inferred models offer an intermediate-level description of meeting videos that are useful in summarization and indexing of meetings. Since the proposed solutions are computationally efficient, require no training and use little domain knowledge, they can be easily added as a feature to other multimedia analysis techniques.
Thesis:
Thesis (MSCS)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by RaviKiran Krishnan.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0004646
usfldc handle - e14.4646
System ID:
SFS0027961:00001


This item is only available as the following downloads:


Full Text

PAGE 1

DetectingGroupTurnsofSpeakerGroupsinMeetingRoomConversationsUsing Audio-VideoChangeScale-Space by RavikiranKrishnan Athesissubmittedinpartialfulllment oftherequirementsforthedegreeof MasterofScienceinComputerScience DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:SudeepSarkar,Ph.D. RangacharKasturi,Ph.D. DmitryGoldgof,Ph.D. DateofApproval: June30,2010 Keywords:Conversationchange,Temporalscales,Turnpattern,Multimediaanalysis, Taxonomy Copyright c 2010,RavikiranKrishnan

PAGE 2

ACKNOWLEDGEMENTS First,Iwouldliketothankmyadvisor,Prof.SudeepSarkar,formanyinsightfulconversationsduringthedevelopmentoftheideas,forhiskindnessandmostofall,forhispatience. Heshowedmedierentwaystoapproacharesearchproblemandtheneedtobepersistentto accomplishanygoal.Iamalsoappreciativeofhisemotionalsupportduringdiculttimes.I wouldalsoliketothankmycommitteemembersProf.DimitryGoldgofandProf.Rangachar Kasturi,fortheirtimeandvaluablesuggestionsduringthepasttwoyears.IthanktheDepartmentofComputerScienceanditsprofessorsattheUniversityofSouthFloridaforproviding mewithnancialsupporttopursuemymaster'sdegree. Iwouldliketotakethisopportunitytothankallmyfriendswhohelpedmegetthrough twoyearsofgraduateschool.Ithankmyparents,whohavealwaysbeenaninspirationto me.Finally,IwouldliketoexpressmyprofoundappreciationtomybrothersSrinivasand Guru.Theircontinuingencouragement,understandingandguidancehavehelpedmebea betterperson.

PAGE 3

TABLEOFCONTENTS LISTOFTABLESiii LISTOFFIGURESiv ABSTRACT vi CHAPTER1INTRODUCTION1 1.1OverviewandMotivation1 1.2Assumptions4 1.3Contribution5 1.4OutlineofThesis5 CHAPTER2PREVIOUSWORK7 CHAPTER3TURNPATTERNS11 3.1SpeakerTurnPatterns11 3.2GroupTurnPatterns13 3.3CommonlyOccurringGroupTurnPatterns14 3.3.1LongConversationwithShortOverlaps15 3.3.2ShortExchange16 3.3.3LongConversationalOverlaps16 3.3.4ShortSpeechExchanges17 3.4Synopsis17 CHAPTER4AUDIO-VIDEOCHANGESCALE-SPACE18 4.1FeaturesUsed18 4.1.1Mel-FrequencyCepstralCoecientsMFCC18 4.2ProposedApproach20 4.2.1Single-ScaleRepresentation22 4.2.2BuildScale-Space:ABottom-UpApproach24 4.2.3Scale-SpaceBreak-Up27 4.3Synopsis29 CHAPTER5DATASETANDRESULTS31 5.1Dataset31 5.2Results35 5.2.1GroupTurnScaleRange35 5.2.2PeakCut-OinGroupTurn37 5.2.3GroupTurnScaleRangeAnalysis38 i

PAGE 4

CHAPTER6CONCLUSIONS40 6.1Conclusions40 6.2FutureWork40 REFERENCES 42 ii

PAGE 5

LISTOFTABLES Table5.1Analysisofdataset33 iii

PAGE 6

LISTOFFIGURES Figure1.1Turnpatternfor5-minutevideousingaudiogroundtruth.3 Figure2.1Previousworkbasedonactionandconversationalturnpatterns.8 Figure3.1Conceptualsketchofspeakerturnpatterns.12 Figure3.2Samplespeakerturnpattern.13 Figure3.3Conceptualsketchofvegroupturns.14 Figure3.4Samplegroupturnpattern.15 Figure3.5Schematicsofdierentconversations.16 Figure4.1Featuresusedingroupturnsegmentation.19 Figure4.2MFCCextractionowdiagram.19 Figure4.3Taxonomyofconversationchange.22 Figure4.4AnexampleAudio-Videochangescale-space.25 Figure4.5ThreedierentAudio-Videochangescale-space.26 Figure4.6Twodierentgroupturnpatternswithgroundtruth.28 Figure4.7Conceptualbreak-upofaAudio-Videochangescale-space.29 Figure5.1TheNISTmeetingroomsetup.32 Figure5.2Aframefromeachclipfromthedataset.34 Figure5.3Trueandfalsepositiveplotforvariablethreshold.35 iv

PAGE 7

Figure5.4Reducingrangeofgroupturnscales.36 Figure5.5Threedierentsegmentsshowinggroupturnpatterns.37 Figure5.6Trueandfalsepositiveplotfordierentgroupscaleranges.38 Figure5.7Dierentsegmentsofgroupturnwithdierentscaleranges.39 v

PAGE 8

DetectingGroupTurnsofSpeakerGroupsinMeetingRoomConversationsUsing Audio-VideoChangeScale-Space RavikiranKrishnan ABSTRACT Automaticanalysisofconversationsisimportantforextractinghigh-leveldescriptionsof meetings.Inthiswork,asanalternativetolinguisticapproaches,wedevelopanovel,purely bottom-uprepresentation,constructedfrombothaudioandvideosignalsthathelpuscharacterizeandbuildarichdescriptionofthecontentatmultipletemporalscales.Nonverbal communicationplaysanimportantroleindescribinginformationaboutthecommunication andthenatureoftheconversation.Weconsidersimpleaudioandvideofeaturestoextract thesechangesinconversation.Inordertodetectthesechanges,weconsidertheevolutionofthe detectedchange,usingtheBayesianInformationCriterionBICatmultipletemporalscales tobuildanaudio-visualchangescale-space.Peaksdetectedinthisrepresentationyieldsgroup turnbasedconversationalchangesatdierenttemporalscales. WeusetheNISTMeetingRoomcorpustotestourapproach.Fourclipsofeightminutes areextractedfromthiscorpusatrandom,andtheothertenareextractedafter90secondsof thestartoftheentirevideointhecorpus.Asinglemicrophoneandasinglecameraareused fromthedataset.Thegroupturnsdetectedinthistestgaveanoveralldetectionresult,when comparedwithdierentthresholdswithxedgroupturnscalerange,of82%,andabestresult of91%forasinglevideo. Conversationoverlaps,changesandtheirinferredmodelsoeranintermediate-leveldescriptionofmeetingvideosthatareusefulinsummarizationandindexingofmeetings.Since theproposedsolutionsarecomputationallyecient,requirenotraininganduselittledomain knowledge,theycanbeeasilyaddedasafeaturetoothermultimediaanalysistechniques. vi

PAGE 9

CHAPTER1 INTRODUCTION 1.1OverviewandMotivation Ameeting,asweknow,isadynamicallychangingentity.Automaticanalysisofameeting's contentandbuildingrichdescriptionsofitarestilldicultproblems.Researchersfromvarious eldssuchasbehavioralpsychology,humancomputerinterfacedesign,humancommunication behavior,computervisionandsignalprocessinghavefocusedeortsonanalyzingmultimedia content,particularlyinmeetings. Asinglemeetingisacomplextemporalchainofdynamicallychangingevents.Forexample, astameeting,canbelookeduponasadiscussionwhenconsideredinitsentirety,butalso canbedescribedasasequenceofcharacterizationsappropriateoversmallertemporalwindows, suchasargument,discussion,note-taking,andbreaks.Automaticinferringofthiswiderange oftemporallydependentdescriptionisstillafundamentalproblem.Bridgingthisgapand describingthedynamicsofmeetingswouldbeanimportantsteptowardsolvingthisproblem. MostresearchersfocusonanalyzingmeetingsbytryingtoanswerquestionssuchasWho? When?Where?Which?.Onequestionthatreallyhelpsinbuildinganintermediatelevel descriptionofthemeetingconversationis,How?Howistheconversationchanging?One importantpropertyinanalyzinghowthemeetingisoccurring,withoutlexicalfeatures,is theturnpatterns.Takingturnstotalkisacommoncommunicationorderinginanytypeof multimediacontent,butitismoreprominentinconversationsinvolvedinmeetings.Inorder todetecttheturnpattern,knowledgeofthetemporalstructureoftheeachspeakermustbe identied.Thistaskisnotsotrivial,astheconversationsinvolvesoverlapspeech. Turnpatternsprovidearecordofspeechtransitionsinaconversation.Accordingtoconversationalanalystsin[1]-whereverpossiblethespeaker'scurrentturnwillbeinterpretedas 1

PAGE 10

implicatingsomeactionbytheresponderintheimmediatenextturn.Similarly,therespondent'ssubsequenttalkwill,wherepossible,beinterpretedasrelatedtotheimmediateprior turn".Instanceslikeapresenterwithwhomaresponderagreesordisagreeswithwhatthe presenterissaying.Theseinstancesareusuallydetectedwiththeuseoflexicalfeatures.In thiswork,nonverbalcuesareusedthatenablesthedetectiontobelanguageindependent. Figure1.1istheresultfromthegroundtruthofameetingroomvideo.Onlyveminutes ofthemeetingareusedforbetterrepresentationpurpose.Thenodesarethestates,thelabels arethenumbersofpeoplespeaking.Thestatelabelsaregivenbybinarystrings,whichspecify subjectsspeakingfromrighttoleft.Descriptionoflabels0101-Subject1andSubject3are speakingtogether;1000-Onlysubject4isspeaking;0001-Onlysubject1isspeaking.The edgesinthebelowgivengraphshowthatthereisatransitionfromonestatetoanother.The solidlineherespeciesthestatechangefromalowernumberwhenconvertedtodecimaldigits, andthedottedlinesspecifyachangeinstatefromahighernumbertoalowernumber.This isdoneonlyforvisualizingturnpatterns. Asinglespeakerinthemeetingcanspeakatdierenttimesinthemeeting,andthisspeaker canspeakatthesametimeasanotherspeakerinthemeeting.Thisiscalledoverlapspeechwith thecurrentspeakerataparticulartimeinthemeeting.Identicationoftemporalstructure dependingonscalesdoesnotdenewhatwasbeingspoken",butdenestheconversation changes.In[2],speakerturnpatternsareusedforspeakerdiarization.Theinputstreamis segmentedtogetthespeakerchangepointforeachspeaker.Beforesegmentation,theoverlap andsilenceframesaredropped.Thispreventsoversegmentationforspeakerdiarization.But droppingtheoverlapframesneglectssomeimportantgroupconversationaldynamics.Inorder toincorporatetheseoverlapframes,ahigherlevelabstractionisneeded.Thefocusofthiswork istoprovideanintermediateleveldescriptionofconversationchangeinmeetingrooms.We useamultiscaleapproachtosegmenttheinputstream,whichresultsinthedetectionofgroup turnpatterns. Communicationisadynamicprocess.Everyconversationhasitsowntemporalscale.For example,theamountoftimetakenforspeakerexchangesinanargumentmaybedierent fromspeakerexchangesinadiscussion.Informationabouttheseunknowntemporalchangesis 2

PAGE 11

Figure1.1:Turnpatternfor5-minutevideousingaudiogroundtruth.Eachstateinthe transitiongraphrepresentsastateameetingisin,intermsofhowmanyspeakersarespeaking atthesametime. importantindescribingthestateofaconversation.Toanalyzetheseunknownscalevariations inconversations,weconsideramulti-scalerepresentationofanaudiostream. Inameetingroom,therearealwaysafewdominantgroupofspeakers.Thesespeakers areusuallyinterruptedbyotherspeakers.Theseinterruptionsmaybeconsideredasspeaker changesinsinglescaleBIC,asistheusualpractice,buttheydonotprovideuswithahigh leveldescriptionoftheconversationbetweenthedominantgroups.Ourmethodprovidesaway todetecttheselongerscaleconversationchangescalledthegroupturns.Inordertodetect groupturns,whichimplicitlydescribethenatureoftheconversation,amultiscaleapproach isconsidered.Theaudio-videojointfeaturesetisrepresentedasascale-spaceandiscalled '`Audio-videochangescale-space".Representationisascale-dependentdescriptionofdierent turnpatternsinconversations,andachangeinthisscale-spaceshowsthemultiscaletemporal changepointstructureoftheentireconversation. Thesepatternssuggestconversationstateswithoutconsideringwhatwasspoken.Many ofthealgorithmsdetectthespeakerchangepointbasedontheBICusingasinglescaleor 3

PAGE 12

one-lengthwindowthattsbestforaparticulardataset.Thistypeofspeakerchangepoint detectionprovidesonlyasingle-scaletemporalstructureoftheconversation.Single-scalecannot determinethelengthofanindividualpersonspeakinginaconversation.Oureortfocuseson obtainingnertocoarserconversationsinameetingroom.Finerconversationsarespeaker conversationsinwhichthespeakerturnishappeningatshorterintervalsoftime.Coarser conversationsareforspeakerturnsatlargerintervals.Therefore,determiningthelengthofa conversationisthemostimportantsteptowarddescribingspeakerturnpatterns. Inordertorecognizetypesofconversations,groupturnpatternsmustbeidentied.This helpsusunderstandthecommunicationorderinmeetings.Thisorderingcanbeclassiedinto groupsofsimilarconversationssuchassilence,monologueandpolylogueconversations.Commonlyoccurringgeneralconversationmodelsaredescribed,whichcanbeusedforclassication ofconversationchangesbasedongroupturns. Insummary,thisproposedapproachcanbeusedasaplaybackfeaturetogotothatparticulargroupturninameetinganddetectdominantgroupsintheameetingbyclustering.These scale-spacerepresentationsanddetectedturnpatternscanbeusedasfeaturesincontent-based queriesofmeetings.Theseinferredgroupturnmodelsoeranintermediate-leveldescription ofmeetingvideosthatcanalsobeusefulinsummarizationandindexingofmeetings. 1.2Assumptions Thesearethefollowingassumptionsregardingmeetingsbeinganalyzed: 1.Themeetinginvolvesmultipleparticipants,butthenumberofparticipantsisnotknown. 2.Themeetingisrecordedusingonlyasinglestationarycameraandasinglemicrophone, andthesestreamsaretimesynchronized. 3.Noassumptionsaremaderegardingtheplacementofthemicrophoneandthecamera.In particular,becausetheparticipantsfaceeachother,allfacesmaynotbevisibletothe camera. 4.Theparticipantsstayseatedintheirchairsthroughouteachmeeting. 4

PAGE 13

5.Thenatureofthemeetingsisunknown-i.e,theymaybegroupdiscussions,debates, brainstormingsessionsororanysuchkindofmeetings. 1.3Contribution Thisworkfocusesmainlyonthreeconcepts.First,wedenetwotypesofturnpatterns {speakerandgroupturnpatterns.Basedongroupturns,wedescribegeneralconversational modelsandtheircorrespondingpeaks.Second,weimplementaBIC-basedmultiplescale organizationofanaudio-videostream,whichwecalltheAudio-videochangeScale-Space".A scalerepresentsatemporalwindowthatisusedtondaconversationchangeintermsofthetwo turnpatterns.Thisrepresentationisascale-dependentdescriptionofdierentturnpatterns. Eachturnpatternisassociatedwithatemporalscalerangeinthescale-space.Thethird aspectofourapproachisautomaticdetectionofgroupturns.Ourapproachprovidesahigher leveldescriptionoftheconversationbetweenthedominantgroupsbydetectinggroupturns. Basedonthiswork,apaper[3]willbepresentedasanoralpresentationattheInternational ConferenceonPatternRecognitionICPR,2010. 1.4OutlineofThesis Thechaptersinthisthesisarewrittenfromatheoryofgroupturnpatternstoproduce resultsthatindicatethedetectionofthegroupturns.Therstchapterisanintroductionto turnpatterns;thesecondchapterdescribesthetheoryofspeakerandgroupturnpatterns. Chapter2presentsthepublishedpriorworktothisthesis.Thepreviousworkdescribes dierenttypesofturnpatterns.Theturnpatternsisdescribedarecategorizedintogroup actionturnpatternsandconversationalturnpatterns. Chapter3describesthetwoturnpatterns.Therstturnpatternisthespeakerturnpattern anditsconceptsaredescribedusingdenitionsandillustrations.Thesecondturnpatternis thegroupturnpattern,whichisthemainfocusofthisthesis.Inaddition,thechapteralso describesthetheoryanditsconceptualvisualizationofthemostcommonlyoccurringgroup turnpatterns. 5

PAGE 14

Chapter4describestheproposedapproachfordetectinggroupturnpatterns.Thischapter startsbydescribingthefeaturesextractedfromthevideoclip.Theproposedapproachfollowsthefeaturesused.Thestepsdetectinggroupturnpatternsisdescribedindetailswith importancegiventothescale-spacecreationandtheBayesianInformationCriterion. Chapter5describesthedatasetused.Thisisfollowedbytheresultsofdetectinggroupturn patternsbyvaryingthegroupturnpatternscalerangestogettheoptimumscalefordetecting groupturnpatterns. 6

PAGE 15

CHAPTER2 PREVIOUSWORK Thischapterreviewspreviousworksthatperformedautomaticanalysisofactionturnpatternsandconversationalturnpatternsinsettingssuchasmeetingrooms.Socialinteraction understandinghasinspiredasurgeofinterestindevelopingcomputationaltechniquesforautomaticconversationalanalysis.Asanalternativetolinguisticapproaches,nonverbalcueshave beenamajordomainforautomaticanalysisofconversationalinteractioninrecentyears.A groupconversationcanproceedthroughvariouscommunicationphasesinthecourseofsharing information.Multipartydiscoursecanrangefromgroupdiscussions,presentations,acasualargument,formalmeetingsandmanymore[4].Figure2.1categorizestherelatedapproachesinto actionturnpatternsandconversationturnpatternssuchasspeakerandgroupturnpatterns. Inaformalmeetingroomscenario,agroupcanassumetobehavingvariousconversational activities.Someoftheseactivitiesinvolveuseofdierentartifactssuchasprojectorscreens, whiteboards,andnotepadsfornotetaking.In[5,6],agroupmeetingissegmentedintoa locationspecicturnofactions.Thisisalsoatypeofturnpattern,butdoesnotinvolvespeech intherecognitionorsegmentationintotheseturns.Thisapproachusedasupervisedtechnique calledHiddenMarkovModels,orHMMs.Thefeaturesusedinthistechniquearethesimple non-verbalaudioandvisualcues.Multiplecamerasandmultiplemicrophoneswereused.Some ofthesimpleaudiofeaturesusedwerepitch,energyandspectralfrequency.Thevisualfeatures extractedwereeachparticipant'sskincolor,motionblobsandlocationfortheindicationofbody motionandheadposition.Thisapproachusedamulti-modelmeetingmanagercorpus[6].The resultsweremeasuredintermsofactionrecognitionrate. Eventhoughactionrecognitionwasconvincingandtheworksshowedthevalueofaudiovisualfusion,suchmethodshadadrawbackofoverttingthedatawhenlearning,asthe 7

PAGE 16

Figure2.1:Previousworkbasedonactionandconversationalturnpatterns.Theactionturn patternisbasedonclassifyingthemeetingintodierenttypesofactions.Theconversational turnpatternisdividedintotwogroups,namelyspeakerturnpatternsandgroupturnpatterns. dataavailablewaslimited.Someofthesegmentationandrecognitionofthemeetingevents andactionsaregivenin[7,8,9].Theseworksaddressedtheconcernsbyhavingamultilayer HMMframework.OnelayeroftheHMMframeworkmodelsbasicindividualactivitiesfromlowlevelaudio-visualfeatures,andthenextleveloftheHMMframeworkmodelstheinteractions. ThisactionperformedreasonablewellontheM4corpus.Someoftheseactionssegmented andrecognizedarethediscussion,monologue,note-taking,presentation,white-board.These actionsprovideadierenttypeofturntakingbehavior.Thisdoesnotinvolvesegmentationof meetingsintermsofconversations. In[10],personalstatesandmeetingstateswereinferredthroughFiniteStateMachines FSMseparately,whereonlyvisualcueswereused.Statessuchuser'sstanding-sittingstates werealsoanalyzed.Thesocialdynamicsofpeopleinvolvedarealsodescribed.Detection ofactionsandactivitiessequencesofactionsisdoneusingasetofobject-orientednite statemachinesthatruninparallelwithoneanother.Thelowest-levelinputstothishierarchy weretheobservationsusingvisualcueslikeheadgazeandhandposition.Similareortswere 8

PAGE 17

pursuedin[11,12].In[11],thecontextsusedforinteractiondetectionincludeheadgesture, attentionfromothers,speechtone,speakingtime,interactionoccasion,andinformationabout previousinteraction.ThesupportvectormachinesSVMclassierisadoptedtorecognize humaninteractions.TheAMIproject[13]alsodealswithinteractionissuesincludingturntaking,gazebehavior,inuenceandtalkativeness.In[12],fourkindsofclassicationmodels, includingSupportVectorMachineSVM,BayesianNet,NaveBayesandDecisionTreewere selectedtoinferthetypeofeachinteraction.Brdiczkaetal.[14]adoptedspeechfeaturesfor theautomaticdetectionofinteractiongroupcongurations. Ahierarchicalapproachofsegmentingandrecognizinghumaninteractionsdependenton whowasspeakingwasadoptedin[15].Inconversationanalysis,turnpatternsareusually referredtoasthespeakerturnpatterns.In[1],speakerdiarizationisatermusedtocluster allthespeakerspecichomogeneoussectionsofanindividualspeaker.Speakerdiarizationis abasicproblemanduseofspeakerdiarizationgivesatemporalstructureofconversationto determinewhowasspeakingatatthatinstantoftime.Anoverviewofthespeakerdiarization systemsareprovidedin[16].Morerecently,newtypesofmulti-modal[17]featuressuch asprosody,speakerturnsetc.wereaddedasfeaturestoclassifymeetingscenariosusinga multi-streamDynamicBayesianModelDBNtechnique,whichisadoptedin[18,19].In [19],twoDBNarchitectureswereinvestigated:atwo-levelhiddenMarkovmodelHMMin whichtheacousticobservationswereconcatenated;andamulti-streamDBNinwhichtwo separateobservationsequencesweremodeled.TherstleveloftheDBNinmulti-levelDBNs decomposedtheinteractionpatternsassequencesofsub-activitieswithoutlabelingthemor extractinganymeaningfromthesepatterns.Thesub-activitiesarejustparametersthatare learnedduringtraining.Thisapproachshowedimprovement,butthecomplexstructureofthe modelsmadeithardtointerpret.ThesecondDBNprocessedotherfeaturesandintegrated theminahigherlevel. Speakerturns[20,21,2,22,23]aremostlyusedtodescribeanswerstoquestionslike, Who?,When?,Where?.Thespeakerturnisusedforspeakerdiarization,localization,oor controlanalysis[24].Therehavebeeneortssimilartothistocombinegesture,gazeand speechtogether.In[20],theaudiofeatureswereextractedandsegmentedusingtheBayesian 9

PAGE 18

InformationCriterionBIC.Thesegmentationindicatedspeakerchangepoints.Thiswasthe rsttimethatamodelselectionframeworksuchasBICwasused.Alotofworkhasbeendone onspeakerdiarization,whichisbasedonthistypeofsegmentation.Themulti-modelaspect wasalsointroduced.In[2],theaudioandvideowascombinedtoformajoint-featurevector. ThesefeatureswerethensegmentedusingtheBICandthenclustered.Thecorrespondingvideo modelwasalsobuiltfortheseclustersinordertolocalizethespeaker.Thespeakerturnpatterns usuallydonotinvolveoverlapspeech.Therehavebeensomedevelopmentsinincorporating overlapspeechandclassifythespeechintoactivities,butitisstilllimited[25,26,27].Labeling activitiesbasedtheanyfeaturesetisachallengingproblem.Theactivitiesallowustodescribe ahigher-leveldescriptionoftheconversationandaction.Thelabelingofdierenttypesof conversationshouldincludeconversationdynamicsofgroupsofspeakers.Onesucheortis indescribingahigherabstractionofconversationcalledthegroupturn.Thisgroupturnalso includesoverlapspeech,asitisnolongeronlyasignal-leveldescriptionanymore. Groupturnpatternisaconversationalturnpatternthatinvolvesoverlapspeech,asingle speakerandagroupofspeakersspeakingsimultaneously.Previously,inlanguageanalysis,this groupturnpatternhasbeendescribed[28].Someotherworkssuchasin[29,30,25,26,27], havebeenpartiallysuccessfulindeterminingtheoverlapspeechsegmentsinaconversation. Theseareagainusedtodescribeactionsoranalyzetheamountofoverlapspeech.In[31], overlapdetectionisdonebyanHMM-basedtechnique.Inourwork,thefocusistocreatea multi-temporalscale-spaceforconversationtodetectgroupturnpatternsanddistinguishthis workfrompreviousworksofactionrecognitionandspeakerturnpatterns. 10

PAGE 19

CHAPTER3 TURNPATTERNS Turnpatternsareacommunicationorderingofallthespeakersintheconversation.This patternshowsuswhoseturn,ataparticulartimeinstant,itwastospeak.Thepatternshows thecombinationofinformationaboutthespeakerssuchas,whowasspeaking,when,how manytimesaspeakerspoke.Thesepatternsalsoshowtheoverlapand/orsimultaneousspeech involvedintheconversation.Havingtheoverlapspeechinconversationcanrevealthenature oftheconversation.Longeroverlapsisanindicationofdisagreementorquarrel.Moreoverlap speechinconversationsleadtomoreargumentsthandiscussions. Inmeetingrooms,agroupofpeopleconversingcanbeseenasproceedingthroughdiverse conversationalphases.Thesephasescanbedetectedthroughthetemporalstructureofthe conversation,accordingwhospokewhen".Thesevariousphasesarecalledastheconversation changes.Turnpatternsdescribethetemporalstructureofthemeeting,whichinturnoersan intermediate-leveldescriptionofthemeeting.Theturnpatterndoesnotdescribewhatwas beingspoken,asthelexicalfeaturesarenotused,butitcandescribehowtheconversation changedorprogressedintime.Wedeneaspeakerhavinghisorherturntospeakbyhave theoor"[29].Wehavedenedconversationchangetobeintermsofspeakerturnsandgroup turns[29][30],andascalerepresentsatemporalwindowthatisusedtondaconversation changeintermsofthetwoturnpatternsdescribedbelow. 3.1SpeakerTurnPatterns Aturnconsistsofthesequenceoftalkspurtsandpausesbyaspeakerwhohastheoor". Aspeakerlosestheoorwhenthereisapausebythespeakerorwhenaspeakerhasfallen silent.Aspeakergainstheoorbackwhenthespeakerstartstospeakandisnotinterrupted 11

PAGE 20

byapauseforatleast3seconds[28].Withoutthiscriterion,eventheshortestunilateralsound wouldbedesignatedasaturn.Thetimesliceof1.5secondswaschosenbecauseitisestimated tobethemeandurationofaphonemicclause,andevidenceshowsthatthephonemicclause isabasicunitintheencodinganddecodingofspeech.Aspeakerturnoccurswhenaspeaker losestheoorandanotherspeakergainsit. Figure3.1:Conceptualsketchofspeakerturnpatterns.Therearetwogures,asketchleft ofspeakerturnpatterninvolvingthreespeakerswhichisindicativeofspeakerstudentturn pattern,andthesketchrightofaspeakerturnpatterninvolvingfourspeakersisindicative ofadiscussionoranargument. InFigure3.1,twoconceptualspeakerturnpatternsareshown.Aspeakerturnpattern involvingthreespeakers,S1,S2,andS3,areshownleft.Thistypeofspeakerturnpattern istypicalinpresentationswhereonespeakerS1isdominantandhastheoorduringmost oftheconversation,andtheothertwospeakersS2andS3gainandlosetheooratsparse intervals,andthereforecommunicationisalwaysheadedbythedominantspeaker.Speakers S2andS3donothaveanyarrowsbetweenthem,indicatingthatthereisnospeakerturnfrom speakerS3toSpeakerS2andvice-versa.Anotherspeakerturnpatterninshownright,where thereare4speakersinvolvedintheconversation.Inthisspeakerturnpattern,moreinteraction occursbetweenallthespeakers.Thisindicatesthatallthespeakershavegainedorlostthe oormorefrequently,whichinturnindicatesdiscussionorargument. Figure3.2showstheaudiogroundtruthofanexcerptofasmallconversationinthemeetingroomcorpusused.Therearethreespeakersandallofthemhavegainedtheoorandlostit. 12

PAGE 21

Figure3.2:Samplespeakerturnpattern.TheverticalaxisisthespeakerIDandthehorizontal axisisthetime.Thewhitepatchesindicatethataparticularspeakerisspeaking. 3.2GroupTurnPatterns Agroupturnoccurswhenaspeakeroragroupofspeakersloosetheoor,andaspeaker oragroupofspeakersgainit.Thedenitionofgroupisdescribedintermsofsimultaneous speech.Simultaneousspeechisatypeofspeechwhereoneormorespeakerswhodonothave theoorareoverlappingaspeakeroragroupofspeakerswhohavetheoor.Aninstance inwhichagroupturnoccursfromaspeakerwhohastheoor,toagroupofspeakers,with thesamespeakerkeepingtheooriscalledoverlapspeech.Overlapsaresynonymouswith interruptivesimultaneousspeech.Groupturnisdescribedtocoverinstanceswhereindividual turntakerswhohavetheoorareeectivelydrownedout"bythegroup. Figure3.4showsthedierentgroupturnpatternsasnodesG1,G2,G3...Gn.Eachnode canbeconsideredasasupernodeconsistingofspeakerturnpatterns.Eachgroupturnpattern consistsofadierentnumberofspeakers.Thedynamicsofeachgroupturnpatternarealso dierent.Therstgroupturnpattern,G1,showsatypeofspeakerturnpattern.Eachdierent speakerturnpatterncanbesegmentedasthegroupturnpatterns.NodeG3,wherethereis 13

PAGE 22

Figure3.3:Conceptualsketchofvegroupturns.Eachturnisdierentfromeachother. moreinteractionbetweenthespeakers,hasadierentnumberofspeakersinvolvedinthe conversation.Figure3.5showstheaudiogroundtruthofanexcerptofasmallconversationin themeeting-roomcorpusused.Therearevespeakersandallofthemhavegainedtheoor. Achangescale-spaceanditsdivisionwillprovidethenecessaryinformationtodetermine whetherthechangepointatspeakerturnoragroup. 3.3CommonlyOccurringGroupTurnPatterns Apolyloguecontainsasequenceofgroupturnpatternsandthesepatternsdescribespeaker andsimultaneousspeechturnpatterns.Accordingtotheseturnpatterns,wedeneconversationalmodelsthatreectthemostcommonlyoccurringgroupturnpatternsanddescribe theturnchangeswithrespecttoascale-space.Therearefourtypesofconversationalmodels described: 1.LongConversationw/ShortOverlaps 2.ShortExchange 3.LongConversationalOverlaps 4.ShortSpeechExchanges 14

PAGE 23

Figure3.4:Samplegroupturnpattern.TheverticalaxisisthespeakerIDandthehorizontal axisisthetime.Thewhitepatchesindicatethataparticularspeakerisspeaking.Dierent combinationsofspeakersspeakatdierenttimeinstances. Aconversationthatsuggestsadispute,quarreland/ordisagreementisanargument.Lookingforthesetypesofconversationalturnpatternswillleadtothedetectionofsimultaneous speech.Constantsmalloverlapexchangesindicatethatthereisdisagreement. 3.3.1LongConversationwithShortOverlaps Longconversationalchangepointswillbedetectedatthegroupturnscalerangeandsimultaneousspeechwithshortlengthwillbedetectedatthespeakerturnscalerange. Figure3.5[a]isaschematiccorrespondingtolong,non-overlappingconversations,indicating adiscussionwithshortoverlaps,whichisconsideredtobearespondersagreeingoracknowledgingthecommunicatorwithwordssuchasHmm",Uh",andsomelaughtergigglesby anotherperson. 15

PAGE 24

a b c d Figure3.5:Schematicsofdierentconversations.Whitehorizontalbarsshowwhohastheoor. Blackbarsshowthatthespeakerhasfallensilent.Foreachspeakeroorchange,corresponding regionsgroupandspeakerinwhichpeaksareshown. 3.3.2ShortExchange ShortExchangespeechisanothercategoryofspeechwherethereisabreakinamonologue andthereisveryshortspeechinwhichtheratioofthemonologueandthisshortspeechis veryhigh,whichsuggestsashortexchangeofspeech.Shortexchangesoccurwhenapersonis lookingforapprovalordisapprovalofthespeechheiscurrentlymaking.Wordsofapprovalonly maynotbeconsideredshortexchange;somespokenwordsthatwillnotdecreasethemonologue toshortspeechratiocanonlybeconsideredshortexchange. Figure3.5[d]depictsaconversationwinwhichaspeakertendstowaitforanapprovaloran acknowledgmentorevenasmallsentenceindicatingtheresponders'sview.Lowertemporal scaleswilldetectalltheseindividualspeakerchangesoraturn. 3.3.3LongConversationalOverlaps Figure3.5[c]showstheschematicofaconversationstartingwithlongoverlapsandcontinuingintoadiscussion.Longsimultaneousspeechwillhavechangepointsatthegroupturn scalerange.Dierentgrouppatternswillhavedierenttemporalscalesofconversations.The 16

PAGE 25

conversationthatrepresentstheswitchintoanotherstate,isanindicationthatargumentshave stoppedandspeakershaveresumeddiscussions. 3.3.4ShortSpeechExchanges Shortspeechexchangeshappenwhenthespeakerturnsarerapid.Therapidspeechmight beduetosomedisagreementorsmalltalkbetweenthespeakers,oradiscussionthatisin progress,inwhich,everyspeakerhasverylittletosay.Theratiobetweenthethespeakersis lesscomparedtoshortexchangespeech. InFigure3.5[b],shortspeechexchangescenariosareshown,whichareprominentlyfound acrosstheconversationsthattendtobequarrelsordisputes.Shortexchangeswouldbedetected atshorterscalesastheconversationlengthofanindividualspeakeroragroupofspeakerswould beshort.Hence,thepeakswillshowupatspeakerturnscalerange,andiftheconversation lengthfallsinthescalerangeofgroupturns,thenpeakswouldalsobedetectedatthatscale. 3.4Synopsis Inthischapter,wedescribetwotypesofconversationalturnpatterns.Thefocusofthis workistodetectgroupturns.Thegroupturnsareaturn-takingbehaviorinagrouphavinga conversation.Thechangestotheconversationsinthegroupturnwillyieldinformationonhow theconversationwasgoing,i.e,intermsofstateslikediscussion,argumentandpresentation. Eventhoughthedetectionstatesarestillaproblem,detectinggroupturnpatternswillprovide valuableinsightintothegroupconversationalpattern.Themostcommonlyoccurringgroup turnpatternscanbefoundineverydayconversationswherethereisoverlap,intrusionsinto someone'sspeech,questionaskingandagreeingwithanotherperson'sspeech.Thegroupturn patternwillprovideanideaastowhichgroupswereconversingthemost,leadingtothe dominantgroupofthespeakersintheconversation. 17

PAGE 26

CHAPTER4 AUDIO-VIDEOCHANGESCALE-SPACE 4.1FeaturesUsed Weusebothaudioandvideofeaturestobuildtheanaudio-videochangescale-space.The audiosignaliscapturedat44kHzandwasprocessedtoextract23Mel-FrequencyCepstral CoecientsMFCCata30Hzrate.Thiswasdonetobringthetemporaldimensionalityof theaudiotothesamerateasthevideo.MFCCsarethenprojectedintoprincipalcomponent analysisPCAspacetoobtaina23dimensionalfeaturevector. Thevideoframerateis30Hzata720x480resolution.Weusethegray-scaledierence image,asitiscomputationallylessexpensive.andperformsbetterthanopticowonour data,wheretheinterframedisplacementsarehigh[33].Wedownsamplethedierenceimage byafactorof20.Theimagedimensionsarefurtherreducedto23byPCAtocorrespond withthedimensionalityoftheaudiofeatures.Wecombinetheaudiovideofeaturestocreate acombinedaudioandvideofeatureset.Thesecombinedfeaturesarethenprojectedinto anotherPCAspace.Wenormalizetheaudioandvideobythecovariancesintherespective spaces.Figure4.1showsavideobrokenintoaudioandvideofeatures.WehavetheMFCC featuresextractedforanaudiostreamandshownasanimageleftoffeatures.Wehavethe imagedierencefeaturesshownasanimagerightoffeatures. 4.1.1Mel-FrequencyCepstralCoecientsMFCC Theaudiofeaturesusedintheprocessofbuildingaudio-videochangescale-spaceareMFCC features.MFCCfeaturesarethedominantfeaturesusedinanyspeechrecognitionsystem[34]. Thefeatureshavetheabilitytorepresentthespeechamplitudespectruminacompactform. Thishasbeenthereasonfortheirsuccess.Theyarederivedfromatypeofcepstralrepresenta18

PAGE 27

Figure4.1:Featuresusedingroupturnsegmentation.AudiofeaturesareMel-frequencyCepstralCo-ecientsleftandvideofeaturearetheimagedierencesright.Usingaudio-video nonverbalcues,amulti-temporalscaleconversationchangeisgroupedintospeakerandgroup turns. tionoftheaudiostream.ThecepstrumisformedbytakingtheFouriertransformoftheaudio spectrum.Inmel-frequencyspectrum,frequencybandsareequallyspacedonthemel-scale. Thisisabetterapproximationofspeechsignalthanlinearspacedfrequecybands[35] Figure4.2:MFCCextractionowdiagram.Thefouriertransformoftheinputistaken,which isanaudiosignal.MappedtotheMelscaleandlogs.Adiscretecosinetransformisperformed toobtaintherequirednumberofMFCCs. Figure4.2showstheprocessofextractingtheMFCCfeatures.Thesmallspeechsignal sectionsthatarestatisticallystationaryaremodeled.Thewindowfunctionistypicallya 19

PAGE 28

Hammingwindow.Thisremovestheedgeeect.DFTofthesignalistakenandmel-frequency warpingisdone.Amel"isaunitofspecialmeasureorscaleofperceivedpitchorfrequency ofatoneandislinearwhenfreqislessthan1kHz.Thefrequencyaxisiswarpedaccordingto themel-scale. StepsinvolvedincalculatingMFCCfeaturesforanaudiostreamincludethefollowing 1.TaketheFouriertransformofawindowedexcerptofasignal. 2.Mapthepowersofthespectrumobtainedaboveontothemel-scale,usingtriangular overlappingwindows. 3.Takethelogsofthepowersateachofthemelfrequencies. 4.Takethediscretecosinetransformofthelistofmellogpowers,asifitwereasignal. 5.TheMFCCsaretheamplitudesoftheresultingspectrum. TheMFCCfeaturesarecalculatedat30framespersecond.Thisisdonetomatchthe temporaldimensionsofthevideo.Onlytherst23[2]MFCCfeaturesareconsidered.These featuresarethenprojectedtoaPCAspace,butthedimensionsarenotreduced.Section4.1.2 providesthenecessarydescriptionofcalculatingprincipalcomponentanalysis. 4.2ProposedApproach TherststepintheproposedmethodinvolvesacquiringMFCCaudiofeaturesandImage Dierencevideofeatures.Theyarejoinedtogethertoformthejointfeaturevector.the speechsignalisthensegmentedintohomogeneoussegments-i.e,segmentscontainingsimilar speechsegments.Intherstsegmentation,thetemporalwindoworscaleusedtocalculatethe segmentsis3seconds.Thenextstepistoincreasethescaleatsomeintervalandsegmentat eachscale.Thisbuildsupascale-spaceofsegmentsandiscalledAudio-videochangescalespace.Thisshowsthechangeofdierentspeakershavingtheooratdierentscales.This scale-spaceisthenbrokenupintotwodierentturnscaleranges.Thenextstepistodetect thegroupturns.Thechangevaluesintherangeofgroupturnpatternsissummedtogetthe groupturnBICcurve. 20

PAGE 29

Theaudio-videochangescale-spaceACSSisatwo-dimensionalfunctionovertime t and scale whosevalueisgivenby ACSS t; =BIC t )]TJ/F20 10.9091 Tf 10.909 0 Td [(;t + .1 where,BIC t )]TJ/F20 10.9091 Tf 10.199 0 Td [(;t + isthechangeinBayesianInformationCriterionBICvaluebetween consideringasinglemultivariateGaussianmodelfortheMFCCcoecients, X ,overthetime window t )]TJ/F20 10.9091 Tf 10.791 0 Td [( to t + versusseparateGaussianmodelsover t )]TJ/F20 10.9091 Tf 10.791 0 Td [( to t andover t to t + .We usedthesingleGaussianBICrepresentationasdescribedin[20]. BIC t )]TJ/F20 10.9091 Tf 10.909 0 Td [(;t + = 2 log j X t )]TJ/F22 5.9776 Tf 5.756 0 Td [(;t j +log j X t;t + j log j X t )]TJ/F22 5.9776 Tf 5.757 0 Td [(;t + j)]TJ/F20 10.9091 Tf 16.156 0 Td [( d + d d +1 2 log N .2 where isapenaltytermtypicallychosenas1.0, d isthenumberofMFCCcoecients,and j X j isthedeterminantofthecovarianceofthesequenceofvectors, X .Alargevalueofthe BICindicateschangeinstatistics. WebuildamultiplescaleBICcurve, U t ,using: U t = X 2 groupturn ACSS t; .3 ThesummedBICcurvewasthenlowpasslteredandsalientpeakswereextracted.A peakisconsideredsalientifitexceedsbothitsneighboringminimaby [36],where isthe standarddeviationoftheBICcurveand isaconstant.Higherthepeaks,thehigherthescale atwhichtherewasachangepoint.Lowpeaksareconsideredtobeaspeakerturnscalerange changepoint.Higherpeaksareconsideredtobeatgroupturnscalerange.Twoclosegroup turnpeaksindicatethedetectedsimultaneousspeech. TaxonomyofconversationchangeinFigure4.3illustratestheproposedmethod.The BayesianInformationcriterionBICisusedtosegmentthejointfeaturevectorsintosegments.Thesesegmentsaredurationsoftheclipinwhichtheaudioandvideoinformationare similar.Thissegmentationisforthelowestscaleatemporalwindowpossibleinthespeech segmentationi.e,3seconds.Buildamultiscalerepresentationofthejoint-featurevectorsby 21

PAGE 30

Figure4.3:Taxonomyofconversationchange.Usingaudio-videononverbalcues,amulti temporalscaleconversationchangeisgroupedintospeakerandgroupturns,andcommonly occurringconversationsforgroupturnsareshown. segmentingateachscale.Themulti-scalerepresentationofthefeaturesisbrokenupwith respecttoscaleintotwodierentregions{groupturnandspeakerturns. Section2.1describesthesegmentationprocedureforasingletemporalwindowscale.The nexttwostepsofbuildingascale-spaceandbreakingthescalesintodierentturnsisdescribed insections2.2and2.3respectively. 4.2.1Single-ScaleRepresentation Therststepinbuildingthemulti-scalerepresentationisthedescriptionofthemodel selectioncriteriontodetectchangesinstatistics,theBayesianInformationcriterionBIC[37]. Estimatingthedimensionofamodelisamodelselectioncriterionwidelyusedinstatistics. Usingthismodelselectioncriterion,thechangestothejointaudio-visualstreamaredetected. Theaudiosignalissampledat16kHz,andtheMel-FrequencyCepstralCoecientsMFCCs areextractedusing32lterswiththebandwidthrangingfrom166Hzto4000Hz.These 22

PAGE 31

settingsarechosenfromastudy[38]thatfoundthemtoproducegoodresultsinspeaker recognitionandvericationtasks.TheMFCCswereextractedat30Hz,tomatchthetemporal dimensionalityofthevideo.Thevideofeatures,whichintendtocapturemotion,areobtained byimagedierencesthreeframesapart.Thedierenceimagesarethresholdedtosuppress jitternoise,dilatedbya3x3circularmask,downsampledfromtheoriginalsizeof480x720 to48x72andarevectorized.ThesefeaturesarethenprojectedontoPCAspace,toreduce dimensionality.Theaudioandvideocoecientsarejoinedtogetherbymultiplyingfeaturesby ascalingfactor,whichisusedtomakethevariancesoftheaudiofeaturesequaltothoseofthe videofeatures. TheBayesianInformationCriterionBICwasintroducedforspeakerchangedetectionin [20].Consideraspeakerlosestheoortoanotherspeakeroranothergroupofspeakersat atimeinstant t .Todeterminewhethertimeinstant t correspondstoachange-point,atime window preceding t iscomparedtoatimewindow following t .Theframesinthetwo windowsaremodeledparametrically,andiftwosetsoftheframesinthecorrespondingwindow aredeemedtobegeneratingdierentmodels,thetimeinstant t representsachange-point.The BIC,givenasetofdata X = x 1 ;:::::;x N ,selectsthemodelthatmaximizesthelikelihoodof thedata.Sincethelikelihoodincreaseswiththenumberofmodelparameters,apenaltyterm proportionaltothenumberofparameters d isintroducedtofavorsimplermodels.TheBIC foramodelMwithparameters isdenedas BIC M =log p X j )]TJ/F20 10.9091 Tf 12.105 7.38 Td [( 2 d log N .4 where isthepenaltytermideallyequalto1and N isthenumberoffeaturevectors.The problemofdeterminingthechangepoint,whichindicatesaspeakerchangeatthelowestscale, canbeconvertedtoamodelselectionproblem.AsgiveninEquation4.2,theGaussianBIC representsthemodels.Accordingly,therearetwopossiblehypotheses.Ifweassumeaunimodal Gaussianmodelforaspeakeroragroupofspeakershavingtheoor,thenthenullhypothesis 23

PAGE 32

wouldbe H 0 : x t )]TJ/F21 7.9701 Tf 6.587 0 Td [( ;::x t ::;x t + N o ; o .5 Thealternatehypothesisisthattwodierentmodelsareneededtoillustratethedataineach window. H 1 : x t )]TJ/F21 7.9701 Tf 6.586 0 Td [( ;::::;x t N 1 ; 1 and x t ;::::;x t + N 2 ; 2 .6 ApositivevaluefortheBICjustiesthealternativehypothesisandsuggeststhatthetime instant isachange-point. Thenextstepinspeakerturnistodetectthepeaksthatareactuallytheindividualspeaker changepoint.Inpreviousworks,speakerturnisidentiedbyremovingalltheoverlapspeech andrunningasilencedetectortodeleteallthesilenceframes.Thiscauseseverychangepoint tobeanindividualspeakerchangepoint. Thissingle-scalesegmentationdetectsonlyspeakerturnpatterns.Whenaspeakerturnis switchedtoagroupofspeakers,thesingle-scaleBICdetectsthatchangeasaspeakerchange andnotagroupchange.Ataparticularinstantoftime,ifmorethanonespeakerisspeaking, thereisoverlapspeech.Thisoverlapspeechisincorrectlysegmentedasaspeakerchange bythespeakerdiarizationalgorithms.Althoughitoutperformsmethodsbasedonsymmetric Kullback-LeiblerKL2andgeneralizedlikelihood,thesingle-scaleBICmethodstillfailsin thecaseofoverlapspeech.Inordertoaddressthisoverlapspeechproblem,amultiscale representationisbuilt.Eventhoughthisrepresentationdoesnotsegmentwhoandhowmany speakerswerespeakinginagroupturnsegment,itseparatesgroupturnsfromspeakerturns. 4.2.2BuildScale-Space:ABottom-UpApproach Insection4.2.1,asingle-scalespeakerchangeisexplained.Detectingspeakerturnusinga singlescaleworksonlywhentheoverlapspeechandthesilenceframesareremoved.Butthis isnotideallythecaseinmanyspeechexchangesystems.Thereisboundtobesomeamount ofoverlapspeechinvolvedintheconversations.Inordertosolvethisproblem,overlapspeech sectionshastobedetected.Usually,verysmalloverlapsdonotmatterastheyindicatea 24

PAGE 33

Figure4.4:AnexampleAudio-Videochangescale-space.Time-HorizontalaxisandscaleVerticalaxis. responder'sagreementordisagreementtothecommunicator.Thesetypeofoverlapspeeches arecollaboratedwithwordssuchasHmmm",Yes",No".Thegoalistodetectoverlap speechsectionsthathaveagroupofspeakersspeakingtogetherwithouthavingtheoor.In section4.2.2,anexplanationonmaximumscalesusedforeachtypeofturnsispresented. Inmostrecentworks,therehasbeenaneorttodetectmultispeakerspeechactivity[39]. Unsupervisedlearningofoverlappedspeechfrommultiplechannelsisoneofthetechniquesused. Butthebasicproblemofwhoisspeakingatwhattimeduringoverlapspeechisstillaproblem. Asthereisacallformorerobustandaccuratespeechactivitydetectionsystems,multi-party conversationingeneral,iscurrentlythefocusofmuchattention.Onesuchapproachisbuilding aconversationchangescale-space,whichprovidesamodelselectionframeworkfordetecting speakerchangepointsatmultipletemporalscales,usingnonverbalfeaturesfromaudioand video.Inordertobuildanaudio-videochangescale-space,apurelybottom-upapproachis used. Theaudio-videochangescale-spaceACSSisdenedasatwodimensionalfunctionover timetandscale ,whosevalueisgiveninEquation4.1.Thisscalespaceisvisualizedin Figure4.4.Itisatwodimensionalplotofchangeagainstscaleandtime.Brightnessindicates thelikelihoodofchangeforthattimeframe,ataparticularscale.Thebrighterthevalue,the morelikelyisthechange. 25

PAGE 34

a b c Figure4.5:ThreedierentAudio-Videochangescale-space.Thechangescale-spaceisshown asintensitychanges.Brightintensitycorrespondstothelikelihoodofconversationchangesat aparticularscaleverticalaxisandtimeinstanthorizontalaxis.Thischangescale-spaceis foran8-minuteconversation. 26

PAGE 35

Theentireconversationscale-spacecanbeconsideredascapturingthetextureofconversation.Therearemanyshortconversationpatternsinsidethistexture.InFigure4.5,examples ofaudio-videochangescale-spaceareshown.Therearethreescale-spacepatternsshown.Each ofthevideo,forwhichthescale-spaceisshown,hasitsowntextureofconversation.Tounderstandthisscalespacevisualization,agoodapproachwouldbetolookatallthebrightest spotsintheimages.Theseindicatestatisticalchangesinthejoint-featurevector.Asthescale goeslonger,therearelesstimeinstantswheretheimageisbright.Thisisbecauseallthesmall speakerchangesbecomestatisticallyinsignicantatcoarserscales.Inthenerscales,orthe shorterscales,theretendstobemorebrightspotsbecauseofspeakerchangeshappening. InFigure4.6,twodierentgroupturnpatternsareshown.Eachgroupturnandthe correspondingwindowwherethegrouphastheoorisshown.Therstgroupofspeakersred hasmorespeakerturns,whichindicatesmoreexchanges.Whenthistypeofconversationis involved,therearealotofbrightspotsonthescale-space.Thescaleofthebrightspotsforthis turnisdependentonthelengthofchangeofthespeakerholdingtheoor.Inthesecondgroup ofspeakersblue,thereareonlythreechangesofspeakersholdingtheoor.Thisisindicative ofasingledominantspeakerinvolvedintheconversation. 4.2.3Scale-SpaceBreak-Up Inchapter3,adiscussionoftwotypesofturnpatternsisprovided.Theseturnpatterns canbeidentiedintheaudio-videochangescale-spaceasscaleranges.Foreachturnpattern,a specicnumberofscalesisallocated.Thisisabreakofthescaleintermsofspeakerandgroup turns.InFigure4.7,aconceptualbreak-upoftheentirescalerangeisshown.Theconceptual divisionofthescale-spaceintoregionswherespeakerandgroupturnpatternscanbedetected. Speakerturnsrequiresscalesinthenerlengths.Finerscalescapturesmallchangesin statisticsandprovideaccuratechangepointsforspeakerchangedetection.Thegroupturn scalerangeisdecideddependingonthelengthofthevideo.Thespeakerturnpatternrangeis smallandisequalto6seconds[2].Inthecaseofscale-spacebreak-up,thespeakerturnrange isincreasedfrom3secondsto6seconds.Thisisdonebecause,eventhoughtorecognizea speakerturnrequiresonly3seconds,anadditional3secondsisallotedtodeliberatelymissall 27

PAGE 36

Figure4.6:Twodierentgroupturnpatternswithgroundtruth.Thebrightestspotsindicate change.Therstgroupturnredinvolvesadierentgroupofspeakersloosingandgaining theoor.Thesecondgroupturnblueinvolvesacompletelydierentsetofspeakersinthe conversation.Scale-verticalaxis,Time-horizontalaxis. theoverlapspeechorsilencewithintherangeofthoseextra3seconds.Thisresultsincoarser scalesbeingassignedtothenextlevelofturnpatterns. Thenextscalerangeisassignedtogroupturnpatterns.Inthistypeofturnpatterns, coarserlengthchangeinasinglespeakerorgroupofspeakerswhohavetheoorisdetected. Thegroupturnscalerangestartsasthespeakerturnscalerangeends.Inthedatasetbeing used,allthevideosareofthesamelength.Asaresult,themaximumrangeofthisturnpattern isxed.Ideallythisrangeincreasesasthelengthofthevideoincreases.Higherthescalerange coarseristhedetectionofturns. 28

PAGE 37

Figure4.7:Conceptualbreak-upofaAudio-Videochangescale-space.Thechangescale-space isshownasintensitychanges.Brightintensitycorrespondtolikelihoodofconversationchanges ataparticularscaleverticalaxisantimeinstanthorizontalaxis.Thischangescale-space isforan8minuteconversation. Oncethegroupturnscalerangeisdecided,thenextstepistodetectthegroupturnpatterns. Todetectgroupturnpattern,ateachinstantoftime,theBICvaluesoftheentiregroupturn scalerangeissummedusingtheEquation4.3.ThisgivesasummedBICcurvethatrepresent scalesofonlygroupturns.ChangepointsaredetectedinthissummedBICcurveinorderto getthegroupturn.ThesummedBICcurveisthenlowpasslteredandsalientpeakswere extracted.Apeakisconsideredsalientifitexceedsbothitsneighboringminimaby [36], where isthestandarddeviationoftheBICcurveand isaconstant. Figure4.7illustratestheconceptualbreak-upofthistechniqueonanexcerptwhereve peopleconversing.Theblackareasindicatethetimeinstantinwhichasinglespeakerora groupofspeakerswerespeaking. 4.3Synopsis Inthischapter,segmentationofconversationaccordingtogroupturnsisdiscussed.In ordertosegment,webuildascale-spacerepresentationoftheentireconversation.Thisgives thetemporalstructureoftheconversationandthechangestakingplaceatvariouslengths.The segmentation,ateachscale,isdoneusingtheBayesianInformationCriterion.Thelengthof 29

PAGE 38

thetemporalwindow,whichisthescale,isvariedtobuildthescale-space.Thescale-spaceis thenbrokenupintospeakerandgroupturnranges.Thisrepresentationusesaudio-visualjoint featurestobuildupascale-space.AdiscussionofMFCCfeaturesisalsoaddedinthischapter. 30

PAGE 39

CHAPTER5 DATASETANDRESULTS 5.1Dataset TheproposedapproachistestedonasubsetoftheNISTpilotmeeting-roomcorpus[40]. Thecorpuscontains19meetingsrecordedinaroomriggedwithvecamerasandfourtable mountedmicrophones.Ofthevecameras,fourarexedcamerasmountedoneachwallof theroomfacingthecentraltable,andthefthisaoatingcamerathatzoomsontosalient activitiesinthescenesuchasthespeaker,thewhiteboard,ortheprojectionscreen.Ofthefour tablemicrophones,threeareomni-directionalmicrophones,andoneisa4-channeldirectional microphone.ThemeetingroomsetupisshowninFigure5.1.Eachparticipantintheroom alsowearsaheadmicrophoneadirectionallapelmicrophone. Thedatabaseavailabletouscontainsthevecamerafeedsforeachmeetingandthevideos haveaspatialresolutionof720x480sampledat29.97Hz.Therearetwoaudiochannels packagedwitheachvideo;oneisagain-normalizedmixofalltheheadmicrophones,andthe secondisagain-normalizedmixofallthedistantmicrophones.Theaudiostreamsaresampled at44kHzandhasaresolutionof16bitspersample.Ofthe19meetings,threemeetingswere excludedfromtheexperimentsbecausetwoofthemdidnothaveassociatedgroundtruthand thethirdconsistedentirelyofapresentationbyoneperson.Audio-visualpairingsareconsidered foreachmeetingbypairingeachofthexedcameraswithoneoftheaudiochannels,resulting in15meetingclips.Fromeachvideo-15,therst90secondsarediscarded,andthenext 8minutesarechosenresultinginapproximately7segmentsineachvideo.Videos-4in Table2.1,arerandomlychosen,fromeachofthevideos.Wehave7 x 15=105segments. Inallofthemeetings,participantsareseatedaroundacentraltableandinteractcasually. Dependingonthetypeofthemeeting,theparticipantsdiscussagiventopic,planevents,play 31

PAGE 40

Figure5.1:TheNISTmeetingroomsetup.Meetingsarerecordedusingfourxedcameras, oneoneachwallandaoatingcameraontheeastwall[40].Theaudioisrecordedusing fourtablemicrophonesandthreewallmountedmicrophonearraysinadditiontoalapelanda microphoneforeachparticipant. gamesorattendpresentations.Fromtimetotime,participantsmaytakenotes,stretch,and sipdrinks.Insomeofthemeetings,participantsleavetheirchairstousethewhiteboard ordistributematerials.Theaudioandvideosignalsfromthesemeetingsarequitecomplex becausethemeetingsareunscriptedandoflongdurations.Sinceonlyasinglecameraviewis consideredatatime,mostfacesarenonfrontalandsometimesparticipantsareonlypartially visible.Insomemeetings,aparticipantmaynotbevisibleatallinaparticularcameraview. Evenwhenthefacesarefrontal,theyareoftenoccludedbytheperson'sownhand.Similarly, theaudiosignaliscomplex,consistingofshortutterances,frequentoverlapsinspeech,and non-speechsoundssuchaswheezing,laughing,coughing,etc.Sampleimagesofallclipsfrom thedatasetareshowninFigures5.2.Inordertoquantitativelycharacterizethemeetingsin thedatasetweusethefollowingvariables: 32

PAGE 41

Table5.1:Analysisofdataset.Analysiswithrespecttonumberofspeakersandspeakerentropy speakerdominancegivenbyspeakerspeakingpertimeinstant. No. VideoName Type No.ofSegments Speakers Entropy 1 NIST 20020627-1010 StaMeeting 7 6 2.13 2 NIST 20020305-1007 Planning 7 7 1.62 3 NIST 20020815-1316 Problemsolving 7 4 1.63 4 NIST 20020213-1012 Planning 7 6 1.91 5 NIST 20011115-1050 FocusGroup 7 4 1.25 6 NIST 20011211-1054 Planning 7 3 1.28 7 NIST 20020111-1012 Planning 7 6 1.37 8 NIST 20020213-1012 StaMeeting 7 6 1.87 9 NIST 20020214-1148 Interactionw/expert 7 6 1.85 10 NIST 20020304-1352 Gameplaying 7 4 1.59 11 NIST 20020305-1007 Planning 7 7 1.79 12 NIST 20020627-1010 Stameeting 7 6 2.26 13 NIST 20020731-1409 Gameplaying 7 4 2.03 14 NIST 20020815-1316 Problemsolving 7 4 1.61 15 NIST 20020904-1322 Interactionw/expert 7 4 1.93 NumberofparticipantsintheroomSpeakers:Weassumethismetricwillcorrelate withthevideoperformance-thegreaterthenumberofspeakers,theclosertogether theywillbeseated,leadingtoocclusions.Secondly,theoverallamountofdistracting motionmayalsobeproportionaltothenumberofspeakers.Also,althoughthenumber ofspeakersmaynotdirectlycorrelatewithaudioperformance,largernumberofpeople resultsinincreasedbackgroundnoiseduetothevarioussoundsthatlistenersmakewhen theyswivelintheirchairs,tapthetableorippages. SpeakerEntropyEntropy:Thisisameasureofspeakerdominationinameeting.A lowentropyindicatesthatonlyafewspeakersspeakformostofthetime,whereasahigh entropyindicatesthattheparticipantsinvolvedspokemoreorlessforaboutthesame duration.Theentropyiscomputedas H conversation = )]TJ/F21 7.9701 Tf 14.929 12.69 Td [(n X i P S i log P S i .1 P S i = d S i n P i d S i .2 33

PAGE 42

a b c d e f g h i j k l m n o Figure5.2:Aframefromeachclipfromthedataset.Theframesarefromthesamecamerafor allthedataset.Thiscameraviewb-oisfromthewestwall[40]. 34

PAGE 43

whereNisthenumberofspeakersinvolvedinthemeeting,d S i isthetotaltimedurationfor whichperson S i speaksandP S i isthepercentageoftimei.e,probabilitythatperson S i speaksduringthemeeting. Figure5.3:Trueandfalsepositiveplotforvariablethreshold.Thethresholdpointscorrespond topeaksselectiondoneforvaluesrangingfrom max Uto max U = 16.Thehighestvalue withlowfalsepositiveisforathresholdof max U = 3.Theirrespectivetruepositivesandfalse positivesareshown. 5.2Results 5.2.1GroupTurnScaleRange Oneoftherstresultsinthegroupturnpatterndetectionisthevisualcomparisonof changesintheentirescale-spacewithonlythegroupturnregion.Thespeakerturnrangeis eliminatedandthegroupturnrangeischosen.Figure5.4showsthecomparisonoftheentire scale-spacewithdierentscalerangesofgroupturn.Theresultisthatallthespeakerturn patternsareeliminatedandonlythegroupturnpatternisconsidered.AsshowninFigure5.4, thechangesarelessdensethantheoriginalscale-space.Thisshowsahigherabstractionof theconversation.Thechangesindicateachangeingroupturnratherthanasingleturn.The Figure5.4alsoshowsthatwhentherangeofthegroupturnisreducedb-d,distinguishing betweengroupturnandspeakerturnbecomesharder.Thisisduetothescalerangereducingto asingle-scaleBICcurvee.Thesimilarresultisquantitativelyanalyzedlaterinthechapter. 35

PAGE 44

a b c d e Figure5.4:Reducingrangeofgroupturnscales.aistheentirescalespace,b-disthegroup turnscale-spacereducedby20from72scalesatbto32scalesatdrespectively.eisthe single-scaleBICcurve. 36

PAGE 45

5.2.2PeakCut-OinGroupTurn Cut-ovalueisintermsofmaximumvalueof U ,obtainedfromEquation4.3,foreach segment.Thecut-ovaluegreaterthan max U = 2hasthelowesttruepositiveandfalse positiverates.Thecut-ovaluegreaterthan max U = 3hasthehighestpercentageoftrue positivesidentiedforallthesegments.InFigure5.3,atruepositivefalsepositiveplotfor varyingthresholdvalueswithaxedgroupturnscalerangeisshownfortheentiredataset. TheplotshowninFigure.5.3,istheaveragevaluesfortheentiredataset. Beforegroupturnpatternscanbedetected,theentirescalespacemustbebrokendown intosmallequalsegments,whicharethenmarkedgroupturngroundtruth.InFigure5.5, thereare3dierenttypesofsegmentsinwhichgroupturnpatternsdetectedareshown.Each ofthesethreeresultsisasegmentandconsistsofaudiogroundtruthtoprow,changescalespaceandmultiscalechangepointsdetectedonthesummedBICcurve.Foreachsegment, thegroupturnsweremanuallymarkedonthegroundtruth.Theyaremarkedbydiamond shapedmarkersunderthegroundtruthofthesegments.Everychangepointdetectedasthe bestgroupturn,inthesegment,is1secondbehindoraheadofthegroundtruthmarkedgroup turn.Thereareotherchangepointsdetectedinthesamesegment,whichistheresultofthe changepointfromtheprevioussegment.Thecumulativeratesofdierentcut-ovaluesfor peakselectionareusedforeachclip. a b c Figure5.5:Threedierentsegmentsshowinggroupturnpatterns.Peaksbottomandthe correspondinggroundtruthtopisshownwithhandmarkedgroupturnpatternsdiamonds. Intheaudiogroundtruthtop,ablackpatchrepresentsaspeakerhavingtheoor. 37

PAGE 46

5.2.3GroupTurnScaleRangeAnalysis Inthissection,wediscussthechangestothedetectionofgroupturnswithrespectto thescalechosen.Thescalerangeofagroupturnstartsfrom6secondsinthescalerange. Thisstartingscaleischosenbecausethechangestothespeakerturncanbeidentiedatthis scale[28].Forallthesegmentsinthedataset,themaximumscalerangeforgroupturnis chosenat60seconds.Thisscaleisreducedcontinuouslyandthegroupturnrangedetection iscalculated.ThehighestscalesigmaMaxchosenwas40.Thepeakdetectionratesare calculatedatscale40andthentherangeisincreasedbyonescaleoneithersideofsigmaMax. ThemaximumrangeofthegroupturnissigmaMax+sigmaMax/2tosigmaMax-sigmaMax/2. Thedetectionrateremainsthesameforafewrangesofscales,thenchanges.Thisisdueto thenatureoftheconversationandthelengthoftheconversation.AsshowninFigure5.6,the truepositiveandfalsepositiveplotforthescalesatwhichtherewasachangeinthedetection rateisshown.Thisplotisfortheentiredatasetwhoseaccuraciesareaveragedoverallthe videos.Thelowerthescalechange,thefewerarethefalsepositivesandtruepositives.When therangeisnallyreducedtoasinglescale,thereductionisalmostzero,asnoneofthegroup turnchangepointsoccurringintheconversationaredetected.Thisalsosuggeststhatevery groupturnpatternisaspeakerturnpatternandnotvice-versa. Figure5.6:Trueandfalsepositiveplotfordierentgroupscaleranges.Thechangesareatthe abovepoints,whichindicatethescaleatwhichtherewasachange.Thelowestscalewillhave fewerfalsepositives. 38

PAGE 47

a b c d Figure5.7:Dierentsegmentsofgroupturnwithdierentscaleranges.Peaksbottomandthe correspondinggroundtruthtopisshownwithhandmarkedgroupturnpatternsdiamonds. Intheaudiogroundtruthtop,ablackpatchrepresentsaspeakerhavingtheoor. 39

PAGE 48

CHAPTER6 CONCLUSIONS 6.1Conclusions Theseparationofspeakerturnsfromgroupturnsprovidesahigher-levelabstractionof meetingroomconversations.Thespeakerturninvolvesthedetectionofpersonsspeaking individually,butthegroupturnprovidesdetectionofgroupsofspeakersspeakingtogether, indicatingoverlapspeech.Thegroupturnisanintermediate-levelfeaturethatcanbeusedin thevideoclassicationandretrieval.Itcanalsobeusedasaplaybackfeature,inordertogo tothatsectionoftheconversationinwhichaparticulargroupofspeakersisspeaking. Conversationsareasetofcomplexturnpatterns.Recognizingconversationchangesusing speakerandgroupturnpatternsprovidesarichdescriptionofmeetingroomconversations.We presentedanovelrepresentation,anaudio-videochangescale-space,thatprovidesasnapshot oftheconversationsatmultipletemporalscales,builtinapurelybottom-upfashion.We demonstratedhowthisrepresentationisusedinautomatedgroupturndetectioninmeeting segments.Therearetwotypesofturnpatternsdescribedinthiswork,speakerandgroupturn patterns.Weneedtwotypesofturnpatternstoeectivelydescribeaconversationchange.The focusofthisthesisisindistinguishingthespeakerturnandthegroupturn. 6.2FutureWork Futureusesofthisrepresentationcouldincludeautomaticallygeneratingrichdescriptions ofmeetingsbycapturingmultipletemporalscales.Itisaintermediate-levelfeaturetodescribe theconversations.Thedetectionofanotherlevelofhierarchytogroupturnisoneofthemain focusesinthefuture.Anotherfocusistowardclassicationofdierentgroupturnpatterns,in 40

PAGE 49

ordertoclassifytheconversationindierentstatessuchaspresentation,discussion,argument, breakandevensilence. 41

PAGE 50

REFERENCES [1]S.EgginsandD.Slade. Analysingcasualconversation .Equinox,1997. [2]H.Vajaria,T.Islam,S.Sarkar,R.Sankar,andR.Kasturi.Audiosegmentationand speakerlocalizationinmeetingvideos. InProc.ofICPR ,2:1150{1153,2006. [3]R.Krishnan,S.Sarkar.Detectinggroupturnpatternsinconversationsusingaudio-video changescale-space. TobepresentedinICPR ,2010. [4]M.R.NaphadeandT.S.Huang.Extractingsemanticsfromaudio-visualcontent:thenal frontierinmultimediaretrieval. IEEETransactionsonInNeuralNetworks ,13:793{810, 2002. [5]I.McCowan,D.Gatica-Perez,S.Bengio,G.Lathouda,M.Barnard,andD.Zhang.Automaticanalysisofmultimodalgroupactionsinmeetings. InProc.oftheIEEEInt.Conf. onMultimediaICME ,2006. [6]I.McCowan,S.Bengio,D.Gatica-Perez,G.Lathoud,F.Monay,P.WellnerD.Moore,and H.Bourlard.Modelinghumaninteractionsinmeetings. InProc.ofICASSP ,3,2003. [7]DongZhang,D.Gatica-Perez,S.Bengio,I.McCowan,andG.Lathoud.Modelingindividualandgroupactionsinmeetings:Atwo-layerhmmframework. InProc.oftheIEEE CVPR.WorkshoponEventMining ,2004. [8]S.Reiter,B.Schuller,andG.Rigoll.Extractingsemanticsfromaudio-visualcontent:the nalfrontierinmultimediaretrieval. IEEETransactionsonInNeuralNetworks ,13:793{ 810,2002. [9]M.R.NaphadeandT.S.Huang.Segmentationandrecognitionofmeetingeventsusinga two-layeredhmmandacombinedmlp-hmmapproach. IEEETransactionsonInNeural Networks ,13:793{810,2002. 42

PAGE 51

[10]T.Teixeira,J.Deokwoo,G.Dublon,andA.Savvides.Recognizingactivitiesfromcontext andarmposeusingnitestatemachines. InProc.ofthirdACM/IEEEInternational ConferenceonDistributedSmartCameras ,1{8,2009. [11]Z.Yu,H.Aoyama,M.Ozeki,andY.Nakamura.Collaborativecapturinganddetectionof humaninteractionsinmeetings. InProc.ofofPERVASIVE ,65{69,2008. [12]Z.Yu,Z.Yu,Y.Ko,X.Zhou,andY.Nakamura.Inferringhumaninteractionsinmeetings: Amultimodalapproach. InProceedingsofthe6thInternationalConferenceonUbiquitous IntelligenceandComputing ,14{24,2009. [13]A.Nijholt,R.J.Rienks,J.Zwiers,andD.Reidsma.Onlineando-linevisualizationof meetinginformationandmeetingsupport. TheVisualComputer ,22:695{976,2006. [14]O.Brdiczka,J.Maisonnasse,P.Reignier,andJ.L.Crowley.Detectingsmallgroupactivities frommultimodalobservations. AppliedIntelligence ,30:47{57,2009. [15]PengDai,LinmiTao,andGuangyouXu.Audio-visualfusedonlinecontextanalysistoward smartmeetingroom. IEEETransactionsonInNeuralNetworks ,13:793{810,2002. [16]S.E.TranterandD.A.Reynolds.Anoverviewofautomaticspeakerdiarizationsystems. IEEETransactionsonAudio,Speech,andLanguageProcessing ,14:1557{1565,2006. [17]K.Otsuka,H.Sawada,andJ.Yamato.Automaticinferenceofcross-modalnonverbal interactionsinmultipartyconversations:"whorespondstowhom,when,andhow?"from gaze,headgestures,andutterances. InProc.oftheninthinternationalconferenceon Multimodalinterfaces ,255{262,2007. [18]A.DielmannandS.Renals.Dynamicbayesiannetworksformeetingstructuring. InProc. ofICASSP ,5:629{632,2004. [19]A.DielmannandS.Renals.Dynamicbayesiannetworksformeetingstructuring. IEEE TransactionsinMultimedia ,9:25{36,2007. 43

PAGE 52

[20]S.ChenandP.Gopalakrishnan.Speaker,environmentandchannelchangedetectionand clusteringviathebayesianinformationcriterion. InProc.DARPASpeechRecognition Workshop ,127{132,1998. [21]S.KnowandS.Narayanan.Speakerchangedetectionusinganewweighteddistance measure. InProc.oftheICSL ,4:2537{2540,2002. [22]M.Kotti,E.Benetos,andC.Kotropoulos.Automaticspeakerchangedetectionwiththe bayesianinformationcriterionusingmpeg-7featuresandafusionscheme. InProc.ofthe IEEEInternationalSymposiumonCircuitsandSystems ,2006. [23]M.Kotti,L.G.P.M.Martins,E.Benetos,J.S.Cardoso,andC.Kotropoulos.Automatic speakersegmentationusingmultiplefeaturesanddistancemeasures:acomparisonofthree approaches. InProc.oftheIEEEInternationalConferenceonMultimediaandExpo ,1101{ 1104,2006. [24]S.Duncan.Somesignalsandrulesfortakingspeakerturnsinconversations. Journalof PersonalityandSocialPsychology ,2:283{292,1972. [25]N.Campbell,T.Sadanobu,M.Imura,N.Iwahashi,S.Noriko,andD.Douxchamps.A multimediadatabaseofmeetingandinformalinteractionsfortrackingparticipantinvolvementanddiscourseow. InProc.oftheLanguageResourcesandEvaluationConference LREC ,2006. [26]N.CampbellandD.Douxchamps.Processingimageandaudioinformationforrecognizingdiscourseparticipationstatusthroughfeaturesoffaceandvoice. InProc.ofthe INTERSPEECH ,2007. [27]C.M.Adda-Decker,P.P.G.Adda,B.M.Philippe,andH.Benoit.Annotationandanalysis ofoverlappingspeechinpoliticalinterviews. InProc.oftheSixthInternationalLanguage ResourcesandEvaluationLREC ,2008. [28]J.JaeandS.Feldstein. Rhythmsofdialogue .NewYorkAcademicPress,1970. 44

PAGE 53

[29]L.M.Dabbs,Jr.,andR.B.Ruback.Vocalpatternsinmaleandfemalegroups.personality andsocialpsychologybulletin. PersonalityandSocialPsychologyBulletin ,518{525,1984. [30]A.J.Sellen.Speechpatternsinvideo-mediatedconversations. InProc.ofSIGCHI ,49{59, 1992. [31]K.Boakye,B.Trueba-Hornero,O.Vinyals,andG.Friedland.Overlappedspeechdetection forimprovedspeakerdiarizationinmultipartymeetings. InProc.ofICASSP ,4353{4356, 2008. [32]S.RenalsandD.Ellis.Audioinformationaccessfrommeetingrooms. InProc.ofICASSP 4:744,2003. [33]F.Quek,D.McNeill,R.Ansari,X.F.Ma,R.Bryll,S.Duncan,andK.E.McCullough. Gesturecuesforconversationalinteractioninmonocularvideo. InRATFG-RTS ,1999. [34]L.RabinerandB.Juang. Fundamentalsofspeechrecognition .Prentice-Hall,Inc.,1993. [35]S.B.DavisandP.Mermelstein.Comparisonofparametricrepresentationsformonosyllabic wordrecognitionincontinuouslyspokensentences. Readingsinspeechrecognition ,65{74, 1990. [36]S.S.ChengandH.M.Wang.Asequentialmetric-basedaudiosegmentationmethodvia thebayesianinformationcriterion. InProc.ofEurospeech ,945-948,2003. [37]GideonSchwarz.Estimatingthedimensionofamodel. TheAnnalsofStatistics ,6:461{ 464,1978. [38]T.Ganchev,N.Fakotakis,andG.Kokkinakis.Comparativeevaluationofvariousmfcc implementationsonthespeakervericationtask. InProceedingsofthe10thInternational ConferenceonSpeechandComputer,SPECOM ,1:191{194,2005. [39]N.Morgan,D.Baron,J.Edwards,D.Ellis,D.Gelbart,A.Janin,T.Pfau,E.Shriberg, andA.Stolcke.Themeetingprojectaticsi. InProc.oftheHumanLanguageTechnology Conference ,1{7,2001. 45

PAGE 54

[40]M.Michel,J.Ajotand,andJ.Fiscus.ThenistmeetingroomphaseIIcorpus. InProc.of MLMI ,3:1{3,2006. 46


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004646
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Krishnan, RaviKiran.
0 245
Detecting group turns of speaker groups in meeting room conversations using audio-video change scale-space
h [electronic resource] /
by RaviKiran Krishnan.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
502
Thesis (MSCS)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: Automatic analysis of conversations is important for extracting high-level descriptions of meetings. In this work, as an alternative to linguistic approaches, we develop a novel, purely bottom-up representation, constructed from both audio and video signals that help us characterize and build a rich description of the content at multiple temporal scales. Nonverbal communication plays an important role in describing information about the communication and the nature of the conversation. We consider simple audio and video features to extract these changes in conversation. In order to detect these changes, we consider the evolution of the detected change, using the Bayesian Information Criterion (BIC) at multiple temporal scales to build an audio-visual change scale-space. Peaks detected in this representation yields group turn based conversational changes at different temporal scales. We use the NIST Meeting Room corpus to test our approach. Four clips of eight minutes are extracted from this corpus at random, and the other ten are extracted after 90 seconds of the start of the entire video in the corpus. A single microphone and a single camera are used from the dataset. The group turns detected in this test gave an overall detection result, when compared with different thresholds with fixed group turn scale range, of 82%, and a best result of 91% for a single video. Conversation overlaps, changes and their inferred models offer an intermediate-level description of meeting videos that are useful in summarization and indexing of meetings. Since the proposed solutions are computationally efficient, require no training and use little domain knowledge, they can be easily added as a feature to other multimedia analysis techniques.
590
Advisor: Sudeep Sarkar, Ph.D.
653
Conversation change
Temporal scales
Turn pattern
Multimedia analysis
Taxonomy
690
Dissertations, Academic
z USF
x Computer Science & Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4646