USF Libraries
USF Digital Collections

Association of sound to motion in video using perceptual organization

MISSING IMAGE

Material Information

Title:
Association of sound to motion in video using perceptual organization
Physical Description:
Book
Language:
English
Creator:
Ravulapalli, Sunil Babu
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Video surveillance
Sound association
Auditory scene analysis
Auditory object
Motion detection
Dissertations, Academic -- Computer Engineering -- Masters -- USF
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: Technological developments and innovations of the first forty years of the digital era have primarily addressed either the audio or the visual senses. Consequently, designers have primarily focused on the audio or the visual aspects of design. In the perspective of video surveillance, the data under consideration has always been visual. However, in light of the new behavioral and physiological studies which established a proof of cross modality in human perception i.e. humans do not process audio and visual stimulus separately, but percieve a scene based on all stimulus available, similar cues are being used to develop a surveillance system which uses both audio and visual data available. Human beings can easily associate a particular sound to an object in the surrounding. Drawing from such studies, we demonstrate a technique by which we can isolate concurrent audio and video events and associate them based on perceptual grouping principles. Associating sound to an object can form apart of larger surveillance system by producing a better description of objects. We represent audio in the pitch-time domain and use image processing algorithms such as line detection to isolate significant events. These events and are then grouped based on gestalt principles of proximity and similarity which operates in audio. Once auditory events are isolated we can extract their periodicity. In video, we can extract objects by using simple background subtraction. We extract motion and shape periodicities of all the objects by tracking their position or the number of pixels in each frame. By comparing all the periodicities in audio and video using a simple index we can easily associate audio to video. We show results on five scenariosin outdoor settings with different kinds of human activity such as running, walking and other moving objects such as balls and cars.
Thesis:
Thesis (M.A.)--University of South Florida, 2006.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Sunil Babu Ravulapalli.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 38 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001793783
oclc - 145505134
usfldc doi - E14-SFE0001524
usfldc handle - e14.1524
System ID:
SFS0025842:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001793783
003 fts
005 20070625132640.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 070625s2006 flu sbm 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001524
040
FHM
c FHM
035
(OCoLC)145505134
049
FHMM
090
TK7885 (ONLINE)
1 100
Ravulapalli, Sunil Babu.
0 245
Association of sound to motion in video using perceptual organization
h [electronic resource] /
by Sunil Babu Ravulapalli.
260
[Tampa, Fla] :
b University of South Florida,
2006.
3 520
ABSTRACT: Technological developments and innovations of the first forty years of the digital era have primarily addressed either the audio or the visual senses. Consequently, designers have primarily focused on the audio or the visual aspects of design. In the perspective of video surveillance, the data under consideration has always been visual. However, in light of the new behavioral and physiological studies which established a proof of cross modality in human perception i.e. humans do not process audio and visual stimulus separately, but percieve a scene based on all stimulus available, similar cues are being used to develop a surveillance system which uses both audio and visual data available. Human beings can easily associate a particular sound to an object in the surrounding. Drawing from such studies, we demonstrate a technique by which we can isolate concurrent audio and video events and associate them based on perceptual grouping principles. Associating sound to an object can form apart of larger surveillance system by producing a better description of objects. We represent audio in the pitch-time domain and use image processing algorithms such as line detection to isolate significant events. These events and are then grouped based on gestalt principles of proximity and similarity which operates in audio. Once auditory events are isolated we can extract their periodicity. In video, we can extract objects by using simple background subtraction. We extract motion and shape periodicities of all the objects by tracking their position or the number of pixels in each frame. By comparing all the periodicities in audio and video using a simple index we can easily associate audio to video. We show results on five scenariosin outdoor settings with different kinds of human activity such as running, walking and other moving objects such as balls and cars.
502
Thesis (M.A.)--University of South Florida, 2006.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 38 pages.
590
Adviser: S. Sarkar, Ph.D.
653
Video surveillance.
Sound association.
Auditory scene analysis.
Auditory object.
Motion detection.
690
Dissertations, Academic
z USF
x Computer Engineering
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.1524



PAGE 1

Asso ciation of Sound to Motion in Video Using P erceptual Organization b y Sunil Babu Ra vulapalli A thesis submitted in partial fulllmen t of the requiremen ts for the degree of Master of Science in Computer Science and Engineering Departmen t of Computer Science and Engineering College of Engineering Univ ersit y of South Florida Ma jor Professor: S. Sark ar, Ph.D. Rangac har Kasturi, Ph.D. Ra vi Sank ar, Ph.D. Date of Appro v al: Marc h 29, 2006 Keyw ords: Video Surv eillance, Sound Asso ciation, Auditory Scene Analysis, Auditory Ob ject, Motion Detection c r Cop yrigh t 2006, Sunil Babu Ra vulapalli

PAGE 2

A CKNO WLEDGEMENTS I w ould lik e to thank Dr. Sudeep Sark ar for giving me this opp ortunit y to w ork with him. I am grateful to him for his en th usiasm for researc h and moral supp ort. Without his patien t guidance this w ork w ould not ha v e b een p ossible. I also thank Dr. Rangac hari Kasturi & Dr. Ra vi Sank ar for the v aluable time they to ok to review this thesis and their helpful commen ts. I am v ery grateful to m y lo ving F ather, Mother, and Sister without whose moral supp ort this w ork w ould not ha v e reac hed the completion. I am indebted to them for b eing a p erp etual source of inspiration and motiv ation for me. Finally I cannot forget the help and supp ort of all m y friends who ha v e b een with me in ev ery step of this pro cess.

PAGE 3

T ABLE OF CONTENTS LIST OF T ABLES ii LIST OF FIGURES iii ABSTRA CT vi CHAPTER 1 INTR ODUCTION 1 1.1 Cross Mo dalit y in Human P erception 2 1.2 Cross Mo dalit y in Articial P erception 6 1.3 Con tribution of this Thesis 8 1.4 La y out of Thesis 8 CHAPTER 2 HUMAN PER CEPTION OF SOUND 9 2.1 Cues Used b y ASA Pro cess 9 2.2 Eects of ASA on P erception 10 2.3 Similarit y in Audio and Vision 11 2.4 Theory of Indisp ensable A ttributes 11 CHAPTER 3 A UDIO VIDEO ASSOCIA TION 14 3.1 Grouping Sound Ev en ts 14 3.1.1 Sp ectrogram Represen tation 14 3.1.2 Line Detection 18 3.1.3 Grouping 20 3.1.4 Prop ert y Extraction 20 3.2 Grouping Video Ob jects 21 3.3 Asso ciation 22 CHAPTER 4 RESUL TS 26 4.1 Scenario 1: P erson Bouncing Ball and W alking P erson 26 4.2 Scenario 2: W alking P erson and Mo ving Car 28 4.3 Scenario 3: P erson W alking and Another P erson Running 29 4.4 Scenario 4: P erson W alking and Bouncing Ball 30 4.5 Scenario 5: P erson W alking, P erson Running and Bouncing Ball 31 CHAPTER 5 CONCLUSIONS 36 REFERENCES 37 i

PAGE 4

LIST OF T ABLES T able 3.1 Orien tation Groups 20 T able 4.1 Dierence P ercen tages: Video Ob ject 1=Bouncing Ball Video Ob ject 2=F o otsteps 27 ii

PAGE 5

LIST OF FIGURES Figure 1.1 Simplied Represen tation of a Stim ulus Sequence that Exhibits "F reezing" Eect [5] 3 Figure 1.2 Simplied Represen tation of a Stim ulus to Measure Motion After Eect (MAE) [6 ] 4 Figure 1.3 A tten tional Mo dulation of MAE P erception [6] 5 Figure 1.4 A tten tional Mo dulation of fMRI Deca y Time in Area MT+ [6] 6 Figure 2.1 Grouping b y Gestalt Principles (a) Grouping b y Pro ximit y (b) Grouping b y Similarit y [2] 11 Figure 2.2 An Illusion Whic h is Seen as Tw o F aces or a V ase Dep ending on Our In terpretaion of the Bac kground 12 Figure 2.3 Theory of Indisp ensable A ttributes in Vision (a) Tw o Y ello w Sp ots at Same Sp ot, Observ er Sees One Y ello w Sp ot (b) One Y ello w and One Blue Create One White Sp ot (c) Tw o Ligh ts Create Tw o Separate Sp ots, Regardless of Color Observ er Sees Tw o Sp ots [18] 12 Figure 2.4 Theory of Indisp ensable A ttributes in Audio (a) One Sp eak er Pla ys Tw o Ds, Listener Hears One D Sound (b) One Sp eak er Pla ys D While Another Pla ys a D, Listener Hears One Sound (c) D and F are Pla y ed Ov er One Sp eak er, Listener Hears Tw o Sounds [18 ] 13 Figure 3.1 Ov erview of Asso ciation of Sound to Ob ject in Video 15 Figure 3.2 Computational Mo del of Grouping Sound Ev en ts 16 Figure 3.3 Sp ectrograms (T op Left) F o otsteps and Mo ving Car (T op Righ t) Chirping Bird and Car Horn (Bottom Left) Car Horn (Bottom Righ t) Bouncing Ball and F o otsteps 17 Figure 3.4 Thresholded Sp ectrograms (T op Left) F o otsteps and Mo ving Car (T op Righ t) Sound of Chirping Bird and Car Horn (Bottom Left) Car Horn (Bottom Righ t) Bouncing Ball and F o otsteps 18 iii

PAGE 6

Figure 3.5 Sp ectrogram Images with Detected Lines (T op Left) F o otsteps and Mo ving Car (T op Righ t) Chirping Bird and Car Horn (Bottom Left) Car Horn (Bottom Righ t) Bouncing Ball and F o otsteps 19 Figure 3.6 Isolated Auditory Ev en ts (T op Left) Bird Chirping (T op Righ t) F o otsteps (Middle Left) Ball Bounce (Middle Righ t) F o otsteps (Bottom Left) Car Horn (Bottom Righ t) T ruc k Horn 23 Figure 3.7 Column Wise Summation of an Auditory Group 24 Figure 3.8 A W alking P erson Exhibits Shap e Change P erio dicit y 24 Figure 3.9 A Bouncing Ball Exhibits T rac k Change P erio dicit y 25 Figure 3.10 (T op) Sp ectrogram of the Sound W a v eform Receiv ed. (Bottom) Corresp onding F rames for Eac h Ev en t in the Sp ectrogram 25 Figure 4.1 (Left) Sp ectrogram (Middle) Thresholded Sp ectrogram (Righ t) Detected Lines 26 Figure 4.2 Audio Groups 27 Figure 4.3 (T op) Video F rames (Middle) Thresholded F rames (Bottom) P erio dicit y Curv es 28 Figure 4.4 Asso ciation of Sound T op Ro w Sho ws the Lines Detected from the Sp ectrogram of the Sound W a v e from Whic h W e Obtain Audio Ob jects Sho wn in the Second Ro w. Third Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w 29 Figure 4.5 Sp ectrogram and the Corresp onding Audio Groups 30 Figure 4.6 (T op) Video F rames (Bottom) P erio dicit y Curv es 31 Figure 4.7 T op Ro w Sho ws the Sp ectrogram on the Left Side and Lines Detected on the Righ t Side. Tw o of the Audio Ob jects are Sho wn in the Third Ro w. F ourth Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w 32 Figure 4.8 T op Three Ro ws Sho w the Sp ectrogram, Detected Lines and the Audio Ob jects. F ourth Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video. The Fifth Ro w Sho ws Thresholded F rames and the Last Ro w Sho ws Video F rames 33 Figure 4.9 T op Ro w Sho ws the Sp ectrogram and the Next Ro w Sho ws the Lines Detected. Tw o of the Audio Ob jects are Sho wn in the Third Ro w. F ourth Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w 34 iv

PAGE 7

Figure 4.10 T op Ro w Sho ws the Lines from the Sp ectrogram. Three of the Audio Ob jects are Sho wn in the Second Ro w. Third Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w 35 v

PAGE 8

ASSOCIA TION OF SOUND TO MOTION IN VIDEO USING PER CEPTUAL OR GANIZA TION Sunil Babu Ra vulapalli ABSTRA CT T ec hnological dev elopmen ts and inno v ations of the rst fort y y ears of the digital era ha v e primarily addressed either the audio or the visual senses. Consequen tly designers ha v e primarily fo cused on the audio or the visual asp ects of design. In the p ersp ectiv e of video surv eillance, the data under consideration has alw a ys b een visual. Ho w ev er, in ligh t of the new b eha vioral and ph ysiological studies whic h established a pro of of cross mo dalit y in h uman p erception i.e. h umans do not pro cess audio and visual stim ulus separately but p erciev e a scene based on all stim ulus a v ailable, similar cues are b eing used to dev elop a surv eillance system whic h uses b oth audio and visual data a v ailable. Human b eings can easily asso ciate a particular sound to an ob ject in the surrounding. Dra wing from suc h studies, w e demonstrate a tec hnique b y whic h w e can isolate concurren t audio and video ev en ts and asso ciate them based on p erceptual grouping principles. Asso ciating sound to an ob ject can form a part of larger surv eillance system b y pro ducing a b etter description of ob jects. W e represen t audio in the pitc h-time domain and use image pro cessing algorithms suc h as line detection to isolate signican t ev en ts. These ev en ts and are then group ed based on gestalt principles of pro ximit y and similarit y whic h op erates in audio. Once auditory ev en ts are isolated w e can extract their p erio dicit y In video, w e can extract ob jects b y using simple bac kground subtraction. W e extract motion and shap e p erio dicities of all the ob jects b y trac king their p osition or the n um b er of pixels in eac h frame. By comparing all the p erio dicities in audio and video using a simple index w e can easily asso ciate audio to video. vi

PAGE 9

W e sho w results on v e scenarios in outdo or settings with dieren t kinds of h uman activit y suc h as running, w alking and other mo ving ob jects suc h as balls and cars. vii

PAGE 10

CHAPTER 1 INTR ODUCTION Automated surv eillance addresses real-time observ ation of p eople, v ehicles and other mo ving ob jects within a complicated en vironmen t, leading to a description of their actions and in teractions. The tec hnical issues include mo ving ob ject detection and trac king, ob ject classication, h uman motion analysis and activit y understanding. Most commonly used sensors for surv eillance are imaging sensors, e.g. video cameras and thermal imaging systems. There are a n um b er of video surv eillance systems based on a single camera or h undreds of cameras. T o ac hiev e con tin uous monitoring, infrared (IR) cameras are used along with the optical cameras under lo w illuminations. Ho w ev er, motion detection and trac king are often uncertain and incomplete as they require recognition metho ds that can handle the probabilities accurately O late "detection and trac king" systems based on non-imaging measuremen ts are b eing exp erimen ted with. Measuremen t of force on load cells [1] has already b een used. Though the p erformance of suc h sensors is high, the cost of their usage is also v ery high. In this regard the microphone turns our to b e a go o d sensor due its compact size, lo w cost, small data v olume and lo w p o w er consumption. Also, recen t exp erimen ts in the psyc hological domain ha v e sho wn that h uman p erception do es not pro cess senses suc h as visual, sound, smell separately but rather p erceiv es a scene based on the fusion of all the mo dalities a v ailable at a particular instan t. This has prompted researc hers in the computer vision comm unit y to mak e use of the ric h m ultimedia information (esp ecially audio) in a video sequence for video surv eillance. Several studies on fusing audio-video data for ob ject detection and trac king ha v e b een rep orted in the literature. In the follo wing section w e will tak e a lo ok at h uman p erception whic h formed the basis for the use of cross-mo dal fusion in articial systems. Then w e go on to lo ok at some literature that use cross-mo dal fusion in articial systems. 1

PAGE 11

1.1 Cross Mo dalit y in Human P erception Bergman in [2] describ ed a n um b er of Gestalt principles for auditory scene analysis in whic h he stressed the resem blance b et w een audition and vision, b ecause principles of p erceptual organisation suc h as similarit y go o d con tin uation, common fate seem to pla y a similar role in the t w o mo dalities. This similarit y led researc hers to b eliev e that a h uman p erceptual system migh t utilize information from all a v ailable mo dalities, th us giving rise to in terest in the cross-mo dal domain. There are sev eral examples of cross mo dal inruences in whic h one ma y assume that p erceptual system indeed con tributes information from dieren t sensory mo dalities to an single ev en t. There is quite a bit of literature sho wing that arbitrary com binations of mo dalities heigh ten p erceptual a w areness and lo w er reaction time compared with just the unimo dal features. Cross-mo dal com binations of features not only enhance p erceptual pro cessing but can also c hange p ercept. The prime example is the McGurk eect [3], in whic h sp eec h information from sound and vision is presen ted. When listeners hear "baba" and at the same time w atc h a sp eak er articulating "gaga", they tend to com bine the information from the t w o sources in to "dada". Cross mo dal in teractions ha v e also b een observ ed in spatial domain. F or example sync hronized sounds and ligh t rashes with a dieren t spatial lo cation tend to b e lo calized closer together (the v en trilo quist eect). The common nding is that there is a substan tial eect of the ligh t rashes on the lo cation of the sound [4]. In a series of exp erimen ts V ro omen and De Gelder explored something they called "freezing eect" [5]. It is an illusion that o ccurs when an abrupt sound is presen ted during a rapidly c hanging visual displa y In this illusion the participan t feels as if the sound is pinning the visual stim ulus for a short momen t so that the visual displa y "freezes" helping the participan t to iden tify the visual. P articipan ts sa w a 4 x 4 matrix of ric k ering dots that w as created b y rapidly presen ting four dieren t displa ys alternated b y a mask (Figure 1.1). Eac h visual con tained four dots in random p ositions. One of the four displa ys consisted of four dots that made up a diamond, and it could app ear in an y four corners of the matrix. The task of the participan ts w as to detect the p osition of the diamond. One set of exp erimen ts in v olv ed pla ying only lo w tones b eing pla y ed along with all the visuals, in another exp erimen t a high 2

PAGE 12

Figure 1.1 Simplied Represen tation of a Stim ulus Sequence that Exhibits "F reezing" Eect [5] tone w as pla y ed along with the target visual. The a v erage n um b er correct resp onses w as 55% when only lo w tones w ere pla y ed and 66% when a high tone w as pla y ed along with the target. In another exp erimen t psyc hoph ysical and functional magnetic resonance imaging (fMRI) data w as acquired while sub jects pro cessed a giv en situation. A visual illusion whic h pro duces motion aftereect (MAE) w as used to assess the eects of visual and auditory atten tion on motion pro cessing in the h uman area MT+ (dorsal stream visual area). The MAE o ccurs after a p erio d of motion adaption, in whic h sub jects view a unidirectionally mo ving stim ulus. When sub jects subsequen tly view a stationary stim ulus, it app ears to mo v e in the opp osite direction for a short p erio d of time [6 ]. Sub jects p erformed three scans in a single imaging session [7]: 3

PAGE 13

Figure 1.2 Simplied Represen tation of a Stim ulus to Measure Motion After Eect (MAE) [6] Lo calizer scan Fixation-only scan A tten tional MAE scan The lo calizer scan (Figure 1.2a) w as used to iden tify area MT+ in eac h sub ject. In this scan sub jects xated a cen tral crosshair and passiv ely view ed rev ersing stim uli alternated b y a stationary stim uli. This stim ulus ho w ev er do es not induce an y MAE. The xation-only scan (Figure 1.2b) w as designed to measure MT+ activit y and the MAE in the absence of atten tional demands. Blo c ks of mo ving and stationary stim uli w ere alternated. In half of the mo ving blo c ks, the stim ulus mo v ed only in the out w ard direction so as to induce motion aftereect. The atten tional MAE scan (Figure 1.2c) w as designed to determine whether atten tion to a cen tral visual or auditory stim ulus during motion adaption has an impact on signal in area MT+ and the MAE. In all of the mo ving blo c ks, the stim ulus mo v ed only in the 4

PAGE 14

REV EXP VIS AUD 0 2 4 6 8 10 MAE Duration (sec) Figure 1.3 A tten tional Mo dulation of MAE P erception [6] out w ard direction. Sub jects p erformed either a visual or an auditory task during the mo ving blo c k. In the visual task, sub jects sa w a series of letters in random order, whic h replaced the xation crosshair. In the auditory task, sub jects heard a series of letters presen ted at the same rate as the mo ving blo c ks. F or b oth visual and auditory tasks, sub jects w ere instructed to press a button whenev er they sa w or heard a v o w el. Also, functional and structural MR data w ere obtained using a 1.5 T GE magnet and custom-designed head coil. Duration of the MAE after expanding motion, a v eraged across sub jects, w as 9.44 seconds (Figure 1.3). MAE duration w as reduced when sub jects attended to the cen tral visual task during expanding motion, to an a v erage of 7.54 seconds. Also when sub jects attended to an auditory task during expanding motion, duration of MAE w as reduced to 6.67 seconds. Similarly the signal in MT+ remained higher after blo c ks that induced a motion aftereffect than after blo c ks when no aftereect accured. This is sho wn b y the mark edly prolonged deca y time for the expanding motion condition as compared to the rev ersing motion condition (Figure 1.4). The deca y time follo wing blo c ks of expanding motion w as reduced signican tly when sub jects attended to the cen tral visual task during adaption. It w as also found that auditory atten tion alters the deca y time in area MT+. These results demonstrate that auditory atten tion can inruence motion pro cessing in an early visual area. 5

PAGE 15

REV EXP VIS AUD 0 2 4 6 8 10 12 14 16 Decay time in MT+ (sec) Figure 1.4 A tten tional Mo dulation of fMRI Deca y Time in Area MT+ [6] 1.2 Cross Mo dalit y in Articial P erception A greater understanding of the h uman p erception has led reseac hers to use cross mo dalit y in n umerous pro jects for increase in accuracy and reliabilit y The use of cross mo dalit y has already app eared in lo calization in video, surv eillance and m ultimedia indexing. Lo and Goubran prop osed a new metho d for p erforming audio-video talk er lo calization [8] that explores the reliabilit y of the individual lo calization estimates suc h as audio, motion detection, and skin color detection. Audio signals are captured b y a six elemen t microphone arra y The talk ers direction is obtained through a b eamforming algorithm. Audio data is digitized and lo calization is p erformed using the dela y-and-sum-metho d [9]. In video, motion detection is done using bac kground subtraction and for skin color detection the luma-c hroma space is used. The reliabilit y information is estimated from the audio and video separately The reliabilit y information in conjunction with a simple summing v oter is used to fuse the lo calization results. Based on the v oter output, a ma jorit y rule is then used to mak e nal decision of the activ e talk ers curren t lo cation. The results sho w that adding the reliabilit y information during fusion impro v es lo calization p erformance when compared to audio only motion detection only or skin color detection only Lately surv eillance systems are using b oth audio and video sensors to rev eal and trac k the presence of an in truder. The system describ ed in [10] is comp osed of a mobile agen t and sev eral 6

PAGE 16

static agen ts co op erating in the trac king task. The mobile agen t is a vision agen t comp osed of an omni-directional vision system and a mobile rob ot. The static agen ts are acoustic agen ts comp osed of self steerable microphone arra ys. T o detect the in truder the image is segmen ted in to the mo ving foreground and in to the static bac kground. An incremen tal bac kground subtraction algorithm used to accoun t for a dynamic en vironmen t. In this tec hnique the bac kgorund image is not a static image, but is up dated frame after frame slo wly incorp orating c hanges in the scene. The acoustic agen t receiv es the p osition of the in truder from the static agen t, a b eamforming algorithm is used to direct the microphone arra y to w ard the acoustic source. As a sequence of frame is obtained, the signal is reconstructed using the o v erlapadd metho d to the result of the IFFT blo c k. The acoustic signal obtained is used to train a HMM (Hidden Mark o v Mo del) of the steps of the in truder. When the p erson mo v es, the learn t HMM can b e used to distinguish one p erson mo ving in the en vironmen t from another one. The acoustic agen t is also able to calculate the p osition of the in truder with resp ect to itself. The lo calisation algorithm is implemen ted using a neural net w ork based algorithm. The measuremen ts on the p osition of the in truder coming from the static acoustic agen ts are fused using a mo died Kalman lter. In [11] a con ten t based video parsing and indexing metho d is presen ted whic h analyzes b oth information sources (audio and video) and accoun ts for their in ter-relations and synergy to extract high-lev el seman tic information. Lo w-lev el audio analysis in v olv es the tasks of sp eec h-silence discrimination, linear predictiv e (LP) analysis and deriv ation of LP cepstral co ecien ts of the sp eec h-v oiced frames. The visual input is the en tire video sequence on whic h a shot b oundary (scene-cut) detection and video segmen tation is p erformed. The rst frame of eac h detected shot is subsequen tly pro cessed b y a face detection algorithm follo w ed b y a facial feature extraction mo dule aiming at extracting mouth lo cation. A mouth template is th us obtained and used for mouth trac king in the remaining frames of the face shot. Multimo dal in teraction can serv e to enhance the con ten t ndings and oer a more detailed con ten t description ab out the same video instances. F or example, since sp eak er recognition p erformed on the audio source is prone to recognition errors, visual information can b e used to detect presence of talking p eople and the sp eak er that exhibits maxim um presence lik eliho o d is the winner. The in teraction of audio and visual seman tic lab els suc h as sp eec h, silence, sp eak er 7

PAGE 17

iden tit y face presence, face absence supplies the user with more detailed information and a degree of underlying con text. Sp eak er lo calisation using audio-video cues at signal lev el has b een explored in [12]. It uses the correlation b et w een the visual motion of the mouth and the corresp onding audio data generated when a p erson sp eaks. A time dela y ed neural net w ork is used to learn the audiovideo correlation whic h is then used to do a spatio-temp oral searc h for a sp eaking p erson. Kidron, Sc hec hner and Elad presen t an metho d whic h detects pixels that are asso ciated to a sound while ltering out other dynamic pixels [13 ]. They use canonical correlation analysis (CCA) b y exploiting t ypical spatial sparsit y of audio-visual ev en ts. 1.3 Con tribution of this Thesis The main con tribution of this thesis consist of, Separating more than one concurren t audio and video ev en ts using a feature based approac h based on single audio and a single video sensor. Usage of higher lev el primitiv es and grouping making audio-video asso ciation more robust. Use of cluttered en vironmen t where more than one ob ject exists and asso ciates sound to the corresp onding ob ject at a particular instan t. Use of auditory scene analysis (ASA) rather than the usual signal pro cessing approac hes to pro cess sound. 1.4 La y out of Thesis This c hapter pro vided motiv ation and bac kground for this w ork. The next c hapter giv es an in tro duction to h uman p erception of sound. The third c hapter describ es our approac h for the audio-video asso ciation. The fourth c hapter consists of the results. The nal c hapter concludes this w ork and lists the scop e for future w ork. 8

PAGE 18

CHAPTER 2 HUMAN PER CEPTION OF SOUND Humans use their sense of hearing to understand the prop erties of sound-pro ducing ev en ts. Often, w e are in terested in a single stream of ev en ts, suc h as a violin pla ying, a p erson talking, or a car approac hing. In a natural listening en vironmen t, ho w ev er, the acoustic energy pro duced b y eac h ev en t sequence is mixed, at the listener's ears, with energy arising from other concurren t ev en ts. "Auditory scene analysis" (ASA) is a pro cess in whic h the auditory system tak es the mixture of sound that it deriv es from a complex natural en vironmen t and sorts it in to pac k ages of acoustic evidence in whic h eac h pac k age probably has arisen from a single source of sound [2]. The p erformance of the h uman auditory system with regard to the abilit y to analyze the auditory en vironmen t, the lo calization of sound sources and p erception of sp eec h is striking. Ev en in complex acoustical scenarios or under adv erse acoustical conditions the h uman system can separate and recognise auditory ev en ts v ery robustly ASA pro vides answ ers to ho w the brain can build separate p erceptual descriptions of sound-generating ev en ts despite the mixing of evidence. The rst thing it do es is to analyze the incoming arra y of sound in to a large n um b er of frequency comp onen ts. Then b y putting together the righ t set of frequency comp onen ts o v er time, a signal is recognized. F rom the p oin t of view of theoretical signal pro cessing it is striking that the auditory system can ac hiev e all this with just t w o receiv ers, the left and the righ t ear. As a consequence, these prop erties of the h uman auditory system ha v e b een a motiv ation to sim ulate it b y means of computer mo dels. 2.1 Cues Used b y ASA Pro cess The p erceptual segregation of sounds in a sequence dep ends up on dierences in their frequencies, pitc hes, tim bres, amplitudes, and lo cations, and up on sudden c hanges of 9

PAGE 19

these v ariables. Segregation also increases as the duration of silence b et w een sounds in the same frequency range gets longer. The p erceptual fusion of sim ultaneous comp onen ts to form single p erceiv ed sounds dep ends on their onset and oset, frequency separation, regularit y of sp ectral spacing, harmonic relations and parallel amplitude mo dulation. Dieren t cues for stream segregation comp ete to con trol the grouping, and dieren t cues ha v e dieren t strengths. Primitiv e grouping o ccurs ev en when the frequency and timing of the sequence is unpredictable. An increased biasing to w ard stream segregation builds up with longer exp osure to sounds in the same frequency region. Stream segregation is con text-dep enden t, in v olving the comp etition of alternativ e organizations. 2.2 Eects of ASA on P erception A c hange in p erceptual grouping can alter the p erception of rh ythms, melo dic patterns, and o v erlap of sounds. P atterns of sounds whose mem b ers are distributed in to more than one p erceptual stream are m uc h harder to p erceiv e than those wholly con tained within a single stream. P erceptual organization can aect p erceiv ed loudness and spatial lo cation. The rules of ASA try to prev en t the crossing of streams in frequency whether the acoustic material is a sequence of discrete tones or con tin uously gliding tones. Kno wn principles of ASA can predict the camourage of melo dies and rh ythms when in terfering sounds are in tersp ersed or mixed with a to-b e-recognized sequence of sounds. The apparen t con tin uit y of sounds through masking noise dep ends on ASA principles. Stim uli include frequency glides, amplitude-v arying tones, and narro w-band noises. 10

PAGE 20

Figure 2.1 Grouping b y Gestalt Principles (a) Grouping b y Pro ximit y (b) Grouping b y Similarit y [2 ] A p erceptual stream can alter another one b y capturing some of its elemen ts. The apparen t spatial p osition of a sound can b e altered if some of its energy b ecomes group ed with other sounds. The segregation of streams of visual apparen t motion w orks in exactly the same w a y as auditory stream segregation. 2.3 Similarit y in Audio and Vision Bergman [2] sho w ed the similarities b et w een the Gestalt principles in vision and audition. Just as grouping b y pro ximit y functions op erates in visual space it also op erates in auditory pitc h. McPherson, Cio cca and Bergman [14] ha v e sho wn that go o d con tin uation op erates in audition in an analogous w a y to vision. The concept of amo dal completion as it is used in vision [15] has b een giv en a n um b er of dieren t names in audition: the acoustic tunnel [16] eect, p erceptual restoration [17] and the con tin uit y eect [2]. Since all these phenomenon abide b y the same la ws of grouping and organization, a framew ork that accoun ts for these needs to b e used. 2.4 Theory of Indisp ensable A ttributes A p erceptual ob ject is that whic h is susceptible to gure-ground segregation. Early processing pro duces elemen ts that require grouping. Grouping o ccurs follo wing the principles describ ed b y the Gestalt psyc hologists (Figure 2.1); it pro duces Gestalts or p erceptual organizations, whic h are also putativ e p erceptual ob jects. A tten tion selects one putativ e ob ject to b ecome Figure 2.2 and relegates all othe information to ground [18 ]. There is little doubt 11

PAGE 21

Figure 2.2 An Illusion Whic h is Seen as Tw o F aces or a V ase Dep ending on Our In terpretaion of the Bac kground Figure 2.3 Theory of Indisp ensable A ttributes in Vision (a) Tw o Y ello w Sp ots at Same Sp ot, Observ er Sees One Y ello w Sp ot (b) One Y ello w and One Blue Create One White Sp ot (c) Tw o Ligh ts Create Tw o Separate Sp ots, Regardless of Color Observ er Sees Tw o Sp ots [18 ] that grouping and gure-ground segregation describ e pro cesses that are meaningful for auditory p erception. Grouping is a w ell-established auditory phenomenon. Another phenomenon that c haracterizes visual ob jects is the formation and assignmen t of edges. Ho w ev er, edges in audio seems to b e an abstract concept. Kub o vy and V alk en burg [18] dev elop the theory of indisp ensable attributes (TIA) whic h states that, in vision, ob jects are formed in space-time domain, ho w ev er auditory ob jects are formed in pitc h-time domain. Spatial separation is an indesp ensible attribute for vision. Imagine presen ting to an observ er t w o sp ots of ligh t on a surface (Figure 2.3a). Both of them y ello w and they coincide; the observ er will rep ort one 12

PAGE 22

Figure 2.4 Theory of Indisp ensable A ttributes in Audio (a) One Sp eak er Pla ys Tw o Ds, Listener Hears One D Sound (b) One Sp eak er Pla ys D While Another Pla ys a D, Listener Hears One Sound (c) D and F are Pla y ed Ov er One Sp eak er, Listener Hears Tw o Sounds [18] ligh t. No w supp ose w e c hange the color of the ligh ts, so that one sp ot is blue and other is y ello w, but they still coincide (Figure 2.3b); the observ er will rep ort one white ligh t. F or the observ er to see more than one ligh t, they m ust o ccup y dieren t spatial lo cation (Figure 2.3c). Pitc h separation is an indisp ensable attribute for sound. Imagine sim ultaneously pla ying t w o 440Hz sounds for a listener (Figure 2.4a). Both of them pla y ed o v er the same loud sp eak er; the listener will rep ort hearing one sound. No w supp ose w e pla y these t w o sounds o v er t w o loudsp eak ers; the listener will still rep ort hearing one sound (Figure 2.4b). F or the listener to rep ort more than one sound, they m ust b e separated in frequency Th us, pitc h separation is an indisp ensable attribute for audio p erception. By analogous argumen t time is an indisp ensable attribute for b oth vision and audition. The TIA th us forms a heuristic to ol for extending theories of visual p erception in to the domain of auditory p erception. 13

PAGE 23

CHAPTER 3 A UDIO VIDEO ASSOCIA TION In this c hapter w e explain v arious audio and video algorithms used in this thesis. The o v erview of this tec hnique is sho w in Fig. 3.1. 3.1 Grouping Sound Ev en ts T raditionally there ha v e b een sev eral appro c hes to audio signal pro cessing. The signal pro cessing appro c hes to audio analysis are dominated b y linear prediction and ceptrum analysis. Linear prediction is used for sp eec h co ding and ceptral features are the heart of sp eec h recognition. Another p opular tec hnique, the Indep enden t Comp onen t Analysis (ICA) tries to mak e the extracted sources as statistically indep enden t as p ossible. Ho w ev er, ICA is a general purp ose signal pro cessing tec hnique and requires as man y mixtures as the n um b er of sources. Also, it needs auditory sp ecic constrain ts to b e incorp orated. Auditory Scene analysis (ASA) forms the next generation of audio pro cessing. The o v erview of the steps used in this thesis to pro cess audio is sho w in Figure 3.2. 3.1.1 Sp ectrogram Represen tation As explained in the previous c hapter, to pro cess audio using ASA tec hniques w e need a frame w ork in the pitc h-time domain. In this thesis w e use the sp ectrogram to represen t audio. Sp ectrograms are usually calculated from the time signal using the Discrete-time F ourier transform (DTFT): X ( w ) = X n x ( n ) e j w n 14

PAGE 24

Figure 3.1 Ov erview of Asso ciation of Sound to Ob ject in Video If x(n) is time-limited to a duration of N samples, then w e can sample the con tin uous function X(w) at N uniformly spaced p oin ts. This corresp onds to forming a p erio dic signal x'(n) of innite duration and p erio d N b y concatenating the length N sequences x(n) and computing its F ourier Series expansion. This v ersion of the F ourier T ransform is written as: X ( k ) = N X n =0 x ( n ) e j w k n where w k = 2 k = N This expression is the Discrete F ourier T ransform, or DFT, of the length N discrete time sequence x(n) Here, k is the DFT bin n um b er, and w k is the discrete frequency of bin k The sp ectrogram uses a sligh tly dieren t form of DFT called the short time F ourier transform (STFT). The STFT is a form ulation that can represen t sequences of an y length b y breaking them in to shorter blo c ks, or frames, and applying the DFT to eac h blo c k. Digitally sampled data, in the time domain, is brok en up in to frames, whic h usually o v erlap, and F ourier 15

PAGE 25

Figure 3.2 Computational Mo del of Grouping Sound Ev en ts transformed to calculate the magnitude of the frequency sp ectrum for eac h frame. Eac h frame then corresp onds to a v ertical line in the image; a measuremen t of magnitude v ersus frequency for a sp ecic momen t in time. The sp ectrums or time plots are then 'laid side b y side' to form the image. The horizon tal axis represen ts time, the v ertical axis is frequency and the in tensit y of eac h p oin t in the image represen ts amplitude of a particular frequency at a particular time. spectr og r am ( t; w ) = j S T F T ( t; w ) j 2 There are sev eral issues in v olv ed in c ho osing a go o d v alue for N First, in order to tak e adv an tage of the computational eciency of the FFT algorithm, w e w an t to tak e N to b e a p o w er of t w o. Secondly the visual displa y pro duced b y the analysis will b e represen ted b y N samples of the DTFT. The larger w e mak e N the closer the DFT will appro ximate the smo oth 16

PAGE 26

Figure 3.3 Sp ectrograms (T op Left) F o otsteps and Mo ving Car (T op Righ t) Chirping Bird and Car Horn (Bottom Left) Car Horn (Bottom Righ t) Bouncing Ball and F o otsteps function represen ted b y the DTFT. A v alue that is to o small, while not thro wing a w a y an y information, will pro duce a v ery coarse visual displa y that ma y lead to misin terpretation of the data. Some scaled sp ectrograms from oudo ors scene are sho wn in Figure 3.3. The in tensit y v alues in the sp ectrogram are represen ted b y 256 gra y scale v alues. Most of the pixels in the sp ectrogram image are pro duced due to noise, only the v ery high v alues gra y scale v alues in the sp ectrogram image corresp ond to an y signican t ev en ts in the audio. T o lter out the noise w e threshold the image with a reletiv ely high v alue (Figure 3.4). 17

PAGE 27

Figure 3.4 Thresholded Sp ectrograms (T op Left) F o otsteps and Mo ving Car (T op Righ t) Sound of Chirping Bird and Car Horn (Bottom Left) Car Horn (Bottom Righ t) Bouncing Ball and F o otsteps 3.1.2 Line Detection A common observ ation from insp ection of sev eral sp ectrograms is that signican t audio ev en ts suc h as a w alking p erson, b ouncing ball, car horn etc. pro duce straigh t lines in the sp ectrogram. In this thesis Steger's line detection [19 ] has b een used to extract these straigh t lines. Steger's line detection pro ceeds in four stages. First, isolated line p oin ts are extracted. These p oin ts are c haracterized as ha ving a zero crossing in the rst directional deriv ativ e in the direction where the second directional deriv ativ e attains its maxim um absolute v alue. Therefore, the algorithm needs to calculate the partial deriv ativ es of the image. This is done b y con v olving it with the partial deriv ativ es of a Gaussian smo othing k ernel with scale The result of this step are individual line p oin ts and their subpixel lo cations, as w ell as the direc18

PAGE 28

Figure 3.5 Sp ectrogram Images with Detected Lines (T op Left) F o otsteps and Mo ving Car (T op Righ t) Chirping Bird and Car Horn (Bottom Left) Car Horn (Bottom Righ t) Bouncing Ball and F o otsteps tions p erp endicular to the line. In the second stage of the algorithm the individual line p oin ts are link ed in to lines. The algorithm starts at p oin ts that ha v e a second directional deriv ativ e larger than a c hosen high v alue, and follo ws lines un til the second directional deriv ativ e is smaller than a lo w. This pro cedure is kno wn as a h ysteresis threshold op eration. The third stage of the algorithm extracts the width of the line for eac h line p oin t. The line width is extracted b y searc hing for p oin ts that ha v e a maxim um in the absolute v alue of the gradien t in the direction giv en b y the normal to the line. Only pixels lying on a searc h line normal to the curren t line p oin t are considered. The length of the searc h line is 2.5* on eac h side of the line. The nal stage of the algorithm is to correct the line p osition and width. Some sp ectrogram images after line detection ha v e b een sho wn in Figure 3.5. 19

PAGE 29

T able 3.1 Orien tation Groups Group Orien tation 1 0-30 and 150-180 2 30-60 3 60-120 4 120-150 3.1.3 Grouping Signican t audio ev en ts generally pro duce parallel lines in the sp ectrogram. No w our task is to group lines whic h b elong to a sp ecic ev en t. Once w e obtain the sp ectrogram image with the lines represen ting only the signican t audio ev en ts, the orien tation and cen troid of eac h line is calculated. The orien tation is the angle b et w een the x-axis and the ma jor axis of the ellipse that has the same second momen ts as the line. The cen troid is the cen ter of mass of the region. The lines are then group ed based on orien tation v alues as sho wn in T able 3.1. Though this step will group all lines with similar orien tation, lines with dieren t pitc h information ma y also b e group ed together. T o a v oid this w e further group lines in eac h orien tation group based on the cen troid. The co-ordinates of the cen troids of lines are clustered using a simple k-me ans algorithm. This lo cation based grouping extracts lines that are close together and along the same co-ordinate axis signifying lines b elonging to the same auditory ev en t. This pro cess ho w ev er still results in groups whic h are formed due to lo w frequency noise in the sp ectrogram. These groups are ignored during the asso ciation stage where they do not nd an y video ob ject to asso ciate themselv es to. Some of the audio groups are sho wn in Figure 3.6. 3.1.4 Prop ert y Extraction Once w e isolate the auditory ev en ts from the sp ectrogram it is easy to estimate its p erio dicit y First w e nd the column-wise summation of eac h auditory group as sho wn in 3.7. The p eaks in the auditory signal can b e obtained b y nding the lo cal maximas in the column summation arra y of the audio group. By nding the corresp onding column of eac h lo cal maxima w e can nd the column dierence b et w een eac h p eak b y subtracting consecutiv e p eak column n um b ers. This can b e con v erted to time b y using: 20

PAGE 30

Time Bet w een P eaks = Column Dierence T otal Num b er Of Colums T otal time The a v erage of the time dierence b et w een eac h lo cal maxima giv es the a v erage p erio dicit y of the auditory ob ject. 3.2 Grouping Video Ob jects Mo ving ob jects can b e detected b y bac kground subtraction. Once w e subtract bac kground frame from eac h frame w e threshold the image to get a binary v ersion with just the ob jects in motion. Video ob jects exhibit t w o t yp es of p erio dicities: shap e c hange p erio dicit y or trac k p erio dicit y P erio dic b eha vior of ob jects suc h as h umans exhibit shap e c hange p erio dicit y and can b e estimated b y measuring the c hange in the n um b er of pixels in the b ounding b o x in eac h frame. This is based on the fact that in a 2D image the n um b er of pixels corresp onding to the h uman decrease as b oth the legs come together and increase as the step is completed (Figure 3.8). The p erio dic b eha vior of ob jects suc h as a b ouncing ball can b e calculated b y trac king the cen troid of the b ounding b o x in eac h frame (Figure 3.9). Once w e ha v e the p erio dic curv es w e can nd the p erio dicit y b y applying p erio dic transforms [20]. The P erio dicit y T ransform searc hes for the b est p erio dic c haracterization of the length N signal x The underlying tec hnique is to pro ject x on to some p erio dic subspace Sp This p erio dicit y is then remo v ed from x lea ving the residual r stripp ed of its p-p erio dicities. Both the pro jection x and the residual r ma y con tain other p erio dicities, and so ma y b e decomp osed in to other q -p erio dic comp onen ts b y further pro jection on to Sq This decomp osition is accomplished directly in terms of p erio dic sequences and not in terms of frequency or scale, as do the F ourier and W a v elet T ransforms. Unlik e most transforms, the set of basis v ectors is not sp ecied a priori, rather, the P erio dicit y T ransform nds its o wn b est set of basis elemen ts. Though w e estimate b oth the p erio dicities for eac h ob ject, only one t yp e of p erio dicit y will b e relev an t to an ob ject. F or example, the h uman migh t exhibit p erio dic motion b y monitoring the c hange in pixels in a b ounding b o x, but trac king the cen troid will rev eal only 21

PAGE 31

a straigh t line. The p erio dic transform will estimate p erio dicit y as zero and hence w e can ignore it. The p erio dicit y estimate in terms of n um b er of frames can b e con v erted to time b y: Video P erio dicit y = Num b er Of F rames T otal Num b er Of F rames T otal time 3.3 Asso ciation Audio and video data are highly correlated, for example the sound of a ball b ounce will b e accompanied b y an ev en t in the video where the ball comes in con tact with the ro or (F rames 21, 52, 83, 116 Figure 3.10). Similarly the sound of a fo otstep will b e accompanied b y the fo ot of a p erson hitting the ro or (F rames 36, 50, 63, 91, 106, 118, 131 in Figure 3.10). W e simply need to k eep trac k of when w e hear the sound and in video w e need k eep trac k of when the ball or the fo ot comes in con tact with the ro or. Essen tially the p erio dicities of ob jects b oth in audio and video should b e similar. Video will ha v e exactly the same n um b er of p erio dicities as the n um b er of ob jects in the scene. Ho w ev er, in audio due to noise, our pro cedure migh t pic k up false p erio dicities. T o eliminate these w e simply do a p ercen tage dierence c hec k b et w een the video and audio p erio dicities. The lo w est dierences will giv e us the video ob ject and audio ob ject whic h are most similar. Dierence P ercen tage = Audio P erio dicit y Video P erio dicit y Audio P erio dicit y % 22

PAGE 32

Figure 3.6 Isolated Auditory Ev en ts (T op Left) Bird Chirping (T op Righ t) F o otsteps (Middle Left) Ball Bounce (Middle Righ t) F o otsteps (Bottom Left) Car Horn (Bottom Righ t) T ruc k Horn 23

PAGE 33

Figure 3.7 Column Wise Summation of an Auditory Group Figure 3.8 A W alking P erson Exhibits Shap e Change P erio dicit y 24

PAGE 34

Figure 3.9 A Bouncing Ball Exhibits T rac k Change P erio dicit y Figure 3.10 (T op) Sp ectrogram of the Sound W a v eform Receiv ed. (Bottom) Corresp onding F rames for Eac h Ev en t in the Sp ectrogram 25

PAGE 35

CHAPTER 4 RESUL TS In this c hapter w e consider certain scenarios consisting of ob jects that exhibit p erio dic motion. The audio and video results of eac h scenario ha v e b een sho wn in this thesis as w ell as on the CD pro vided with this rep ort. 4.1 Scenario 1: P erson Bouncing Ball and W alking P erson First, w e consider a small video clip sho wing a p erson b ouncing a ball and a p erson w alking (CLIP 1 on CD). The corresp onding sp ectrogram of the audio signal, the thresholded sp ectrogram and the detected lines are generated as describ ed is the previous c hapter (Figure 4.1). The auditory groups formed are sho wn in Figure 4.2. A t this p oin t w e get three ob jects, though in realit y there are only t w o sound pro ducing ob jects and the other is auditory ob ject is formed due to noise. Using lo cal maxima to detect p eak a in audio signal, the p erio dicities of the audio groups sho wn are estimated to b e 1.06, 0.8 and 0.46 seconds resp ectiv ely Some of the video frames are sho wn in the top section of Figure 4.3. Motion segmen tation giv es us t w o ob jects. Eac h ob ject either giv es us shap e p erio dicit y or trac k p erio dict y Ho wFigure 4.1 (Left) Sp ectrogram (Middle) Thresholded Sp ectrogram (Righ t) Detected Lines 26

PAGE 36

Figure 4.2 Audio Groups T able 4.1 Dierence P ercen tages: Video Ob ject 1=Bouncing Ball Video Ob ject 2=F o otsteps Audio Ob ject n um b er (P erio dicit y) Video Ob ject n um b er(P erio dicit y) Dierence % 1 (1.06 seconds) 1 (1.04 seconds) 0.02% 2 (0.8 seconds) 1 (1.04 seconds) 0.30% 3 (0.46 seconds) 1 (1.04 seconds) 1.26% 1 (1.06 seconds) 2 (0.52 seconds) 0.51% 2 (0.8 seconds) 2 (0.52 seconds) 0.35% 3 (0.46 seconds) 2 (0.52 seconds) 0.13% ev er, the t yp e of p erio dict y is irrelev an t for asso ciation. The relev an t p erio dicit y curv es are in the b ottom section of Figure 4.3. Once w e apply p erio dicit y transform w e get p erio dicities 1.04 seconds and 0.52 seconds resp ectiv ely Once w e obtain the auditory and visual p erio dicities, w e calculate the dierence p ercen tages (T able 4.1). Since w e ha v e t w o video ob jects, the smallest t w o p ercen tages form the asso ciation. Hence, from the table the rst asso ciation is formed b et w een visual ob ject 1 and audio ob ject 1 and the second asso ciation is formed b et w een visual ob ject 2 and audio ob ject 3. Visually the b ouncing ball video ob ject is asso ciated with audio ob ject three and h uman w alking is asso ciated with audio ob ject one as sho wn in Fig. 4.4. Sp ectrogram images of the isolated auditory ev en ts can also b e used to regenerate the corresp onding audio signal. By retaining only the columns in whic h atleast one pixel is equal to one, in the original sp ectrogram matrix (obtained b y applying F ourier transform to the original w a v e form) and making all others zero, w e can then apply a frame b y frame in v erse F ourier transform and app end the signal to obtain the corresp onding sound [21]. W e can 27

PAGE 37

Figure 4.3 (T op) Video F rames (Middle) Thresholded F rames (Bottom) P erio dicit y Curv es regenerate the sound corresp onding to the relev an t auditory ob ject. CLIP 2 and CLIP 3 on the CD exhibit the audio-video asso ciation. F ollo w the y ello w b o x in eac h clip and hear the sound whic h has b een asso ciated to it. 4.2 Scenario 2: W alking P erson and Mo ving Car No w, w e consider another scenario in whic h w e ha v e a w alking p erson and car passing b y in the bac kground (CLIP 4 on CD). The corresp onding sp ectrogram of the audio signal and some of the auditory groups formed are sho wn in Fig. 4.5. W e exp ect to nd one audio ob ject for the fo otsteps and another for the sound of the car. Ho w ev er, the sound of a car creates a noise lik e signature in sp ectrogram and hence the line detection cannot pic k up the sound of the car. In this case, one audio ob ject corresp onding to the p ersons fo otsteps and other is formed due to noise. The p erio dicities of the audio groups sho wn are 0.53 and 1.54 seconds resp ectiv ely Some of the video frames are sho wn in the top section of Fig. 4.6. The p erio dicit y curv es 28

PAGE 38

Figure 4.4 Asso ciation of Sound T op Ro w Sho ws the Lines Detected from the Sp ectrogram of the Sound W a v e from Whic h W e Obtain Audio Ob jects Sho wn in the Second Ro w. Third Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w are in the b ottom section of Fig. 4.6. The w alking p erson exhibits shap e p erio dict y ho w ev er the mo ving car do es not exhibit an y p erio dicit y Once w e apply p erio dicit y transform w e get p erio dicities 0.55 and zero seconds resp ectiv ely Asso ciation relates audio ob ject one to the w alking p erson (CLIP 5 on CD) but no audio ob ject is asso ciated to the mo ving car. W e indicate this in Fig. 4.7 b y not sho wing an y bi-directional arro w for the second video ob ject. 4.3 Scenario 3: P erson W alking and Another P erson Running No w, w e consider a scenario in whic h w e ha v e a w alking p erson and another p erson running (CLIP 6 on CD). The corresp onding sp ectrogram of the audio signal, the detected lines and 29

PAGE 39

Figure 4.5 Sp ectrogram and the Corresp onding Audio Groups some of the auditory groups formed are sho wn in the top 3 ro ws of Fig. 4.8. The p erio dicities of the audio groups sho wn are 0.23, 0.5 and 0.4 seconds resp ectiv ely Some of the video frames are sho wn in the b ottom ro w of Fig. 4.8. The p erio dicit y curv es are in the forth ro w of Fig. 4.8. Both video ob ject exhibit shap e p erio dicities. Once w e apply p erio dicit y transform w e get p erio dicities 0.52 and 0.41 seconds resp ectiv ely The rst audio ob ject is formed due to noise. Asso ciation relates second audio ob ject to the running p erson (CLIP 7 on CD). The third corresp onds to the fo otsteps of the w alking p erson (CLIP 8 on CD). 4.4 Scenario 4: P erson W alking and Bouncing Ball No w, w e consider a scenario in whic h w e ha v e a w alking p erson and a b ouncing ball (CLIP 9 on CD). The corresp onding sp ectrogram of the audio signal, the detected lines and some of the auditory groups formed are sho wn in the top 3 ro ws of Fig. 4.9. The p erio dicities of the audio groups sho wn are 0.38 and 0.73 seconds resp ectiv ely Some of the video frames are sho wn in the b ottom ro w of Fig. 4.9. The p erio dicit y curv es are in the forth ro w of Fig. 4.9. Once w e apply p erio dicit y transform w e get p erio dicities 0.4 and 0.7 seconds resp ectiv ely A p ercen tage dierence c hec k corresp onds rst audio ob ject to the fo otsteps of the running p erson (CLIP 11 on CD) and the second corresp onds to the b ouncing ball(CLIP 10 on CD). 30

PAGE 40

Figure 4.6 (T op) Video F rames (Bottom) P erio dicit y Curv es 4.5 Scenario 5: P erson W alking, P erson Running and Bouncing Ball No w, w e consider another scenario in whic h w e ha v e a w alking p erson, a running p erson and a b ouncing ball (CLIP 12 on CD). The corresp onding sp ectrogram of the audio signal, the detected lines and some of the auditory groups formed are sho wn in the top 2 ro ws of Fig. 4.9. The p erio dicities of the audio groups sho wn are 0.61, 1.1 and 0.22 seconds resp ectiv ely Some of the video frames are sho wn in the b ottom ro w of Fig. 4.9. The p erio dicit y curv es are in the forth ro w of Fig. 4.9. Once w e apply p erio dicit y transform w e get p erio dicities 0.6, 1.1 and 0.23 seconds resp ectiv ely The rst corresp onds to the fo otsteps of the running p erson (CLIP 13 on CD). The second corresp onds to the b ouncing ball (CLIP 14 on CD) and third corresp onds to w alking p erson (CLIP 15 on CD). 31

PAGE 41

Figure 4.7 T op Ro w Sho ws the Sp ectrogram on the Left Side and Lines Detected on the Righ t Side. Tw o of the Audio Ob jects are Sho wn in the Third Ro w. F ourth Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w 32

PAGE 42

Figure 4.8 T op Three Ro ws Sho w the Sp ectrogram, Detected Lines and the Audio Ob jects. F ourth Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video. The Fifth Ro w Sho ws Thresholded F rames and the Last Ro w Sho ws Video F rames 33

PAGE 43

Figure 4.9 T op Ro w Sho ws the Sp ectrogram and the Next Ro w Sho ws the Lines Detected. Tw o of the Audio Ob jects are Sho wn in the Third Ro w. F ourth Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w 34

PAGE 44

Figure 4.10 T op Ro w Sho ws the Lines from the Sp ectrogram. Three of the Audio Ob jects are Sho wn in the Second Ro w. Third Ro w Sho ws the P erio dicit y Curv es of Ob jects in the Video Extracted from Video F rames Sho wn in the Last Ro w 35

PAGE 45

CHAPTER 5 CONCLUSIONS In this thesis, w e presen t a tec hnique b y whic h w e can asso ciate sound to motion in the video. Sp ecically w e asso ciate sound to ob jects exhibiting shap e c hange or trac k p erio dicities. Our metho d uses a feature based approac h that groups high lev el primitiv es, th us a v oiding the noise asso ciated with using lo w lev el features. This approac h w orks w ell ev en in a cluttered en vironmen t and can asso ciate more than one sound to the resp ectiv e ob jects at the same instan t. Con trary to the traditional signal pro cessing tec hniques, w e use auditory scene analysis tec hniques to pro cess audio. This giv es us separate audio ev en ts presen t in the sound signal that are then group ed based on p erio dicit y of the audio signal. Unlik e man y w ell-kno wn signal pro cessing tec hniques whic h use m ultiple microphones to collect audio data, w e stic k to audio data receiv ed b y a single microphone. In an age when video surv eillance has b ecome utmost imp ortan t, asso ciation of sound to video can help zo om in on in teresting ob jects in the video. It can also help gure out if the sound pro duced is within the visual range. The camera can then pan to see if it can nd an y ob ject of in terest. Though this tec hnique uses primarily p erio dicit y as its cue to form audio-video asso ciation, w e b eliev e the concurren t nature of the audio-video data exhibit man y more high lev el primitiv es whic h can b e used for asso ciation. Multi sensor pro cessing nds its use in man y areas including video surv eillance. Psyc hological evidence and analysis of biological systems has sho wn that audio-visual fusion enhances p erception. 36

PAGE 46

REFERENCES [1] I A Essa. Ubiquitous sensing for smart and a w are en vironmen ts: tec hnology to w ards the building of an a w are home. In IEEE Personal Communic ations pages 47{49, 2000. [2] A S Bergman. A uditory sc ene analysis: The p er c eptual or ganisation of sound MIT Press, 1990. [3] H McGurk and J MacDonald. Hearing lips and seeing v oices. Natur e pages 746{748, 1976. [4] J V ro omen, P Bertelson and B de Gelder. A visual inruence in the discrimination of auditory lo cation. In Pr o c e e dings of the International Confer enc e on A uditory-Visual Sp e e ch pr o c essing pages 131{135, 1998. [5] J V ro omen and B de Gelder. Sound enhances visual p erception: Cross-Mo dal eects of auditory organization on vision. Journal of Exp erimental Psyc olo gy: Human p er c eption and p erformanc e pages 1583{1590, 2000. [6] G Mather. The motion after ee ct: A mo dern p ersp e ctive MIT Press, 1998. [7] R A Berman and C L Colb y. Auditory and visual atten tion mo dulate motion pro cessing in area MT+. Co gnitive Br ain R ese ar ch 14:64{74, 2002. [8] D Lo and R A Goubran Auditory and visual atten tion mo dulate motion pro cessing in area MT+. In IEEE T r ansactions on instruments and me asur ements v olume 20, pages 1132{1139, 2004. [9] D Johnson and D Dudgeon. A rr ay Signal Pr o c essing pages 112{113. Pren tice-Hall, 1993. [10] E Menegatti, E Mumlo, M Nolic h and E P agello. A surv eillance system based on audio and video sensory agen ts co op erating with a mobile rob ot. In International Confer enc e on Intel ligent A utonomous Systems pages 335{343, 2004. [11] S Tsek eridou and I Pitas. Con ten t-Based video parsing and indexing based on audiovisual in teraction. In IEEE T r ansactions on cir cuits and systems for vide o te chnolo gy v olume 11, pages 522{535, 2001. [12] R Cutler and L Da vis. Lo ok who's talking: Sp eak er detection using video and audio correlation. In IEEE International Confer enc e on Multime dia and Exp o v olume 3, pages 1589{1592, 2000. [13] E Kidron, Y Y Sc hec hner and M Elad. Pixels that Sound. In International c onfer enc e on p attern r e c o gnition v olume 1, pages 88{95, 2005. 37

PAGE 47

[14] L McPherson, V Cio cca and A Bergman. Organization in audition b y similarit y in rate of c hange: evidence from trac king individual frequency glides in mixtures. Per c eption and Psychophysics pages 269{278, 1994. [15] G Kanizsa. Or ganization in vision: essays on Gestalt principle New Y ork:Praeger, 1979. [16] G Vicario. The acoustic tunnel eect. R ivista da Psic olgia pages 41{52,54, 1960. [17] R M W arren. A uditory p er c eption: a new synthesis New Y ork: P ergamon Press, 1982. [18] M Kub o vy and D V V alk en burg. Auditory and visual ob jects. Co gnition 80:97{126, 2001. [19] C Steger. An Un biased detector of Curvilinear structures. In IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e v olume 20, pages 113{135, 1998. [20] W A Sethares and T W Staley. P erio dicit y T ransforms. In IEEE T r ansactions on Signal Pr o c essing v olume 47, pages 2953{2964, 1999. [21] D Ellis. Sp ectrogram in v esion. http://www.ee.columbia.edu/ dpwe/resources /matlab/sgram/ispecgram.m 38