USF Libraries
USF Digital Collections

Mining associations using directed hypergraphs

MISSING IMAGE

Material Information

Title:
Mining associations using directed hypergraphs
Physical Description:
Book
Language:
English
Creator:
Simha, Ramanuja N
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Association Rules
Clustering
Discretization
Financial Time-series
Multi-valued Attributes
Similarity
Dissertations, Academic -- Computer Science -- Masters -- USF   ( lcsh )
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: This thesis proposes a novel directed hypergraph based model for any database. We introduce the notion of association rules for multi-valued attributes, which is an adaptation of the definition of quantitative association rules known in the literature. The association rules for multi-valued attributes are integrated in building the directed hypergraph model. This model allows to capture attribute-level associations and their strength. Basing on this model, we provide association-based similarity notions between any two attributes and present a method for finding clusters of similar attributes. We then propose algorithms to identify a subset of attributes known as a leading indicator that influences the values of almost all other attributes. Finally, we present an association-based classifier that can be used to predict values of attributes. We demonstrate the effectiveness of our proposed model, notions, algorithms, and classifier through experiments on a financial time-series data set (S&P 500).
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2011.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Ramanuja N. Simha.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 67 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0004856
usfldc handle - e14.4856
System ID:
SFS0028122:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2011 flu ob 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004856
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Simha, Ramanuja N.
0 245
Mining associations using directed hypergraphs
h [electronic resource] /
by Ramanuja N. Simha.
260
[Tampa, Fla] :
b University of South Florida,
2011.
500
Title from PDF of title page.
Document formatted into pages; contains 67 pages.
502
Thesis
(M.S.C.S.)--University of South Florida, 2011.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
3 520
ABSTRACT: This thesis proposes a novel directed hypergraph based model for any database. We introduce the notion of association rules for multi-valued attributes, which is an adaptation of the definition of quantitative association rules known in the literature. The association rules for multi-valued attributes are integrated in building the directed hypergraph model. This model allows to capture attribute-level associations and their strength. Basing on this model, we provide association-based similarity notions between any two attributes and present a method for finding clusters of similar attributes. We then propose algorithms to identify a subset of attributes known as a leading indicator that influences the values of almost all other attributes. Finally, we present an association-based classifier that can be used to predict values of attributes. We demonstrate the effectiveness of our proposed model, notions, algorithms, and classifier through experiments on a financial time-series data set (S&P 500).
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
590
Advisor:
Tripathi, Rahul
653
Association Rules
Clustering
Discretization
Financial Time-series
Multi-valued Attributes
Similarity
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4856



PAGE 1

MiningAssociationsUsingDirectedHypergraphs by RamanujaN.Simha Athesissubmittedinpartialfulllment oftherequirementsforthedegreeof MasterofScienceinComputerScience DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:RahulTripathi,Ph.D. YichengTu,Ph.D. XiaoningQian,Ph.D. DateofApproval: March9,2011 Keywords:AssociationRules,Multi-valuedAttributes,FinancialTime-series, Discretization,Clustering,Similarity,Dominator Copyright c 2011,RamanujaN.Simha

PAGE 2

DEDICATION tomyparents,brother,andsister

PAGE 3

ACKNOWLEDGEMENTS ThisthesishasbeenwrittenunderthekindsupervisionofDr.RahulTripathi,mythesis advisor.Iamextremelygratefulforhishelpfulguidanceandfeedbackwhilereviewingthe variousversionsofthethesisdraft.Hisin-depthtechnicalinsightsandcommentshavebeen immenselymotivationaltomeinachievingahighqualityofresearch. IamextremelygratefultoDr.MayurThakurfromGoogleInc.,forbeingacollaborator inthisprojectandforprovidingmewithideasandin-depthtechnicalinsightsduringthe discussions.Wehadanumberofe-mailandphoneconversationsduringwhichhiscomments havebeenhighlymotivationaltome. IwishtothankDr.RahulTripathiforpermittingmetoincludeworkdonejointlywith him.TheworkinChapter3andChapter4wasjointlydonewithhim. MythesiscommitteememberswereDr.RahulTripathi,Dr.YichengTu,andDr. XiaoningQian.Iamextremelythankfultoallofthemfortheirhelpfulcomments. Ihavelearntimmenselyaboutachievingveryhighlevelsofacademicquality,technical excellence,andaptitudetolearn,bytakingtheorycoursesunderDr.RahulTripathi,and byworkingasaTeachingAssistantfortheorycoursesandasaResearchAssistantunder Dr.RahulTripathi.Ihavebeenfortunateforhavinghadsuchanopportunity.Duringthis period,theexperienceshugelycontributedinshapingmyresearchinterestsinalgorithms andproblemsolving. IwishtothankSrinivasanRamkumarfromQualcomm/RiverbedTechnologyforprovidingsupportandmotivationthroughoutmystudiesbyunderstandingtheneedtoconquer challengesforbecomingsuccessfulingraduateresearch. Iwouldliketothankmyparents,mybrother,andmysisterforgivingmeinspiration andsupportwheneverIneededduringmystudies.

PAGE 4

TABLEOFCONTENTS LISTOFTABLESiii LISTOFFIGURESiv LISTOFALGORITHMSv ABSTRACT vi CHAPTER1INTRODUCTION1 1.1BackgroundandMotivation1 1.2ThesisOverview4 CHAPTER2PRELIMINARIES6 2.1ApproximationAlgorithms6 2.1.1SetCover7 2.1.2GraphDominatingSet8 2.1.3The t -clustering9 2.2DirectedHypergraphs11 2.3DataMining12 2.3.1ClassicationRuleMining13 2.3.2Clustering15 CHAPTER3MODELINGASSOCIATIONSINDATABASESUSINGDIRECTED HYPERGRAPHS17 3.1AssociationsBetweenMulti-ValuedAttributes17 3.1.1Discretization22 3.2AssociationHypergraphs22 3.2.1ConstructingAssociationHypergraphs23 3.3Association-BasedSimilarityBetweenMulti-ValuedAttributes26 3.3.1In-SimilarityandOut-Similarity27 3.3.2ClustersofSimilarAttributes28 CHAPTER4COMPUTATIONALPROBLEMS30 4.1LeadingIndicators30 4.1.1AnAdaptationofGraphDominatingSetApproximationAlgorithm31 4.1.2AnAdaptationofSetCoverApproximationAlgorithm31 4.2Association-BasedClassier33 i

PAGE 5

CHAPTER5EXPERIMENTATION37 5.1AssociationHypergraphModeling37 5.1.1Discretization37 5.1.2ChoiceofParameters38 5.2AssociationCharacteristicsofFinancialTime-Series39 5.3Association-BasedSimilarity43 5.3.1ComparisonwithEuclideanSimilarity43 5.3.2ClustersofFinancialTime-Series45 5.4LeadingIndicatorsofFinancialTime-Series45 5.5Association-BasedClassier48 5.5.1Evaluation52 CHAPTER6CONCLUSIONSANDFUTUREWORK53 REFERENCES55 ABOUTTHEAUTHOREndPage ii

PAGE 6

LISTOFTABLES Table3.1Patientdatabase.19 Table3.2Patientdatabaseafterdiscretization.19 Table3.3Genedatabase.20 Table3.4Genedatabaseafterdiscretization.20 Table3.5Personalinterestdatabase.21 Table3.6Personalinterestdatabaseafterdiscretization.21 Table3.7AnexampleassociationtableATforthecombination f A 1 ;A 2 g ; f A 3 g .24 Table5.1Thedirectededgeandthe2-to-1directedhyperedgewiththe highest ACV foreachselectednancialtime-seriesfromdifferentsectorsandforeachcongurationchoiceareshown.42 Table5.2The2-to-1directedhyperedgewiththehighest ACV andthe constituentdirectededgesforeachselectednancialtime-series fromdierentsectorsandforeachcongurationchoiceareshown.44 Table5.3Thesizeofadominatorforallnancialtime-seriesandthe meanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm5.49 Table5.4Thesizeofadominatorforallnancialtime-seriesandthe meanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm6.50 iii

PAGE 7

LISTOFFIGURES Figure5.1Weighteddegreedistribution.40 Figure5.2Euclideansimilaritycomparision.46 Figure5.3Clustersofnancialtime-seriesforconguration C 1.47 Figure5.4Classicationcondencedistributionoftheassociation-based classierforin-sampleandout-sampledataforconguration C 1.51 iv

PAGE 8

LISTOFALGORITHMS 1Agreedyalgorithmforcomputingasetcover.8 2The t -clusteringalgorithm.10 3Perceptronlearningrule.14 4The k -meansalgorithm.16 5Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhichisanadaptationofgraphdominatingsetapproximation.32 6Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhichisanadaptationofsetcoverapproximation.34 7Enhancement1.34 8Enhancement2.34 9Association-basedclassier.35 v

PAGE 9

ABSTRACT Thisthesisproposesanoveldirectedhypergraphbasedmodelforanydatabase.Weintroducethenotionofassociationrulesformulti-valuedattributes,whichisanadaptation ofthedenitionofquantitativeassociationrulesknownintheliterature.Theassociation rulesformulti-valuedattributesareintegratedinbuildingthedirectedhypergraphmodel. Thismodelallowstocaptureattribute-levelassociationsandtheirstrength.Basingon thismodel,weprovideassociation-basedsimilaritynotionsbetweenanytwoattributesand presentamethodforndingclustersofsimilarattributes.Wethenproposealgorithms toidentifyasubsetofattributesknownasa leadingindicator thatinuencesthevalues ofalmostallotherattributes.Finally,wepresentanassociation-basedclassierthatcan beusedtopredictvaluesofattributes.Wedemonstratetheeectivenessofourproposed model,notions,algorithms,andclassierthroughexperimentsonanancialtime-series datasetS&P500. vi

PAGE 10

CHAPTER1 INTRODUCTION 1.1BackgroundandMotivation DataMininginvolvessearchinginterestingpatternsand/orclassifyingdata.Association ruleshelptodiscoverinterestingpatternsbyidentifyingimplicationrelationshipsamong attribute-valuepairspresentinthedata.Similarly,classicationruleshelptoclassifydata bypredictingvaluesofspecicattributesinthedata.Ingeneral,anyassociationorclassicationruleconsistsoftwocomponents:theantecedentandtheconsequent.Association rulesmayhavemorethanoneattributeintheirconsequent,whereasclassicationrules alwayshaveasingleattributeintheirconsequent.Anexampleofanassociationruleis: Ifacustomerbuysmilkanddiapers,thenthecustomeralsobuysbeerandeggs."Here, milk,diapers,beer,andeggsareattributes,andtheyeachtakevalue`1'where`1'denotes presentand`0'denotesabsentintheassociationrule.Anexampleofaclassicationrule is:Ifweatherissunnyandhumidityislow,thenplayistrue."Here,weather,humidity, andplayareattributes,andthevalues`sunny'and`low'helptopredictthevalue`true'for theattributeplay.Attributescanbeeithercategorical,e.g.,daysofaweek,orquantitative, e.g.,stockprice.Agrawaletal.[AIS93]presentedassociationrulesthattargetidentifying implicationrelationshipsamongcategoricalattributesthattakeonly0 = 1values.Suchassociationrulesarecalled boolean associationrules.Srikantetal.[SA96]introducedthemore general quantitative associationrulesthataccommodatebothcategoricalandquantitative attributes.Henceforward,wewillrefertoquantitativeassociationrulesasassociationrules. Associationrulesndtheiruseinidentifyingimplicationrelationshipsamongattributes inmarket-baskettypedata.Thisdatatypeisatransactionaldatawhereeachobservationinthedatabaseisatransactionconsistingofitemspurchasedbyacustomer.Much work[BAG99,SA95,AS94,SVA97,NLHP98]hasbeendoneinminingassociationrules 1

PAGE 11

thatsatisfyconstraintssuchas,minimumsupportameasureofsignicanceandminimumcondenceameasureofpredictiveability.Associationrulescanbeusedtosolve thefollowingproblems:ndingclustersofsimilarattributes,ndingasmallsubsetofattributesthatinuencesalargesectionofotherattributes,andndingclassicationrulesto predictvaluesofattributes.Forinstance,inmarket-baskettypedata,apracticalapplicationofassociationrulesistoidentifyclustersofsimilaritemsbasedonthecustomersales information.Thishelpstounderstandpatternsinsalesofitemsandtogroupitemsbased oncustomerinterests.Similarly,identifyingasmallsubsetofitemsthatinuencethesales ofallotheritemsinmarket-baskettypedatamayhelpinrecognizingmajorsalesindicators. Also,usingtheclassicationrules,thepurchaseofparticularitemsforcustomerscouldbe predictedbasedonthepriorpurchasesmadebythem. Infact,applicationsofassociationrulesgofarbeyondtherealmsofjustthemarketbaskettypedomain.Someotherdomainswhereassociationruleshaveimportantapplicationsareasfollows:inmedicineforidentifyingrelationshipsamongmedicalconditionsanddiseases[Ord06],inbioinformaticsforidentifyinginterrelationshipsamong genes[CH03,CSCR + 06],insocialnetworksforidentifyingsocialrelationships,andinnanceforidentifyingpredictionrelationshipsamongstocks.Weprovidebelowsomeexamplesfromthesedomainsthatwillhelpelucidatetheapplicationsofassociationrules.In eachoftheseexamples,35%isthecondenceand5%isthesupportofthestatedrule. 1.%ofthetimeswhentheageofapatientisintherange40-49yearsandthe cholesterolofthesamepatientisintherange220-229mg/dL,thepatient'sbloodpressureisintherange130-139mmHg"andthisruleoccursin5%ofthetotal observations.Here,age,cholesterol,andblood-pressureareattributes. 2.%ofthetimeswhengene1andgene2inapatientareunderexpressed,gene3 isoverexpressedinthepatient"andthisruleoccursin5%ofthetotalobservations. Here,gene1,gene2,andgene3areattributes. 3.%ofthetimeswhenapersonhashighinterestinreadingandplaying,theperson haslowinterestinmusic"andthisruleoccursin5%ofthetotalobservations.Here, reading,playing,andmusicareattributes. 2

PAGE 12

4.%ofthetimeswhenthepriceofstock A andofstock B goup,thepriceofstock C goesdown"andthisruleoccursin5%ofthetotalobservations.Here,stocks A B ,and C areattributes. Ourpresentworkismotivatedbythegoalofbuildingamodelthatinherentlyhandles many-to-manyrelationships,thusenablingtocapturetheserelationshipsamongattributes ofadatabasemoreaccurately.Suchamodelisexpectedtobesuitedforhandlingproblems suchas,similarityandclusteringbecausemanyoftherelationshipsexhibitedinrealworld arenotrestrictedtobeone-to-one. Knobbeetal.[KH06]proposedanapproachtomineasmallsetofbinaryattributes thathelpdierentiateobservationsinadatabase.Siebesetal.[SVL06]andBringmann etal.[BZ07]proposedtechniquesforcompressingadatabase.Thecompressedsetofbinaryattributescouldthenbeusedinadataminingclassiertoavoidovertting.The abovemethodsmainlyattempttoidentifyinformativesetsof binaryattributes ,whereas ourapproachattemptstobuildagenericmodelforanydatabasecontaining multi-valued attributes andaddressavarietyofproblemssuchas,similarity,clustering,leadingindicators,andclassication. Inthisthesis,weproposeanovelmodelforanydatabaseusingadirectedhypergraphin whichthenodesrepresentattributesandthedirectedhyperedgesrepresentmany-to-many relationshipsamongtheattributes.Theweightsondirectedhyperedgescapturethelikelinessofassociationinaparticulardirection.Weintroducethenotionofassociationrulesfor multi-valuedattributes,whichisanadaptationofthedenitionofquantitativeassociation rulesknownintheliterature[SA96].Theassociationrulesformulti-valuedattributesare integratedinbuildingthedirectedhypergraphmodel.Basingonthismodel,weprovide association-basedsimilaritynotionsbetweenanytwoattributes,presentamethodforndingclustersofsimilarattributes,andproposeanalgorithmtoidentifyasubsetofattributes knownasa leadingindicator thatinuencesthevaluesofalmostallotherattributes.Finally, wepresentanassociation-basedclassierthatcanbeusedtopredictvaluesofattributes. Wedemonstratetheeectivenessofourproposedmodel,notions,algorithm,andclassier throughexperimentsonanancialtime-seriesdatasetS&P500. 3

PAGE 13

1.2ThesisOverview WepresentbackgroundonapproximationalgorithmsanddatamininginChapter2.The variousapproximationalgorithmspresentedare:ndingminimumcostsetcover,nding minimumsizegraphdominator,andndingminimumdiameterclustering.Wethendiscusssomefundamentalconceptsandapproachesindataminingsuchaslinearregression, perceptronrulelearningalgorithm,and k -meansclustering. InChapter3,weintroducethenotionofassociationrulesformulti-valuedattributes, whichisanadaptationofquantitativeassociationrulesknownintheliterature.Wethen proposeanoveldirectedhypergraphbasedmodelingforanydatabasethatallowsusto computeassociation-basedsimilaritybetweenmulti-valuedattributesandndclustersof similarmulti-valuedattributes. Usingthedirectedhypergraphmodel,weproposeinChapter4twogreedyalgorithms foridentifyingasubsetofattributesknownasaleadingindicatorthatinuencesthevalues ofalmostallotherattributes.Therstgreedyalgorithmisbasedonanadaptationofthe greedy O log n -approximationalgorithmforcomputingaminimumcardinalitydominating setingraphs.Thesecondgreedyalgorithmisbasedonanadaptionofthegreedy O log n approximationalgorithmforcomputingaminimumcostsetcover.Finally,wepresentan association-basedclassierthatcanbeusedtopredictvaluesofattributes. InChapter5,weconductexperimentsonnancialtime-seriesobtainedfromYahooFinance[Yah10]onthefollowingproblems:andingclustersofsimilarattributes,bndingleadingindicators,andcpredictingvaluesofnancialtime-seriesusingtheassociationbasedclassier.Weproposeanequi-depthpartitioningtechniquetodiscretizethenancial time-seriesintheS&P500.Thistransformationisusedtoconstructadatabasethatissuitablefortheassociationhypergraphmodeling.Thedirectedhypergraph H requiredforthe experimentsisthenconstructedusingthedatabase.Weshowtheassociationcharacteristics ofnancialtime-seriesbypresentingthedegreedistributionofnodesintheassociationhypergraph,bydisplayingthedirectededgesanddirectedhyperedgeswithhighestassociation condencevalueACVforselectednancialtime-series,andbycomparingthedirected hyperedgeswiththehighestACVwiththoseofthecorrespondingtwodirectededges.We 4

PAGE 14

alsodisplaytheclustersofnancialtime-seriesinthesimilaritygraph,computetheleadingindicatorsforacollectionofnancialtime-series,andpresentthestatisticsofnancial time-seriespredictions. Inthisthesisweshowthatdirectedhypergraphs,duetotheirinherentstructure,help ustocaptureinterestingcharacteristicsthatexhibitmany-to-onerelationshipsamongattributesinadatabase.Theproposedmodeldisplaystheversatileusageofdirectedhypergraphsinaddressingvariousproblemsrelevanttominingassociationsindatabases.Our experimentsonnancialtime-seriesdemonstratethatdirectedhypergraphscapturemore relationshipsthandirectedgraphsandinthestockmarketdomain,ourmodelingallowsus toaddressproblemssuchascomputingnancialtime-seriessimilarity,computingnancial time-seriesleadingindicators,andpredictingnancialtime-seriesvalues,usingacommon framework/methodology. 5

PAGE 15

CHAPTER2 PRELIMINARIES 2.1ApproximationAlgorithms TherearemanyproblemswhosedecisionversionisNP-completeandoptimizationversion isNP-hardinnature,i.e.,obtaininganoptimalsolutiontotheprobleminpolynomialtime iscurrentlyunknown.Duetotheirpracticalimportance,itisextremelyusefultohavean ecientpolynomialtimesolutiontosuchproblemsbycarryingoutanyofthefollowing: aconstrainingtheinput,bintroducingrandomizationinthesolutionapproach,and cobtaininganearoptimalsolution.Inthissection,weshallfocusonobtaininganear optimalsolutiontohardproblems.Approachesthatproduceanearoptimalsolutionare usedinthedesignofpolynomialtimeapproximationalgorithms.Thequalitymeasureof anapproximationalgorithmisgivenbyitsapproximationfactor. Letussaythatwehaveanoptimizationproblem.Basedontheproblem,anoptimal solutionmaybeasolutionwitheitherthemaximumvalueortheminimumvalue.Inother words,theproblemmaybeamaximizationoraminimization. Denition2.1 Foranoptimizationproblem A thattakesinput I ,analgorithm ALG has anapproximationratio ifthecost ALG A I ofthesolutionproducedbythealgorithmon input I iswithinafactorof ofthecost OPT A I ofanoptimalsolutiononthesame input I .Thatis, max ALG A I OPT A I ; OPT A I ALG A I : Foraminimizationproblem,theratio ALG A I =OPT A I givesafactorbywhichthe solutioncostproducedbythealgorithmexceedsthesolutioncostofanoptimalsolution. Foramaximizationproblem,theratio OPT A I =ALG A I givesafactorbywhichthe solutioncostofanoptimalsolutionexceedsthesolutioncostproducedbythealgorithm. 6

PAGE 16

2.1.1SetCover Denition2.2 Let U beauniverseconsistingof n elementsandlet S = f S 1 ;S 2 ;:::;S m g beacollectionofsubsetsof U .Asetcover SC isasubcollectionof S thatcoversallthe elementsin U Algorithm1presentedbelowcomputesasetcover SC ofsize ALG SC I givenauniverse U ofsize n andacollectionofsubsets S = f S 1 ;S 2 ;:::;S m g of U asinput I .Let Cover denotethesetofelementsthatarecurrentlycoveredbythealgorithm.Thegreedystrategy ofthealgorithmisasfollows:foreverysubset S i 2S thatisnotpartofthesetcoveryet, thealgorithmcomputesitscosteectiveness S i thatreects S i 'scoveringability,i.e, S i istheaveragecostpaidbythegreedyalgorithmtocovertheelementsin S i thatare notalreadyin Cover ,i.e.,1 = j S i )]TJ/F22 10.9091 Tf 11.464 0 Td [(Cover j .Duringeachiterationofthealgorithmuntil alltheelementsin U arecovered,thesubset S i withthelowestaveragecosthighest costeectivenessisaddedtothesetcover.Inotherwords,apriceof1 = j S i )]TJ/F22 10.9091 Tf 11.307 0 Td [(Cover j is paidtocovereachelementin S i )]TJ/F22 10.9091 Tf 11.021 0 Td [(Cover .Therefore,thecostofthesetcoverobtainedis P n k =1 price u k Thegreedysetcoveralgorithmcanbeimplementedinlineartime. Theorem2.3[Joh74,Lov75,Chv79] Givenauniverse U ofsize n andacollectionof subsets S = f S 1 ;S 2 ;:::;S m g of U asinput I ,thecostofthesetcover ALG SC I computed byAlgorithm1isatmostafactorof O log n greaterthanthecostofanoptimalsetcover OPT SC I in O n log m time. Proof Let u 1 ;u 2 ;:::;u n 2 U betheorderinwhichtheelementsareaddedto Cover byAlgorithm1.Sinceanoptimalsolutioncancoveralltheelementsin U withacostof OPT SC I ,theremustalwaysexistasethavinganaveragecostofatmost OPT SC I = j U )]TJ/F22 10.9091 Tf -423.515 -21.922 Td [(Cover j .Weknowthat j U )]TJ/F22 10.9091 Tf 10.543 0 Td [(Cover j isatleast n )]TJ/F22 10.9091 Tf 10.544 0 Td [(k +1whenanelement u k isabouttobe covered.Sincethealgorithmpicksthelowestaveragecostsubsetduringeachiteration,we have price e k OPT SC I j U )]TJ/F22 10.9091 Tf 10.909 0 Td [(Cover j OPT SC I n )]TJ/F22 10.9091 Tf 10.91 0 Td [(k +1 : 7

PAGE 17

Input :Auniverse U ofsize n andacollectionofsubsets S = f S 1 ;S 2 ;:::;S m g of U Output :Asetcover SC begin 1 SC ; ; 2 Cover ; ; 3 while Cover 6 = U do 4 foreach S 2S do 5 count 0; 6 foreach u 2 S do 7 if u= 2 Cover then 8 count count +1; 9 end 10 end 11 S 1 count ; 12 end 13 Let S 0 besuchthat S 0 min S 2S S ; 14 Cover Cover [ S 0 ; 15 SC SC [f S 0 g ; 16 end 17 return SC ; 18 end 19 Algorithm1:Agreedyalgorithmforcomputingasetcover. Therefore,thecostofthesetcover ALG SC I is ALG SC I = n X k =1 price u k n X k =1 OPT SC I n )]TJ/F22 10.9091 Tf 10.909 0 Td [(k +1 1 1 + 1 2 + + 1 n OPT SC I log n OPT SC I : Thiscompletestheproof. 2.1.2GraphDominatingSet Denition2.4 Let G = V;E beagraph.Adominatingsetisasubset DomSet V of verticessuchthat,foreachvertex v 2 V ,eitherthereexistsanedge u;v 2 E suchthat u 2 DomSet or v 2 DomSet 8

PAGE 18

Theorem2.5[Joh74,Lov75,Chv79] Thereisagreedy O log n -approximationalgorithmforgraphdominatingsetproblemthat,givenanygraph G = V;E asinput I computesadominatingsetof G whosesizeiswithin O log n oftheoptimaldominatingset size,where n isthenumberofverticesand m isthenumberofedges,of G Thefollowingconstructioncanbeusedtotransformaninstanceofgraphdominatingset intoaninstanceofsetcover.Let U = V bethesetofverticesthatneedtobecoveredandlet S = f S 1 ;S 2 ;:::;S n g besubsetsofvertices,whereeachsubset S i containsvertex v i andits neighborhoodset N v i .Thisinstanceofsetcovercanbesolvedusingtheapproximation resultinTheorem2.3.SinceeverysetaddedbyAlgorithm1tothesetcoveralwaysadds onlyavertextothegraphdominatingset,theapproximationguaranteeprovidedinthe theoremforthesizeofthesetcoverholdsforthesizeofthegraphdominatingset. 2.1.3The t -clustering Letusassumewehaveasetofattributes,e.g.,nancialtime-series,genes,orimages,and wearerequiredtondindependentgroupsofattributeswhereattributesbelongingtothe samegrouparesimilartoeachother.Thegeneralclusteringproblemaddressesthisby deninganobjectivefunctionbasedonwhichtheattributesaregrouped.Avariantofthe generalclusteringproblemistheproblem t -clustering,whichgroupstheattributesbased onhowclosetheyaretoeachother.Inotherwords,theobjectiveof t -clusteringistogroup attributesinto t clusterssothatthemaximumdistancebetweenanytwoattributeswithin thesameclusteralsocalledthediameterofthegroupinga.k.a.clusteringisminimized. ThisproblemalsoassumesthattheEuclideandistancefunctiondisplaysmetricproperties.Adistancefunction d ; onapointset X = f x 1 ;x 2 ;:::;x n g issaidtohavemetric propertiesifandonlyifitsatises: 1. d x 1 ;x 2 0forall x i 2 X ,and 2. d x 1 ;x 2 =0ifandonlyif x 1 = x 2 ,and 3. d x 1 ;x 2 = d x 2 ;x 1 ,and 4. d x 1 ;x 2 d x 1 ;x 3 + d x 3 ;x 2 TriangleInequality. 9

PAGE 19

Denition2.6 Let X = x 1 ;x 2 ;:::;x n beasetofpointswithadistancefunctionbetweenanypairofpoints d ; thatsatisesthemetricpropertiesandlet t beaninteger. A t -clustering C = f C 1 ;C 2 ;:::;C t g isapartitionof X into t clusters C 1 ;C 2 ;:::;C t by designating t pointsof X ascenters,suchthat C minimizesthedimeter Diam overall possibleclusterings.Thatis C minimizes Diam C =max i max x a ;x b 2 C i d x a ;x b : Algorithm2presentedbelowndssuchapartitionof X bydesignatingsome t pointsas clustercenters.Initially,anypoint 1 2 V ispickedastherstclustercenter.Duringeach iterationofthealgorithm,itndsapoint i 2 X thatisthefarthestfromtheexisting centers 1 ; 2 ;:::; i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ,i.e.,apoint i 2 X thatmaximizesmin j
PAGE 20

lessthan r toitsclosestcenter.Fromtriangleinequality,thediameterofthe t clusteringis atmost2 r .Also,fromthewaythecentersarechosenineveryiterationoftheAlgorithm2, weknowthatallthecenters 1 ; 2 ;:::; t andthepoint u areplacedatleast r distance apartfromeachother.Ifa t -clusteringofthesepointsisfound,thenthediameterofthe clusteringisatleast r i.e., OPT TC I .Fromabovewehave ALG TC I 2 OPT TC I : Thiscompletestheproof. 2.2DirectedHypergraphs Directedhypergraphsareageneralizationofdirectedgraphsinwhicheachdirectedhyperedgehasoneormoresourcetailverticesandhasoneormoredestinationheadvertices.TheyhavefoundavarietyofapplicationsinComputerScience,e.g.,inpropositional logic[AG97],databases[ADS83,ADS85],scheduling[LS95,GS02],routingindynamicnetworks[Pre00],bioinformatics[KNO + 03],anddatamining[CDP04].Theoreticalproblems relatedtodirectedhypergraphshavebeenstudiedin[Vie04,TT09]. Arelatednotionisundirectedhypergraphs,whichgeneralizeundirectedgraphs. Denition2.8 [Ber73] An undirectedhypergraph G isapair V G ;E G ,where V G isanite setofverticesand E G 2 V G isanitesetofhyperedgessuchthat,foreveryhyperedge e i 2 E G e i 6 = ; Denition2.9 [GLPN93] A directedhypergraph H isapair V;E ,where V isanite setofverticesand E 2 V 2 V isanitesetofdirectedhyperedgessuchthat,forevery directedhyperedge e = T;H 2 E T 6 = ; H 6 = ; ,and T H = ; Here, T iscalledthe tailsetand H iscalledtheheadsetof e Noticethatdirectedhypergraphsaredierentfromundirectedhypergraphs[Ber73].While directedhypergraphsgeneralizedirectedgraphs,undirectedhypergraphsgeneralizeundirectedgraphs. 11

PAGE 21

2.3DataMining Datamininginvolveslearninginformationorconceptsfromdatabasesandusingthemto solveproblems.Basedonthestructureofthedatabase,therearedierentproblemsthat canbeaddressed.Forexample,supposeourdatabasecontainsattributessuchashumidity, weather,temperature,andtypeofplay,andcontainsobservationsthatinfactarerecords ofvaluesthattheseattributestakeondierentdays.Inordertolearnaboutthetypeof play,onecanxtypeofplayasthe classicationattribute andlookintothevaluesofother attributes.Aclassierisanalgorithmthatlearnsrulesabouttheclassicationattribute basedonotherattributevalues.Thisapproachcanbeusedtopredicttheclassattribute valueofunseenobservations.Sincetheclassierisbuiltunderthesupervisionofasetof observations,i.e.,trainingdataset,thismethodisknownassupervisedclassication.The classvaluemayalsobenon-discreteinnature. Incertaincircumstances,adatabasemaynothaveanyinformationtodistinguisha particularattributeastheclassicationattribute.Insuchascenario,learninginformation aboutthedatabasemaynotaidinthepredictionofaclassvalue.However,learning informationnotonlycorrespondstothepredictionofsomeattributevalueinthedatabase, butalsocanbevisualizedasinferringrelationshipsamongattributes.Forexample,in thesampledatabasediscussedearlier,aninterestingassociationcanbeasfollows:The patternofhumidityhavingthevalue80%,temperaturehavingthevalue90F,andweather havingthevalueRainyisveryfrequentinthedatabase."Suchaninferenceisknownasan associationrule.Here,thereisnoparticularattributewhosevalueispredicted.But,aswe haveseenintheexample,learninginformationcorrespondstoidentifyinginterestingand usefulpatternsamongattributes. Inthepreviousexample,wesawthatlearninginformationcanbevisualizedinterms ofinferringrelationshipsamongattributes.Ifthisconceptisgeneralized,thenitcanbe termedasidentifyinggroupsofattributeswithsimilarcharacteristics.Here,characteristics mayrelatetothepatternsamongattributesidentiedearlier.Thisproblemisknownas clustering,astheobjectiveistondattributesandclassifythemintogroupsbasedon theircharacteristics.Forexample,letthesetofattributesbeasetofnancialtime-series 12

PAGE 22

andthesetofobservationsbethestockpricesrecordedforthenancialtime-serieson variousdays.Aninterestingquestionhereistogroupthenancialtime-seriesintoclusters containingsimilarnancialtime-series.Thequalityoftheclusteringobtainedisgenerally veriedbasedontheinformationavailabletotheuser.Inthiscase,onewaytoverify thequalityofthenancialtime-seriesclustersobtainedisbylookingatthesectorsofthe time-series.Here,thequalityoftheclusteringmaybedenedgoodifahighpercentageof thetime-seriesthatbelongtothesameclusterarefromthesamesector. 2.3.1ClassicationRuleMining Wereviewthelinearregressionclassierusedtoconstructclassicationrules.Linearregressioncanbeusedtopredicttheclassattributevaluewhentherearenon-discreteattributes. Let A 1 ;A 2 ;:::;A n denotetheattributesandlet O 1 ;O 2 ;:::;O m denotetheobservations correspondingtotheattributes.Lettherebeadatabasethatisvisualizedasa m n table containing m rowsand n columns,whererowscorrespondtoobservationsandcolumnscorrespondtoattributes.Let o ij denotethevalueofthe j 'thattributeforthe i 'thobservation where1 i m and1 j n .Forconvenience,letussaythatthelastattribute, A n betheclassattributewhosevalueneedstobepredicted.Inordertopredict A n 'svalue, linearregressionbuildsanattributeweightassignmentmodelwhere w 1 ;w 2 ;:::;w n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 are theweightsassignedtotherst n )]TJ/F15 10.9091 Tf 10.747 0 Td [(1attributes.Theweightassignmentmodelclassier isiterativelybuiltbycomputingthepredictedvaluefor A n foreachobservation.Forthe i 'thobservation, A n 'spredictedvalueisgivenby: w 1 o i 1 + w 2 o i 2 + + w n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 o in )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X j =1 w j o ij : Intuitively,inordertohaveanaccuratepredictionof A n forthe i 'thobservation,the dierencebetween A n 'spredictedvalue P n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 j =1 w j o ij and A n 'sactualvalue o in hastobe minimized.Thus,forthe i 'thobservation,where1 i m ,theclassierisconstructed byreducingthesumofsquaresofdierence o in )]TJ/F27 10.9091 Tf 11.476 8.182 Td [(P n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 j =1 w j o ij 2 .Overall,linearregressionminimizes P m i =1 o in )]TJ/F27 10.9091 Tf 10.91 8.182 Td [(P n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 j =1 w j o ij 2 : Usingtheattributeweightsintheconstructed classier,theclassattributevalueofanunseenobservationcanbepredicted. 13

PAGE 23

Thebasicideausedinlinearregressionistoexpresstheclassattributevalueasalinear combinationofotherattributes.Whentheattributevaluesarediscrete,expressingthe classattributeasalinearcombinationofotherdiscreteattributesfailstopredictthevalue oftheclassattribute.Thisisduetothefactthattheerrorcomputedbetweenthepredicted classvalueandtheactualclassvaluehaslittlemeaningasthiserrordoesnotindicatehow farthepredictedvalueisfromtheactualvalue.Also,thepredictionsmaylieoutsidethe setofdiscretevaluesallowedinthedataset. for i 0 to n )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 do 1 w i 0; 2 end 3 while thereexistsanincorrectlyclassiedobservationinthetrainingdata do 4 for j 1to m do 5 if O j iscurrentlyincorrectlyclassied then 6 if O j belongstotherstclass then 7 Addvaluesofattributes A 0 ;A 1 ;:::;A n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 inthisobservationto 8 weights w 0 ;w 1 ;:::;w n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 intheequation; end 9 else 10 Subtractvaluesofattributes A 0 ;A 1 ;:::;A n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 inthisobservationfrom 11 weights w 0 ;w 1 ;:::;w n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 intheequation; end 12 end 13 end 14 end 15 Algorithm3:Perceptronlearningrule. Wenowlookatanotherapproachtopredicttheclassattributevalue.Let A 1 ;A 2 ;:::;A n denotetheattributesandlet O 1 ;O 2 ;:::;O m denotetheobservationscorrespondingtothe attributesasbefore.Letusassumetheclassattribute A n is0 = 1-valued.Thisapproach triestoseparatetheobservationsintoeither0-valuedor1-valuedbyusingahyperplane. Theequationofthehyperplaneisasfollows: w 0 A 0 + w 1 A 1 + w 2 A 2 + + w n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 A n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 =0 : Duringtheconstructionofthehyperplane,theweight w 0 ofattribute A 0 ,called bias ,is usedtocontributeaconstantvalueintheequationofthehyperplaneandhence A 0 =1. ThefollowingalgorithminAlgorithm3,called perceptronlearningrule ,isusedtoconstruct 14

PAGE 24

thehyperplane.A perceptron [Ros58]isabinaryclassierthatclassiesanobservationinto therstclassifthesumfromtheequationisgreaterthan0andclassiesanobservation intothesecondclassotherwise. Weseethatthealgorithmiterativelyeitherincrementsordecrementstheattribute weightsbytheattributevaluesoftheobservation,iftheclassattributevalueofthatobservationisincorrectlyclassiedintheequationofthehyperplane.Thegoalistopredictclass valuesofalltheobservationscorrectly.Ifthedatasetisnotlinearlyseparable,theabove algorithmwouldnotterminate.Insuchcases,thealgorithmcanbeterminatedforcefully aftertheexecutionofacertainnumberofiterations. 2.3.2Clustering Denition2.10[Llo82] Let X = x 1 ;x 2 ;:::;x n beasetofpointswithadistance functionbetweenanypairofpoints d ; andlet k beaninteger.A k -means-clustering C = f C 1 ;C 2 ;:::;C k g isapartitionof X into k clusters C 1 ;C 2 ;:::;C k suchthat C minimizes thesumofsquaresofdistancesofpointsfromthecentroid i = P x j 2 C i x j = j C i j foreach cluster C i 2C .Thatis, C minimizes k X i =1 X x j 2 C i jj x j )]TJ/F22 10.9091 Tf 10.909 0 Td [( i jj 2 : The k -meansclusteringalgorithm,providedinAlgorithm4,isawellknowntechniqueto nda k -means-clusteringgivenpointsinEuclideanspace.Initially k randomclustercenters arepicked.The k -meansalgorithmthenworksiteratively.Ineachiteration,thealgorithm assignstheremainingpointstotheclosestclustercenterandcomputesthecentroidforeach ofthe k clusters.Thesecentroidsarethenewclustercentersforthenextiteration.The algorithmcontinuesuntiltheclusterscentersbetweentwoconsecutiveiterationsremain unchanged. Theoutputof k -meansalgorithmdependsontheinitialclustercentersthatarepicked. The k -meansalgorithmineachiterationpicksthecentroidsofthepointsinthecurrent clustersasnewclustercenters NewCC .Thealgorithmterminateswhenthecurrentcluster centers CurrentCC andnewclustercenters NewCC arethesame.The k -meansclustering 15

PAGE 25

Input :Thenumberofclusters k andasetofpoints X = x 1 ;x 2 ;:::;x n witha distancefunctionbetweenanypairofpoints d x a ;x b Output :Aclustering C = f C 1 ;C 2 ;:::;C k g begin 1 NewCC Pickany k points,say, 1 ; 2 ;:::; k ; 2 CurrentCC ; ; 3 while CurrentCC 6 = NewCC do 4 CurrentCC NewCC ; 5 Create k clusterssuchthatthe i 'thclusteris C i = f all x 2 X whoseclosest 6 centeris i 2 CurrentCC g ; for i 1 to k do 7 i P x j 2 C i x j j C i j ; 8 end 9 NewCC f 1 ; 2 ;:::; k g ; 10 end 11 end 12 Algorithm4:The k -meansalgorithm. algorithmissuitableforndingclustersinasetofpointsthatcontainsubgroupsofpoints thathavesymmetricshapessuchas,circular,sincetheseshapesallowthepointstobe assignedtouniquecentroids.Thismakesthealgorithmiterativelymovetowardstheunique centroidsoftheindividualsubgroupsofpointsinordertoobtainastableclustering.When thealgorithmisinputwithsubgroupsofpointsthathaveasymmetricshapeswithirregular boundaries,itmovesthecentroidswithoutbeingabletoobtainastableclustering,thus constantlymodifyingtheclusteringconguration.Theworstcaserunningtimeforthe k -meansalgorithmisknowntobe2 p n [AV06]. 16

PAGE 26

CHAPTER3 MODELINGASSOCIATIONSINDATABASESUSINGDIRECTED HYPERGRAPHS 3.1AssociationsBetweenMulti-ValuedAttributes Let D beadatabaseintheformofa m n table,wheretherowscorrespondtoobservations andthecolumnscorrespondtomulti-valuedattributes.Let O = f O 1 ;O 2 ;:::;O m g be thesetofobservationsand A = f A 1 ;A 2 ;:::;A n g bethesetofattributes.Thetable entryforeachattribute A i andeachobservation O j isavaluefromaxedniteset V = f v 1 ;v 2 ;:::;v k g .Wedenotesuchadatabase D intheformof D A ; O ; V .Forany X AV let 1 X denote f A i j9 v j A i ;v j 2 X g Wenextpresentthedenitionofanassociationruleformulti-valuedattributesandthe supportandcondencemeasuresforsuchanassociationrule. Denition3.1 Anassociationruleformulti-valuedattributesinshort, mva-typeassociationrule inadatabase D A ; O ; V isanimplicationrelationshipoftheform X mva = Y where X;Y AV and 1 X and 1 Y aredisjointsubsetsof A Denition3.2 Thesupportandcondencemeasuresaregeneralizedformulti-valuedattributesinadatabase D A ; O ; V asfollows: 1.Let X = f A i 1 ;v j 1 ; A i 2 ;v j 2 ;:::; A ir ;v jr g beanysubsetof AV .The support of X ,denotedby Supp X ,isdenedasthefractionofobservationsin D forwhich A i 1 takesvalue v j 1 A i 2 takesvalue v j 2 ::: ,and A ir takesvalue v jr 2.Let X mva = Y beanmva-typeassociationrule.Thenthe condence ofthisrule,denoted by Conf X mva = Y ,isdenedasfollows: Conf X mva = Y = Supp X [ Y Supp X : 17

PAGE 27

Theabovedenitionofanmva-typeassociationrulehasbeenadaptedfromthedenition ofaquantitativeassociationrule[SA96]withaminorchangetosimplifythedenition.In anmva-typeassociationrule,attributesareassociatedwithvaluesfromaxedniteset, whereasinaquantitativeassociationrule,attributesareassociatedwitheithercategorical valuese.g.zipcode,makeofcarorintervalse.g.age,income.Duringtheprocessof discoveringquantitativeassociationrules,theattributevaluesarethenmappedtodiscrete values.Ontheotherhand,ourdenitionofdatabase D inparticular,thesetofvalues V assumesthattheattributevaluesarealreadymappedtodiscretevalues.Inthissense,our denitionofmva-typeassociationrulessimpliesthedenitionofquantitativeassociation rulesgivenin[SA96]. Notethatthedenitionsofsupportandcondenceinthemarket-baskettypedatabase canbeviewedasaspecialcaseofDenition3.2.Forinstance,let A 1 A 2 ,and A 3 be0 = 1valuedi.e., binary attributes.Then,themeasuresupportof f A 1 ;A 2 g "inthemarketbaskettypedatacanbeseenasequivalenttoSupp f A 1 ; 1 ; A 2 ; 1 g andthemeasurecondenceof f A 1 ;A 2 g = f A 3 g "canbeseenasequivalenttoConf f A 1 ; 1 ; A 2 ; 1 g mva = f A 3 ; 1 g givenbyDenition3.2. Wenowgivesomeexamplesofdatabasescontainingmulti-valuedattributes,observations,andthevaluesthatattributestakeforeachobservations.Theseexamplesarefrom variousdomains,suchas,medicine,bioinformatics,andsocialnetworks. OurrstexampleisaPatientdatabase. Example3.3 ConsideraPatientdatabaseinTable3.1whereeachobservationconsistsof attributevaluessuchasage,cholesterol,blood-pressure,andheart-rateofdierentpatients. Here,patient 1 hasage 25 years,cholesterol 135 mg/dL,bloodpressure 135 mmHg,and heart-rate 75 beatsperminute.Similarly,recordsforotherpatientscanbereadfromthis table. Inordertoimprovetheusabilityofadatabasethatcontainsattributeseachofwhichcan takerealvaluesinanarbitraryrange,itisageneralpracticetodiscretizetheattribute values.InthepatientdatabaseinTable3.1,foreachattributevalue a i ,weconsiderthe discretizedvalue b a i = 10 c .Table3.2displaysthediscretizeddatabase. 18

PAGE 28

Table3.1Patientdatabase. Observations Attributes PatientId Age Cholesterol Blood-Pressure Heart-Rate Id A C B H 1 25 105 135 75 2 62 160 165 85 3 32 125 139 71 4 12 95 105 67 5 38 129 135 75 6 39 121 117 71 7 41 134 145 73 8 85 125 155 78 Table3.2Patientdatabaseafterdiscretization. Observations Attributes PatientId Age Cholesterol Blood-Pressure Heart-Rate Id A C B H 1 2 10 13 7 2 6 16 16 8 3 3 12 13 7 4 1 9 10 6 5 3 12 13 7 6 3 12 11 7 7 4 13 14 7 8 8 12 15 7 Letusconsiderthemva-typeassociationrule X mva = Y ,where X = f A; 3 ; C; 12 g and Y = f B; 13 g .Thisisanimplicationrelationshipthatstates:Iftheageofapatientis intherange30-39andthecholesterolofthesamepatientisintherange120-129mg/dL, thenitislikelythatthepatient'sblood-pressureisintherange130-139mmHg.Here, Supp X =3 = 8=0 : 375andConf X mva = Y =Supp X [ Y = Supp X =2 = 3=0 : 667. OurnextexampleisaGenedatabase. Example3.4 ConsideraGenedatabaseinTable3.3whereeachobservationconsistsof attributevaluessuchasgene1expressionvalue,gene2expressionvalue,gene3expression value,andgene4expressionvalueofdierentpatients.Here,patient 1 hasgene1expression value 54 : 23 ,gene2expressionvalue 66 : 22 ,gene3expressionvalue 342 : 32 ,andgene4 expressionvalue 422 : 21 .Similarly,recordsforotherpatientscanbereadfromthistable. 19

PAGE 29

Table3.3Genedatabase. Observations Attributes PatientId Gene1 Gene2 Gene3 Gene4 Id G1 G2 G3 G4 1 54 : 23 66 : 22 342 : 32 422 : 21 2 541 : 21 324 : 21 165 : 21 852 : 21 3 321 : 67 125 : 98 139 : 43 71 : 11 4 123 : 87 95 : 54 105 : 88 678 : 65 5 388 : 44 129 : 33 135 : 65 754 : 32 6 399 : 98 121 : 54 117 : 55 719 : 33 7 414 : 33 134 : 73 145 : 32 733 : 22 8 855 : 78 125 : 93 155 : 76 789 : 43 Table3.4Genedatabaseafterdiscretization. Observations Attributes PatientId Gene1 Gene2 Gene3 Gene4 Id G1 G2 G3 G4 1 # # $ $ 2 $ # # 3 # # # # 4 # # # 5 $ # # 6 $ # # 7 $ # # 8 # # InthegenedatabaseinTable3.3,foreachattributevalue a i ,weconsiderthediscretized value # if0 a i 333,thediscretizedvalue $ if334 a i 666,andthediscretizedvalue if667 a i 999.Table3.4displaysthediscretizeddatabase. Letusconsiderthemva-typeassociationrule X mva = Y ,where X = f G 2 ; # ; G 3 ; # g and Y = f G 4 ; g .Thisisanimplicationrelationshipthatstates:Ifgene2andgene3ina patientareunderexpressed,thenitislikelythatgene4isoverexpressedinthepatient. Here,Supp X =7 = 8=0 : 875andConf X mva = Y =Supp X [ Y = Supp X =6 = 7=0 : 857. OurnextexampleisaPersonalInterestdatabase. Example3.5 ConsideraPersonalInterestdatabasefromasocialnetworkinTable3.5, whereeachobservationconsistsofaratingforattributessuchas`read',`play',`music', and`eat'ofdierentpeoplewhere0denotesthelowestinterestand10denotesthehighest 20

PAGE 30

interest.Here,person 1 hasaratingof 10 forread, 10 forplay, 3 formusic,and 5 foreat. Similarly,recordsforotherpeoplecanbereadfromthistable. Table3.5Personalinterestdatabase. Observations Attributes PersonId Read Play Music Eat Id R P M E 1 10 10 3 5 2 7 9 4 6 3 3 1 9 10 4 5 1 10 7 5 9 8 2 6 6 8 10 7 6 7 5 4 6 5 8 8 10 1 8 Table3.6Personalinterestdatabaseafterdiscretization. Observations Attributes PersonId Read Play Music Eat Id R P M E 1 h h l m 2 m h m m 3 l l h h 4 m l h m 5 h h l m 6 h h m m 7 m m m m 8 h h l h InthepersonalinterestdatabaseinTable3.5,foreachattributevalue a i ,weconsider thediscretizedvalue l lowif0 a i 3, m moderateif4 a i 7,and h highif 8 a i 10.Table3.6displaysthediscretizeddatabase. Letusconsiderthemva-typeassociationrule X mva = Y ,where X = f R;h ; P;h g and Y = f M;l g .Thisisanimplicationrelationshipthatstates:Ifapersonhashighinterest inreadingandplaying,thenitislikelythatthepersonhaslowinterestinmusic.Here, Supp X =4 = 8=0 : 5andConf X mva = Y =Supp X [ Y = Supp X =3 = 4=0 : 75. Thus,withtheaboveexamples,wehaveseenexamplesofdatabasescontainingmultivaluedattributes. 21

PAGE 31

3.1.1Discretization Weprovideamethodologytodiscretizeattributevaluesinnancialtime-seriesdatabases inSection5.1.1.Inthesedatabases,everyobservationcorrespondstoareadingtakenata particulartimeandtheorderofobservationsisimportant.Ourdiscretizationmethodology capturestherelationshipbetweenanytwoconsecutiveobservationsinanancialtime-series database.Thus,intheresultingdatabasewithdiscretizedattributevalues,theorderof observationsisirrelevant. 3.2AssociationHypergraphs Denition3.6 An associationhypergraph H foradatabase D A ; O ; V isadirectedhypergraphwhosevertexset V is A andhyperedgeset E consistsofdirectedhyperedges T;H where T and H aredisjointsubsetsof A .Eachdirectedhyperedge e = T;H hasan associationcondencevalue intherange [0 ; 1] ,denoted ACV e or ACV T;H ,andan association table asshowninTable3.7 ,denoted AT e or AT T;H ,thataredenedasfollows: 1.Theassociationcondencevalueofadirectedhyperedge f t 1 ;:::;t r g ; f h 1 ;:::;h s g equals X v 1 ;:::;v r 2V Supp f t 1 ;v 1 ;:::; t r ;v r g Conf f t 1 ;v 1 ;:::; t r ;v r g mva = H ; where H = f h 1 ;v 1 ;:::; h s ;v s g and v 1 ::: v s dependon v 1 ::: v r suchthatthey maximizethecondenceofthemva-typeassociationrule f t 1 ;v 1 ;:::; t r ;v r g mva = f h 1 ;v 0 1 ;:::; h s ;v 0 s g overallchoicesof v 0 1 ::: v 0 s 2V .Inotherwords,itequals P v 1 ;:::;v r 2V Supp f t 1 ;v 1 ::: t r ;v r g[ H ,where H isasdenedabove. 2.Theassociationtableofadirectedhyperedge f t 1 ;t 2 ;:::;t r g ; f h 1 ;h 2 ;:::;h s g hasrows correspondingtothesetofallpossiblevaluesthat t 1 ::: t r cantakefrom V .Therow correspondingto t 1 = v 1 ::: t r = v r ,where v 1 ::: v r 2V ,isalistthatcontains 22

PAGE 32

a Supp f t 1 ;v 1 ;:::; t r ;v r g b thevalues v 1 ::: v s 2V denedinpartabove,and c thecondenceofthemva-typeassociationrule f t 1 ;v 1 ;:::; t r ;v r g mva = f h 1 ;v 1 ;:::; h s ;v s g : Inotherwords,everyrowcorrespondstoanmva-typeassociationrule. Themotivationforrepresentingadatabase D usingadirectedhypergraphistocapturea moregeneralimplicationrelationshipbetweenattributesthantheoneidentiedbymva-type associationrules.Forexample,wenowwanttoanswerthefollowingquestion:Regardless ofthevaluestheattributesinset T take,whatisthelikelinessofpredictingthevaluesofthe attributesinset H ?".Theaboveintuitionhelpsustomodelsucharelationshipbetween attributesin T andattributesin H usingadirectedhyperedge T;H andtocapturethe likelinessastheassociationcondencevalue ACV T;H ofthisdirectedhyperedge. Inthisthesis,weconsideronlyassociationsoftheform T;H ,where T and H are disjointsubsetsofattributes, j T j 2,and j H j 1.Havingnoconstrainton j T j and j H j addscomplexityinthemodelbecauseofthenumerouspossibilitiesinvolved.Extendingour methodologytomodelassociationsindatabasestothegeneralcaseisasubjectoffuture work.Henceforth,weusetheterm associationhypergraph torefertothisrestrictedcase, andcallanydirectedhyperedge T;H inwhich j T j =1adirectededgeandoneinwhich j T j =2a2-to-1directedhyperedge. 3.2.1ConstructingAssociationHypergraphs Theassociationhypergraph H foradatabase D A ; O ; V hasnodeset A andthehyperedgesetconsistsofdirectedhyperedgesoftheform T;H ,where T and H aredisjoint subsetsof A .Weconstructdirectedhyperedgesintheorderoftheirheadset.Foraxed combinationoftwoorfewerattributes,say f A 1 ;A 2 g ,andanyotherattribute,say A 3 ,we determinewhether f A 1 ;A 2 g ; f A 3 g couldbeincludedasadirectedhyperedgeof H by checkingwhetherthecombinationis signicant accordingtoDenition3.7. 23

PAGE 33

Table3.7AnexampleassociationtableATforthecombination f A 1 ;A 2 g ; f A 3 g Index Values Support Mostfrequentvalue Condence of A 1 and A 2 Supp f A 1 ;v 1 ; A 2 ;v 2 g of A 3 v 3 Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f A 3 ;v 3 g 1 h 1 ; 1 i 0 : 14 2 0 : 43 2 h 1 ; 2 i 0 : 03 1 0 : 62 3 h 1 ; 3 i 0 : 06 1 0 : 75 4 h 2 ; 1 i 0 : 21 3 0 : 45 5 h 2 ; 2 i 0 : 11 2 0 : 38 6 h 2 ; 3 i 0 : 17 1 0 : 66 7 h 3 ; 1 i 0 : 01 2 0 : 49 8 h 3 ; 2 i 0 : 04 3 0 : 81 9 h 3 ; 3 i 0 : 23 2 0 : 73 24

PAGE 34

Denition3.7 Consideracombination T;H forinclusionasadirectedhyperedgeofthe associationhypergraph H ,where j T j 1 .For 1 ,wesaythat T;H is signicant if ACV T;H max v 2 T f ACV T )-222(f v g ;H g If T;H is -signicant,thenweincludethisdirectedhyperedgein H .Otherwiseweskip thiscombinationandproceedtothenextone.Theweightofadirectedhyperedge T;H issetto ACV T;H Theassociationtableforthedirectedhyperedge f A 1 ;A 2 g ; f A 3 g isconstructedasfollows.Supp f A 1 ;v 1 ; A 2 ;v 2 g iscomputedbycountingobservationsinthedatabasefor which A 1 'svalueis v 1 and A 2 'svalueis v 2 .Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f A 3 ;v 3 g isthen computedbycountingsuchobservationsinthedatabaseforwhich A 3 'svalueis v 3 ,where v 3 isthemostfrequentvaluefor A 3 .Thisprocessisrepeatedforallpossiblevaluesthat A 1 and A 2 cantakefrom V Theorem3.8 Let A 1 A 2 ,and A 3 beattributesovercommonobservations.Thenthe followingholds: 1. ACV f A g ; f X g ACV ; ; f X g 2. ACV f A;B g ; f X g max f ACV f A g ; f X g ;ACV f B g ; f X g g Proof Weprovepartonlyastheproofofpartissimilar.Assumethatthesymbolic representationsof A 1 and A 3 areoverthealphabet f 1 ; 2 ;:::;k g .Since A 1 and A 3 have commonobservations,alltheobservationsof A 1 and A 3 contributetowardsbuildingthe associationtable AT ofthecombination f A 1 g ; f A 3 g .Supposethatthereare d rowsin AT .LetMaj d denotethenumberoftimesthemostfrequentvalueof A 3 occursinthese rows.Then,thefractionofthemostfrequentvalueof A 3 isMaj d =d .W.l.o.g.,assume thatthemostfrequentvalueof A 3 is1. Outof d rowsin AT ,let d i ofthemhavevalue i for A 1 .Thus,wehave P k i =1 d i = d Assumethat,foreach1 i k ,thenumberofrowsin AT suchthat A 1 takesvalue i and A 3 takesitsmostfrequentvalue,1,is a i .Then,wehave P k i =1 a i =Maj d .Forevery 1 i k ,restricting AT toallrowsinwhich A 1 takesvalue i ,letMaj d i denotethe numberofrowsthemostfrequentvalueof A 3 belongstointhisrestricted AT 25

PAGE 35

Bydenition, ACV f A 1 g ; f A 3 g isgivenby P k i =1 d i =d Maj d i =d i ,whichsimplies to P k i =1 Maj d i =d .Next,noticethatforevery1 i k ,Maj d i a i ,sincethemost frequentvalueof A 3 outofallrowsof AT thathave A 1 = i mustoccuratleast a i times. Itfollowsthat P k i =1 Maj d i P k i =1 a i =Maj d .Hence,wegetthat ACV f A 1 g ; f A 3 g = k X i =1 Maj d i =d Maj d =d = ACV ; ; f X g : Thiscompletestheproof. 3.3Association-BasedSimilarityBetweenMulti-ValuedAttributes Undirectedhypergraphshavebeenusedtoclusterbinaryattributesandsubsetsofbinary attributes.Hanetal.[HKKM98]usedundirectedhypergraphswherenodesrepresentbinary attributesandhyperedgesrepresentsubsetsofattributes.Theyappliedanundirected hypergraphclusteringalgorithmtoidentifyclustersofsimilarattributes.Ozdaletal.[OA04] alsousedundirectedhypergraphswherenodesrepresentpatternsi.e.,subsetsofattributes andhyperedgesrepresentrelationshipsamongpatterns,andproposedaclusteringapproach forthesehypergraphmodels.Lentetal.[LSW97]usedclusteringtogroupassociation rulesinvolvingmulti-valuedattributes.Thisgroupingwasdonebasedoncertainattribute conditionsthatsegmenttheobservationsinthedatabase. Weproposesimilaritynotionsthatmeasureassociation-basedsimilaritybetweenany twonodesofthe associationhypergraph .Wecanimaginetwonodes A and B tobeinsimilarifsignicantlymanydirectedhyperedgesthatcontain A intheirheadsetleadto validdirectedhyperedgesthatcontain B intheirheadsetwhen A isreplacedby B .Here, A and B arein-similarsincetheybothsharesimilarincomingdirectedhyperedges.Likewise, A and B canberegardedasout-similarbyrelatingtothetailsetinsteadoftheheadsetof directedhyperedges.Thesesimilaritynotionsarethenusedtodeneclusteringofmultivaluedattributes. 26

PAGE 36

Notation3.9 Let H = V;E beanassociationhypergraphasdenedinSection3.1. 1.Forany A 2 V out H A denotesthesetofalldirectedhyperedgesof H whosetailset contains A 2.Forany A 2 V in H A denotesthesetofalldirectedhyperedgesof H whoseheadset contains A 3.Forany A 1 ;A 2 2 V and e = T;H 2 out H A 1 e j T: A 1 A 2 denotesthedirected hyperedge T 0 ;H 0 whoseheadset H 0 is H andwhosetailset T 0 isformedfrom T byreplacingnode A 1 bynode A 2 i:e:;H 0 = H and T 0 = T )-267(f A 1 g [f A 2 g .For anysetofdirectedhyperedges S andnodes A 1 and A 2 S j T: A 1 A 2 denotestheset S e 2 S f e j T: A 1 A 2 g 4.Forany A 1 ;A 2 2 V and e = T;H 2 in H A 1 e j H: A 1 A 2 denotesthedirectedhyperedge T 0 ;H 0 whosetailset T 0 is T andwhoseheadset H 0 isformedfrom H by replacingnode A 1 bynode A 2 i:e:;T 0 = T and H 0 = H )-314(f A 1 g [f A 2 g .For anysetofdirectedhyperedges S andnodes A 1 and A 2 S j H: A 1 A 2 denotestheset S e 2 S f e j H: A 1 A 2 g Notation3.10 Let A 1 and A 2 beattributesand H = V;E beanassociationhypergraph asdenedinSection3.1.Let ; denotetheemptydirectedhyperedge. 1. out H A 1 out H A 2 isthesetofdirectedhyperedgepairs e;f s.t. e 2 out H A 1 f 2 out H A 2 ,and e = f j T: A 2 A 1 .Thedenitionof in H A 1 in H A 2 issimilar. 2. out H A 1 out H A 2 istheunionofthefollowingsetsofdirectedhyperedgepairs: out H A 1 out H A 2 e; ; s.t. e 2 out H A 1 and e 6 = f j T: A 2 A 1 foreach f 2 out H A 2 ,and ; ;f s.t. f 2 out H A 2 and e j T: A 1 A 2 6 = f foreach e 2 out H A 1 Thedenitionof in H A 1 in H A 2 issimilar. 3.3.1In-SimilarityandOut-Similarity InDenition3.11,wedenethesimilaritynotions in-similarity and out-similarity for anypair A 1 ;A 2 ofattributes.Thein-similarityofattributes A 1 and A 2 ,denotedby 27

PAGE 37

in-sim H A 1 ;A 2 ,istheweightedfractionofdirectedhyperedges e 2 in H A 1 [ in H A 2 suchthatswitching A 1 to A 2 intheheadsetof e resultsinanotherdirectedhyperedge; theweightsaretheassociationcondencevaluesofdirectedhyperedges.Similarly,theoutsimilarityofattributes A 1 and A 2 ,denotedbyout-sim H A 1 ;A 2 ,istheweightedfraction ofdirectedhyperedges e 2 out H A 1 [ out H A 2 suchthatswitching A 1 to A 2 inthetail setof e resultsinanotherdirectedhyperedge. Denition3.11 Let A 1 and A 2 beattributesand H = V;E beanassociationhypergraph asdenedinSection3.1.Thefollowingsimilaritynotionsaredenedfor A 1 and A 2 : 1. out sim H A 1 ;A 2 = P e;f 2 out H A 1 out H A 2 min f ACV e ;ACV f g P e;f 2 out H A 1 out H A 2 max f ACV e ;ACV f g : 2. in sim H A 1 ;A 2 = P e;f 2 in H A 1 in H A 2 min f ACV e ;ACV f g P e;f 2 in H A 1 in H A 2 max f ACV e ;ACV f g : Example3.12 Suppose H hasdirectedhyperedges a = f A 1 ;A 3 g ; f A 6 g b = f A 1 ;A 4 g ; f A 6 g c = f A 2 ;A 3 g ; f A 6 g d = f A 2 ;A 4 ;A 5 g ; f A 6 g ,and e = f A 4 ;A 5 g ; f A 6 g ,where A 1 A 2 A 3 A 4 A 5 ,and A 6 areattributes.LettheACVsof a b c d ,and e be 0 : 4 0 : 5 0 : 6 0 : 7 and 0 : 8 ,respectively.Then,wehave out H A 1 out H A 2 = f a;c g out H A 1 out H A 2 = f a;c ; b; ; ; ; ;d g ,andso weighted out sim H A 1 ;A 2 = 0 : 4 0 : 6+0 : 5+0 : 7 =0 : 22 3.3.2ClustersofSimilarAttributes Wedenebelowthenotionofasimilaritygraphinducedbyanysubset S ofattributesthat assignseveryattributepair f A 1 ;A 2 g in S anundirectededgewhoseweightdependsonthe in-similarityandtheout-similarityofthepair. Denition3.13 Let H = V;E beanassociationhypergraphasdenedinSection3.1. Givenanycollection S ofattributes,a similaritygraph SG S = V 0 ;E 0 inducedby S in H isanundirected,weighted,completegraphwhosenodeset V 0 is S andedgeset E 0 contains allattributepairsin S suchthat,foreveryedge f A 1 ;A 2 g2 E 0 ,itsweight d A 1 ;A 2 is denedas 1 )]TJ/F15 10.9091 Tf 12.387 0 Td [(weighted in sim H A 1 ;A 2+weighted out sim H A 1 ;A 2 = 2 Ourobjectiveistodetermineapartitionof S intosubsetsofattributessuchthatattributes withineachsubsetarehighlysimilarintheirassociativecharacteristics.The t -clustering 28

PAGE 38

algorithmbyGonzalez[Gon85],presentedinAlgorithm2Chapter2,ndssuchapartition of S bydesignatingsome t attributesasclustercenters.Thealgorithmtakestheparameter t intheinputandassignseachattributetoitsclosestclustercenter.Thisisafactor2approximationalgorithmforminimizingthediameterofthe t -clustering,assumingthatthe distancesi.e.,weightssatisfythe metricproperties .Here,thediameterofa t -clustering isthemaximumdistancebetweenanytwodatapointsthatarewithinthesamecluster. Ifouroriginalassociationhypergraph H has n verticesand m directedhyperedges,then theconstructionofsimilaritygraph SG S takestime O m 2 ,andthecomputationofthe t -clusteringofverticesin SG S takesadditionaltime O j t jjSj 29

PAGE 39

CHAPTER4 COMPUTATIONALPROBLEMS Aninterestingquestionthatarisesinthecontextofassociationrulesisaboutdevisinga methodologytobuildclassicationrules.Aclassicationrulesuersfrom overtting when therulecloselymodelsaparticularcharacteristicofthetrainingdatasetand,duetothis specicity,therulefailstopredictthevalueoftheattributeinanunseentestdataset.A classicationrulesuersfrom undertting whentherulemodelsnoparticularcharacteristic ofthetrainingdatasetand,duetothisgenerality,therulefailstopredictthevalueofthe attributebothinthetrainingdatasetandinanunseentestdataset.Liuetal.[LHM98] addressedtheaboveproblemsbyminingassociationrulesthathaveonlyoneattributein theirconsequent,proposingapruningbasedalgorithmtogenerateclassicationrules,and testingonanunseendataset.Bayardo[Bay97]proposedvariousmethodsthatagainuse pruningtoreducethenumberofassociationrulesdiscoveredbystandardassociationrule miningalgorithms.Butthequalityoftherulesisunclearastheyhavenotbeentested onanunseendataset.Liuetal.[LHM99]addressedtheaboveproblemsbyintroducing multipleminimumsupportsduringminingofassociationrules. Inthissection,weusethedirectedhypergraphbasedmodeltorstproposealgorithms toidentifyasmallsubsetofattributesthatinuencethecharacteristicsofalmostallother attributesandthenpresenttheassociation-basedclassierthatcanbeusedtopredictthe valuesofattributes. 4.1LeadingIndicators Aleadingindicator X foranyset S ofattributesisasubsetof S suchthatknowingonly thevaluesfortheattributesin X allowsustoinferthevalueforallattributesin S)-252(X Motivatedbythenotionofadominatingsetforanygraphthatessentiallycapturesthe 30

PAGE 40

propertyofasubsetofnodescoveringallthenodesofthegraphbyatmostoneedge,we denebelowthenotionofadominatorforasetofverticesofanyassociationhypergraph. Ourhypothesisisthatadominatorfornodescorrespondingtotheset S ofattributesin theassociationhypergraphmodelingattributerelationshipsgivesaleadingindicatorfor S Denition4.1 Adominatorforaset S ofverticesinanassociationhypergraph H = V;E isaset X V suchthat,forevery u 2S)-155(X ,thereisadirectedhyperedge e = T;H 2 E suchthat T X and u 2 H .Thatis,eachnode u 2S)-237(X iscoveredusingonlydirected hyperedgeswhosetailsetisfromtheset X Wenextprovidetwogreedyalgorithmsforidentifyingasubsetofattributesknownasa leadingindicatorthatinuencesthevaluesofalmostallotherattributes.Therstgreedy algorithmisbasedonanadaptationofthegreedy O log n -approximationalgorithmfor computingaminimumcardinalitydominatingsetingraphs.Thesecondgreedyalgorithm isbasedonanadaptionofthegreedy O log n -approximationalgorithmforcomputinga minimumcostsetcover. 4.1.1AnAdaptationofGraphDominatingSetApproximationAlgorithm Algorithm5isagreedyalgorithmforcomputingadominatorforanyset S ofvertices inanassociationhypergraph H = V;E .Itisanadaptationofthegreedy O log n approximationalgorithmforcomputingaminimumcardinalitydominatingsetingraphs. Inordertominimizethesizeofthedominator,thefollowinggreedyheuristicisapplied:for everynode u thatisnotpartofthedominatingsetyet,thealgorithmcomputesthenode eectiveness u thatreects u 'scoveringability.Duringeachiterationofthealgorithm untilallthenodesinthegrapharecovered,thenodewiththehighesteectivenessvalue isaddedtothedominatorset.Thealgorithmrunsintime O jSjj E j 4.1.2AnAdaptationofSetCoverApproximationAlgorithm Algorithm6computesadominatorforanyset S ofverticesofanyassociationhypergraph H = V;E .Itisanadaptationofthegreedy O log n -approximationalgorithmforSet Cover.Thealgorithmmaintainsavariable DomSet tostorethepartiallyconstructed 31

PAGE 41

dominatorsetof H andavariable CoveredSet tostorethesetofverticescoveredbythose in DomSet .Thisisaniterativealgorithmsuchthatineachiterationsomesubsetofvertices t isgreedilychosenandmadepartof DomSet .Thealgorithmmaintainsavariable T that storessubsetsofverticesforpossibleinclusionin DomSet .Initially, T isthecollectionof alltailsetsofdirectedhyperedgesin H ,i.e., T = f T e j e 2 E g Input :Aset S ofverticesandanassociationhypergraph H = V;E Output :Adominator DomSet fortheset S ofvertices. begin 1 DomSet ; ; 2 CoveredSet ; ; 3 while CoveredSet 6 = S do 4 foreach vertex u 2 V )]TJ/F15 10.9091 Tf 10.909 0 Td [(DomSet do 5 if u 62 CoveredSetand u 2S then 6 u 1; 7 end 8 else 9 u 0; 10 end 11 u u + X v 62 CoveredSet ^ v 2 S L u;v ; 12 where L u;v max e : u 2 T e ^ v 2 H e w e j T e )]TJ/F20 7.9701 Tf 6.586 0 Td [(DomSet j 13 end 14 Let u 0 2 V besuchthat u 0 =max u 62 DomSet u ; 15 DomSet DomSet [f u 0 g ; 16 CoveredSet CoveredSet [f u 0 g[f v 2Sj9 e 2 E s.t. v 2 H e 17 and T e DomSet g ; 18 end 19 return DomSet ; 20 end 21 Algorithm5:Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhich isanadaptationofgraphdominatingsetapproximation. Thealgorithmiteratesuntilalltheverticesintheset S arecovered.Foreverysubset t 2 T ,ameasure t isdenedthatreects t 'scoveringabilityinthefollowingsense: t containsallnewverticesthatcanbecoveredbyincluding t in DomSet .Ineach iterationinLine5, t iscomputedforevery t 2 T asfollows.First, t isassigned tothenumberofverticesthatarein t aswellas S ,butnotyetin CoveredSet Lines6{ 12.Next, t addsupthenumberofverticesin S)]TJ/F22 10.9091 Tf 19.801 0 Td [(CoveredSet that t coversviasome directedhyperedge e whosetailset T e ispartof t Lines13{17.Specically,forthelatter 32

PAGE 42

contributionin t ,everydirectedhyperedge e istraversedandifboth T e t and H e 2S)]TJ/F22 10.9091 Tf 29.558 0 Td [(CoveredSet ,then H e iscountedinthecomputationof t .Clearly,allsets t 2 T whoseeectiveness t iszeroareinsignicant;Line15discardssuchsetsfrom T sothattheyarenotconsideredinlateriterations.Towardstheendofeveryiteration, theset t ofmaximumeectivenessisincludedin DomSet andthevariable CoveredSet is updatedtoaccountforthechanged DomSet Lines20{22. Sinceeachiterationtakes O j E j 2 timeandthereare jSj iterations,thecomputation timeofthealgorithmis O jSjj E j 2 Wenowprovidetwoenhancementswiththegoalofimprovingthecomputationtime ofAlgorithm6andalsoofreducingthesizeofthedominatorsetoutputbythisalgorithm.Enhancement1,presentedinAlgorithm7,shouldbeaddedbetweenLines20{21, andEnhancement2,presentedinAlgorithm8,shouldbeaddedbetweenLines22{24of Algorithm6. Enhancement1. DuringeachiterationofthewhileloopinLine5,Algorithm6includes in DomSet asubset t 0 whoseeectiveness t 0 isthehighestamongall t 2 T .This enhancementconsidersthescenariowhentherearemorethanonepossiblecandidate t 0 in someiteration.Insuchacase,thecandidatesubset t 0 thatcontributestheleastnumberof elementsto DomSet i.e.,thatminimizes j t 0 )]TJ/F22 10.9091 Tf 11.13 0 Td [(DomSet j mightbebetterthananyother candidate t 0 0 Enhancement2. WeknowthatAlgorithm6updatesthecoverageof DomSet inevery iterationbygoingovereachdirectedhyperedgewhosetailsetisapartof DomSet and addingverticesintheheadsetthatarein S to CoveredSet .Byremovingasubset t that isalreadyapartofthe DomSet from T ,theenhancementsavesthecomputationtime requiredtogooverallthedirectedhyperedgesinthesubsequentiterationsofthealgorithm tocompute t 4.2Association-BasedClassier Let S bethesetofattributes A 1 ;A 2 ;:::;A t andlettheseattributestakevalues v 1 ;v 2 ;:::;v t respectively.Let T beanothersetofattributes,disjointfrom S .Theassociation-based 33

PAGE 43

Input :Aset S ofverticesandanassociationhypergraph H = V;E Output :Adominator DomSet fortheset S ofvertices. begin 1 DomSet ; ; 2 CoveredSet ; ; 3 T f T e j e 2 E g ; 4 while CoveredSet 6 = S do 5 foreach set t 2 T do 6 t 0; 7 foreach vertex u 2 t do 8 if u= 2 CoveredSet and u 2S then 9 t t +1; 10 end 11 end 12 foreach directedhyperedge e suchthat T e t do 13 if H e = 2 CoveredSet and H e 2S then 14 t t +1; 15 end 16 end 17 if t ==0 then T T )-222(f t g ; 18 end 19 Let t 0 besuchthat t 0 max t 2 T t ; 20 DomSet DomSet [ t 0 ; 21 CoveredSet CoveredSet [f t 0 g[f v 2Sj9 e 2 E s.t. v 2 H e and 22 T e DomSet g ; 23 end 24 return DomSet ; 25 end 26 Algorithm6:Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhich isanadaptationofsetcoverapproximation. foreach set t 2 T do 1 if t 0 == t then 2 if j t 0 )]TJ/F22 10.9091 Tf 10.91 0 Td [(DomSet j > j t )]TJ/F22 10.9091 Tf 10.909 0 Td [(Domset j then 3 t 0 t ; 4 end 5 end 6 end 7 Algorithm7:Enhancement1. foreach set t 2 T do 1 if t DomSet then 2 T T )]TJ/F22 10.9091 Tf 10.909 0 Td [(t ; 3 end 4 end 5 Algorithm8:Enhancement2. 34

PAGE 44

classierdeterminesthevaluesofallattributesin T giventhevaluesofattributesin S .For thisproblem,wewillassumethat S isadominatorfor T intheassociationhypergraph H andso S canbecomputedusingthealgorithmpresentedinSection4.1.Thisassumption standsonourearlierstatedhypothesisthatadominatorforanysetofattributesisalsoa leadingindicatorfortheset. Input :Anassociationhypergraph H = V;E modelingattributerelationships,a set T ofattributes,andaset S = f A 1 ;v 1 ; A 2 ;v 2 ;:::; A t ;v t g ,where A 1 ;A 2 ;:::;A t areattributesand v 1 ;v 2 ;:::;v t 2V aretheirrespective values. Output :Anassignmentofvaluesthatassignseachattribute Y 2T itsbest classiedvalue y andtheclassicationcondenceval[ y ]associatedwith everysuchassignment y to Y begin 1 foreach attribute Y 2T do 2 for y 1 to k do 3 val[ y ] 0; 4 end 5 foreach directedhyperedge e = T;H 2 E with H = f Y g and 6 T f A 1 ;A 2 ;:::;A t g do Let T be f A 1 ;A 2 g andlet y bethemostfrequentvalueof Y given 7 A 1 = v 1 "and A 2 = v 2 "; val[ y ] val[ y ]+Supp f A 1 ;v 1 ; A 2 ;v 2 g 8 Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f Y;y g ; end 9 Let y 2f 1 ;:::;k g besuchthatval[ y ]=max y 2f 1 ;:::;k g val[ y ]; 10 val[ y ] val[ y ] = P y 2f 1 ;:::;k g val[ y ]; 11 Output Y;y ; val[ y ]"; 12 end 13 end 14 Algorithm9:Association-basedclassier. Tocomputethevalueforanyattribute Y 2T givenonlythevalues v 1 ;v 2 ;:::;v t of aset S oftheattributes,weiterateoveralldirectedhyperedgesof H .Foreachdirected hyperedge e whosetailsetisasubsetoftheattributesin S andwhoseheadsetis f Y g ,we examinethe AT associatedwith e .Specically,assumethat e = f A 1 ;A 2 g ; f Y g .Using the AT of e ,wendSupp f A 1 ;v 1 ; A 2 ;v 2 g andConf f A 1 ;v 1 ; A 2 ;v 2 g mva = f Y;y g where y 2V isthemostfrequentvalueof Y giventhat A 1 takesvalue v 1 and A 2 takes value v 2 .Thecontributionofthedirectedhyperedge e inthevalueassignment y of Y is Supp f A 1 ;v 1 ; A 2 ;v 2 g Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f Y;y g .Thetotalcontribution 35

PAGE 45

ofalldirectedhyperedgesinthevalueassignment y of Y isdenotedbyval[ y ].Atthe endofalliterations,wechoosethevalue y of Y forwhichval[ y ]ismaximum.The classicationcondence associatedwiththevalueassignment y to Y isthenthenormalized val[ y ] 2 [0 ; 1]. Weseethatthecontributionsfromallmva-typeassociationrulesofbothrelevantdirectededgesandrelevantdirectedhyperedgesaretakenintoconsiderationduringtheprocessofdeterminationofthevalueofanattribute.Thisapproachavoidsoverttingbynot xingthevalueassignmentforanattributebylookingonlyintoaparticularmva-typeassociationrulethatcloselymodelssomecharacteristicofthedataset.Also,thisapproach avoidsunderttingasitconsiderstheappropriateweightsforallmva-typeassociationrules beforexingthevalueassignmentforanattribute. Algorithm9describesthissolutionapproach.Theinputisanassociationhypergraph H = V;E modelingattributerelationships,aset T ofattributes,andaset S = f A 1 ;v 1 ; A 2 ;v 2 ;:::; A t ;v t g ,where A 1 ;A 2 ;:::;A t areattributesand v 1 ;v 2 ;:::;v t 2V aretheirrespectivevalues.Thealgorithmreturnsanassignmentofvaluesthatassignseach attribute Y 2T itsbestclassiedvalue y andalsoreturnstheclassicationcondence val[ y ]associatedwitheverysuchassignment y to Y .Itsrunningtimeis O k 2 jTjj E j 36

PAGE 46

CHAPTER5 EXPERIMENTATION TheStandard&Poor'sS&P500isamarket-value-weightedindexofthestockprices ofsome500largepubliclyheldcompanies.WeusedYahooFinance[Yah10]toobtain stockinformationforallnancialtime-seriesinS&P500.Inthissection,allanalysis isbasedonthedailyclosingstockpriceinformationfromJan1,1995toDec21,2009. WehadtorestrictstartdatetoJan1,1995sinceanumberofnancialtime-seriesin thecurrentS&P500startedtradingonlyfromthemid90s.Asaresultofthisclean upofourdata,thenumberofnancialtime-seriesinouranalysisis346.Thesetimeseriesbelongtothefollowingindustrialsectors:BasicMaterials BM ,CapitalGoods CG ,Conglomerates C ,ConsumerCyclical CC ,ConsumerNoncyclical CN ,Energy E ,Financial F ,Healthcare H ,Services SV ,Technology T ,Transportation TP andUtilities U .Eachindustrialsectorissubdividedintosub-sectorse.g.,Technology T has11sub-sectorsincludingCommunicationsEquipment,ComputerHardware,Computer Networks,ComputerPeripherals,ComputerServices,ComputerStorageDevices,Electronic Instr.andControls,OceEquipment,ScienticandTechnicalInstr.,Semiconductors,and SoftwareandProgramming.Thetotalnumberofsub-sectorsovertheentiresectorsis104. 5.1AssociationHypergraphModeling 5.1.1Discretization Wenowdescribehowtotransformnancialtime-seriesdatasetintoadatabase D A ; O ; V suitablefortheassociationhypergraphmodeling.Foreachnancialtime-seriesinthedata set,wecreateadeltatime-series,whichisalistofrealnumberswhose i 'thentryisthe fractionalchangeintheclosingstockpriceofthe i +1'thdayrelativetotheclosingstock priceofthe i 'thday.Wethencomputea k -thresholdvector,forsomeinteger k 2,foreach 37

PAGE 47

nancialtime-series.A k -thresholdvectorforanancialtime-series A isa k )]TJ/F15 10.9091 Tf 11.398 0 Td [(1-tuple h a 1 ;a 2 ;:::;a k )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 i suchthat,forevery1 i k ,wehave a i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1
PAGE 48

withameanACVof0 : 437andtheconguration C 2leadsto109 ; 810directededgeswith ameanACVof0 : 288and274 ; 0482-to-1directedhyperedgeswithameanACVof0 : 288. 5.2AssociationCharacteristicsofFinancialTime-Series Figure5.1ashowstheweightedin-degreedistributionofthenodesoftheassociationhypergraphforconguration C 1.Heretheweightedin-degreeofanode v is P e : f v g = H e w e i.e.,thesumofweightsofallhyperedgesentering v .Figure5.1bshowstheweightedoutdegreedistributionofthenodesoftheassociationhypergraphforconguration C 1.Here theweightedout-degreeofanode v is P e : v 2 T e w e j T e j ,i.e.,thesumofnormalizedweights ofallhyperedgesleaving v .ThemeanACVofdirectededgesis0 : 43andof2-to-1directed hyperedgesis0 : 44. Usingourdirectedhypergraphbasedmodeling,weareabletodeducesomeinteresting factsaboutthecurrentstockmarketthatwouldhavebeendiculttovalidatebyother means.Ourndingsconcernrelationshipbetweenproducersandconsumersinthestock market. Roughlyspeaking,aproducerisanyentitycompanythathasveryfewdependency onothersforitsresourcerequirements.Aproducerthrivesmostlyonitsownorhaslittle resourcerequirementscomparedtotherest.Somenotablesectorsthatarelikelytohave entitiesinproducercategoryare BM CG E ,and SV sectors.Ontheotherhand,a consumerisanentitythatishighlydependentonotherentitiesorend-usersforitsown functioning.Sectorssuchas CC CN H SV ,and T arelikelytohaveentitiesinconsumer category.Aparticularexceptionisthe SV sectorinwhichtherearebothproducersand consumersdependingontheparticularindustrytheybelongto.Forexample,entitiesin SV sectorthatdealwithrealestateoperationse.g.,KimcoRealtyCorporationaremostly producerswhereasentitiesin SV sectorthatprovidebasicservicestotheendusere.g., Yahoo!Inc.mostlysatisfytheconsumercategory. Time-seriesthatarelessdependentonothertime-seriesfortheirresourcese.g.,Producersarelikelytobemorepredictableincomparisontoothers.Why?Sincethesetime-series donotdependonothertime-seriesforrawmaterialsandotherbasicrequirements,their 39

PAGE 49

a b Figure5.1Weighteddegreedistribution.aWeightedin-degreedistributionandb Weightedoutdegreedistribution.Theidsofnodesareonthe x -axisandtheirweighted in-degreeout-degreevaluesareonthe y -axis. 40

PAGE 50

valueinthemarketcanbeestimatedbyanalyzingthedemandofthesourcesofrawmaterialsonwhichtheyrelyon.Theweightedin-degreeofanodeinourdirectedhypergraph modelindicatesthelevelofpredictabilityofthecorrespondingtime-series.Thegreater theweightedin-degreeofanodeis,themorepredictablethecorrespondingtime-seriesis. WeseefromFigure5.1athatXOM E sectorandGT CC sectorhavehighweighted in-degreevaluesincomparisontoothersfromourchosenlist.InthecaseofGTThe GoodyearTire&Rubbercompany,althoughthesectoritbelongstoisdierentfromthe oneswelistedforproducercategory,oneofthebasicmaterialsthistime-seriesdependson isrubber,whichisanaturallyoccurringcompound.Wealsoexperimentallyfoundoutthat ofthetime-serieswiththetop25weightedin-degreevalues,72%ofthembelongtosectors BM iron,gold,silver,andmetalminingindustries, E oil,gas,andcoalindustries,and SV realestateindustries. Time-seriesthathavedirectinteractionswithend-usersfortheirproductse.g.,Consumersaremorelikelytobegoodpredictors.Why?Thesetime-serieshappentobegood predictorsastheygetaectedbytheirimmediateactions.Theeconomyisdrivenbythe end-user'srequirements.Bybeingindirectcontactwiththeend-user,theperformance oftheconsumertime-seriesinthemarketfollowsdirectpatternsofthebehaviorofthe end-user. Theweightedout-degreeofanodeinourmodelingindicateshowmuchthatnode time-seriescanpredictothertime-series.Thegreatertheweightedout-degreeofanode is,thehigheritsabilitytopredictothertime-seriesis.WeseefromFigure5.1bthat PG CN sectorandJNJ H sectorhavehighout-degreevaluesincomparisontoothers fromourchosenlist.Wealsoexperimentallyfoundoutthatofthetime-serieswiththe top25weightedout-degreevalues,84%ofthetime-seriesbelongtosectors H facilities, biotechnology,anddrugs, SV particularly,business,realestate,andrestaurantservices, and T software,hardware,andsemiconductors. Table5.1shows,foreachconguration,thedirectededgeandthe2-to-1directedhyperedgewiththehighest ACV foreachselectednancialtime-seriesfromdierentsectors. Forexample,thebestpredictiondirectededgeforGTTheGoodyearTire&RubberCompanyshowninRow3for C 1and C 2representsarelationshipbetweenGTandPPGPPG 41

PAGE 51

Table5.1Thedirectededgeandthe2-to-1directedhyperedgewiththehighest ACV foreachselectednancialtime-seriesfromdierent sectorsandforeachcongurationchoiceareshown.Thesectorinformationofeachnancialtime-seriesisindicatedinparenthesisgoogle nance. Row Time-series Conguration Topdirectededge Top 2 -to1 directedhyperedge 1 EMN BM C 1 PPG BM EMN BM AVY BM ,GT CC EMN BM C 2 PPG BM EMN BM BLL BM ,IFF BM EMN BM 2 HON CG C 1 TXT C HON CG CAT CG ,ITT T HON CG C 2 UTX CG HON CG BA CG ,ROK T HON CG 3 GT CC C 1 PPG BM GT CC DOW BM ,F CC GT CC C 2 PPG BM GT CC ETN T ,FMC BM GT CC 4 PG CN C 1 CL CN PG CN CLX CN ,K CN PG CN C 2 CL CN PG CN ABT H ,CPB CN PG CN 5 XOM E C 1 CVX E XOM E HES E ,SLB E XOM E C 2 CVX E XOM E COG E ,PEG U XOM E 6 AIG F C 1 C F AIG F BEN F ,PGR F AIG F C 2 C F AIG F AON F ,CI F AIG F 7 JNJ H C 1 MRK H JNJ H IFF BM ,SYY SV JNJ H C 2 MRK H JNJ H CL CN ,PEP CN JNJ H 8 JCP SV C 1 M SV JCP SV FDO SV ,GPS SV JCP SV C 2 M SV JCP SV COST SV ,HD SV JCP SV 9 INTC T C 1 LLTC T INTC T EMC T ,QCOM T INTC T C 2 XLNX T INTC T CTXS T ,QCOM T INTC T 10 FDX TP C 1 AXP F FDX TP EXPD TP ,ITT T FDX TP C 2 AXP F FDX TP EXPD TP ,BAC F FDX TP 11 TE U C 1 PGN U TE U PEG U ,SO U TE U C 2 AEP U TE U SO U ,TEG U TE U 42

PAGE 52

Industries,Inc..ThisrelationshipmaybeinterpretedintermsofGTprocuringrawmaterialse.g.,precipitatedsilicasfromPPGforthemanufacturingorprocessingofrubber. Amoreinterestingmany-to-onerelationshipisrepresentedbythebestprediction2-to-1 directedhyperedgeforGTshowninRow3for C 1.Here,the2-to-1directedhyperedge forGTrepresentsarelationshipofGTwithDOWTheDowChemicalCompanyandF FordMotorCompany.ThisrelationshipmaybeinterpretedintermsofGTprocuring rawmaterialse.g.,polyurethanepolymerfromDOW,whereastherelationshipwithF maybeattributedtowardsFutilizingtheproductse.g.,tiresfromGT.Thus,wesee that2-to-1directedhyperedgesprovidemeaningfulandmoreinterestinginformationthan directededgesdo. Table5.2shows,foreachconguration,the2-to-1directedhyperedgewiththehighest ACV andtheconstituentdirectededgesforeachselectednancialtime-seriesfromdierent sectors.Forexample,inRow5,for C 1theaccuracyofHESHessCorporationpredicting XOMExxonMobilCorporationis0 : 55andSLBSchlumbergerLimitedpredictingXOM is0 : 54,butbothofthemtogetherpredictingXOMis0 : 58.Thisshowsthatthecombination oftwonancialtime-seriesleadstoabetterpredicting2-to-1directedhyperedge. 5.3Association-BasedSimilarity 5.3.1ComparisonwithEuclideanSimilarity Figure5.2comparesin-similarityandout-similarityvalueswithEuclideansimilarityvalues forconguration C 1.Here,theEuclideansimilaritybetweenanytwonancialtime-series A and B iscomputedasfollows.Let A = a 1 ;a 2 ;:::;a n and B = b 1 ;b 2 ;:::;b n where a i and b i arethevaluesthat A and B takeinthe i 'thobservation.TheEuclideandistancebetween A and B isdenedasED A;B = jj normalized A )]TJ/F15 10.9091 Tf 9.377 0 Td [(normalized B jj whereforanyvector V = v 1 ;v 2 ;:::;v n ,normalized V = v 1 = jj V jj ;v 2 = jj V jj ;:::;v n = jj V jj and jj V jj = P n i =1 v 2 i 1 2 .Now,theEuclideansimilarityES A;B between A and B isdened asES A;B =1 )]TJ/F20 7.9701 Tf 11.83 4.295 Td [(1 2 ED A;B .NotethatES A;B isarealvalueintherange[0 ; 1]such thatahighervalueindicatesagreatersimilarity.Figure5.2showsthatEuclideansimilarity doesnotdierentiatebetweennancialtime-seriespairsasdistinctlyasoursimilaritymea43

PAGE 53

Table5.2The2-to-1directedhyperedgewiththehighest ACV andtheconstituentdirectededgesforeachselectednancialtime-series fromdierentsectorsandforeachcongurationchoiceareshown. Row Time-series Conguration Top 2 -to1 directedhyperedge Directededge1 Directededge2 1 EMN BM C 1 AVY,GT EMN.52 AVY EMN.49 GT EMN.49 C 2 BLL,IFF EMN.37 BLL EMN.32 IFF EMN.33 2 HON CG C 1 CAT,ITT HON.53 CAT HON.5 ITT HON.49 C 2 BA,ROK HON.38 BA HON.33 ROK HON.33 3 GT CC C 1 DOW,F GT.51 DOW GT.48 F GT.47 C 2 ETN,FMC GT.37 ETN GT.33 FMC GT.33 4 PG CN C 1 CLX,K PG.53 CLX PG.5 K PG.49 C 2 ABT,CPB .36 ABT PG.32 CPB PG.32 5 XOM E C 1 HES,SLB XOM0.58 HES XOM.55 SLB XOM.54 C 2 COG,PEG XOM.37 COG XOM0.33 PEG XOM.31 6 AIG F C 1 BEN,PGR AIG.54 BEN AIG.51 PGR AIG.51 C 2 AON,CI AIG.37 AON AIG.33 CI AIG.33 7 JNJ H C 1 IFF,SYY JNJ.48 IFF JNJ.45 SYY JNJ.45 C 2 CL,PEP JNJ.36 CL JNJ.32 PEP JNJ.31 8 JCP SV C 1 FDO,GPS JCP.51 FDO JCP.48 GPS JCP.48 C 2 COST,HD JCP.37 COST JCP.32 HD JCP.33 9 INTC T C 1 EMC,QCOM INTC.55 EMC INTC.52 QCOM INTC.52 C 2 CTXS,QCOM INTC.4 CTXS INTC.35 QCOM INTC.35 10 FDX TP C 1 EXPD,ITT FDX0.52 EXPD FDX.49 ITT FDX0.46 C 2 EXPD,BAC FDX.37 EXPD FDX.33 BAC FDX.33 11 TE U C 1 PEG,SO TE.55 PEG TE.52 SO TE.52 C 2 SO,TEG TE.4 SO TE.35 TEG TE.35 44

PAGE 54

suresdo.ThiscouldbebecauseofthefactthatEuclideansimilarityaccountsforpair-wise dierencesinpricevariationsonaday-to-daybasiswhereasthesimilaritymeasuresaccount fortheclosenessinbeingassociatedwithcommonsetsofnancialtime-seriesonanaverage basis. 5.3.2ClustersofFinancialTime-Series Figure5.3showsaclusteringofnancialtime-seriesforconguration C 1.Theclusters areobtainedusingtheapproachexplainedinSection3.3.2.Here,thecollection S isthe setofallnancialtime-seriesinourdataset.Thevalueofparameter t inthe t -clustering algorithmissetto104,whichisthetotalnumberofsub-sectorsovertheentiresectorsas pointedoutatthebeginningofChapter5.Therstclustercenterispickedfromthesector T asthissectorhasthemaximumnumberofnancialtime-seriesinourdataset. WeexperimentallyveriedthattheweightfunctioninDenition3.13satisesthetriangleinequalityproperty,andhencethefactor2-approximationoftheoptimaldiameterof the t -clusteringisinfactachievedbythealgorithm.Forclarityofdisplay,Figure5.3shows onlyclustersofsizegreaterthan6.Theedgesthatconnectallclustercenterstoother nodesintheirclustersandtheedgesthatinterconnecttheclustercentersarealsoshown inthegure.Thispartialsimilaritygraphconsistsof256nodesand298edges.Toshow thequalityoftheclusteringobtained,thefollowinginformationisrelevant:ithemean diameteroverallclustersobtainedis0.83andtheoverallmeandistancein SG S is0.89and iithelargestclusterofsize29containsallnancialtime-seriesfromthesector T 5.4LeadingIndicatorsofFinancialTime-Series Inthisexperiment,wendtheleadingindicatorsforthecollection S ofallnancialtimeseriesusingtheapproachexplainedinSection4.1.Inordertoobtainadominatorsetthat coverstherestofthenancialtime-seriesviadirectededgesand2-to-1directedhyperedges ofhigh ACV ,wesetathresholdfor ACV ACV -thresholdanddiscardalldirectededges and2-to-1directedhyperedgesbelowthisthresholdduringthecomputationofthedominator.Now,forthecomputationofdominatorsforcongurations C 1and C 2,weconsider 45

PAGE 55

a b Figure5.2Euclideansimilaritycomparision.aIn-similarityforconguration C 1vs EuclideansimilarityandbOut-similarityforconguration C 1vsEuclideansimilarity. Thein-similarityout-similarityvaluesareonthe x -axisandthecorrespondingEuclidean similarityvaluesareonthe y -axis. 46

PAGE 56

Figure5.3Clustersofnancialtime-seriesforconguration C 1.Notethatthisgureismadeupofmultiplecolorsandthecolors correspondtosectorsinnancialdomain.Thebigcirclesrepresentclustercentersandthesmallonesrepresentothernodes.Thesizeof thebigcirclesisdirectlyproportionaltothenumberofnodesassignedtothem.Thesmallcirclesareattachedtotheirrespectivecluster centers. 47

PAGE 57

thefollowingchoicesforthresholds:itop40%directedhyperedgesw.r.t. ACV s|thissets ACV -thresholdto0.45for C 1andsets ACV -thresholdto0.32for C 2,iitop30%directed hyperedgesw.r.t. ACV s|thissets ACV -thresholdto0.46for C 1andsets ACV -threshold to0.33for C 2,andiiitop20%directedhyperedgesw.r.t. ACV s|thissets ACV -threshold to0.47for C 1andsets ACV -thresholdto0.34for C 2.Tables5.4and5.3showsthesize ofdominatorforalmostallthenancialtime-seriesinourdataset.Here,inrow1ofTable5.4,forthecasewhen ACV -thresholdissetto0 : 45,ourgreedyapproximationalgorithm Section4.1ndsadominatorofsize16thatcovers96%ofallnancialtime-seriesinour dataset. 5.5Association-BasedClassier Inthisexperiment,weevaluatetheaccuracyoftheassignmentsgivenbytheAsociationBasedClassieronseveraltestdatasets.Foreachtrainingdataset,weconstructan associationhypergraph H = V;E usingtheproceduredescribedinSection3.2.Next,in thecorrespondingtestdataset,allthenancialtime-seriesareconvertedtotheirrespective deltatime-seriesandthendiscretizedusingthemethodologydescribedinSection5.1.1.We chooseasmallcollection S ofnancialtime-series,usuallyadominatorforallnancial time-seriesinourdatasets.Thevaluesofeverynancialtime-series A inthedominator arealreadyknownfromthediscretizedrepresentationof A .Thebestpredictedvalueofall othernancialtime-seriesinourdatasetsiscomputedusingtheassociation-basedclassier presentedinSection4.2.Wedenethe classicationcondence foranynancialtime-series Y onaparticulartestdatasetasthefractionofdaysonwhichthevalueassignedbyour classiermatchesthevaluein Y 'sdiscretizedrepresentation,obtainedfromthesamedata set. Wealsocomputetheaccuracyofvalueassignmentsgivenbyotherdataminingclassiers suchasthesupportvectormachineSVM,multilayerperceptron,andlogisticregression. Forexperimentshere,theclassiersprovidedbyWeka[HFH + 09]areused.Thefollowing methodologyisusedtopredictthevaluesofanynancialtime-series Y byconstructinga trainingdatasetwhosefeaturesetistheset S ofnancialtime-series:Consideradirected 48

PAGE 58

Table5.3Thesizeofadominatorforallnancialtime-seriesandthemeanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm5. Row Conguration ACVthreshold Dominator Size Percent Covered MeanClassicationCondence In-sample Out-sample top% hyperedges AssociationBased Classier AssociationBased Classier SVM Multilayer Perceptron Logistic Regression 1 C 1 0.45 % 13 99 0.643 0.719 0.546 0.716 0.541 0.46 % 15 95 0.646 0.723 0.509 0.718 0.508 0.47 % 22 94 0.65 0.724 0.494 0.719 0.492 2 C 2 0.32 % 20 96 0.646 0.716 0.429 0.627 0.231 0.33 % 30 96 0.649 0.719 0.433 0.638 0.238 0.34 % 31 91 0.65 0.722 0.403 0.633 0.224 49

PAGE 59

Table5.4Thesizeofadominatorforallnancialtime-seriesandthemeanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm6. Row Conguration ACVthreshold Dominator Size Percent Covered MeanClassicationCondence In-sample Out-sample top% hyperedges AssociationBased Classier AssociationBased Classier SVM Multilayer Perceptron Logistic Regression 1 C 1 0.45 % 16 96 0.651 0.723 0.526 0.717 0.519 0.46 % 22 93 0.653 0.723 0.514 0.718 0.510 0.47 % 26 91 0.656 0.728 0.515 0.725 0.512 2 C 2 0.32 % 28 96 0.65 0.721 0.429 0.627 0.231 0.33 % 40 90 0.652 0.722 0.433 0.638 0.238 0.34 % 36 78 0.652 0.72 0.403 0.633 0.224 50

PAGE 60

a b Figure5.4Classicationcondencedistributionoftheassociation-basedclassierforinsampleandout-sampledataforconguration C 1.Thestartyearfortrainingdatasetis 1996.FigureausesthedominatorobtainedfromAlgorithm5andFigurebusesthe dominatorobtainedfromAlgorithm6. 51

PAGE 61

hyperedge e in H suchthat e = f A 1 ;A 2 g ; f Y g and A 1 ;A 2 2S .Thetrainingdatasetis builtbyusingeachrowin AT e asadatapoint.Here,theparticularvalueassignment A 1 = v 1 and A 2 = v 2 isthefeaturevalue,andthecorrespondingvalue y of Y denedin Denition3.6istheclassvalue. 5.5.1Evaluation Tables5.4and5.3list,foreachdominator,themeanclassicationcondenceoverall nancialtime-seriesforcongurations C 1and C 2.Here, in-sample indicatesthetraining datasetforconstructingthedirectedhypergraphmodeland out-sample indicatesthetest datasetforevaluatingthevalueassignmentsmadebytheclassiers.Thein-samplecontains nancialtime-seriesdatafromJan1,1996toDec31,2008andtheout-samplecontains nancialtime-seriesdatafromJan1,2009toDec31,2009.Fromthistable,itisclear thattheassociation-basedclassieroutperformsSVM,LogisticRegression,andMultilayer Perceptronforbothcongurations C 1and C 2.Also,itsmeanclassicationcondenceis consistentregardlessof k ,whereasthemeanclassicationcondenceofotherclassiers decreasesas k increases. Figures5.4aand5.4bshowsthedistributionofthemeanclassicationcondence overvariousin-sampleandout-sampledataforconguration C 1wherethedominatorhaving ACV -thresholdof0 : 45ischosentobe S .Inthesegures,themeanclassication condenceforeachin-sampledatahasbeencomputedbyincreasingthetrainingdataset incrementallyoneyearatatime,startingfromJan1,1996andendingatDec31,2008. Thecorrespondingout-samplecontainsnancialtime-seriesdataforoneyearimmediately followingthelastdayinthetrainingdataset.Forinstance,ifthetrainingdatasetis fromJan1,1996toDec31,2001,thenthecorrespondingtestdatasetcontainsnancial time-seriesfromJan1,2002toDec31,2002.Fromthesegures,itisevidentthatthe association-basedclassierachievesmeanclassicationcondenceintherange0 : 60to0 : 75 onbothin-sampleandout-sampledata.Thehigherclassicationcondencevaluesforoutsampledatacomparedtoin-sampledatamaybeattributedtothefactthattheout-sample dataissubstantiallysmallerthanthein-sampledata. 52

PAGE 62

CHAPTER6 CONCLUSIONSANDFUTUREWORK Weproposedadirectedhypergraphbasedmodeltocaptureattribute-levelassociationsand theirstrengthinanydatabase.Wetestedthismodelonanancialtime-seriesdataset S&P500.Theclusteringmethodbasedonoursimilaritynotionsallowedtondclusters ofnancialtime-series.Theassociation-basedclassiercoupledwiththeleadingindicator ofallthenancialtime-seriesexhibitedamethodologytousemva-typeassociationrules andpredictvaluesofnancialtime-series.Wealsodemonstratedtheconsistencyofour modelbyvarying k throughoutourexperiments. Ourworkraisesinterestingquestionsontheapplicationsofassociationrulemining.It mightbefruitfultoexploreassociationsbyapplyingthedirectedhypergraphmodelon datasetssuchasgenedatabases,socialnetworkdatasets,andmedicaldatabases.Itwould beusefultounderstandhowthedierentparameters k ,andthesizesofheadandtail setsaectthemodel.Examplesofapplicationsofassociationrulemininginthecontext ofsocialnetworkdatasetsandmedicaldatabasehavebeenpresentedinSection3.1. Ingeneticresearch,acommonproblemistoeectivelymodeltheinterrelationship amongmultiplegenes.Byrecordingthegeneexpressionvaluesofasetofgenes,researchers worktowardsobtainingthegeneexpressionvaluesofothers.Ingeneral,knowledgeabout thegeneshelpresearchersandphysiciststounderstandthegeneticstateofpatientsand thegeneticconditionsfordiseases. Anastassiou[Ana07]providedasynergy-basedmethodtoanalyzetheinteractionsamong multipleinteractinggenes.Suchaframeworkcanbeusedtopredictthepresenceofadisease ortheabsenceofadisease,basedongivengeneexpressionvalues.Althoughthisisrecent workindiseaseprediction,therebeencontinuousinterestinlocatinggeneswithsimilar characteristics.Forexample,Eisenetal.[ESBB98]providedavariationofclusteringto 53

PAGE 63

ndclustersofsimilargenes.Findingagroupofgenesthatinstigateacertaindiseaseis veryimportantinrecognizingamedicalconditioninpatients. Thedirectedhypergraphbasedmodel,i.e.associationhypergraph,presentedinthis thesiscanbeusedtomodelgeneinteractions.Bymodelingthegeneinteractionsusing anassociationhypergraph,wecanaddressthefollowingproblems:identifyclustersof similargenesandpredictgeneexpressionvaluesofasetofgenes,andidentifydisease causingconditionspresentamongasetofgenesandpredictthepresenceofadiseaseorthe absenceofadisease.Letagenedatabaseconsistofgeneexpressionvaluesrecordedfrom patients.Also,letthedatabasecontaininformationonthestatusofpatientsbeingaected byacertaindisease.Heregenesanddiseasesarethemulti-valuedattributes,andeach observationconsistsofthegeneexpressionvaluesandthediseasestatus,ofaparticular patient. Theproblemstatedincanbeaddressedbyconsideringthepartofthegenedatabase thatonlyconsistsofthemulti-valuedattributesthatcorrespondtothegeneexpression valuesandconstructinganassociationhypergraphusingthetechniquedescribedearlier inSection3.1.Byconsideringthemulti-valuedattributesthatcorrespondtothegene expressionvalues,interactionsamongthegenescanbemodeledusingdirectedhyperedges. Then,groupsofsimilargenescanbeidentiedbyndingclustersofsimilarmulti-valued attributes,asgiveninSection3.3.2.Also,byknowingthegeneexpressionvaluesofa subsetofgenes,thegeneexpressionvaluesoftheremaininggenescanbepredictedusing theassociation-basedclassiergiveninSection4.2. Theproblemstatedincanbeaddressedbyconsideringtheentiregenedatabaseand constructinganassociationhypergraphusingthetechniquedescribedearlierinSection3.1. Inthisproblem,weareinterestedinobtainingapredictionforthepresenceofadiseaseor theabsenceofadiseaseinpatients.Therefore,duringtheconstructionoftheassociation hypergraph,thedirectedhyperedgeswhoseheadsetconsistadiseasearetheonlyones thatgetincludedintheassociationhypergraph.Theremainingdirectedhyperedgesare notconsideredforinclusionintheassociationhypergraph.Now,byknowingthegene expressionvaluesofasubsetofgenesforapatient,adiseasepredictioncanbeobtainedby usingtheassociation-basedclassiergiveninSection4.2. 54

PAGE 64

REFERENCES [ADS83]G.Ausiello,A.D'Atri,andD.Sacca.Graphalgorithmsforfunctionaldependencymanipulation. JournaloftheACM ,30:752{766,1983. [ADS85]G.Ausiello,A.D'Atri,andD.Sacca.Stronglyequivalentdirectedhypergraphs. AnalysisandDesignofAlgorithmsforCombinatorialProblems ,25:1{25,1985. [AG97]G.AusielloandR.Giaccio.On-linealgorithmsforsatisabilityformulaewith uncertainty. TheoreticalComputerScience ,171{2:3{24,1997. [AIS93]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetween setsofitemsinlargedatabases.In SIGMOD93 ,pages207{216.ACM,1993. [Ana07]D.Anastassiou.Computationalanalysisofthesynergyamongmultipleinteractinggenes. MolecularSystemsBiology ,3:1{8,2007. [AS94]R.AgrawalandR.Srikant.Fastalgorithmsforminingassociationrulesin largedatabases.In VLDB94 ,pages487{499.MorganKaufmann,1994. [AV06]D.ArthurandS.Vassilvitskii.Howslowisthek-meansmethod?In ACM SymposiumonComputationalGeometry .ACM,2006. [BAG99]R.Bayardo,R.Agrawal,andD.Gunopulos.Constraint-basedruleminingin large,densedatabases.In ICDE99 ,pages188{197.IEEE,1999. [Bay97]R.Bayardo.Brute-forceminingofhigh-condenceclassicationrules.In KDD 97 ,pages123{126.ACM,1997. [Ber73]C.Berge. GraphsandHypergraphs .North-Holland,1973. [BZ07]B.BringmannandA.Zimmermann.Thechosenfew:Onidentifyingvaluable patterns.In ICDM07 ,pages63{72.IEEE,2007. [CDP04]S.Chawla,J.Davis,andG.Pandey.Onlocalpruningofassociationrulesusing directedhypergraphs.In ICDE04 ,page832.IEEE,2004. [CH03]C.CreightonandS.Hanash.Mininggeneexpressiondatabasesforassociation rules. Bioinformatics ,19:79{86,2003. [Chv79]V.Chvatal.Agreedyheuristicforthesetcoveringproblem. Mathematicsof OperationsResearch ,4:233235,1979. [CSCR + 06]P.Carmona-Saez,M.Chagoyen,A.Rodriguez,O.Trelles,J.Carazo,and A.Pascual-Montano.Integratedanalysisofgeneexpressionbyassociation rulesdiscovery. BMCBioinformatics ,19:79{86,2006. 55

PAGE 65

[ESBB98]M.Eisen,P.Spellman,P.Brown,andD.Botstein.Clusteranalysisanddisplay ofgenome-wideexpressionpatterns. ProcNatlAcadSciUSA ,95:14863{ 14868,December1998. [GLPN93]G.Gallo,G.Longo,S.Pallottino,andS.Nguyen.Directedhypergraphsand applications. DiscreteAppliedMathematics ,42{3:177{201,1993. [Gon85]T.Gonzalez.Clusteringtominimizethemaximuminterclusterdistance. TheoreticalComputerScience ,38:293{306,1985. [GS02]G.GalloandM.Scutella.Anoteonminimummakespanassemblyplans. EuropeanJournalofOperationalResearch ,142:309{320,2002. [HFH + 09]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.Witten. Thewekadataminingsoftware:Anupdate. SIGKDDExplorations ,11,2009. [HKKM98]E.Han,G.Karypis,V.Kumar,andB.Mobasher.Hypergraphbasedclustering inhigh-dimensionaldatasets:Asummaryofresults. IEEEDataEng.Bull. 21:15{22,1998. [Joh74]D.Johnson.Approximationalgorithmsforcombinatorialproblems. Journalof ComputerandSystemSciences ,9:256278,1974. [KH06]A.KnobbeandE.Ho.Maximallyinformativek-itemsetsandtheirecient discovery.In KDD06 ,pages237{244.ACM,2006. [KNO + 03]L.Krishnamurthy,J.Nadeau,G.Ozsoyoglu,M.Ozsoyoglu,G.Schaeer, M.Tasan,andW.Xu.Pathwaysdatabasesystem:anintegratedsystemfor biologicalpathways. Bioinformatics ,19:930{937,2003. [LHM98]B.Liu,W.Hsu,andY.Ma.Integratingclassicationandassociationrule mining.In KDD98 ,pages80{86.ACM,1998. [LHM99]B.Liu,W.Hsu,andY.Ma.Miningassociationruleswithmultipleminimum supports.In KDD99 ,pages337{341.ACM,1999. [Llo82]S.Lloyd.Leastsquarequantizationinpcm. IEEETransactionsonInformation Theory ,28:129137,1982. [Lov75]L.Lovasz.Ontheratioofoptimalintegralandfractionalcovers. Discrete Mathematics ,13:383390,1975. [LS95]W.LinandM.Sarrafzadeh.Alineararrangementproblemwithapplications. In ISCAS95 ,pages57{60,1995. [LSW97]B.Lent,A.Swami,andJ.Widom.Clusteringassociationrules.In ICDE97 pages220{231.IEEE,1997. [NLHP98]R.Ng,L.Lakshmanan,J.Han,andA.Pang.Exploratoryminingandpruning optimizationsofconstrainedassociationrules.In SIGMOD98 ,pages13{24. ACM,1998. [OA04]M.OzdalandC.Aykanat.Hypergraphmodelsandalgorithmsfordata-patternbasedclustering. DataMiningandKnowledgeDiscovery ,9:29{57,2004. 56

PAGE 66

[Ord06]C.Ordonez.Comparingassociationrulesanddecisiontreesfordiseaseprediction.In HIKM06 ,pages17{24.ACM,2006. [Pre00]D.Pretolani.Adirectedhypergraphmodelforrandomtimedependentshortest paths. EuropeanJournalofOperationalResearch ,123:315{324,2000. [Ros58]F.Rosenblatt.Theperceptron:Aprobabilisticmodelforinformationstorage andorganizationinthebrain. PsychologicalReview ,65:386408,1958. [SA95]R.SrikantandR.Agrawal.Mininggeneralizedassociationrules.In VLDB95 pages407{419.MorganKaufmann,1995. [SA96]R.SrikantandR.Agrawal.Miningquantitativeassociationrulesinlarge relationaltables.In SIGMOD96 ,pages1{12.ACM,1996. [SVA97]R.Srikant,Q.Vu,andR.Agrawal.Miningassociationruleswithitemconstraints.In KDD97 ,pages67{73.ACM,1997. [SVL06]A.Siebes,J.Vreeken,andM.Leeuwen.Itemsetsthatcompress.In SDM06 SIAM,2006. [TT09]M.ThakurandR.Tripathi.Linearconnectivityproblemsindirectedhypergraphs. TheoreticalComputerScience ,410{29:2592{2618,2009. [Vie04]A.Vietri.Thecomplexityofarc-coloringsfordirectedhypergraphs. Discrete AppliedMathematics ,143-3:266{271,2004. [Yah10]Yahoo.com.Yahoonance.http://nance.yahoo.com/,2010. 57

PAGE 67

ABOUTTHEAUTHOR RamanujaSimhastudiedinformationscienceatTheNationalInstituteofEngineeringafliatedtotheVisvesvarayaTechnologicalUniversityfrom2001to2005.Hegraduatedwith aBachelorofEngineeringinInformationScience.Then,heworkedatTescoHindustan ServiceCenterasaSoftwareEngineerfrom2005to2008,andhadthetitleofaSenior SoftwareEngineerwhenhelefttherm.Hebeganhisgraduatestudiesincomputerscience attheUniversityofSouthFloridain2008.There,hepursuedresearchinAlgorithmsand DataMiningunderthesupervisionofProf.RahulTripathi.Duringthisperiod,healso completedasummerinternshipattheNationalCenterforAtmosphericResearchinthe ComputationalandInformationSystemsLaboratory.