Mining associations using directed hypergraphs

Citation
Mining associations using directed hypergraphs

Material Information

Title:
Mining associations using directed hypergraphs
Creator:
Simha, Ramanuja N
Place of Publication:
[Tampa, Fla]
Publisher:
University of South Florida
Publication Date:
Language:
English

Subjects

Subjects / Keywords:
Association Rules
Clustering
Discretization
Financial Time-series
Multi-valued Attributes
Similarity
Dissertations, Academic -- Computer Science -- Masters -- USF ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Abstract:
ABSTRACT: This thesis proposes a novel directed hypergraph based model for any database. We introduce the notion of association rules for multi-valued attributes, which is an adaptation of the definition of quantitative association rules known in the literature. The association rules for multi-valued attributes are integrated in building the directed hypergraph model. This model allows to capture attribute-level associations and their strength. Basing on this model, we provide association-based similarity notions between any two attributes and present a method for finding clusters of similar attributes. We then propose algorithms to identify a subset of attributes known as a leading indicator that influences the values of almost all other attributes. Finally, we present an association-based classifier that can be used to predict values of attributes. We demonstrate the effectiveness of our proposed model, notions, algorithms, and classifier through experiments on a financial time-series data set (S&P 500).
Thesis:
Thesis (M.S.C.S.)--University of South Florida, 2011.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 67 pages.
Statement of Responsibility:
by Ramanuja N. Simha.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
E14-SFE0004856 ( USFLDC DOI )
e14.4856 ( USFLDC Handle )

Postcard Information

Format:
Book

Downloads

This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2011 flu ob 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0004856
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Simha, Ramanuja N.
0 245
Mining associations using directed hypergraphs
h [electronic resource] /
by Ramanuja N. Simha.
260
[Tampa, Fla] :
b University of South Florida,
2011.
500
Title from PDF of title page.
Document formatted into pages; contains 67 pages.
502
Thesis
(M.S.C.S.)--University of South Florida, 2011.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
3 520
ABSTRACT: This thesis proposes a novel directed hypergraph based model for any database. We introduce the notion of association rules for multi-valued attributes, which is an adaptation of the definition of quantitative association rules known in the literature. The association rules for multi-valued attributes are integrated in building the directed hypergraph model. This model allows to capture attribute-level associations and their strength. Basing on this model, we provide association-based similarity notions between any two attributes and present a method for finding clusters of similar attributes. We then propose algorithms to identify a subset of attributes known as a leading indicator that influences the values of almost all other attributes. Finally, we present an association-based classifier that can be used to predict values of attributes. We demonstrate the effectiveness of our proposed model, notions, algorithms, and classifier through experiments on a financial time-series data set (S&P 500).
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
590
Advisor:
Tripathi, Rahul
653
Association Rules
Clustering
Discretization
Financial Time-series
Multi-valued Attributes
Similarity
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.4856



PAGE 1

MiningAssociationsUsingDirectedHypergraphs by RamanujaN.Simha Athesissubmittedinpartialfulllment oftherequirementsforthedegreeof MasterofScienceinComputerScience DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:RahulTripathi,Ph.D. YichengTu,Ph.D. XiaoningQian,Ph.D. DateofApproval: March9,2011 Keywords:AssociationRules,Multi-valuedAttributes,FinancialTime-series, Discretization,Clustering,Similarity,Dominator Copyright c 2011,RamanujaN.Simha

PAGE 2

DEDICATION tomyparents,brother,andsister

PAGE 3

ACKNOWLEDGEMENTS ThisthesishasbeenwrittenunderthekindsupervisionofDr.RahulTripathi,mythesis advisor.Iamextremelygratefulforhishelpfulguidanceandfeedbackwhilereviewingthe variousversionsofthethesisdraft.Hisin-depthtechnicalinsightsandcommentshavebeen immenselymotivationaltomeinachievingahighqualityofresearch. IamextremelygratefultoDr.MayurThakurfromGoogleInc.,forbeingacollaborator inthisprojectandforprovidingmewithideasandin-depthtechnicalinsightsduringthe discussions.Wehadanumberofe-mailandphoneconversationsduringwhichhiscomments havebeenhighlymotivationaltome. IwishtothankDr.RahulTripathiforpermittingmetoincludeworkdonejointlywith him.TheworkinChapter3andChapter4wasjointlydonewithhim. MythesiscommitteememberswereDr.RahulTripathi,Dr.YichengTu,andDr. XiaoningQian.Iamextremelythankfultoallofthemfortheirhelpfulcomments. Ihavelearntimmenselyaboutachievingveryhighlevelsofacademicquality,technical excellence,andaptitudetolearn,bytakingtheorycoursesunderDr.RahulTripathi,and byworkingasaTeachingAssistantfortheorycoursesandasaResearchAssistantunder Dr.RahulTripathi.Ihavebeenfortunateforhavinghadsuchanopportunity.Duringthis period,theexperienceshugelycontributedinshapingmyresearchinterestsinalgorithms andproblemsolving. IwishtothankSrinivasanRamkumarfromQualcomm/RiverbedTechnologyforprovidingsupportandmotivationthroughoutmystudiesbyunderstandingtheneedtoconquer challengesforbecomingsuccessfulingraduateresearch. Iwouldliketothankmyparents,mybrother,andmysisterforgivingmeinspiration andsupportwheneverIneededduringmystudies.

PAGE 4

TABLEOFCONTENTS LISTOFTABLESiii LISTOFFIGURESiv LISTOFALGORITHMSv ABSTRACT vi CHAPTER1INTRODUCTION1 1.1BackgroundandMotivation1 1.2ThesisOverview4 CHAPTER2PRELIMINARIES6 2.1ApproximationAlgorithms6 2.1.1SetCover7 2.1.2GraphDominatingSet8 2.1.3The t -clustering9 2.2DirectedHypergraphs11 2.3DataMining12 2.3.1ClassicationRuleMining13 2.3.2Clustering15 CHAPTER3MODELINGASSOCIATIONSINDATABASESUSINGDIRECTED HYPERGRAPHS17 3.1AssociationsBetweenMulti-ValuedAttributes17 3.1.1Discretization22 3.2AssociationHypergraphs22 3.2.1ConstructingAssociationHypergraphs23 3.3Association-BasedSimilarityBetweenMulti-ValuedAttributes26 3.3.1In-SimilarityandOut-Similarity27 3.3.2ClustersofSimilarAttributes28 CHAPTER4COMPUTATIONALPROBLEMS30 4.1LeadingIndicators30 4.1.1AnAdaptationofGraphDominatingSetApproximationAlgorithm31 4.1.2AnAdaptationofSetCoverApproximationAlgorithm31 4.2Association-BasedClassier33 i

PAGE 5

CHAPTER5EXPERIMENTATION37 5.1AssociationHypergraphModeling37 5.1.1Discretization37 5.1.2ChoiceofParameters38 5.2AssociationCharacteristicsofFinancialTime-Series39 5.3Association-BasedSimilarity43 5.3.1ComparisonwithEuclideanSimilarity43 5.3.2ClustersofFinancialTime-Series45 5.4LeadingIndicatorsofFinancialTime-Series45 5.5Association-BasedClassier48 5.5.1Evaluation52 CHAPTER6CONCLUSIONSANDFUTUREWORK53 REFERENCES55 ABOUTTHEAUTHOREndPage ii

PAGE 6

LISTOFTABLES Table3.1Patientdatabase.19 Table3.2Patientdatabaseafterdiscretization.19 Table3.3Genedatabase.20 Table3.4Genedatabaseafterdiscretization.20 Table3.5Personalinterestdatabase.21 Table3.6Personalinterestdatabaseafterdiscretization.21 Table3.7AnexampleassociationtableATforthecombination f A 1 ;A 2 g ; f A 3 g .24 Table5.1Thedirectededgeandthe2-to-1directedhyperedgewiththe highest ACV foreachselectednancialtime-seriesfromdifferentsectorsandforeachcongurationchoiceareshown.42 Table5.2The2-to-1directedhyperedgewiththehighest ACV andthe constituentdirectededgesforeachselectednancialtime-series fromdierentsectorsandforeachcongurationchoiceareshown.44 Table5.3Thesizeofadominatorforallnancialtime-seriesandthe meanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm5.49 Table5.4Thesizeofadominatorforallnancialtime-seriesandthe meanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm6.50 iii

PAGE 7

LISTOFFIGURES Figure5.1Weighteddegreedistribution.40 Figure5.2Euclideansimilaritycomparision.46 Figure5.3Clustersofnancialtime-seriesforconguration C 1.47 Figure5.4Classicationcondencedistributionoftheassociation-based classierforin-sampleandout-sampledataforconguration C 1.51 iv

PAGE 8

LISTOFALGORITHMS 1Agreedyalgorithmforcomputingasetcover.8 2The t -clusteringalgorithm.10 3Perceptronlearningrule.14 4The k -meansalgorithm.16 5Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhichisanadaptationofgraphdominatingsetapproximation.32 6Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhichisanadaptationofsetcoverapproximation.34 7Enhancement1.34 8Enhancement2.34 9Association-basedclassier.35 v

PAGE 9

ABSTRACT Thisthesisproposesanoveldirectedhypergraphbasedmodelforanydatabase.Weintroducethenotionofassociationrulesformulti-valuedattributes,whichisanadaptation ofthedenitionofquantitativeassociationrulesknownintheliterature.Theassociation rulesformulti-valuedattributesareintegratedinbuildingthedirectedhypergraphmodel. Thismodelallowstocaptureattribute-levelassociationsandtheirstrength.Basingon thismodel,weprovideassociation-basedsimilaritynotionsbetweenanytwoattributesand presentamethodforndingclustersofsimilarattributes.Wethenproposealgorithms toidentifyasubsetofattributesknownasa leadingindicator thatinuencesthevalues ofalmostallotherattributes.Finally,wepresentanassociation-basedclassierthatcan beusedtopredictvaluesofattributes.Wedemonstratetheeectivenessofourproposed model,notions,algorithms,andclassierthroughexperimentsonanancialtime-series datasetS&P500. vi

PAGE 10

CHAPTER1 INTRODUCTION 1.1BackgroundandMotivation DataMininginvolvessearchinginterestingpatternsand/orclassifyingdata.Association ruleshelptodiscoverinterestingpatternsbyidentifyingimplicationrelationshipsamong attribute-valuepairspresentinthedata.Similarly,classicationruleshelptoclassifydata bypredictingvaluesofspecicattributesinthedata.Ingeneral,anyassociationorclassicationruleconsistsoftwocomponents:theantecedentandtheconsequent.Association rulesmayhavemorethanoneattributeintheirconsequent,whereasclassicationrules alwayshaveasingleattributeintheirconsequent.Anexampleofanassociationruleis: Ifacustomerbuysmilkanddiapers,thenthecustomeralsobuysbeerandeggs."Here, milk,diapers,beer,andeggsareattributes,andtheyeachtakevalue`1'where`1'denotes presentand`0'denotesabsentintheassociationrule.Anexampleofaclassicationrule is:Ifweatherissunnyandhumidityislow,thenplayistrue."Here,weather,humidity, andplayareattributes,andthevalues`sunny'and`low'helptopredictthevalue`true'for theattributeplay.Attributescanbeeithercategorical,e.g.,daysofaweek,orquantitative, e.g.,stockprice.Agrawaletal.[AIS93]presentedassociationrulesthattargetidentifying implicationrelationshipsamongcategoricalattributesthattakeonly0 = 1values.Suchassociationrulesarecalled boolean associationrules.Srikantetal.[SA96]introducedthemore general quantitative associationrulesthataccommodatebothcategoricalandquantitative attributes.Henceforward,wewillrefertoquantitativeassociationrulesasassociationrules. Associationrulesndtheiruseinidentifyingimplicationrelationshipsamongattributes inmarket-baskettypedata.Thisdatatypeisatransactionaldatawhereeachobservationinthedatabaseisatransactionconsistingofitemspurchasedbyacustomer.Much work[BAG99,SA95,AS94,SVA97,NLHP98]hasbeendoneinminingassociationrules 1

PAGE 11

thatsatisfyconstraintssuchas,minimumsupportameasureofsignicanceandminimumcondenceameasureofpredictiveability.Associationrulescanbeusedtosolve thefollowingproblems:ndingclustersofsimilarattributes,ndingasmallsubsetofattributesthatinuencesalargesectionofotherattributes,andndingclassicationrulesto predictvaluesofattributes.Forinstance,inmarket-baskettypedata,apracticalapplicationofassociationrulesistoidentifyclustersofsimilaritemsbasedonthecustomersales information.Thishelpstounderstandpatternsinsalesofitemsandtogroupitemsbased oncustomerinterests.Similarly,identifyingasmallsubsetofitemsthatinuencethesales ofallotheritemsinmarket-baskettypedatamayhelpinrecognizingmajorsalesindicators. Also,usingtheclassicationrules,thepurchaseofparticularitemsforcustomerscouldbe predictedbasedonthepriorpurchasesmadebythem. Infact,applicationsofassociationrulesgofarbeyondtherealmsofjustthemarketbaskettypedomain.Someotherdomainswhereassociationruleshaveimportantapplicationsareasfollows:inmedicineforidentifyingrelationshipsamongmedicalconditionsanddiseases[Ord06],inbioinformaticsforidentifyinginterrelationshipsamong genes[CH03,CSCR + 06],insocialnetworksforidentifyingsocialrelationships,andinnanceforidentifyingpredictionrelationshipsamongstocks.Weprovidebelowsomeexamplesfromthesedomainsthatwillhelpelucidatetheapplicationsofassociationrules.In eachoftheseexamples,35%isthecondenceand5%isthesupportofthestatedrule. 1.%ofthetimeswhentheageofapatientisintherange40-49yearsandthe cholesterolofthesamepatientisintherange220-229mg/dL,thepatient'sbloodpressureisintherange130-139mmHg"andthisruleoccursin5%ofthetotal observations.Here,age,cholesterol,andblood-pressureareattributes. 2.%ofthetimeswhengene1andgene2inapatientareunderexpressed,gene3 isoverexpressedinthepatient"andthisruleoccursin5%ofthetotalobservations. Here,gene1,gene2,andgene3areattributes. 3.%ofthetimeswhenapersonhashighinterestinreadingandplaying,theperson haslowinterestinmusic"andthisruleoccursin5%ofthetotalobservations.Here, reading,playing,andmusicareattributes. 2

PAGE 12

4.%ofthetimeswhenthepriceofstock A andofstock B goup,thepriceofstock C goesdown"andthisruleoccursin5%ofthetotalobservations.Here,stocks A B ,and C areattributes. Ourpresentworkismotivatedbythegoalofbuildingamodelthatinherentlyhandles many-to-manyrelationships,thusenablingtocapturetheserelationshipsamongattributes ofadatabasemoreaccurately.Suchamodelisexpectedtobesuitedforhandlingproblems suchas,similarityandclusteringbecausemanyoftherelationshipsexhibitedinrealworld arenotrestrictedtobeone-to-one. Knobbeetal.[KH06]proposedanapproachtomineasmallsetofbinaryattributes thathelpdierentiateobservationsinadatabase.Siebesetal.[SVL06]andBringmann etal.[BZ07]proposedtechniquesforcompressingadatabase.Thecompressedsetofbinaryattributescouldthenbeusedinadataminingclassiertoavoidovertting.The abovemethodsmainlyattempttoidentifyinformativesetsof binaryattributes ,whereas ourapproachattemptstobuildagenericmodelforanydatabasecontaining multi-valued attributes andaddressavarietyofproblemssuchas,similarity,clustering,leadingindicators,andclassication. Inthisthesis,weproposeanovelmodelforanydatabaseusingadirectedhypergraphin whichthenodesrepresentattributesandthedirectedhyperedgesrepresentmany-to-many relationshipsamongtheattributes.Theweightsondirectedhyperedgescapturethelikelinessofassociationinaparticulardirection.Weintroducethenotionofassociationrulesfor multi-valuedattributes,whichisanadaptationofthedenitionofquantitativeassociation rulesknownintheliterature[SA96].Theassociationrulesformulti-valuedattributesare integratedinbuildingthedirectedhypergraphmodel.Basingonthismodel,weprovide association-basedsimilaritynotionsbetweenanytwoattributes,presentamethodforndingclustersofsimilarattributes,andproposeanalgorithmtoidentifyasubsetofattributes knownasa leadingindicator thatinuencesthevaluesofalmostallotherattributes.Finally, wepresentanassociation-basedclassierthatcanbeusedtopredictvaluesofattributes. Wedemonstratetheeectivenessofourproposedmodel,notions,algorithm,andclassier throughexperimentsonanancialtime-seriesdatasetS&P500. 3

PAGE 13

1.2ThesisOverview WepresentbackgroundonapproximationalgorithmsanddatamininginChapter2.The variousapproximationalgorithmspresentedare:ndingminimumcostsetcover,nding minimumsizegraphdominator,andndingminimumdiameterclustering.Wethendiscusssomefundamentalconceptsandapproachesindataminingsuchaslinearregression, perceptronrulelearningalgorithm,and k -meansclustering. InChapter3,weintroducethenotionofassociationrulesformulti-valuedattributes, whichisanadaptationofquantitativeassociationrulesknownintheliterature.Wethen proposeanoveldirectedhypergraphbasedmodelingforanydatabasethatallowsusto computeassociation-basedsimilaritybetweenmulti-valuedattributesandndclustersof similarmulti-valuedattributes. Usingthedirectedhypergraphmodel,weproposeinChapter4twogreedyalgorithms foridentifyingasubsetofattributesknownasaleadingindicatorthatinuencesthevalues ofalmostallotherattributes.Therstgreedyalgorithmisbasedonanadaptationofthe greedy O log n -approximationalgorithmforcomputingaminimumcardinalitydominating setingraphs.Thesecondgreedyalgorithmisbasedonanadaptionofthegreedy O log n approximationalgorithmforcomputingaminimumcostsetcover.Finally,wepresentan association-basedclassierthatcanbeusedtopredictvaluesofattributes. InChapter5,weconductexperimentsonnancialtime-seriesobtainedfromYahooFinance[Yah10]onthefollowingproblems:andingclustersofsimilarattributes,bndingleadingindicators,andcpredictingvaluesofnancialtime-seriesusingtheassociationbasedclassier.Weproposeanequi-depthpartitioningtechniquetodiscretizethenancial time-seriesintheS&P500.Thistransformationisusedtoconstructadatabasethatissuitablefortheassociationhypergraphmodeling.Thedirectedhypergraph H requiredforthe experimentsisthenconstructedusingthedatabase.Weshowtheassociationcharacteristics ofnancialtime-seriesbypresentingthedegreedistributionofnodesintheassociationhypergraph,bydisplayingthedirectededgesanddirectedhyperedgeswithhighestassociation condencevalueACVforselectednancialtime-series,andbycomparingthedirected hyperedgeswiththehighestACVwiththoseofthecorrespondingtwodirectededges.We 4

PAGE 14

alsodisplaytheclustersofnancialtime-seriesinthesimilaritygraph,computetheleadingindicatorsforacollectionofnancialtime-series,andpresentthestatisticsofnancial time-seriespredictions. Inthisthesisweshowthatdirectedhypergraphs,duetotheirinherentstructure,help ustocaptureinterestingcharacteristicsthatexhibitmany-to-onerelationshipsamongattributesinadatabase.Theproposedmodeldisplaystheversatileusageofdirectedhypergraphsinaddressingvariousproblemsrelevanttominingassociationsindatabases.Our experimentsonnancialtime-seriesdemonstratethatdirectedhypergraphscapturemore relationshipsthandirectedgraphsandinthestockmarketdomain,ourmodelingallowsus toaddressproblemssuchascomputingnancialtime-seriessimilarity,computingnancial time-seriesleadingindicators,andpredictingnancialtime-seriesvalues,usingacommon framework/methodology. 5

PAGE 15

CHAPTER2 PRELIMINARIES 2.1ApproximationAlgorithms TherearemanyproblemswhosedecisionversionisNP-completeandoptimizationversion isNP-hardinnature,i.e.,obtaininganoptimalsolutiontotheprobleminpolynomialtime iscurrentlyunknown.Duetotheirpracticalimportance,itisextremelyusefultohavean ecientpolynomialtimesolutiontosuchproblemsbycarryingoutanyofthefollowing: aconstrainingtheinput,bintroducingrandomizationinthesolutionapproach,and cobtaininganearoptimalsolution.Inthissection,weshallfocusonobtaininganear optimalsolutiontohardproblems.Approachesthatproduceanearoptimalsolutionare usedinthedesignofpolynomialtimeapproximationalgorithms.Thequalitymeasureof anapproximationalgorithmisgivenbyitsapproximationfactor. Letussaythatwehaveanoptimizationproblem.Basedontheproblem,anoptimal solutionmaybeasolutionwitheitherthemaximumvalueortheminimumvalue.Inother words,theproblemmaybeamaximizationoraminimization. Denition2.1 Foranoptimizationproblem A thattakesinput I ,analgorithm ALG has anapproximationratio ifthecost ALG A I ofthesolutionproducedbythealgorithmon input I iswithinafactorof ofthecost OPT A I ofanoptimalsolutiononthesame input I .Thatis, max ALG A I OPT A I ; OPT A I ALG A I : Foraminimizationproblem,theratio ALG A I =OPT A I givesafactorbywhichthe solutioncostproducedbythealgorithmexceedsthesolutioncostofanoptimalsolution. Foramaximizationproblem,theratio OPT A I =ALG A I givesafactorbywhichthe solutioncostofanoptimalsolutionexceedsthesolutioncostproducedbythealgorithm. 6

PAGE 16

2.1.1SetCover Denition2.2 Let U beauniverseconsistingof n elementsandlet S = f S 1 ;S 2 ;:::;S m g beacollectionofsubsetsof U .Asetcover SC isasubcollectionof S thatcoversallthe elementsin U Algorithm1presentedbelowcomputesasetcover SC ofsize ALG SC I givenauniverse U ofsize n andacollectionofsubsets S = f S 1 ;S 2 ;:::;S m g of U asinput I .Let Cover denotethesetofelementsthatarecurrentlycoveredbythealgorithm.Thegreedystrategy ofthealgorithmisasfollows:foreverysubset S i 2S thatisnotpartofthesetcoveryet, thealgorithmcomputesitscosteectiveness S i thatreects S i 'scoveringability,i.e, S i istheaveragecostpaidbythegreedyalgorithmtocovertheelementsin S i thatare notalreadyin Cover ,i.e.,1 = j S i )]TJ/F22 10.9091 Tf 11.464 0 Td [(Cover j .Duringeachiterationofthealgorithmuntil alltheelementsin U arecovered,thesubset S i withthelowestaveragecosthighest costeectivenessisaddedtothesetcover.Inotherwords,apriceof1 = j S i )]TJ/F22 10.9091 Tf 11.307 0 Td [(Cover j is paidtocovereachelementin S i )]TJ/F22 10.9091 Tf 11.021 0 Td [(Cover .Therefore,thecostofthesetcoverobtainedis P n k =1 price u k Thegreedysetcoveralgorithmcanbeimplementedinlineartime. Theorem2.3[Joh74,Lov75,Chv79] Givenauniverse U ofsize n andacollectionof subsets S = f S 1 ;S 2 ;:::;S m g of U asinput I ,thecostofthesetcover ALG SC I computed byAlgorithm1isatmostafactorof O log n greaterthanthecostofanoptimalsetcover OPT SC I in O n log m time. Proof Let u 1 ;u 2 ;:::;u n 2 U betheorderinwhichtheelementsareaddedto Cover byAlgorithm1.Sinceanoptimalsolutioncancoveralltheelementsin U withacostof OPT SC I ,theremustalwaysexistasethavinganaveragecostofatmost OPT SC I = j U )]TJ/F22 10.9091 Tf -423.515 -21.922 Td [(Cover j .Weknowthat j U )]TJ/F22 10.9091 Tf 10.543 0 Td [(Cover j isatleast n )]TJ/F22 10.9091 Tf 10.544 0 Td [(k +1whenanelement u k isabouttobe covered.Sincethealgorithmpicksthelowestaveragecostsubsetduringeachiteration,we have price e k OPT SC I j U )]TJ/F22 10.9091 Tf 10.909 0 Td [(Cover j OPT SC I n )]TJ/F22 10.9091 Tf 10.91 0 Td [(k +1 : 7

PAGE 17

Input :Auniverse U ofsize n andacollectionofsubsets S = f S 1 ;S 2 ;:::;S m g of U Output :Asetcover SC begin 1 SC ; ; 2 Cover ; ; 3 while Cover 6 = U do 4 foreach S 2S do 5 count 0; 6 foreach u 2 S do 7 if u= 2 Cover then 8 count count +1; 9 end 10 end 11 S 1 count ; 12 end 13 Let S 0 besuchthat S 0 min S 2S S ; 14 Cover Cover [ S 0 ; 15 SC SC [f S 0 g ; 16 end 17 return SC ; 18 end 19 Algorithm1:Agreedyalgorithmforcomputingasetcover. Therefore,thecostofthesetcover ALG SC I is ALG SC I = n X k =1 price u k n X k =1 OPT SC I n )]TJ/F22 10.9091 Tf 10.909 0 Td [(k +1 1 1 + 1 2 + + 1 n OPT SC I log n OPT SC I : Thiscompletestheproof. 2.1.2GraphDominatingSet Denition2.4 Let G = V;E beagraph.Adominatingsetisasubset DomSet V of verticessuchthat,foreachvertex v 2 V ,eitherthereexistsanedge u;v 2 E suchthat u 2 DomSet or v 2 DomSet 8

PAGE 18

Theorem2.5[Joh74,Lov75,Chv79] Thereisagreedy O log n -approximationalgorithmforgraphdominatingsetproblemthat,givenanygraph G = V;E asinput I computesadominatingsetof G whosesizeiswithin O log n oftheoptimaldominatingset size,where n isthenumberofverticesand m isthenumberofedges,of G Thefollowingconstructioncanbeusedtotransformaninstanceofgraphdominatingset intoaninstanceofsetcover.Let U = V bethesetofverticesthatneedtobecoveredandlet S = f S 1 ;S 2 ;:::;S n g besubsetsofvertices,whereeachsubset S i containsvertex v i andits neighborhoodset N v i .Thisinstanceofsetcovercanbesolvedusingtheapproximation resultinTheorem2.3.SinceeverysetaddedbyAlgorithm1tothesetcoveralwaysadds onlyavertextothegraphdominatingset,theapproximationguaranteeprovidedinthe theoremforthesizeofthesetcoverholdsforthesizeofthegraphdominatingset. 2.1.3The t -clustering Letusassumewehaveasetofattributes,e.g.,nancialtime-series,genes,orimages,and wearerequiredtondindependentgroupsofattributeswhereattributesbelongingtothe samegrouparesimilartoeachother.Thegeneralclusteringproblemaddressesthisby deninganobjectivefunctionbasedonwhichtheattributesaregrouped.Avariantofthe generalclusteringproblemistheproblem t -clustering,whichgroupstheattributesbased onhowclosetheyaretoeachother.Inotherwords,theobjectiveof t -clusteringistogroup attributesinto t clusterssothatthemaximumdistancebetweenanytwoattributeswithin thesameclusteralsocalledthediameterofthegroupinga.k.a.clusteringisminimized. ThisproblemalsoassumesthattheEuclideandistancefunctiondisplaysmetricproperties.Adistancefunction d ; onapointset X = f x 1 ;x 2 ;:::;x n g issaidtohavemetric propertiesifandonlyifitsatises: 1. d x 1 ;x 2 0forall x i 2 X ,and 2. d x 1 ;x 2 =0ifandonlyif x 1 = x 2 ,and 3. d x 1 ;x 2 = d x 2 ;x 1 ,and 4. d x 1 ;x 2 d x 1 ;x 3 + d x 3 ;x 2 TriangleInequality. 9

PAGE 19

Denition2.6 Let X = x 1 ;x 2 ;:::;x n beasetofpointswithadistancefunctionbetweenanypairofpoints d ; thatsatisesthemetricpropertiesandlet t beaninteger. A t -clustering C = f C 1 ;C 2 ;:::;C t g isapartitionof X into t clusters C 1 ;C 2 ;:::;C t by designating t pointsof X ascenters,suchthat C minimizesthedimeter Diam overall possibleclusterings.Thatis C minimizes Diam C =max i max x a ;x b 2 C i d x a ;x b : Algorithm2presentedbelowndssuchapartitionof X bydesignatingsome t pointsas clustercenters.Initially,anypoint 1 2 V ispickedastherstclustercenter.Duringeach iterationofthealgorithm,itndsapoint i 2 X thatisthefarthestfromtheexisting centers 1 ; 2 ;:::; i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ,i.e.,apoint i 2 X thatmaximizesmin j
PAGE 20

lessthan r toitsclosestcenter.Fromtriangleinequality,thediameterofthe t clusteringis atmost2 r .Also,fromthewaythecentersarechosenineveryiterationoftheAlgorithm2, weknowthatallthecenters 1 ; 2 ;:::; t andthepoint u areplacedatleast r distance apartfromeachother.Ifa t -clusteringofthesepointsisfound,thenthediameterofthe clusteringisatleast r i.e., OPT TC I .Fromabovewehave ALG TC I 2 OPT TC I : Thiscompletestheproof. 2.2DirectedHypergraphs Directedhypergraphsareageneralizationofdirectedgraphsinwhicheachdirectedhyperedgehasoneormoresourcetailverticesandhasoneormoredestinationheadvertices.TheyhavefoundavarietyofapplicationsinComputerScience,e.g.,inpropositional logic[AG97],databases[ADS83,ADS85],scheduling[LS95,GS02],routingindynamicnetworks[Pre00],bioinformatics[KNO + 03],anddatamining[CDP04].Theoreticalproblems relatedtodirectedhypergraphshavebeenstudiedin[Vie04,TT09]. Arelatednotionisundirectedhypergraphs,whichgeneralizeundirectedgraphs. Denition2.8 [Ber73] An undirectedhypergraph G isapair V G ;E G ,where V G isanite setofverticesand E G 2 V G isanitesetofhyperedgessuchthat,foreveryhyperedge e i 2 E G e i 6 = ; Denition2.9 [GLPN93] A directedhypergraph H isapair V;E ,where V isanite setofverticesand E 2 V 2 V isanitesetofdirectedhyperedgessuchthat,forevery directedhyperedge e = T;H 2 E T 6 = ; H 6 = ; ,and T H = ; Here, T iscalledthe tailsetand H iscalledtheheadsetof e Noticethatdirectedhypergraphsaredierentfromundirectedhypergraphs[Ber73].While directedhypergraphsgeneralizedirectedgraphs,undirectedhypergraphsgeneralizeundirectedgraphs. 11

PAGE 21

2.3DataMining Datamininginvolveslearninginformationorconceptsfromdatabasesandusingthemto solveproblems.Basedonthestructureofthedatabase,therearedierentproblemsthat canbeaddressed.Forexample,supposeourdatabasecontainsattributessuchashumidity, weather,temperature,andtypeofplay,andcontainsobservationsthatinfactarerecords ofvaluesthattheseattributestakeondierentdays.Inordertolearnaboutthetypeof play,onecanxtypeofplayasthe classicationattribute andlookintothevaluesofother attributes.Aclassierisanalgorithmthatlearnsrulesabouttheclassicationattribute basedonotherattributevalues.Thisapproachcanbeusedtopredicttheclassattribute valueofunseenobservations.Sincetheclassierisbuiltunderthesupervisionofasetof observations,i.e.,trainingdataset,thismethodisknownassupervisedclassication.The classvaluemayalsobenon-discreteinnature. Incertaincircumstances,adatabasemaynothaveanyinformationtodistinguisha particularattributeastheclassicationattribute.Insuchascenario,learninginformation aboutthedatabasemaynotaidinthepredictionofaclassvalue.However,learning informationnotonlycorrespondstothepredictionofsomeattributevalueinthedatabase, butalsocanbevisualizedasinferringrelationshipsamongattributes.Forexample,in thesampledatabasediscussedearlier,aninterestingassociationcanbeasfollows:The patternofhumidityhavingthevalue80%,temperaturehavingthevalue90F,andweather havingthevalueRainyisveryfrequentinthedatabase."Suchaninferenceisknownasan associationrule.Here,thereisnoparticularattributewhosevalueispredicted.But,aswe haveseenintheexample,learninginformationcorrespondstoidentifyinginterestingand usefulpatternsamongattributes. Inthepreviousexample,wesawthatlearninginformationcanbevisualizedinterms ofinferringrelationshipsamongattributes.Ifthisconceptisgeneralized,thenitcanbe termedasidentifyinggroupsofattributeswithsimilarcharacteristics.Here,characteristics mayrelatetothepatternsamongattributesidentiedearlier.Thisproblemisknownas clustering,astheobjectiveistondattributesandclassifythemintogroupsbasedon theircharacteristics.Forexample,letthesetofattributesbeasetofnancialtime-series 12

PAGE 22

andthesetofobservationsbethestockpricesrecordedforthenancialtime-serieson variousdays.Aninterestingquestionhereistogroupthenancialtime-seriesintoclusters containingsimilarnancialtime-series.Thequalityoftheclusteringobtainedisgenerally veriedbasedontheinformationavailabletotheuser.Inthiscase,onewaytoverify thequalityofthenancialtime-seriesclustersobtainedisbylookingatthesectorsofthe time-series.Here,thequalityoftheclusteringmaybedenedgoodifahighpercentageof thetime-seriesthatbelongtothesameclusterarefromthesamesector. 2.3.1ClassicationRuleMining Wereviewthelinearregressionclassierusedtoconstructclassicationrules.Linearregressioncanbeusedtopredicttheclassattributevaluewhentherearenon-discreteattributes. Let A 1 ;A 2 ;:::;A n denotetheattributesandlet O 1 ;O 2 ;:::;O m denotetheobservations correspondingtotheattributes.Lettherebeadatabasethatisvisualizedasa m n table containing m rowsand n columns,whererowscorrespondtoobservationsandcolumnscorrespondtoattributes.Let o ij denotethevalueofthe j 'thattributeforthe i 'thobservation where1 i m and1 j n .Forconvenience,letussaythatthelastattribute, A n betheclassattributewhosevalueneedstobepredicted.Inordertopredict A n 'svalue, linearregressionbuildsanattributeweightassignmentmodelwhere w 1 ;w 2 ;:::;w n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 are theweightsassignedtotherst n )]TJ/F15 10.9091 Tf 10.747 0 Td [(1attributes.Theweightassignmentmodelclassier isiterativelybuiltbycomputingthepredictedvaluefor A n foreachobservation.Forthe i 'thobservation, A n 'spredictedvalueisgivenby: w 1 o i 1 + w 2 o i 2 + + w n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 o in )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X j =1 w j o ij : Intuitively,inordertohaveanaccuratepredictionof A n forthe i 'thobservation,the dierencebetween A n 'spredictedvalue P n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 j =1 w j o ij and A n 'sactualvalue o in hastobe minimized.Thus,forthe i 'thobservation,where1 i m ,theclassierisconstructed byreducingthesumofsquaresofdierence o in )]TJ/F27 10.9091 Tf 11.476 8.182 Td [(P n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 j =1 w j o ij 2 .Overall,linearregressionminimizes P m i =1 o in )]TJ/F27 10.9091 Tf 10.91 8.182 Td [(P n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 j =1 w j o ij 2 : Usingtheattributeweightsintheconstructed classier,theclassattributevalueofanunseenobservationcanbepredicted. 13

PAGE 23

Thebasicideausedinlinearregressionistoexpresstheclassattributevalueasalinear combinationofotherattributes.Whentheattributevaluesarediscrete,expressingthe classattributeasalinearcombinationofotherdiscreteattributesfailstopredictthevalue oftheclassattribute.Thisisduetothefactthattheerrorcomputedbetweenthepredicted classvalueandtheactualclassvaluehaslittlemeaningasthiserrordoesnotindicatehow farthepredictedvalueisfromtheactualvalue.Also,thepredictionsmaylieoutsidethe setofdiscretevaluesallowedinthedataset. for i 0 to n )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 do 1 w i 0; 2 end 3 while thereexistsanincorrectlyclassiedobservationinthetrainingdata do 4 for j 1to m do 5 if O j iscurrentlyincorrectlyclassied then 6 if O j belongstotherstclass then 7 Addvaluesofattributes A 0 ;A 1 ;:::;A n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 inthisobservationto 8 weights w 0 ;w 1 ;:::;w n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 intheequation; end 9 else 10 Subtractvaluesofattributes A 0 ;A 1 ;:::;A n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 inthisobservationfrom 11 weights w 0 ;w 1 ;:::;w n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 intheequation; end 12 end 13 end 14 end 15 Algorithm3:Perceptronlearningrule. Wenowlookatanotherapproachtopredicttheclassattributevalue.Let A 1 ;A 2 ;:::;A n denotetheattributesandlet O 1 ;O 2 ;:::;O m denotetheobservationscorrespondingtothe attributesasbefore.Letusassumetheclassattribute A n is0 = 1-valued.Thisapproach triestoseparatetheobservationsintoeither0-valuedor1-valuedbyusingahyperplane. Theequationofthehyperplaneisasfollows: w 0 A 0 + w 1 A 1 + w 2 A 2 + + w n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 A n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 =0 : Duringtheconstructionofthehyperplane,theweight w 0 ofattribute A 0 ,called bias ,is usedtocontributeaconstantvalueintheequationofthehyperplaneandhence A 0 =1. ThefollowingalgorithminAlgorithm3,called perceptronlearningrule ,isusedtoconstruct 14

PAGE 24

thehyperplane.A perceptron [Ros58]isabinaryclassierthatclassiesanobservationinto therstclassifthesumfromtheequationisgreaterthan0andclassiesanobservation intothesecondclassotherwise. Weseethatthealgorithmiterativelyeitherincrementsordecrementstheattribute weightsbytheattributevaluesoftheobservation,iftheclassattributevalueofthatobservationisincorrectlyclassiedintheequationofthehyperplane.Thegoalistopredictclass valuesofalltheobservationscorrectly.Ifthedatasetisnotlinearlyseparable,theabove algorithmwouldnotterminate.Insuchcases,thealgorithmcanbeterminatedforcefully aftertheexecutionofacertainnumberofiterations. 2.3.2Clustering Denition2.10[Llo82] Let X = x 1 ;x 2 ;:::;x n beasetofpointswithadistance functionbetweenanypairofpoints d ; andlet k beaninteger.A k -means-clustering C = f C 1 ;C 2 ;:::;C k g isapartitionof X into k clusters C 1 ;C 2 ;:::;C k suchthat C minimizes thesumofsquaresofdistancesofpointsfromthecentroid i = P x j 2 C i x j = j C i j foreach cluster C i 2C .Thatis, C minimizes k X i =1 X x j 2 C i jj x j )]TJ/F22 10.9091 Tf 10.909 0 Td [( i jj 2 : The k -meansclusteringalgorithm,providedinAlgorithm4,isawellknowntechniqueto nda k -means-clusteringgivenpointsinEuclideanspace.Initially k randomclustercenters arepicked.The k -meansalgorithmthenworksiteratively.Ineachiteration,thealgorithm assignstheremainingpointstotheclosestclustercenterandcomputesthecentroidforeach ofthe k clusters.Thesecentroidsarethenewclustercentersforthenextiteration.The algorithmcontinuesuntiltheclusterscentersbetweentwoconsecutiveiterationsremain unchanged. Theoutputof k -meansalgorithmdependsontheinitialclustercentersthatarepicked. The k -meansalgorithmineachiterationpicksthecentroidsofthepointsinthecurrent clustersasnewclustercenters NewCC .Thealgorithmterminateswhenthecurrentcluster centers CurrentCC andnewclustercenters NewCC arethesame.The k -meansclustering 15

PAGE 25

Input :Thenumberofclusters k andasetofpoints X = x 1 ;x 2 ;:::;x n witha distancefunctionbetweenanypairofpoints d x a ;x b Output :Aclustering C = f C 1 ;C 2 ;:::;C k g begin 1 NewCC Pickany k points,say, 1 ; 2 ;:::; k ; 2 CurrentCC ; ; 3 while CurrentCC 6 = NewCC do 4 CurrentCC NewCC ; 5 Create k clusterssuchthatthe i 'thclusteris C i = f all x 2 X whoseclosest 6 centeris i 2 CurrentCC g ; for i 1 to k do 7 i P x j 2 C i x j j C i j ; 8 end 9 NewCC f 1 ; 2 ;:::; k g ; 10 end 11 end 12 Algorithm4:The k -meansalgorithm. algorithmissuitableforndingclustersinasetofpointsthatcontainsubgroupsofpoints thathavesymmetricshapessuchas,circular,sincetheseshapesallowthepointstobe assignedtouniquecentroids.Thismakesthealgorithmiterativelymovetowardstheunique centroidsoftheindividualsubgroupsofpointsinordertoobtainastableclustering.When thealgorithmisinputwithsubgroupsofpointsthathaveasymmetricshapeswithirregular boundaries,itmovesthecentroidswithoutbeingabletoobtainastableclustering,thus constantlymodifyingtheclusteringconguration.Theworstcaserunningtimeforthe k -meansalgorithmisknowntobe2 p n [AV06]. 16

PAGE 26

CHAPTER3 MODELINGASSOCIATIONSINDATABASESUSINGDIRECTED HYPERGRAPHS 3.1AssociationsBetweenMulti-ValuedAttributes Let D beadatabaseintheformofa m n table,wheretherowscorrespondtoobservations andthecolumnscorrespondtomulti-valuedattributes.Let O = f O 1 ;O 2 ;:::;O m g be thesetofobservationsand A = f A 1 ;A 2 ;:::;A n g bethesetofattributes.Thetable entryforeachattribute A i andeachobservation O j isavaluefromaxedniteset V = f v 1 ;v 2 ;:::;v k g .Wedenotesuchadatabase D intheformof D A ; O ; V .Forany X AV let 1 X denote f A i j9 v j A i ;v j 2 X g Wenextpresentthedenitionofanassociationruleformulti-valuedattributesandthe supportandcondencemeasuresforsuchanassociationrule. Denition3.1 Anassociationruleformulti-valuedattributesinshort, mva-typeassociationrule inadatabase D A ; O ; V isanimplicationrelationshipoftheform X mva = Y where X;Y AV and 1 X and 1 Y aredisjointsubsetsof A Denition3.2 Thesupportandcondencemeasuresaregeneralizedformulti-valuedattributesinadatabase D A ; O ; V asfollows: 1.Let X = f A i 1 ;v j 1 ; A i 2 ;v j 2 ;:::; A ir ;v jr g beanysubsetof AV .The support of X ,denotedby Supp X ,isdenedasthefractionofobservationsin D forwhich A i 1 takesvalue v j 1 A i 2 takesvalue v j 2 ::: ,and A ir takesvalue v jr 2.Let X mva = Y beanmva-typeassociationrule.Thenthe condence ofthisrule,denoted by Conf X mva = Y ,isdenedasfollows: Conf X mva = Y = Supp X [ Y Supp X : 17

PAGE 27

Theabovedenitionofanmva-typeassociationrulehasbeenadaptedfromthedenition ofaquantitativeassociationrule[SA96]withaminorchangetosimplifythedenition.In anmva-typeassociationrule,attributesareassociatedwithvaluesfromaxedniteset, whereasinaquantitativeassociationrule,attributesareassociatedwitheithercategorical valuese.g.zipcode,makeofcarorintervalse.g.age,income.Duringtheprocessof discoveringquantitativeassociationrules,theattributevaluesarethenmappedtodiscrete values.Ontheotherhand,ourdenitionofdatabase D inparticular,thesetofvalues V assumesthattheattributevaluesarealreadymappedtodiscretevalues.Inthissense,our denitionofmva-typeassociationrulessimpliesthedenitionofquantitativeassociation rulesgivenin[SA96]. Notethatthedenitionsofsupportandcondenceinthemarket-baskettypedatabase canbeviewedasaspecialcaseofDenition3.2.Forinstance,let A 1 A 2 ,and A 3 be0 = 1valuedi.e., binary attributes.Then,themeasuresupportof f A 1 ;A 2 g "inthemarketbaskettypedatacanbeseenasequivalenttoSupp f A 1 ; 1 ; A 2 ; 1 g andthemeasurecondenceof f A 1 ;A 2 g = f A 3 g "canbeseenasequivalenttoConf f A 1 ; 1 ; A 2 ; 1 g mva = f A 3 ; 1 g givenbyDenition3.2. Wenowgivesomeexamplesofdatabasescontainingmulti-valuedattributes,observations,andthevaluesthatattributestakeforeachobservations.Theseexamplesarefrom variousdomains,suchas,medicine,bioinformatics,andsocialnetworks. OurrstexampleisaPatientdatabase. Example3.3 ConsideraPatientdatabaseinTable3.1whereeachobservationconsistsof attributevaluessuchasage,cholesterol,blood-pressure,andheart-rateofdierentpatients. Here,patient 1 hasage 25 years,cholesterol 135 mg/dL,bloodpressure 135 mmHg,and heart-rate 75 beatsperminute.Similarly,recordsforotherpatientscanbereadfromthis table. Inordertoimprovetheusabilityofadatabasethatcontainsattributeseachofwhichcan takerealvaluesinanarbitraryrange,itisageneralpracticetodiscretizetheattribute values.InthepatientdatabaseinTable3.1,foreachattributevalue a i ,weconsiderthe discretizedvalue b a i = 10 c .Table3.2displaysthediscretizeddatabase. 18

PAGE 28

Table3.1Patientdatabase. Observations Attributes PatientId Age Cholesterol Blood-Pressure Heart-Rate Id A C B H 1 25 105 135 75 2 62 160 165 85 3 32 125 139 71 4 12 95 105 67 5 38 129 135 75 6 39 121 117 71 7 41 134 145 73 8 85 125 155 78 Table3.2Patientdatabaseafterdiscretization. Observations Attributes PatientId Age Cholesterol Blood-Pressure Heart-Rate Id A C B H 1 2 10 13 7 2 6 16 16 8 3 3 12 13 7 4 1 9 10 6 5 3 12 13 7 6 3 12 11 7 7 4 13 14 7 8 8 12 15 7 Letusconsiderthemva-typeassociationrule X mva = Y ,where X = f A; 3 ; C; 12 g and Y = f B; 13 g .Thisisanimplicationrelationshipthatstates:Iftheageofapatientis intherange30-39andthecholesterolofthesamepatientisintherange120-129mg/dL, thenitislikelythatthepatient'sblood-pressureisintherange130-139mmHg.Here, Supp X =3 = 8=0 : 375andConf X mva = Y =Supp X [ Y = Supp X =2 = 3=0 : 667. OurnextexampleisaGenedatabase. Example3.4 ConsideraGenedatabaseinTable3.3whereeachobservationconsistsof attributevaluessuchasgene1expressionvalue,gene2expressionvalue,gene3expression value,andgene4expressionvalueofdierentpatients.Here,patient 1 hasgene1expression value 54 : 23 ,gene2expressionvalue 66 : 22 ,gene3expressionvalue 342 : 32 ,andgene4 expressionvalue 422 : 21 .Similarly,recordsforotherpatientscanbereadfromthistable. 19

PAGE 29

Table3.3Genedatabase. Observations Attributes PatientId Gene1 Gene2 Gene3 Gene4 Id G1 G2 G3 G4 1 54 : 23 66 : 22 342 : 32 422 : 21 2 541 : 21 324 : 21 165 : 21 852 : 21 3 321 : 67 125 : 98 139 : 43 71 : 11 4 123 : 87 95 : 54 105 : 88 678 : 65 5 388 : 44 129 : 33 135 : 65 754 : 32 6 399 : 98 121 : 54 117 : 55 719 : 33 7 414 : 33 134 : 73 145 : 32 733 : 22 8 855 : 78 125 : 93 155 : 76 789 : 43 Table3.4Genedatabaseafterdiscretization. Observations Attributes PatientId Gene1 Gene2 Gene3 Gene4 Id G1 G2 G3 G4 1 # # $ $ 2 $ # # 3 # # # # 4 # # # 5 $ # # 6 $ # # 7 $ # # 8 # # InthegenedatabaseinTable3.3,foreachattributevalue a i ,weconsiderthediscretized value # if0 a i 333,thediscretizedvalue $ if334 a i 666,andthediscretizedvalue if667 a i 999.Table3.4displaysthediscretizeddatabase. Letusconsiderthemva-typeassociationrule X mva = Y ,where X = f G 2 ; # ; G 3 ; # g and Y = f G 4 ; g .Thisisanimplicationrelationshipthatstates:Ifgene2andgene3ina patientareunderexpressed,thenitislikelythatgene4isoverexpressedinthepatient. Here,Supp X =7 = 8=0 : 875andConf X mva = Y =Supp X [ Y = Supp X =6 = 7=0 : 857. OurnextexampleisaPersonalInterestdatabase. Example3.5 ConsideraPersonalInterestdatabasefromasocialnetworkinTable3.5, whereeachobservationconsistsofaratingforattributessuchas`read',`play',`music', and`eat'ofdierentpeoplewhere0denotesthelowestinterestand10denotesthehighest 20

PAGE 30

interest.Here,person 1 hasaratingof 10 forread, 10 forplay, 3 formusic,and 5 foreat. Similarly,recordsforotherpeoplecanbereadfromthistable. Table3.5Personalinterestdatabase. Observations Attributes PersonId Read Play Music Eat Id R P M E 1 10 10 3 5 2 7 9 4 6 3 3 1 9 10 4 5 1 10 7 5 9 8 2 6 6 8 10 7 6 7 5 4 6 5 8 8 10 1 8 Table3.6Personalinterestdatabaseafterdiscretization. Observations Attributes PersonId Read Play Music Eat Id R P M E 1 h h l m 2 m h m m 3 l l h h 4 m l h m 5 h h l m 6 h h m m 7 m m m m 8 h h l h InthepersonalinterestdatabaseinTable3.5,foreachattributevalue a i ,weconsider thediscretizedvalue l lowif0 a i 3, m moderateif4 a i 7,and h highif 8 a i 10.Table3.6displaysthediscretizeddatabase. Letusconsiderthemva-typeassociationrule X mva = Y ,where X = f R;h ; P;h g and Y = f M;l g .Thisisanimplicationrelationshipthatstates:Ifapersonhashighinterest inreadingandplaying,thenitislikelythatthepersonhaslowinterestinmusic.Here, Supp X =4 = 8=0 : 5andConf X mva = Y =Supp X [ Y = Supp X =3 = 4=0 : 75. Thus,withtheaboveexamples,wehaveseenexamplesofdatabasescontainingmultivaluedattributes. 21

PAGE 31

3.1.1Discretization Weprovideamethodologytodiscretizeattributevaluesinnancialtime-seriesdatabases inSection5.1.1.Inthesedatabases,everyobservationcorrespondstoareadingtakenata particulartimeandtheorderofobservationsisimportant.Ourdiscretizationmethodology capturestherelationshipbetweenanytwoconsecutiveobservationsinanancialtime-series database.Thus,intheresultingdatabasewithdiscretizedattributevalues,theorderof observationsisirrelevant. 3.2AssociationHypergraphs Denition3.6 An associationhypergraph H foradatabase D A ; O ; V isadirectedhypergraphwhosevertexset V is A andhyperedgeset E consistsofdirectedhyperedges T;H where T and H aredisjointsubsetsof A .Eachdirectedhyperedge e = T;H hasan associationcondencevalue intherange [0 ; 1] ,denoted ACV e or ACV T;H ,andan association table asshowninTable3.7 ,denoted AT e or AT T;H ,thataredenedasfollows: 1.Theassociationcondencevalueofadirectedhyperedge f t 1 ;:::;t r g ; f h 1 ;:::;h s g equals X v 1 ;:::;v r 2V Supp f t 1 ;v 1 ;:::; t r ;v r g Conf f t 1 ;v 1 ;:::; t r ;v r g mva = H ; where H = f h 1 ;v 1 ;:::; h s ;v s g and v 1 ::: v s dependon v 1 ::: v r suchthatthey maximizethecondenceofthemva-typeassociationrule f t 1 ;v 1 ;:::; t r ;v r g mva = f h 1 ;v 0 1 ;:::; h s ;v 0 s g overallchoicesof v 0 1 ::: v 0 s 2V .Inotherwords,itequals P v 1 ;:::;v r 2V Supp f t 1 ;v 1 ::: t r ;v r g[ H ,where H isasdenedabove. 2.Theassociationtableofadirectedhyperedge f t 1 ;t 2 ;:::;t r g ; f h 1 ;h 2 ;:::;h s g hasrows correspondingtothesetofallpossiblevaluesthat t 1 ::: t r cantakefrom V .Therow correspondingto t 1 = v 1 ::: t r = v r ,where v 1 ::: v r 2V ,isalistthatcontains 22

PAGE 32

a Supp f t 1 ;v 1 ;:::; t r ;v r g b thevalues v 1 ::: v s 2V denedinpartabove,and c thecondenceofthemva-typeassociationrule f t 1 ;v 1 ;:::; t r ;v r g mva = f h 1 ;v 1 ;:::; h s ;v s g : Inotherwords,everyrowcorrespondstoanmva-typeassociationrule. Themotivationforrepresentingadatabase D usingadirectedhypergraphistocapturea moregeneralimplicationrelationshipbetweenattributesthantheoneidentiedbymva-type associationrules.Forexample,wenowwanttoanswerthefollowingquestion:Regardless ofthevaluestheattributesinset T take,whatisthelikelinessofpredictingthevaluesofthe attributesinset H ?".Theaboveintuitionhelpsustomodelsucharelationshipbetween attributesin T andattributesin H usingadirectedhyperedge T;H andtocapturethe likelinessastheassociationcondencevalue ACV T;H ofthisdirectedhyperedge. Inthisthesis,weconsideronlyassociationsoftheform T;H ,where T and H are disjointsubsetsofattributes, j T j 2,and j H j 1.Havingnoconstrainton j T j and j H j addscomplexityinthemodelbecauseofthenumerouspossibilitiesinvolved.Extendingour methodologytomodelassociationsindatabasestothegeneralcaseisasubjectoffuture work.Henceforth,weusetheterm associationhypergraph torefertothisrestrictedcase, andcallanydirectedhyperedge T;H inwhich j T j =1adirectededgeandoneinwhich j T j =2a2-to-1directedhyperedge. 3.2.1ConstructingAssociationHypergraphs Theassociationhypergraph H foradatabase D A ; O ; V hasnodeset A andthehyperedgesetconsistsofdirectedhyperedgesoftheform T;H ,where T and H aredisjoint subsetsof A .Weconstructdirectedhyperedgesintheorderoftheirheadset.Foraxed combinationoftwoorfewerattributes,say f A 1 ;A 2 g ,andanyotherattribute,say A 3 ,we determinewhether f A 1 ;A 2 g ; f A 3 g couldbeincludedasadirectedhyperedgeof H by checkingwhetherthecombinationis signicant accordingtoDenition3.7. 23

PAGE 33

Table3.7AnexampleassociationtableATforthecombination f A 1 ;A 2 g ; f A 3 g Index Values Support Mostfrequentvalue Condence of A 1 and A 2 Supp f A 1 ;v 1 ; A 2 ;v 2 g of A 3 v 3 Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f A 3 ;v 3 g 1 h 1 ; 1 i 0 : 14 2 0 : 43 2 h 1 ; 2 i 0 : 03 1 0 : 62 3 h 1 ; 3 i 0 : 06 1 0 : 75 4 h 2 ; 1 i 0 : 21 3 0 : 45 5 h 2 ; 2 i 0 : 11 2 0 : 38 6 h 2 ; 3 i 0 : 17 1 0 : 66 7 h 3 ; 1 i 0 : 01 2 0 : 49 8 h 3 ; 2 i 0 : 04 3 0 : 81 9 h 3 ; 3 i 0 : 23 2 0 : 73 24

PAGE 34

Denition3.7 Consideracombination T;H forinclusionasadirectedhyperedgeofthe associationhypergraph H ,where j T j 1 .For 1 ,wesaythat T;H is signicant if ACV T;H max v 2 T f ACV T )-222(f v g ;H g If T;H is -signicant,thenweincludethisdirectedhyperedgein H .Otherwiseweskip thiscombinationandproceedtothenextone.Theweightofadirectedhyperedge T;H issetto ACV T;H Theassociationtableforthedirectedhyperedge f A 1 ;A 2 g ; f A 3 g isconstructedasfollows.Supp f A 1 ;v 1 ; A 2 ;v 2 g iscomputedbycountingobservationsinthedatabasefor which A 1 'svalueis v 1 and A 2 'svalueis v 2 .Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f A 3 ;v 3 g isthen computedbycountingsuchobservationsinthedatabaseforwhich A 3 'svalueis v 3 ,where v 3 isthemostfrequentvaluefor A 3 .Thisprocessisrepeatedforallpossiblevaluesthat A 1 and A 2 cantakefrom V Theorem3.8 Let A 1 A 2 ,and A 3 beattributesovercommonobservations.Thenthe followingholds: 1. ACV f A g ; f X g ACV ; ; f X g 2. ACV f A;B g ; f X g max f ACV f A g ; f X g ;ACV f B g ; f X g g Proof Weprovepartonlyastheproofofpartissimilar.Assumethatthesymbolic representationsof A 1 and A 3 areoverthealphabet f 1 ; 2 ;:::;k g .Since A 1 and A 3 have commonobservations,alltheobservationsof A 1 and A 3 contributetowardsbuildingthe associationtable AT ofthecombination f A 1 g ; f A 3 g .Supposethatthereare d rowsin AT .LetMaj d denotethenumberoftimesthemostfrequentvalueof A 3 occursinthese rows.Then,thefractionofthemostfrequentvalueof A 3 isMaj d =d .W.l.o.g.,assume thatthemostfrequentvalueof A 3 is1. Outof d rowsin AT ,let d i ofthemhavevalue i for A 1 .Thus,wehave P k i =1 d i = d Assumethat,foreach1 i k ,thenumberofrowsin AT suchthat A 1 takesvalue i and A 3 takesitsmostfrequentvalue,1,is a i .Then,wehave P k i =1 a i =Maj d .Forevery 1 i k ,restricting AT toallrowsinwhich A 1 takesvalue i ,letMaj d i denotethe numberofrowsthemostfrequentvalueof A 3 belongstointhisrestricted AT 25

PAGE 35

Bydenition, ACV f A 1 g ; f A 3 g isgivenby P k i =1 d i =d Maj d i =d i ,whichsimplies to P k i =1 Maj d i =d .Next,noticethatforevery1 i k ,Maj d i a i ,sincethemost frequentvalueof A 3 outofallrowsof AT thathave A 1 = i mustoccuratleast a i times. Itfollowsthat P k i =1 Maj d i P k i =1 a i =Maj d .Hence,wegetthat ACV f A 1 g ; f A 3 g = k X i =1 Maj d i =d Maj d =d = ACV ; ; f X g : Thiscompletestheproof. 3.3Association-BasedSimilarityBetweenMulti-ValuedAttributes Undirectedhypergraphshavebeenusedtoclusterbinaryattributesandsubsetsofbinary attributes.Hanetal.[HKKM98]usedundirectedhypergraphswherenodesrepresentbinary attributesandhyperedgesrepresentsubsetsofattributes.Theyappliedanundirected hypergraphclusteringalgorithmtoidentifyclustersofsimilarattributes.Ozdaletal.[OA04] alsousedundirectedhypergraphswherenodesrepresentpatternsi.e.,subsetsofattributes andhyperedgesrepresentrelationshipsamongpatterns,andproposedaclusteringapproach forthesehypergraphmodels.Lentetal.[LSW97]usedclusteringtogroupassociation rulesinvolvingmulti-valuedattributes.Thisgroupingwasdonebasedoncertainattribute conditionsthatsegmenttheobservationsinthedatabase. Weproposesimilaritynotionsthatmeasureassociation-basedsimilaritybetweenany twonodesofthe associationhypergraph .Wecanimaginetwonodes A and B tobeinsimilarifsignicantlymanydirectedhyperedgesthatcontain A intheirheadsetleadto validdirectedhyperedgesthatcontain B intheirheadsetwhen A isreplacedby B .Here, A and B arein-similarsincetheybothsharesimilarincomingdirectedhyperedges.Likewise, A and B canberegardedasout-similarbyrelatingtothetailsetinsteadoftheheadsetof directedhyperedges.Thesesimilaritynotionsarethenusedtodeneclusteringofmultivaluedattributes. 26

PAGE 36

Notation3.9 Let H = V;E beanassociationhypergraphasdenedinSection3.1. 1.Forany A 2 V out H A denotesthesetofalldirectedhyperedgesof H whosetailset contains A 2.Forany A 2 V in H A denotesthesetofalldirectedhyperedgesof H whoseheadset contains A 3.Forany A 1 ;A 2 2 V and e = T;H 2 out H A 1 e j T: A 1 A 2 denotesthedirected hyperedge T 0 ;H 0 whoseheadset H 0 is H andwhosetailset T 0 isformedfrom T byreplacingnode A 1 bynode A 2 i:e:;H 0 = H and T 0 = T )-267(f A 1 g [f A 2 g .For anysetofdirectedhyperedges S andnodes A 1 and A 2 S j T: A 1 A 2 denotestheset S e 2 S f e j T: A 1 A 2 g 4.Forany A 1 ;A 2 2 V and e = T;H 2 in H A 1 e j H: A 1 A 2 denotesthedirectedhyperedge T 0 ;H 0 whosetailset T 0 is T andwhoseheadset H 0 isformedfrom H by replacingnode A 1 bynode A 2 i:e:;T 0 = T and H 0 = H )-314(f A 1 g [f A 2 g .For anysetofdirectedhyperedges S andnodes A 1 and A 2 S j H: A 1 A 2 denotestheset S e 2 S f e j H: A 1 A 2 g Notation3.10 Let A 1 and A 2 beattributesand H = V;E beanassociationhypergraph asdenedinSection3.1.Let ; denotetheemptydirectedhyperedge. 1. out H A 1 out H A 2 isthesetofdirectedhyperedgepairs e;f s.t. e 2 out H A 1 f 2 out H A 2 ,and e = f j T: A 2 A 1 .Thedenitionof in H A 1 in H A 2 issimilar. 2. out H A 1 out H A 2 istheunionofthefollowingsetsofdirectedhyperedgepairs: out H A 1 out H A 2 e; ; s.t. e 2 out H A 1 and e 6 = f j T: A 2 A 1 foreach f 2 out H A 2 ,and ; ;f s.t. f 2 out H A 2 and e j T: A 1 A 2 6 = f foreach e 2 out H A 1 Thedenitionof in H A 1 in H A 2 issimilar. 3.3.1In-SimilarityandOut-Similarity InDenition3.11,wedenethesimilaritynotions in-similarity and out-similarity for anypair A 1 ;A 2 ofattributes.Thein-similarityofattributes A 1 and A 2 ,denotedby 27

PAGE 37

in-sim H A 1 ;A 2 ,istheweightedfractionofdirectedhyperedges e 2 in H A 1 [ in H A 2 suchthatswitching A 1 to A 2 intheheadsetof e resultsinanotherdirectedhyperedge; theweightsaretheassociationcondencevaluesofdirectedhyperedges.Similarly,theoutsimilarityofattributes A 1 and A 2 ,denotedbyout-sim H A 1 ;A 2 ,istheweightedfraction ofdirectedhyperedges e 2 out H A 1 [ out H A 2 suchthatswitching A 1 to A 2 inthetail setof e resultsinanotherdirectedhyperedge. Denition3.11 Let A 1 and A 2 beattributesand H = V;E beanassociationhypergraph asdenedinSection3.1.Thefollowingsimilaritynotionsaredenedfor A 1 and A 2 : 1. out sim H A 1 ;A 2 = P e;f 2 out H A 1 out H A 2 min f ACV e ;ACV f g P e;f 2 out H A 1 out H A 2 max f ACV e ;ACV f g : 2. in sim H A 1 ;A 2 = P e;f 2 in H A 1 in H A 2 min f ACV e ;ACV f g P e;f 2 in H A 1 in H A 2 max f ACV e ;ACV f g : Example3.12 Suppose H hasdirectedhyperedges a = f A 1 ;A 3 g ; f A 6 g b = f A 1 ;A 4 g ; f A 6 g c = f A 2 ;A 3 g ; f A 6 g d = f A 2 ;A 4 ;A 5 g ; f A 6 g ,and e = f A 4 ;A 5 g ; f A 6 g ,where A 1 A 2 A 3 A 4 A 5 ,and A 6 areattributes.LettheACVsof a b c d ,and e be 0 : 4 0 : 5 0 : 6 0 : 7 and 0 : 8 ,respectively.Then,wehave out H A 1 out H A 2 = f a;c g out H A 1 out H A 2 = f a;c ; b; ; ; ; ;d g ,andso weighted out sim H A 1 ;A 2 = 0 : 4 0 : 6+0 : 5+0 : 7 =0 : 22 3.3.2ClustersofSimilarAttributes Wedenebelowthenotionofasimilaritygraphinducedbyanysubset S ofattributesthat assignseveryattributepair f A 1 ;A 2 g in S anundirectededgewhoseweightdependsonthe in-similarityandtheout-similarityofthepair. Denition3.13 Let H = V;E beanassociationhypergraphasdenedinSection3.1. Givenanycollection S ofattributes,a similaritygraph SG S = V 0 ;E 0 inducedby S in H isanundirected,weighted,completegraphwhosenodeset V 0 is S andedgeset E 0 contains allattributepairsin S suchthat,foreveryedge f A 1 ;A 2 g2 E 0 ,itsweight d A 1 ;A 2 is denedas 1 )]TJ/F15 10.9091 Tf 12.387 0 Td [(weighted in sim H A 1 ;A 2+weighted out sim H A 1 ;A 2 = 2 Ourobjectiveistodetermineapartitionof S intosubsetsofattributessuchthatattributes withineachsubsetarehighlysimilarintheirassociativecharacteristics.The t -clustering 28

PAGE 38

algorithmbyGonzalez[Gon85],presentedinAlgorithm2Chapter2,ndssuchapartition of S bydesignatingsome t attributesasclustercenters.Thealgorithmtakestheparameter t intheinputandassignseachattributetoitsclosestclustercenter.Thisisafactor2approximationalgorithmforminimizingthediameterofthe t -clustering,assumingthatthe distancesi.e.,weightssatisfythe metricproperties .Here,thediameterofa t -clustering isthemaximumdistancebetweenanytwodatapointsthatarewithinthesamecluster. Ifouroriginalassociationhypergraph H has n verticesand m directedhyperedges,then theconstructionofsimilaritygraph SG S takestime O m 2 ,andthecomputationofthe t -clusteringofverticesin SG S takesadditionaltime O j t jjSj 29

PAGE 39

CHAPTER4 COMPUTATIONALPROBLEMS Aninterestingquestionthatarisesinthecontextofassociationrulesisaboutdevisinga methodologytobuildclassicationrules.Aclassicationrulesuersfrom overtting when therulecloselymodelsaparticularcharacteristicofthetrainingdatasetand,duetothis specicity,therulefailstopredictthevalueoftheattributeinanunseentestdataset.A classicationrulesuersfrom undertting whentherulemodelsnoparticularcharacteristic ofthetrainingdatasetand,duetothisgenerality,therulefailstopredictthevalueofthe attributebothinthetrainingdatasetandinanunseentestdataset.Liuetal.[LHM98] addressedtheaboveproblemsbyminingassociationrulesthathaveonlyoneattributein theirconsequent,proposingapruningbasedalgorithmtogenerateclassicationrules,and testingonanunseendataset.Bayardo[Bay97]proposedvariousmethodsthatagainuse pruningtoreducethenumberofassociationrulesdiscoveredbystandardassociationrule miningalgorithms.Butthequalityoftherulesisunclearastheyhavenotbeentested onanunseendataset.Liuetal.[LHM99]addressedtheaboveproblemsbyintroducing multipleminimumsupportsduringminingofassociationrules. Inthissection,weusethedirectedhypergraphbasedmodeltorstproposealgorithms toidentifyasmallsubsetofattributesthatinuencethecharacteristicsofalmostallother attributesandthenpresenttheassociation-basedclassierthatcanbeusedtopredictthe valuesofattributes. 4.1LeadingIndicators Aleadingindicator X foranyset S ofattributesisasubsetof S suchthatknowingonly thevaluesfortheattributesin X allowsustoinferthevalueforallattributesin S)-252(X Motivatedbythenotionofadominatingsetforanygraphthatessentiallycapturesthe 30

PAGE 40

propertyofasubsetofnodescoveringallthenodesofthegraphbyatmostoneedge,we denebelowthenotionofadominatorforasetofverticesofanyassociationhypergraph. Ourhypothesisisthatadominatorfornodescorrespondingtotheset S ofattributesin theassociationhypergraphmodelingattributerelationshipsgivesaleadingindicatorfor S Denition4.1 Adominatorforaset S ofverticesinanassociationhypergraph H = V;E isaset X V suchthat,forevery u 2S)-155(X ,thereisadirectedhyperedge e = T;H 2 E suchthat T X and u 2 H .Thatis,eachnode u 2S)-237(X iscoveredusingonlydirected hyperedgeswhosetailsetisfromtheset X Wenextprovidetwogreedyalgorithmsforidentifyingasubsetofattributesknownasa leadingindicatorthatinuencesthevaluesofalmostallotherattributes.Therstgreedy algorithmisbasedonanadaptationofthegreedy O log n -approximationalgorithmfor computingaminimumcardinalitydominatingsetingraphs.Thesecondgreedyalgorithm isbasedonanadaptionofthegreedy O log n -approximationalgorithmforcomputinga minimumcostsetcover. 4.1.1AnAdaptationofGraphDominatingSetApproximationAlgorithm Algorithm5isagreedyalgorithmforcomputingadominatorforanyset S ofvertices inanassociationhypergraph H = V;E .Itisanadaptationofthegreedy O log n approximationalgorithmforcomputingaminimumcardinalitydominatingsetingraphs. Inordertominimizethesizeofthedominator,thefollowinggreedyheuristicisapplied:for everynode u thatisnotpartofthedominatingsetyet,thealgorithmcomputesthenode eectiveness u thatreects u 'scoveringability.Duringeachiterationofthealgorithm untilallthenodesinthegrapharecovered,thenodewiththehighesteectivenessvalue isaddedtothedominatorset.Thealgorithmrunsintime O jSjj E j 4.1.2AnAdaptationofSetCoverApproximationAlgorithm Algorithm6computesadominatorforanyset S ofverticesofanyassociationhypergraph H = V;E .Itisanadaptationofthegreedy O log n -approximationalgorithmforSet Cover.Thealgorithmmaintainsavariable DomSet tostorethepartiallyconstructed 31

PAGE 41

dominatorsetof H andavariable CoveredSet tostorethesetofverticescoveredbythose in DomSet .Thisisaniterativealgorithmsuchthatineachiterationsomesubsetofvertices t isgreedilychosenandmadepartof DomSet .Thealgorithmmaintainsavariable T that storessubsetsofverticesforpossibleinclusionin DomSet .Initially, T isthecollectionof alltailsetsofdirectedhyperedgesin H ,i.e., T = f T e j e 2 E g Input :Aset S ofverticesandanassociationhypergraph H = V;E Output :Adominator DomSet fortheset S ofvertices. begin 1 DomSet ; ; 2 CoveredSet ; ; 3 while CoveredSet 6 = S do 4 foreach vertex u 2 V )]TJ/F15 10.9091 Tf 10.909 0 Td [(DomSet do 5 if u 62 CoveredSetand u 2S then 6 u 1; 7 end 8 else 9 u 0; 10 end 11 u u + X v 62 CoveredSet ^ v 2 S L u;v ; 12 where L u;v max e : u 2 T e ^ v 2 H e w e j T e )]TJ/F20 7.9701 Tf 6.586 0 Td [(DomSet j 13 end 14 Let u 0 2 V besuchthat u 0 =max u 62 DomSet u ; 15 DomSet DomSet [f u 0 g ; 16 CoveredSet CoveredSet [f u 0 g[f v 2Sj9 e 2 E s.t. v 2 H e 17 and T e DomSet g ; 18 end 19 return DomSet ; 20 end 21 Algorithm5:Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhich isanadaptationofgraphdominatingsetapproximation. Thealgorithmiteratesuntilalltheverticesintheset S arecovered.Foreverysubset t 2 T ,ameasure t isdenedthatreects t 'scoveringabilityinthefollowingsense: t containsallnewverticesthatcanbecoveredbyincluding t in DomSet .Ineach iterationinLine5, t iscomputedforevery t 2 T asfollows.First, t isassigned tothenumberofverticesthatarein t aswellas S ,butnotyetin CoveredSet Lines6{ 12.Next, t addsupthenumberofverticesin S)]TJ/F22 10.9091 Tf 19.801 0 Td [(CoveredSet that t coversviasome directedhyperedge e whosetailset T e ispartof t Lines13{17.Specically,forthelatter 32

PAGE 42

contributionin t ,everydirectedhyperedge e istraversedandifboth T e t and H e 2S)]TJ/F22 10.9091 Tf 29.558 0 Td [(CoveredSet ,then H e iscountedinthecomputationof t .Clearly,allsets t 2 T whoseeectiveness t iszeroareinsignicant;Line15discardssuchsetsfrom T sothattheyarenotconsideredinlateriterations.Towardstheendofeveryiteration, theset t ofmaximumeectivenessisincludedin DomSet andthevariable CoveredSet is updatedtoaccountforthechanged DomSet Lines20{22. Sinceeachiterationtakes O j E j 2 timeandthereare jSj iterations,thecomputation timeofthealgorithmis O jSjj E j 2 Wenowprovidetwoenhancementswiththegoalofimprovingthecomputationtime ofAlgorithm6andalsoofreducingthesizeofthedominatorsetoutputbythisalgorithm.Enhancement1,presentedinAlgorithm7,shouldbeaddedbetweenLines20{21, andEnhancement2,presentedinAlgorithm8,shouldbeaddedbetweenLines22{24of Algorithm6. Enhancement1. DuringeachiterationofthewhileloopinLine5,Algorithm6includes in DomSet asubset t 0 whoseeectiveness t 0 isthehighestamongall t 2 T .This enhancementconsidersthescenariowhentherearemorethanonepossiblecandidate t 0 in someiteration.Insuchacase,thecandidatesubset t 0 thatcontributestheleastnumberof elementsto DomSet i.e.,thatminimizes j t 0 )]TJ/F22 10.9091 Tf 11.13 0 Td [(DomSet j mightbebetterthananyother candidate t 0 0 Enhancement2. WeknowthatAlgorithm6updatesthecoverageof DomSet inevery iterationbygoingovereachdirectedhyperedgewhosetailsetisapartof DomSet and addingverticesintheheadsetthatarein S to CoveredSet .Byremovingasubset t that isalreadyapartofthe DomSet from T ,theenhancementsavesthecomputationtime requiredtogooverallthedirectedhyperedgesinthesubsequentiterationsofthealgorithm tocompute t 4.2Association-BasedClassier Let S bethesetofattributes A 1 ;A 2 ;:::;A t andlettheseattributestakevalues v 1 ;v 2 ;:::;v t respectively.Let T beanothersetofattributes,disjointfrom S .Theassociation-based 33

PAGE 43

Input :Aset S ofverticesandanassociationhypergraph H = V;E Output :Adominator DomSet fortheset S ofvertices. begin 1 DomSet ; ; 2 CoveredSet ; ; 3 T f T e j e 2 E g ; 4 while CoveredSet 6 = S do 5 foreach set t 2 T do 6 t 0; 7 foreach vertex u 2 t do 8 if u= 2 CoveredSet and u 2S then 9 t t +1; 10 end 11 end 12 foreach directedhyperedge e suchthat T e t do 13 if H e = 2 CoveredSet and H e 2S then 14 t t +1; 15 end 16 end 17 if t ==0 then T T )-222(f t g ; 18 end 19 Let t 0 besuchthat t 0 max t 2 T t ; 20 DomSet DomSet [ t 0 ; 21 CoveredSet CoveredSet [f t 0 g[f v 2Sj9 e 2 E s.t. v 2 H e and 22 T e DomSet g ; 23 end 24 return DomSet ; 25 end 26 Algorithm6:Agreedyalgorithmforcomputingdominatorsindirectedhypergraphswhich isanadaptationofsetcoverapproximation. foreach set t 2 T do 1 if t 0 == t then 2 if j t 0 )]TJ/F22 10.9091 Tf 10.91 0 Td [(DomSet j > j t )]TJ/F22 10.9091 Tf 10.909 0 Td [(Domset j then 3 t 0 t ; 4 end 5 end 6 end 7 Algorithm7:Enhancement1. foreach set t 2 T do 1 if t DomSet then 2 T T )]TJ/F22 10.9091 Tf 10.909 0 Td [(t ; 3 end 4 end 5 Algorithm8:Enhancement2. 34

PAGE 44

classierdeterminesthevaluesofallattributesin T giventhevaluesofattributesin S .For thisproblem,wewillassumethat S isadominatorfor T intheassociationhypergraph H andso S canbecomputedusingthealgorithmpresentedinSection4.1.Thisassumption standsonourearlierstatedhypothesisthatadominatorforanysetofattributesisalsoa leadingindicatorfortheset. Input :Anassociationhypergraph H = V;E modelingattributerelationships,a set T ofattributes,andaset S = f A 1 ;v 1 ; A 2 ;v 2 ;:::; A t ;v t g ,where A 1 ;A 2 ;:::;A t areattributesand v 1 ;v 2 ;:::;v t 2V aretheirrespective values. Output :Anassignmentofvaluesthatassignseachattribute Y 2T itsbest classiedvalue y andtheclassicationcondenceval[ y ]associatedwith everysuchassignment y to Y begin 1 foreach attribute Y 2T do 2 for y 1 to k do 3 val[ y ] 0; 4 end 5 foreach directedhyperedge e = T;H 2 E with H = f Y g and 6 T f A 1 ;A 2 ;:::;A t g do Let T be f A 1 ;A 2 g andlet y bethemostfrequentvalueof Y given 7 A 1 = v 1 "and A 2 = v 2 "; val[ y ] val[ y ]+Supp f A 1 ;v 1 ; A 2 ;v 2 g 8 Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f Y;y g ; end 9 Let y 2f 1 ;:::;k g besuchthatval[ y ]=max y 2f 1 ;:::;k g val[ y ]; 10 val[ y ] val[ y ] = P y 2f 1 ;:::;k g val[ y ]; 11 Output Y;y ; val[ y ]"; 12 end 13 end 14 Algorithm9:Association-basedclassier. Tocomputethevalueforanyattribute Y 2T givenonlythevalues v 1 ;v 2 ;:::;v t of aset S oftheattributes,weiterateoveralldirectedhyperedgesof H .Foreachdirected hyperedge e whosetailsetisasubsetoftheattributesin S andwhoseheadsetis f Y g ,we examinethe AT associatedwith e .Specically,assumethat e = f A 1 ;A 2 g ; f Y g .Using the AT of e ,wendSupp f A 1 ;v 1 ; A 2 ;v 2 g andConf f A 1 ;v 1 ; A 2 ;v 2 g mva = f Y;y g where y 2V isthemostfrequentvalueof Y giventhat A 1 takesvalue v 1 and A 2 takes value v 2 .Thecontributionofthedirectedhyperedge e inthevalueassignment y of Y is Supp f A 1 ;v 1 ; A 2 ;v 2 g Conf f A 1 ;v 1 ; A 2 ;v 2 g mva = f Y;y g .Thetotalcontribution 35

PAGE 45

ofalldirectedhyperedgesinthevalueassignment y of Y isdenotedbyval[ y ].Atthe endofalliterations,wechoosethevalue y of Y forwhichval[ y ]ismaximum.The classicationcondence associatedwiththevalueassignment y to Y isthenthenormalized val[ y ] 2 [0 ; 1]. Weseethatthecontributionsfromallmva-typeassociationrulesofbothrelevantdirectededgesandrelevantdirectedhyperedgesaretakenintoconsiderationduringtheprocessofdeterminationofthevalueofanattribute.Thisapproachavoidsoverttingbynot xingthevalueassignmentforanattributebylookingonlyintoaparticularmva-typeassociationrulethatcloselymodelssomecharacteristicofthedataset.Also,thisapproach avoidsunderttingasitconsiderstheappropriateweightsforallmva-typeassociationrules beforexingthevalueassignmentforanattribute. Algorithm9describesthissolutionapproach.Theinputisanassociationhypergraph H = V;E modelingattributerelationships,aset T ofattributes,andaset S = f A 1 ;v 1 ; A 2 ;v 2 ;:::; A t ;v t g ,where A 1 ;A 2 ;:::;A t areattributesand v 1 ;v 2 ;:::;v t 2V aretheirrespectivevalues.Thealgorithmreturnsanassignmentofvaluesthatassignseach attribute Y 2T itsbestclassiedvalue y andalsoreturnstheclassicationcondence val[ y ]associatedwitheverysuchassignment y to Y .Itsrunningtimeis O k 2 jTjj E j 36

PAGE 46

CHAPTER5 EXPERIMENTATION TheStandard&Poor'sS&P500isamarket-value-weightedindexofthestockprices ofsome500largepubliclyheldcompanies.WeusedYahooFinance[Yah10]toobtain stockinformationforallnancialtime-seriesinS&P500.Inthissection,allanalysis isbasedonthedailyclosingstockpriceinformationfromJan1,1995toDec21,2009. WehadtorestrictstartdatetoJan1,1995sinceanumberofnancialtime-seriesin thecurrentS&P500startedtradingonlyfromthemid90s.Asaresultofthisclean upofourdata,thenumberofnancialtime-seriesinouranalysisis346.Thesetimeseriesbelongtothefollowingindustrialsectors:BasicMaterials BM ,CapitalGoods CG ,Conglomerates C ,ConsumerCyclical CC ,ConsumerNoncyclical CN ,Energy E ,Financial F ,Healthcare H ,Services SV ,Technology T ,Transportation TP andUtilities U .Eachindustrialsectorissubdividedintosub-sectorse.g.,Technology T has11sub-sectorsincludingCommunicationsEquipment,ComputerHardware,Computer Networks,ComputerPeripherals,ComputerServices,ComputerStorageDevices,Electronic Instr.andControls,OceEquipment,ScienticandTechnicalInstr.,Semiconductors,and SoftwareandProgramming.Thetotalnumberofsub-sectorsovertheentiresectorsis104. 5.1AssociationHypergraphModeling 5.1.1Discretization Wenowdescribehowtotransformnancialtime-seriesdatasetintoadatabase D A ; O ; V suitablefortheassociationhypergraphmodeling.Foreachnancialtime-seriesinthedata set,wecreateadeltatime-series,whichisalistofrealnumberswhose i 'thentryisthe fractionalchangeintheclosingstockpriceofthe i +1'thdayrelativetotheclosingstock priceofthe i 'thday.Wethencomputea k -thresholdvector,forsomeinteger k 2,foreach 37

PAGE 47

nancialtime-series.A k -thresholdvectorforanancialtime-series A isa k )]TJ/F15 10.9091 Tf 11.398 0 Td [(1-tuple h a 1 ;a 2 ;:::;a k )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 i suchthat,forevery1 i k ,wehave a i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1
PAGE 48

withameanACVof0 : 437andtheconguration C 2leadsto109 ; 810directededgeswith ameanACVof0 : 288and274 ; 0482-to-1directedhyperedgeswithameanACVof0 : 288. 5.2AssociationCharacteristicsofFinancialTime-Series Figure5.1ashowstheweightedin-degreedistributionofthenodesoftheassociationhypergraphforconguration C 1.Heretheweightedin-degreeofanode v is P e : f v g = H e w e i.e.,thesumofweightsofallhyperedgesentering v .Figure5.1bshowstheweightedoutdegreedistributionofthenodesoftheassociationhypergraphforconguration C 1.Here theweightedout-degreeofanode v is P e : v 2 T e w e j T e j ,i.e.,thesumofnormalizedweights ofallhyperedgesleaving v .ThemeanACVofdirectededgesis0 : 43andof2-to-1directed hyperedgesis0 : 44. Usingourdirectedhypergraphbasedmodeling,weareabletodeducesomeinteresting factsaboutthecurrentstockmarketthatwouldhavebeendiculttovalidatebyother means.Ourndingsconcernrelationshipbetweenproducersandconsumersinthestock market. Roughlyspeaking,aproducerisanyentitycompanythathasveryfewdependency onothersforitsresourcerequirements.Aproducerthrivesmostlyonitsownorhaslittle resourcerequirementscomparedtotherest.Somenotablesectorsthatarelikelytohave entitiesinproducercategoryare BM CG E ,and SV sectors.Ontheotherhand,a consumerisanentitythatishighlydependentonotherentitiesorend-usersforitsown functioning.Sectorssuchas CC CN H SV ,and T arelikelytohaveentitiesinconsumer category.Aparticularexceptionisthe SV sectorinwhichtherearebothproducersand consumersdependingontheparticularindustrytheybelongto.Forexample,entitiesin SV sectorthatdealwithrealestateoperationse.g.,KimcoRealtyCorporationaremostly producerswhereasentitiesin SV sectorthatprovidebasicservicestotheendusere.g., Yahoo!Inc.mostlysatisfytheconsumercategory. Time-seriesthatarelessdependentonothertime-seriesfortheirresourcese.g.,Producersarelikelytobemorepredictableincomparisontoothers.Why?Sincethesetime-series donotdependonothertime-seriesforrawmaterialsandotherbasicrequirements,their 39

PAGE 49

a b Figure5.1Weighteddegreedistribution.aWeightedin-degreedistributionandb Weightedoutdegreedistribution.Theidsofnodesareonthe x -axisandtheirweighted in-degreeout-degreevaluesareonthe y -axis. 40

PAGE 50

valueinthemarketcanbeestimatedbyanalyzingthedemandofthesourcesofrawmaterialsonwhichtheyrelyon.Theweightedin-degreeofanodeinourdirectedhypergraph modelindicatesthelevelofpredictabilityofthecorrespondingtime-series.Thegreater theweightedin-degreeofanodeis,themorepredictablethecorrespondingtime-seriesis. WeseefromFigure5.1athatXOM E sectorandGT CC sectorhavehighweighted in-degreevaluesincomparisontoothersfromourchosenlist.InthecaseofGTThe GoodyearTire&Rubbercompany,althoughthesectoritbelongstoisdierentfromthe oneswelistedforproducercategory,oneofthebasicmaterialsthistime-seriesdependson isrubber,whichisanaturallyoccurringcompound.Wealsoexperimentallyfoundoutthat ofthetime-serieswiththetop25weightedin-degreevalues,72%ofthembelongtosectors BM iron,gold,silver,andmetalminingindustries, E oil,gas,andcoalindustries,and SV realestateindustries. Time-seriesthathavedirectinteractionswithend-usersfortheirproductse.g.,Consumersaremorelikelytobegoodpredictors.Why?Thesetime-serieshappentobegood predictorsastheygetaectedbytheirimmediateactions.Theeconomyisdrivenbythe end-user'srequirements.Bybeingindirectcontactwiththeend-user,theperformance oftheconsumertime-seriesinthemarketfollowsdirectpatternsofthebehaviorofthe end-user. Theweightedout-degreeofanodeinourmodelingindicateshowmuchthatnode time-seriescanpredictothertime-series.Thegreatertheweightedout-degreeofanode is,thehigheritsabilitytopredictothertime-seriesis.WeseefromFigure5.1bthat PG CN sectorandJNJ H sectorhavehighout-degreevaluesincomparisontoothers fromourchosenlist.Wealsoexperimentallyfoundoutthatofthetime-serieswiththe top25weightedout-degreevalues,84%ofthetime-seriesbelongtosectors H facilities, biotechnology,anddrugs, SV particularly,business,realestate,andrestaurantservices, and T software,hardware,andsemiconductors. Table5.1shows,foreachconguration,thedirectededgeandthe2-to-1directedhyperedgewiththehighest ACV foreachselectednancialtime-seriesfromdierentsectors. Forexample,thebestpredictiondirectededgeforGTTheGoodyearTire&RubberCompanyshowninRow3for C 1and C 2representsarelationshipbetweenGTandPPGPPG 41

PAGE 51

Table5.1Thedirectededgeandthe2-to-1directedhyperedgewiththehighest ACV foreachselectednancialtime-seriesfromdierent sectorsandforeachcongurationchoiceareshown.Thesectorinformationofeachnancialtime-seriesisindicatedinparenthesisgoogle nance. Row Time-series Conguration Topdirectededge Top 2 -to1 directedhyperedge 1 EMN BM C 1 PPG BM EMN BM AVY BM ,GT CC EMN BM C 2 PPG BM EMN BM BLL BM ,IFF BM EMN BM 2 HON CG C 1 TXT C HON CG CAT CG ,ITT T HON CG C 2 UTX CG HON CG BA CG ,ROK T HON CG 3 GT CC C 1 PPG BM GT CC DOW BM ,F CC GT CC C 2 PPG BM GT CC ETN T ,FMC BM GT CC 4 PG CN C 1 CL CN PG CN CLX CN ,K CN PG CN C 2 CL CN PG CN ABT H ,CPB CN PG CN 5 XOM E C 1 CVX E XOM E HES E ,SLB E XOM E C 2 CVX E XOM E COG E ,PEG U XOM E 6 AIG F C 1 C F AIG F BEN F ,PGR F AIG F C 2 C F AIG F AON F ,CI F AIG F 7 JNJ H C 1 MRK H JNJ H IFF BM ,SYY SV JNJ H C 2 MRK H JNJ H CL CN ,PEP CN JNJ H 8 JCP SV C 1 M SV JCP SV FDO SV ,GPS SV JCP SV C 2 M SV JCP SV COST SV ,HD SV JCP SV 9 INTC T C 1 LLTC T INTC T EMC T ,QCOM T INTC T C 2 XLNX T INTC T CTXS T ,QCOM T INTC T 10 FDX TP C 1 AXP F FDX TP EXPD TP ,ITT T FDX TP C 2 AXP F FDX TP EXPD TP ,BAC F FDX TP 11 TE U C 1 PGN U TE U PEG U ,SO U TE U C 2 AEP U TE U SO U ,TEG U TE U 42

PAGE 52

Industries,Inc..ThisrelationshipmaybeinterpretedintermsofGTprocuringrawmaterialse.g.,precipitatedsilicasfromPPGforthemanufacturingorprocessingofrubber. Amoreinterestingmany-to-onerelationshipisrepresentedbythebestprediction2-to-1 directedhyperedgeforGTshowninRow3for C 1.Here,the2-to-1directedhyperedge forGTrepresentsarelationshipofGTwithDOWTheDowChemicalCompanyandF FordMotorCompany.ThisrelationshipmaybeinterpretedintermsofGTprocuring rawmaterialse.g.,polyurethanepolymerfromDOW,whereastherelationshipwithF maybeattributedtowardsFutilizingtheproductse.g.,tiresfromGT.Thus,wesee that2-to-1directedhyperedgesprovidemeaningfulandmoreinterestinginformationthan directededgesdo. Table5.2shows,foreachconguration,the2-to-1directedhyperedgewiththehighest ACV andtheconstituentdirectededgesforeachselectednancialtime-seriesfromdierent sectors.Forexample,inRow5,for C 1theaccuracyofHESHessCorporationpredicting XOMExxonMobilCorporationis0 : 55andSLBSchlumbergerLimitedpredictingXOM is0 : 54,butbothofthemtogetherpredictingXOMis0 : 58.Thisshowsthatthecombination oftwonancialtime-seriesleadstoabetterpredicting2-to-1directedhyperedge. 5.3Association-BasedSimilarity 5.3.1ComparisonwithEuclideanSimilarity Figure5.2comparesin-similarityandout-similarityvalueswithEuclideansimilarityvalues forconguration C 1.Here,theEuclideansimilaritybetweenanytwonancialtime-series A and B iscomputedasfollows.Let A = a 1 ;a 2 ;:::;a n and B = b 1 ;b 2 ;:::;b n where a i and b i arethevaluesthat A and B takeinthe i 'thobservation.TheEuclideandistancebetween A and B isdenedasED A;B = jj normalized A )]TJ/F15 10.9091 Tf 9.377 0 Td [(normalized B jj whereforanyvector V = v 1 ;v 2 ;:::;v n ,normalized V = v 1 = jj V jj ;v 2 = jj V jj ;:::;v n = jj V jj and jj V jj = P n i =1 v 2 i 1 2 .Now,theEuclideansimilarityES A;B between A and B isdened asES A;B =1 )]TJ/F20 7.9701 Tf 11.83 4.295 Td [(1 2 ED A;B .NotethatES A;B isarealvalueintherange[0 ; 1]such thatahighervalueindicatesagreatersimilarity.Figure5.2showsthatEuclideansimilarity doesnotdierentiatebetweennancialtime-seriespairsasdistinctlyasoursimilaritymea43

PAGE 53

Table5.2The2-to-1directedhyperedgewiththehighest ACV andtheconstituentdirectededgesforeachselectednancialtime-series fromdierentsectorsandforeachcongurationchoiceareshown. Row Time-series Conguration Top 2 -to1 directedhyperedge Directededge1 Directededge2 1 EMN BM C 1 AVY,GT EMN.52 AVY EMN.49 GT EMN.49 C 2 BLL,IFF EMN.37 BLL EMN.32 IFF EMN.33 2 HON CG C 1 CAT,ITT HON.53 CAT HON.5 ITT HON.49 C 2 BA,ROK HON.38 BA HON.33 ROK HON.33 3 GT CC C 1 DOW,F GT.51 DOW GT.48 F GT.47 C 2 ETN,FMC GT.37 ETN GT.33 FMC GT.33 4 PG CN C 1 CLX,K PG.53 CLX PG.5 K PG.49 C 2 ABT,CPB .36 ABT PG.32 CPB PG.32 5 XOM E C 1 HES,SLB XOM0.58 HES XOM.55 SLB XOM.54 C 2 COG,PEG XOM.37 COG XOM0.33 PEG XOM.31 6 AIG F C 1 BEN,PGR AIG.54 BEN AIG.51 PGR AIG.51 C 2 AON,CI AIG.37 AON AIG.33 CI AIG.33 7 JNJ H C 1 IFF,SYY JNJ.48 IFF JNJ.45 SYY JNJ.45 C 2 CL,PEP JNJ.36 CL JNJ.32 PEP JNJ.31 8 JCP SV C 1 FDO,GPS JCP.51 FDO JCP.48 GPS JCP.48 C 2 COST,HD JCP.37 COST JCP.32 HD JCP.33 9 INTC T C 1 EMC,QCOM INTC.55 EMC INTC.52 QCOM INTC.52 C 2 CTXS,QCOM INTC.4 CTXS INTC.35 QCOM INTC.35 10 FDX TP C 1 EXPD,ITT FDX0.52 EXPD FDX.49 ITT FDX0.46 C 2 EXPD,BAC FDX.37 EXPD FDX.33 BAC FDX.33 11 TE U C 1 PEG,SO TE.55 PEG TE.52 SO TE.52 C 2 SO,TEG TE.4 SO TE.35 TEG TE.35 44

PAGE 54

suresdo.ThiscouldbebecauseofthefactthatEuclideansimilarityaccountsforpair-wise dierencesinpricevariationsonaday-to-daybasiswhereasthesimilaritymeasuresaccount fortheclosenessinbeingassociatedwithcommonsetsofnancialtime-seriesonanaverage basis. 5.3.2ClustersofFinancialTime-Series Figure5.3showsaclusteringofnancialtime-seriesforconguration C 1.Theclusters areobtainedusingtheapproachexplainedinSection3.3.2.Here,thecollection S isthe setofallnancialtime-seriesinourdataset.Thevalueofparameter t inthe t -clustering algorithmissetto104,whichisthetotalnumberofsub-sectorsovertheentiresectorsas pointedoutatthebeginningofChapter5.Therstclustercenterispickedfromthesector T asthissectorhasthemaximumnumberofnancialtime-seriesinourdataset. WeexperimentallyveriedthattheweightfunctioninDenition3.13satisesthetriangleinequalityproperty,andhencethefactor2-approximationoftheoptimaldiameterof the t -clusteringisinfactachievedbythealgorithm.Forclarityofdisplay,Figure5.3shows onlyclustersofsizegreaterthan6.Theedgesthatconnectallclustercenterstoother nodesintheirclustersandtheedgesthatinterconnecttheclustercentersarealsoshown inthegure.Thispartialsimilaritygraphconsistsof256nodesand298edges.Toshow thequalityoftheclusteringobtained,thefollowinginformationisrelevant:ithemean diameteroverallclustersobtainedis0.83andtheoverallmeandistancein SG S is0.89and iithelargestclusterofsize29containsallnancialtime-seriesfromthesector T 5.4LeadingIndicatorsofFinancialTime-Series Inthisexperiment,wendtheleadingindicatorsforthecollection S ofallnancialtimeseriesusingtheapproachexplainedinSection4.1.Inordertoobtainadominatorsetthat coverstherestofthenancialtime-seriesviadirectededgesand2-to-1directedhyperedges ofhigh ACV ,wesetathresholdfor ACV ACV -thresholdanddiscardalldirectededges and2-to-1directedhyperedgesbelowthisthresholdduringthecomputationofthedominator.Now,forthecomputationofdominatorsforcongurations C 1and C 2,weconsider 45

PAGE 55

a b Figure5.2Euclideansimilaritycomparision.aIn-similarityforconguration C 1vs EuclideansimilarityandbOut-similarityforconguration C 1vsEuclideansimilarity. Thein-similarityout-similarityvaluesareonthe x -axisandthecorrespondingEuclidean similarityvaluesareonthe y -axis. 46

PAGE 56

Figure5.3Clustersofnancialtime-seriesforconguration C 1.Notethatthisgureismadeupofmultiplecolorsandthecolors correspondtosectorsinnancialdomain.Thebigcirclesrepresentclustercentersandthesmallonesrepresentothernodes.Thesizeof thebigcirclesisdirectlyproportionaltothenumberofnodesassignedtothem.Thesmallcirclesareattachedtotheirrespectivecluster centers. 47

PAGE 57

thefollowingchoicesforthresholds:itop40%directedhyperedgesw.r.t. ACV s|thissets ACV -thresholdto0.45for C 1andsets ACV -thresholdto0.32for C 2,iitop30%directed hyperedgesw.r.t. ACV s|thissets ACV -thresholdto0.46for C 1andsets ACV -threshold to0.33for C 2,andiiitop20%directedhyperedgesw.r.t. ACV s|thissets ACV -threshold to0.47for C 1andsets ACV -thresholdto0.34for C 2.Tables5.4and5.3showsthesize ofdominatorforalmostallthenancialtime-seriesinourdataset.Here,inrow1ofTable5.4,forthecasewhen ACV -thresholdissetto0 : 45,ourgreedyapproximationalgorithm Section4.1ndsadominatorofsize16thatcovers96%ofallnancialtime-seriesinour dataset. 5.5Association-BasedClassier Inthisexperiment,weevaluatetheaccuracyoftheassignmentsgivenbytheAsociationBasedClassieronseveraltestdatasets.Foreachtrainingdataset,weconstructan associationhypergraph H = V;E usingtheproceduredescribedinSection3.2.Next,in thecorrespondingtestdataset,allthenancialtime-seriesareconvertedtotheirrespective deltatime-seriesandthendiscretizedusingthemethodologydescribedinSection5.1.1.We chooseasmallcollection S ofnancialtime-series,usuallyadominatorforallnancial time-seriesinourdatasets.Thevaluesofeverynancialtime-series A inthedominator arealreadyknownfromthediscretizedrepresentationof A .Thebestpredictedvalueofall othernancialtime-seriesinourdatasetsiscomputedusingtheassociation-basedclassier presentedinSection4.2.Wedenethe classicationcondence foranynancialtime-series Y onaparticulartestdatasetasthefractionofdaysonwhichthevalueassignedbyour classiermatchesthevaluein Y 'sdiscretizedrepresentation,obtainedfromthesamedata set. Wealsocomputetheaccuracyofvalueassignmentsgivenbyotherdataminingclassiers suchasthesupportvectormachineSVM,multilayerperceptron,andlogisticregression. Forexperimentshere,theclassiersprovidedbyWeka[HFH + 09]areused.Thefollowing methodologyisusedtopredictthevaluesofanynancialtime-series Y byconstructinga trainingdatasetwhosefeaturesetistheset S ofnancialtime-series:Consideradirected 48

PAGE 58

Table5.3Thesizeofadominatorforallnancialtime-seriesandthemeanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm5. Row Conguration ACVthreshold Dominator Size Percent Covered MeanClassicationCondence In-sample Out-sample top% hyperedges AssociationBased Classier AssociationBased Classier SVM Multilayer Perceptron Logistic Regression 1 C 1 0.45 % 13 99 0.643 0.719 0.546 0.716 0.541 0.46 % 15 95 0.646 0.723 0.509 0.718 0.508 0.47 % 22 94 0.65 0.724 0.494 0.719 0.492 2 C 2 0.32 % 20 96 0.646 0.716 0.429 0.627 0.231 0.33 % 30 96 0.649 0.719 0.433 0.638 0.238 0.34 % 31 91 0.65 0.722 0.403 0.633 0.224 49

PAGE 59

Table5.4Thesizeofadominatorforallnancialtime-seriesandthemeanclassicationcondenceofdierentclassiersforeach congurationareshownobtainedusingAlgorithm6. Row Conguration ACVthreshold Dominator Size Percent Covered MeanClassicationCondence In-sample Out-sample top% hyperedges AssociationBased Classier AssociationBased Classier SVM Multilayer Perceptron Logistic Regression 1 C 1 0.45 % 16 96 0.651 0.723 0.526 0.717 0.519 0.46 % 22 93 0.653 0.723 0.514 0.718 0.510 0.47 % 26 91 0.656 0.728 0.515 0.725 0.512 2 C 2 0.32 % 28 96 0.65 0.721 0.429 0.627 0.231 0.33 % 40 90 0.652 0.722 0.433 0.638 0.238 0.34 % 36 78 0.652 0.72 0.403 0.633 0.224 50

PAGE 60

a b Figure5.4Classicationcondencedistributionoftheassociation-basedclassierforinsampleandout-sampledataforconguration C 1.Thestartyearfortrainingdatasetis 1996.FigureausesthedominatorobtainedfromAlgorithm5andFigurebusesthe dominatorobtainedfromAlgorithm6. 51

PAGE 61

hyperedge e in H suchthat e = f A 1 ;A 2 g ; f Y g and A 1 ;A 2 2S .Thetrainingdatasetis builtbyusingeachrowin AT e asadatapoint.Here,theparticularvalueassignment A 1 = v 1 and A 2 = v 2 isthefeaturevalue,andthecorrespondingvalue y of Y denedin Denition3.6istheclassvalue. 5.5.1Evaluation Tables5.4and5.3list,foreachdominator,themeanclassicationcondenceoverall nancialtime-seriesforcongurations C 1and C 2.Here, in-sample indicatesthetraining datasetforconstructingthedirectedhypergraphmodeland out-sample indicatesthetest datasetforevaluatingthevalueassignmentsmadebytheclassiers.Thein-samplecontains nancialtime-seriesdatafromJan1,1996toDec31,2008andtheout-samplecontains nancialtime-seriesdatafromJan1,2009toDec31,2009.Fromthistable,itisclear thattheassociation-basedclassieroutperformsSVM,LogisticRegression,andMultilayer Perceptronforbothcongurations C 1and C 2.Also,itsmeanclassicationcondenceis consistentregardlessof k ,whereasthemeanclassicationcondenceofotherclassiers decreasesas k increases. Figures5.4aand5.4bshowsthedistributionofthemeanclassicationcondence overvariousin-sampleandout-sampledataforconguration C 1wherethedominatorhaving ACV -thresholdof0 : 45ischosentobe S .Inthesegures,themeanclassication condenceforeachin-sampledatahasbeencomputedbyincreasingthetrainingdataset incrementallyoneyearatatime,startingfromJan1,1996andendingatDec31,2008. Thecorrespondingout-samplecontainsnancialtime-seriesdataforoneyearimmediately followingthelastdayinthetrainingdataset.Forinstance,ifthetrainingdatasetis fromJan1,1996toDec31,2001,thenthecorrespondingtestdatasetcontainsnancial time-seriesfromJan1,2002toDec31,2002.Fromthesegures,itisevidentthatthe association-basedclassierachievesmeanclassicationcondenceintherange0 : 60to0 : 75 onbothin-sampleandout-sampledata.Thehigherclassicationcondencevaluesforoutsampledatacomparedtoin-sampledatamaybeattributedtothefactthattheout-sample dataissubstantiallysmallerthanthein-sampledata. 52

PAGE 62

CHAPTER6 CONCLUSIONSANDFUTUREWORK Weproposedadirectedhypergraphbasedmodeltocaptureattribute-levelassociationsand theirstrengthinanydatabase.Wetestedthismodelonanancialtime-seriesdataset S&P500.Theclusteringmethodbasedonoursimilaritynotionsallowedtondclusters ofnancialtime-series.Theassociation-basedclassiercoupledwiththeleadingindicator ofallthenancialtime-seriesexhibitedamethodologytousemva-typeassociationrules andpredictvaluesofnancialtime-series.Wealsodemonstratedtheconsistencyofour modelbyvarying k throughoutourexperiments. Ourworkraisesinterestingquestionsontheapplicationsofassociationrulemining.It mightbefruitfultoexploreassociationsbyapplyingthedirectedhypergraphmodelon datasetssuchasgenedatabases,socialnetworkdatasets,andmedicaldatabases.Itwould beusefultounderstandhowthedierentparameters k ,andthesizesofheadandtail setsaectthemodel.Examplesofapplicationsofassociationrulemininginthecontext ofsocialnetworkdatasetsandmedicaldatabasehavebeenpresentedinSection3.1. Ingeneticresearch,acommonproblemistoeectivelymodeltheinterrelationship amongmultiplegenes.Byrecordingthegeneexpressionvaluesofasetofgenes,researchers worktowardsobtainingthegeneexpressionvaluesofothers.Ingeneral,knowledgeabout thegeneshelpresearchersandphysiciststounderstandthegeneticstateofpatientsand thegeneticconditionsfordiseases. Anastassiou[Ana07]providedasynergy-basedmethodtoanalyzetheinteractionsamong multipleinteractinggenes.Suchaframeworkcanbeusedtopredictthepresenceofadisease ortheabsenceofadisease,basedongivengeneexpressionvalues.Althoughthisisrecent workindiseaseprediction,therebeencontinuousinterestinlocatinggeneswithsimilar characteristics.Forexample,Eisenetal.[ESBB98]providedavariationofclusteringto 53

PAGE 63

ndclustersofsimilargenes.Findingagroupofgenesthatinstigateacertaindiseaseis veryimportantinrecognizingamedicalconditioninpatients. Thedirectedhypergraphbasedmodel,i.e.associationhypergraph,presentedinthis thesiscanbeusedtomodelgeneinteractions.Bymodelingthegeneinteractionsusing anassociationhypergraph,wecanaddressthefollowingproblems:identifyclustersof similargenesandpredictgeneexpressionvaluesofasetofgenes,andidentifydisease causingconditionspresentamongasetofgenesandpredictthepresenceofadiseaseorthe absenceofadisease.Letagenedatabaseconsistofgeneexpressionvaluesrecordedfrom patients.Also,letthedatabasecontaininformationonthestatusofpatientsbeingaected byacertaindisease.Heregenesanddiseasesarethemulti-valuedattributes,andeach observationconsistsofthegeneexpressionvaluesandthediseasestatus,ofaparticular patient. Theproblemstatedincanbeaddressedbyconsideringthepartofthegenedatabase thatonlyconsistsofthemulti-valuedattributesthatcorrespondtothegeneexpression valuesandconstructinganassociationhypergraphusingthetechniquedescribedearlier inSection3.1.Byconsideringthemulti-valuedattributesthatcorrespondtothegene expressionvalues,interactionsamongthegenescanbemodeledusingdirectedhyperedges. Then,groupsofsimilargenescanbeidentiedbyndingclustersofsimilarmulti-valued attributes,asgiveninSection3.3.2.Also,byknowingthegeneexpressionvaluesofa subsetofgenes,thegeneexpressionvaluesoftheremaininggenescanbepredictedusing theassociation-basedclassiergiveninSection4.2. Theproblemstatedincanbeaddressedbyconsideringtheentiregenedatabaseand constructinganassociationhypergraphusingthetechniquedescribedearlierinSection3.1. Inthisproblem,weareinterestedinobtainingapredictionforthepresenceofadiseaseor theabsenceofadiseaseinpatients.Therefore,duringtheconstructionoftheassociation hypergraph,thedirectedhyperedgeswhoseheadsetconsistadiseasearetheonlyones thatgetincludedintheassociationhypergraph.Theremainingdirectedhyperedgesare notconsideredforinclusionintheassociationhypergraph.Now,byknowingthegene expressionvaluesofasubsetofgenesforapatient,adiseasepredictioncanbeobtainedby usingtheassociation-basedclassiergiveninSection4.2. 54

PAGE 64

REFERENCES [ADS83]G.Ausiello,A.D'Atri,andD.Sacca.Graphalgorithmsforfunctionaldependencymanipulation. JournaloftheACM ,30:752{766,1983. [ADS85]G.Ausiello,A.D'Atri,andD.Sacca.Stronglyequivalentdirectedhypergraphs. AnalysisandDesignofAlgorithmsforCombinatorialProblems ,25:1{25,1985. [AG97]G.AusielloandR.Giaccio.On-linealgorithmsforsatisabilityformulaewith uncertainty. TheoreticalComputerScience ,171{2:3{24,1997. [AIS93]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetween setsofitemsinlargedatabases.In SIGMOD93 ,pages207{216.ACM,1993. [Ana07]D.Anastassiou.Computationalanalysisofthesynergyamongmultipleinteractinggenes. MolecularSystemsBiology ,3:1{8,2007. [AS94]R.AgrawalandR.Srikant.Fastalgorithmsforminingassociationrulesin largedatabases.In VLDB94 ,pages487{499.MorganKaufmann,1994. [AV06]D.ArthurandS.Vassilvitskii.Howslowisthek-meansmethod?In ACM SymposiumonComputationalGeometry .ACM,2006. [BAG99]R.Bayardo,R.Agrawal,andD.Gunopulos.Constraint-basedruleminingin large,densedatabases.In ICDE99 ,pages188{197.IEEE,1999. [Bay97]R.Bayardo.Brute-forceminingofhigh-condenceclassicationrules.In KDD 97 ,pages123{126.ACM,1997. [Ber73]C.Berge. GraphsandHypergraphs .North-Holland,1973. [BZ07]B.BringmannandA.Zimmermann.Thechosenfew:Onidentifyingvaluable patterns.In ICDM07 ,pages63{72.IEEE,2007. [CDP04]S.Chawla,J.Davis,andG.Pandey.Onlocalpruningofassociationrulesusing directedhypergraphs.In ICDE04 ,page832.IEEE,2004. [CH03]C.CreightonandS.Hanash.Mininggeneexpressiondatabasesforassociation rules. Bioinformatics ,19:79{86,2003. [Chv79]V.Chvatal.Agreedyheuristicforthesetcoveringproblem. Mathematicsof OperationsResearch ,4:233235,1979. [CSCR + 06]P.Carmona-Saez,M.Chagoyen,A.Rodriguez,O.Trelles,J.Carazo,and A.Pascual-Montano.Integratedanalysisofgeneexpressionbyassociation rulesdiscovery. BMCBioinformatics ,19:79{86,2006. 55

PAGE 65

[ESBB98]M.Eisen,P.Spellman,P.Brown,andD.Botstein.Clusteranalysisanddisplay ofgenome-wideexpressionpatterns. ProcNatlAcadSciUSA ,95:14863{ 14868,December1998. [GLPN93]G.Gallo,G.Longo,S.Pallottino,andS.Nguyen.Directedhypergraphsand applications. DiscreteAppliedMathematics ,42{3:177{201,1993. [Gon85]T.Gonzalez.Clusteringtominimizethemaximuminterclusterdistance. TheoreticalComputerScience ,38:293{306,1985. [GS02]G.GalloandM.Scutella.Anoteonminimummakespanassemblyplans. EuropeanJournalofOperationalResearch ,142:309{320,2002. [HFH + 09]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.Witten. Thewekadataminingsoftware:Anupdate. SIGKDDExplorations ,11,2009. [HKKM98]E.Han,G.Karypis,V.Kumar,andB.Mobasher.Hypergraphbasedclustering inhigh-dimensionaldatasets:Asummaryofresults. IEEEDataEng.Bull. 21:15{22,1998. [Joh74]D.Johnson.Approximationalgorithmsforcombinatorialproblems. Journalof ComputerandSystemSciences ,9:256278,1974. [KH06]A.KnobbeandE.Ho.Maximallyinformativek-itemsetsandtheirecient discovery.In KDD06 ,pages237{244.ACM,2006. [KNO + 03]L.Krishnamurthy,J.Nadeau,G.Ozsoyoglu,M.Ozsoyoglu,G.Schaeer, M.Tasan,andW.Xu.Pathwaysdatabasesystem:anintegratedsystemfor biologicalpathways. Bioinformatics ,19:930{937,2003. [LHM98]B.Liu,W.Hsu,andY.Ma.Integratingclassicationandassociationrule mining.In KDD98 ,pages80{86.ACM,1998. [LHM99]B.Liu,W.Hsu,andY.Ma.Miningassociationruleswithmultipleminimum supports.In KDD99 ,pages337{341.ACM,1999. [Llo82]S.Lloyd.Leastsquarequantizationinpcm. IEEETransactionsonInformation Theory ,28:129137,1982. [Lov75]L.Lovasz.Ontheratioofoptimalintegralandfractionalcovers. Discrete Mathematics ,13:383390,1975. [LS95]W.LinandM.Sarrafzadeh.Alineararrangementproblemwithapplications. In ISCAS95 ,pages57{60,1995. [LSW97]B.Lent,A.Swami,andJ.Widom.Clusteringassociationrules.In ICDE97 pages220{231.IEEE,1997. [NLHP98]R.Ng,L.Lakshmanan,J.Han,andA.Pang.Exploratoryminingandpruning optimizationsofconstrainedassociationrules.In SIGMOD98 ,pages13{24. ACM,1998. [OA04]M.OzdalandC.Aykanat.Hypergraphmodelsandalgorithmsfordata-patternbasedclustering. DataMiningandKnowledgeDiscovery ,9:29{57,2004. 56

PAGE 66

[Ord06]C.Ordonez.Comparingassociationrulesanddecisiontreesfordiseaseprediction.In HIKM06 ,pages17{24.ACM,2006. [Pre00]D.Pretolani.Adirectedhypergraphmodelforrandomtimedependentshortest paths. EuropeanJournalofOperationalResearch ,123:315{324,2000. [Ros58]F.Rosenblatt.Theperceptron:Aprobabilisticmodelforinformationstorage andorganizationinthebrain. PsychologicalReview ,65:386408,1958. [SA95]R.SrikantandR.Agrawal.Mininggeneralizedassociationrules.In VLDB95 pages407{419.MorganKaufmann,1995. [SA96]R.SrikantandR.Agrawal.Miningquantitativeassociationrulesinlarge relationaltables.In SIGMOD96 ,pages1{12.ACM,1996. [SVA97]R.Srikant,Q.Vu,andR.Agrawal.Miningassociationruleswithitemconstraints.In KDD97 ,pages67{73.ACM,1997. [SVL06]A.Siebes,J.Vreeken,andM.Leeuwen.Itemsetsthatcompress.In SDM06 SIAM,2006. [TT09]M.ThakurandR.Tripathi.Linearconnectivityproblemsindirectedhypergraphs. TheoreticalComputerScience ,410{29:2592{2618,2009. [Vie04]A.Vietri.Thecomplexityofarc-coloringsfordirectedhypergraphs. Discrete AppliedMathematics ,143-3:266{271,2004. [Yah10]Yahoo.com.Yahoonance.http://nance.yahoo.com/,2010. 57

PAGE 67

ABOUTTHEAUTHOR RamanujaSimhastudiedinformationscienceatTheNationalInstituteofEngineeringafliatedtotheVisvesvarayaTechnologicalUniversityfrom2001to2005.Hegraduatedwith aBachelorofEngineeringinInformationScience.Then,heworkedatTescoHindustan ServiceCenterasaSoftwareEngineerfrom2005to2008,andhadthetitleofaSenior SoftwareEngineerwhenhelefttherm.Hebeganhisgraduatestudiesincomputerscience attheUniversityofSouthFloridain2008.There,hepursuedresearchinAlgorithmsand DataMiningunderthesupervisionofProf.RahulTripathi.Duringthisperiod,healso completedasummerinternshipattheNationalCenterforAtmosphericResearchinthe ComputationalandInformationSystemsLaboratory.


printinsert_linkshareget_appmore_horiz

Download Options

close
Choose Size
Choose file type
Cite this item close

APA

Cras ut cursus ante, a fringilla nunc. Mauris lorem nunc, cursus sit amet enim ac, vehicula vestibulum mi. Mauris viverra nisl vel enim faucibus porta. Praesent sit amet ornare diam, non finibus nulla.

MLA

Cras efficitur magna et sapien varius, luctus ullamcorper dolor convallis. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Fusce sit amet justo ut erat laoreet congue sed a ante.

CHICAGO

Phasellus ornare in augue eu imperdiet. Donec malesuada sapien ante, at vehicula orci tempor molestie. Proin vitae urna elit. Pellentesque vitae nisi et diam euismod malesuada aliquet non erat.

WIKIPEDIA

Nunc fringilla dolor ut dictum placerat. Proin ac neque rutrum, consectetur ligula id, laoreet ligula. Nulla lorem massa, consectetur vitae consequat in, lobortis at dolor. Nunc sed leo odio.