USF Libraries
USF Digital Collections

Goal attainment on long tail web sites :

MISSING IMAGE

Material Information

Title:
Goal attainment on long tail web sites : an information foraging approach
Physical Description:
Book
Language:
English
Creator:
McCart, James
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Clickstream research
Information foraging theory
Web mining
Information scent
Data mining
Dissertations, Academic -- Information Sys and Decision Sci -- Doctoral -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: This dissertation sought to explain goal achievement at limited traffic "long tail" Web sites using Information Foraging Theory (IFT). The central thesis of IFT is that individuals are driven by a metaphorical sense of smell that guides them through patches of information in their environment. An information patch is an area of the search environment with similar information. Information scent is the driving force behind why a person makes a navigational selection amongst a group of competing options. As foragers are assumed to be rational, scent is a mechanism by which to reduce search costs by increasing the accuracy on which option leads to the information of value. IFT was originally developed to be used in a "production rule" environment, where a user would perform an action when the conditions of a rule were met. However, the use of IFT in clickstream research required conceptualizing the ideas of information scent and patches in a non-production rule environment. To meet such an end this dissertation asked three research questions regarding (1) how to learn information patches, (2) how to learn trails of scent, and finally (3) how to combine both concepts to create a Clickstream Model of Information Foraging (CMIF). The learning of patches and trails were accomplished by using contrast sets, which distinguished between individuals who achieved a goal or not. A user- and site-centric version of the CMIF, which extended and operationalized IFT, presented and evaluated hypotheses. The user-centric version had four hypotheses and examined product purchasing behavior from panel data, whereas the site-centric version had nine hypotheses and predicted contact form submission using data from a Web hosting company. In general, the results show that patches and trails exist on several Web sites, and the majority of hypotheses were supported in each version of the CMIF. This dissertation contributed to the literature by providing a theoretically-grounded model which tested and extended IFT; introducing a methodology for learning patches and trails; detailing a methodology for preprocessing clickstream data for long tail Web sites; and focusing on traditionally under-studied long tail Web sites.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2009.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by James McCart.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0003190
usfldc handle - e14.3190
System ID:
SFS0027506:00001


This item is only available as the following downloads:


Full Text

PAGE 1

Goal Attainment On Long Tail Web Sites: An Information Foraging Approach by James A. Mccart A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Information S ystems and Decision Sciences College of Business University of South Florida Co Major Professor: Donald J. Berndt Ph.D. Co Major Professor: Balaji Padmanab han, Ph.D. Joni L. Jones Ph.D. Richard P. Will Ph.D. Date of Approval: October 13, 2009 Keywords: clickstream research, information foraging theory, web mining, information scent, data mining Copyr ight 2009, James A Mccart

PAGE 2

TableofContents ListofTables....................................... ...iv ListofFigures...................................... ....viii Abstract........................................... ..xi Chapter1Introduction............................... .....1 1.1ResearchQuestions............................... ...4 1.1.1ResearchQuestion1–LearningPatches............. ......4 1.1.2ResearchQuestion2–LearningScentTrails......... .......5 1.1.3ResearchQuestion3–ClickstreamModelofInformatio nForaging....5 1.2Contributions................................... ..6 1.3DissertationStructure........................... ......7 Chapter2LiteratureReview........................... ......9 2.1Terminology..................................... .9 2.2PriorResearch................................... ..10 2.2.1MultipleObjectives............................ ..15 2.2.2Browsing...................................162.2.3Purchasing..................................202.2.4GoalAchievement..............................2 4 2.3Datasets........................................ 25 2.3.1Type......................................252.3.2Sector.....................................262.3.3Time,Duration,andSize......................... ..26 2.4Metrics........................................3 0 2.4.1AnalysisLevel................................3 0 2.4.2MetricCategories.............................. .31 2.5Conclusion...................................... 35 Chapter3Theory..................................... .36 3.1OptimalForagingTheory........................... ....37 3.1.1PreyModel..................................383.1.2PatchModel.................................42 3.2AdaptiveControlofThought-RationalTheory......... ...........47 3.2.1CentralProductionSystem....................... ...49 3.2.2ProductionLearning............................ .51 3.2.3Chunk.....................................523.2.4DeclarativeMemory............................. 53 i

PAGE 3

3.3InformationForagingTheory....................... ......55 3.3.1InformationScent.............................. .56 3.3.2InformationPatch.............................. .58 3.3.3SNIF-ACT..................................59 3.4Conclusion...................................... 66 Chapter4Hypotheses................................. ...67 4.1InformationForaging............................. ....69 4.1.1PageEvaluation................................ 69 4.1.2SampleSession................................7 0 4.2ClickstreamModelofInformationForaging........... ..........72 4.2.1User-centric.................................. 72 4.2.2Site-centric.................................. 84 4.3Conclusion...................................... 87 Chapter5Methodology................................ ...88 5.1User-centricClickstreamModelofInformationForagin g..............88 5.1.1DatasetSample................................8 8 5.1.2Metrics....................................92 5.2Site-centricClickstreamModelofInformationForagin g..............94 5.2.1DatasetSample................................9 4 5.2.2Metrics....................................95 5.3MetricTesting................................... ..106 5.4Conclusion...................................... 112 5.AClickstreamComplexityAppendix................... .......112 5.A.1ExampleClickstreams........................... .113 5.A.2GraphTheory.................................11 4 5.A.3Compactness.................................11 5 5.A.4Stratum....................................115 5.BLearningPatchesandScentTrailsAppendix........... ..........117 5.B.1InformationPatches............................ .117 5.B.2ScentTrails..................................1 24 Chapter6Datasets................................... ...127 6.1User-centricDataset............................. .....127 6.1.1PreprocessingofOriginalDataset................ ......128 6.1.2FinalDataset.................................1 40 6.2Site-centricDataset............................. .....146 6.2.1PreprocessingofOriginalDataset................ ......147 6.2.2FinalDataset.................................1 67 6.3Conclusion...................................... 174 Chapter7Results.................................... ..175 7.1User-centricClickstreamModelofInformationForagin g..............175 7.1.1DescriptiveStatistics......................... ....175 7.1.2HypothesesTesting............................. .184 ii

PAGE 4

7.2Site-centricClickstreamModelofInformationForagin g..............190 7.2.1DescriptiveStatistics......................... ....190 7.2.2HypothesesTesting............................. .207 7.2.3SensitivityAnalysis........................... ..220 7.3Conclusion...................................... 253 Chapter8TemporalAspectsofInformationForaging....... ............254 8.1Methodology..................................... 255 8.1.1DatasetSample................................2 55 8.1.2ProgressiveCalculations....................... ....256 8.1.3Metrics....................................260 8.2Results......................................... 267 8.2.1DescriptiveStatistics......................... ....267 8.2.2HypothesesTesting............................. .272 8.3Conclusion...................................... 280 Chapter9Conclusion................................. ...281 9.1Limitations..................................... .284 9.2Contributions................................... ..288 9.3FutureResearch.................................. ..289 References......................................... ...292 AbouttheAuthor..................................... EndPage iii

PAGE 5

ListofTables Table1PriorLiterature:Results...................... .........11 Table2PriorLiterature:Datasets..................... .........28 Table3PriorLiterature:Metrics...................... .........33 Table4OFT:ExamplePreyTypesforaBrownBear............ ........41 Table5OFT:ExampleDietforaBrownBear................. .......42 Table6OFT:ExampleSinglePatchTypeforaBrownBear...... ..........45 Table7OFT:ExampleMultiplePatchTypesforaBrownBear... ............47 Table8User-centric:Hypotheses...................... ........74 Table9RelationofHypothesestoInformationForagingTheo ry..............83 Table10Site-centric:Hypotheses..................... .........86 Table11User-centric:ExampleSessions................ ..........89 Table12User-centric:ExampleUser-Sessions........... ............91 Table13User-centric:ModelMetrics................... .........92 Table14Site-centric:ExampleSessionTuples........... ............95 Table15Site-centric:ExampleSessionStatisticsbyConta ctGoal.............95 Table16Site-centric:ModelMetrics................... .........96 Table17Site-centric:ExampleValuablePatches......... .............99 Table18Site-centric:ExampleVisitedPatches.......... .............100 Table19Site-centric:ExampleLastVisitedPatches...... ..............101 Table20Site-centric:ExampleValuableTrails.......... .............103 Table21Site-centric:ExampleFollowedTrails.......... .............104 Table22Site-centric:ExampleLastFollowedTrails...... ..............105 Table23ContingencyTablefor RETURN and VISITED ...................107 Table24ExampleT-testMetricTestingDataset........... ............108 Table25ExampleWilcoxonMetricTestingDataset......... ............110 Table26ExampleSignTestMetricTestingDataset......... ............112 iv

PAGE 6

Table27Site-centric:ExampleVisitorClickstreamComple xityMetrics..........113 Table28Site-centric:ExampleContingencyTableforaPote ntialContrastSet......120 Table29Site-centric:ExamplePotentialContrastSets... ................121 Table30User-centric:PreprocessingofOriginalDatasetS tatistics.............129 Table31User-centric:PreprocessingParameters........ ..............130 Table32ExampleOutlierPoints........................ .......135 Table33ExampleOutlierDistances..................... ........135 Table34User-centric:ParameterValuesforDBSCAN....... ............137 Table35User-centric:FinalDatasetStatistics......... ...............141 Table36User-centric:WebSiteCharacteristicStatistics ..................141 Table37User-centric:SessionCharacteristicStatistics ...................143 Table38User-centric:User-sessionCharacteristicStati stics................145 Table39Site-centric:PreprocessingofOriginalDatasetS tatistics.............149 Table40Site-centric:PreprocessingParameters........ ...............150 Table41Site-centric:ParameterValuesforDBSCAN....... ............156 Table42Site-centric:ConictingContactGoals......... .............160 Table43Site-centric:ConictingContactGoalPages–Webs iteD............163 Table44Site-centric:WebsiteDPageVisitations........ .............163 Table45Site-centric:AllContactGoalsStats........... .............165 Table46Site-centric:Websiteswith 50GoalSessions–ContactGoalsStats......166 Table47Site-centric:FinalDatasetStatistics......... ...............167 Table48Site-centric:WebSiteCharacteristicStatistics ...................168 Table49Site-centric:SessionCharacteristicStatistics ...................174 Table50User-centric:User-sessionsbySite............ .............176 Table51User-centric:MetricStatistics............... ............177 Table52User-centric:AssumptionsofStatisticalTests.. .................178 Table53User-centric:MetricNormalityandSkew......... ............180 Table54User-centric:Results........................ ........185 Table55User-centric:HypothesesResultsSummary....... .............190 Table56Site-centric:SessionsbySite................. ...........191 Table57Site-centric:MetricStatistics............... .............193 Table58Site-centric:MetricStatistics(Signicant–0.0 5).................195 v

PAGE 7

Table59Site-centric:PatchandTrailMetricStatistics(S ignicant–0.05).........196 Table60Site-centric:ExamplePatches................. ..........198 Table61Site-centric:ExampleTrails.................. ..........199 Table62Site-centric:AssumptionsofStatisticalTests.. .................200 Table63Site-centric:MetricNormalityandSkew......... ............202 Table64Site-centric:MetricNormalityandSkew(Signica nt–0.05)...........203 Table65Site-centric:Results........................ ........209 Table66Site-centric:Results(Signicant–0.05)....... ...............210 Table67Site-centric:HypothesesResultsSummary....... .............219 Table68Site-centric:SensitivityAnalysisMetricStatis tics.................223 Table69Site-centric:MetricStatistics(Signicant–0.0 1).................226 Table70Site-centric:MetricStatistics(Signicant–0.0 5).................227 Table71Site-centric:MetricStatistics(Supported–0.25 ).................228 Table72Site-centric:MetricStatistics(Supported–0.50 ).................229 Table73Site-centric:MetricStatistics(Supported–0.75 ).................230 Table74Site-centric:MetricStatistics(Supported–1.00 ).................231 Table75Site-centric:MetricStatistics(Supported–1.25 ).................232 Table76Site-centric:MetricStatistics(Supported–1.50 ).................233 Table77Site-centric:NumberofPatchesbySite.......... ............236 Table78Site-centric:NumberofTrailsbySite........... ............236 Table79Site-centric:PatchSizebySite................ ...........237 Table80Site-centric:TrailSizebySite................ ...........237 Table81Site-centric:PatchCoveragebySite............ ............238 Table82Site-centric:TrailCoveragebySite............ ............238 Table83Site-centric:PatchValuebySite............... ...........239 Table84Site-centric:TrailValuebySite............... ...........239 Table85Site-centric:PatchVisitationbySite.......... ..............240 Table86Site-centric:TrailFollowingbySite........... .............241 Table87Site-centric:PatchesandTrailsHypothesesResul tsSummary..........242 Table88Site-centric:Results(Signicant–0.01)....... ...............244 Table89Site-centric:Results(Signicant–0.05)....... ...............245 Table90Site-centric:Results(Supported–0.25)........ ..............246 vi

PAGE 8

Table91Site-centric:Results(Supported–0.50)........ ..............247 Table92Site-centric:Results(Supported–0.75)........ ..............248 Table93Site-centric:Results(Supported–1.00)........ ..............249 Table94Site-centric:Results(Supported–1.25)........ ..............250 Table95Site-centric:Results(Supported–1.50)........ ..............251 Table96TemporalSite-centric:ExampleSessions........ .............258 Table97TemporalSite-centric:ExampleDatasetProcessin g................259 Table98TemporalSite-centric:ModelMetrics........... ............261 Table99TemporalSite-centric:SessionsbySite......... .............268 Table100TemporalSite-centric:MetricStatistics...... ................269 Table101TemporalSite-centric:MetricStatistics(Signi cant–0.05)...........271 Table102TemporalSite-centric:Results............... ...........273 Table103TemporalSite-centric:Results(Signicant–0.0 5)................274 Table104TemporalSite-centric:HypothesesResultsSumma ry..............279 vii

PAGE 9

ListofFigures Figure1PowerLawDistribution........................ .......3 Figure2ShoppingStrategyTypology.................... ........15 Figure3CategoryofPagesViewedinaPath................ ........21 Figure4User-centricVersusSite-centricData.......... ..............22 Figure5ExampleMetricLevelofAnalysis................ .........31 Figure6OFT:SimulatedOptimalDiet.................... .......40 Figure7OFT:PatchyEnvironment....................... ......43 Figure8OFT:ExamplePatchGainFunction................ ........43 Figure9OFT:ExampleYearOnePatch..................... ......45 Figure10OFT:ExampleYearTwoPatch.................... ......46 Figure11OFT:ExampleYearThreePatch.................. .......46 Figure12OFT:ExampleOptimalMulti-PatchTime.......... ..........47 Figure13ACT-R:5.0Architecture...................... ........48 Figure14ACT-R:ExampleProductionRules............... .........49 Figure15ACT-R:ExampleCognitiveProblem-SolvingProces s..............51 Figure16ACT-R:ExampleProductionCompilation......... ...........52 Figure17ACT-R:ExampleChunks........................ .....53 Figure18ACT-R:DeclarativeMemoryNetworkStructure.... .............53 Figure19ACT-R:ExampleMemorySchematic............... ........58 Figure20SNIF-ACT:ProductionRules................... ........60 Figure21SNIF-ACT:Site-leavingActions............... ..........62 Figure22SNIF-ACT:HypotheticalDistributionofLinkUtil ities..............64 Figure23SNIF-ACT:HypotheticalProductionProbabilitie s................65 Figure24ConsumerDecisionProcessModelandInformationF oragingTheory......67 Figure25User-centric:ExampleUserClickstreamGraph... ..............71 Figure26User-centric:ExampleForagerPath............ ...........81 viii

PAGE 10

Figure27Site-centric:ExampleUserClickstreamGraph... ...............87 Figure28User-centric:createUserSessionsAlgorithm... ................91 Figure29Site-centric:ExampleClickstreamWebGraphs... ..............113 Figure30Site-centric:ExampleClickstreamGraphandMatr ices.............114 Figure31Site-centric:ExampleItemsetsbyDataset...... ..............120 Figure32Site-centric:ExamplePatternsbyDataset...... ..............126 Figure33User-centric:GoalSessionsbyWebSites........ ............131 Figure34ExampleOutlierPoints....................... .......135 Figure35User-centric:Sorted4-DistGraphs............ ............137 Figure36User-centric:OutlierPointsPlot............. ............139 Figure37User-centric:WebsiteSessionsHistograms..... ..............142 Figure38User-centric:WebsiteConversionHistogram.... ..............143 Figure39User-centric:SessionPagesViewedHistograms.. ...............144 Figure40User-centric:SessionDurationHistograms..... ...............145 Figure41User-centric:User-sessionSessionsHistograms .................146 Figure42Site-centric:DistinctFormSubmissionsHistogr am...............153 Figure43Site-centric:GoalSessionsbyWebsiteHistogram ................154 Figure44Site-centric:Sorted4-DistGraphs............ ............156 Figure45Site-centric:OutlierPointsPlot............. .............157 Figure46Site-centric:ContactGoalsPerWebsite........ .............164 Figure47Site-centric:GoalsPerContactGoal........... ............165 Figure48Site-centric:Websiteswith 50GoalSessionsPerContactGoal........166 Figure49Site-centric:Websites'Activity............. .............169 Figure50Site-centric:WebsitePagesHistograms........ .............169 Figure51Site-centric:WebsiteSessionsHistograms..... ...............170 Figure52Site-centric:WebsiteConversionHistogram.... ...............171 Figure53Site-centric:SessionPagesViewedHistograms.. ...............172 Figure54Site-centric:SessionDurationHistograms..... ...............173 Figure55User-centric:DifferencePlots............... ............182 Figure56Site-centric:DifferencePlots............... ............204 Figure57Site-centric:PatchandTrailDifferencePlots(S ignicant–0.05)........205 Figure58Site-centric:PatchandTrailSampleSizebySigni cance/SupportLevels...220 ix

PAGE 11

Figure59Site-centric:AllPatchandTrailMetricsbySigni cance/SupportLevels....222 Figure60Site-centric:PatchandTrailMetricsbySignica nce/SupportLevels......225 Figure61Site-centric:AveragePatchandTrailStatistics PerSite.............234 Figure62Site-centric:TrailandPatchp-ValuesbySignic ance/SupportLevels.....252 Figure63TemporalSite-centric:processDatasetAlgorith m................257 x

PAGE 12

GoalAttainmentonLongTailWebSites: AnInformationForagingApproach JamesA.McCart ABSTRACT Thisdissertationsoughttoexplaingoalachievementatlim itedtrafc“longtail”Websitesusing InformationForagingTheory(IFT).ThecentralthesisofIF Tisthatindividualsaredrivenbya metaphoricalsenseofsmellthatguidesthemthroughpatche sofinformationintheirenvironment. Aninformationpatchisanareaofthesearchenvironmentwit hsimilarinformation.Information scentisthedrivingforcebehindwhyapersonmakesanavigat ionalselectionamongstagroup ofcompetingoptions.Asforagersareassumedtoberational ,scentisamechanismbywhichto reducesearchcostsbyincreasingtheaccuracyonwhichopti onleadstotheinformationofvalue. IFTwasoriginallydevelopedtobeusedina“productionrule ”environment,whereauserwould performanactionwhentheconditionsofaruleweremet.Howe ver,theuseofIFTinclickstream researchrequiredconceptualizingtheideasofinformatio nscentandpatchesinanon-production ruleenvironment.Tomeetsuchanendthisdissertationaske dthreeresearchquestionsregarding (1)howtolearninformationpatches,(2)howtolearntrails ofscent,andnally(3)howtocombinebothconceptstocreateaClickstreamModelofInformat ionForaging(CMIF). Thelearningofpatchesandtrailswereaccomplishedbyusin gcontrastsets,whichdistinguished betweenindividualswhoachievedagoalornot.Auser-andsi te-centricversionoftheCMIF, whichextendedandoperationalizedIFT,presentedandeval uatedhypotheses.Theuser-centric versionhadfourhypothesesandexaminedproductpurchasin gbehaviorfrompaneldata,whereas thesite-centricversionhadninehypothesesandpredicted contactformsubmissionusingdata fromaWebhostingcompany. xi

PAGE 13

Ingeneral,theresultsshowthatpatchesandtrailsexiston severalWebsites,andthemajority ofhypothesesweresupportedineachversionoftheCMIF.Thi sdissertationcontributedtotheliteraturebyprovidingatheoretically-groundedmodelwhich testedandextendedIFT;introducing amethodologyforlearningpatchesandtrails;detailingam ethodologyforpreprocessingclickstreamdataforlongtailWebsites;andfocusingontraditio nallyunder-studiedlongtailWebsites. xii

PAGE 14

Chapter1 Introduction UnderstandingthebrowsingbehaviorofusersatWebsitesha sbeentheobjectiveofmuchofthe researchemployingdataaboutusers'Webusage(commonlykn ownas“clickstreamdata”).Especiallysalienthasbeentheinvestigationoffactorsrela tingtochoicebehavior,wherechoiceis typicallyconcernedwiththepurchaseofaproduct(Bucklin etal.,2002).Besideshavingageneralunderstandingofwhyusersbehavethewaytheydo,suchk nowledgealsoformsthebasisfor developingmechanismstoinuencechoice.Forexample,tos teeravisitortowardsapurchase, dynamicon-the-ychangesmaybemadetoaWebsiteintermsof its“...pages,linkchoices,promotionalinterventions,andpricesandproductassortment s”(Bucklinetal.,2002,pg.252). Suchageneralunderstandingoffactorsaffectingchoice;h owever,hasbeendifculttoobtain. Inpart,thedifcultyarisesbecauseconceptualresearchf ocusingonthetheoriesandideaswhich provideanexplanationofauser'sbehaviorhasbeenlimited (Bucklinetal.,2002).Thislackofa theoreticalbasenegativelyimpactstheabilityoftheresu ltsfromclickstreamresearchtobereconciled,synthesized,andthusprovideaclearerpictureofth osefactors. Findinganappropriatetheorytouseischallenginginlight ofthetypeofdataavailable.Clickstreamdataprovidesinformationontheactionsofauser(e. g.,whatpageswerevisited,howmuch timewasspentatasite),butnothingelse.Aperson'sattitu des,emotions,intentions,andother suchconceptsareunknown.However,manytheoriesexaminin ganindividual'sbehaviorininformationsystemsresearchrelyonsuchconceptsasattitudesa ndintentions(e.g.,TheoryofPlanned Behavior(Ajzen,1991))andthusarenotappropriatetouse. Therefore,atheoryisneededwhich can(1)explainbehaviorbasedonauser'sactionand(2)beap propriatelyappliedtotheclickstreamdomain. Withinthelastdecade,atheorycalledInformationForagin gTheory(IFT)hasemergedwhich explainsthesearchingbehaviorofindividualsastheyhunt forinformation(PirolliandCard,1999). ThethesisofIFTisthatanindividualisdrivenbyametaphor icalsenseofsmellthatguidesthem 1

PAGE 15

throughpatchesofinformationintheirenvironmentbasedo ntheirinformationgoal(i.e.,what theyaretryingtoaccomplish)(Pirolli,2007).Asthey“for age”,individualsevaluatewhetherto continuebrowsingintheircurrentpatchofinformationorl eavetohuntforanotherone.Centralto thistheoryaretheconceptsofinformationpatchesandinfo rmationscent.Informationpatchesare distinctareasofthesearchenvironmentwhichdifferininf ormationalcontent.Informationscent isthedrivingforceofwhyapersonmakesanavigationalsele ctionamongstagroupofcompeting options. IFTitselfbuildsonmoreestablishedtheoriessuchasOptim alForagingTheory(OFT)(Stephens andKrebs,1986)andtheAdaptiveControlofThought-Ration alTheory(ACT-R)(Andersonetal., 2004).OFTisanecologicaltheoryconcernedwithexplainin gtheforagingbehaviorofanimalsas theyhuntforfood.OFTassumeseachanimalgoesthroughasea rch–encounter–decisionprocess astheyforage,withthegoalbeingtomaximizenetenergygai ned.Tomaximizeenergy,theanimalisfacedwiththedecisionofwhichpreytoeatorhowlongt oforageinapatch.OFTisused toexplainthebehavioralelementsofpeopleforagingforin formation. ACT-Risapsychologicaltheoryofthehumanmindthatinclud esthecognitivearchitectureand processbywhichcognitionworks.IFTusesaproductionrule systemfromACT-Rtodetermine probabilisticallywhichactionisselectedbasedonitsuti litywithinthecontextofauser'scurrent goal.Forexample,anactiontoclickonahyperlinkmaybecho senoverbackinguptoapreviouslyvisitedpagebecausefollowingthehyperlinkmaybemo relikelytoleadtotheinformation beingsought.ACT-Risusedtoexplainatacognitivelevelwh yactionsareperformed. IFTwasoriginallydevelopedtobeusedina“productionrule ”environment,whereauserwould performanactionwhentheconditionsofaruleweremet.Howe ver,theuseofIFTinclickstream researchrequiresconceptualizingtheideasofIFTinanonproductionruleenvironment.Inessence, thisrequiresutilizinguseractiontoinferthecognitivep rocessandthusthereasoningbehindthe observedbehavior.Tomeetsuchanendthisdissertationdes cribeshowinformationpatchesand trailsofinformationscentcanbelearnedfromclickstream data.However,themainfocusofthis dissertationistodeterminehowtheconceptsofIFTcanbeus edtobuildaclickstreammodelof informationforaging(CMIF).Themodelreliesonmeasuresd erivedfromclickstreamdatarepresentingIFTconceptstoexplaingoalachievementat“longta il”Websitesthathavelimitedtrafc. Goalachievementisfromtheperspectiveoftheonlinerman dconsistsofsomethingtherm 2

PAGE 16

wouldliketohappenattheirWebsite(i.e.,achoice).Thisd issertationexaminesWebsiteswhere thegoalisthepurchaseofaproductorthesubmissionofacon tactform. Theterm“longtail”referstoaWebsitethatresidesintheta ilofapowerlawdistribution(Anderson,2006).Figure1showsahypotheticalpowerlawdistr ibutionillustratingWebsitesand theirpopularityintermsofthenumberofvisitstheyreceiv ed 1 .Theheadofthecurve(darkly shadedportion)representsthemostpopularWebsitessucha sAmazon.comandeBay.com.The longdrawn-outtailofthecurve(lightlyshadedportion)ex tendstoincludeallotherWebsites. Figure1.:PowerLawDistribution Thedecisiontoanalyzeauser'sbehavioratlongtailWebsit eswasmotivatedbytheabilityof IFTtoguideanalysis.Comparedtositesinthehead,longtai lWebsiteshavesignicantlysmaller amountsofdata,whichispreciselywheretheorycanhelpgui deanalysisthemost.Lackingtheory,analysiswouldrequirelargeamountsofdatatoworkwel lwithtechniquescommonlyused suchasdatamining.LongtailWebsitesbytheirverynaturea reprohibitivelysparseindatawhich hampertheapplicationofsuchanexploratoryapproach. Theremainderofthischapterisorganizedasfollows.First ,theresearchquestionsguidingthis dissertationareintroducedin§1.1.Abriefdiscussionoft hecontributionsofthisdissertationare givenin§1.2.Finally,§1.3providesabriefoverviewofthe structureofthisdissertation. 1 ThepowerlawdistributionofWebsitesandtrafchasbeenpr eviouslyconrmedthroughempiricalstudy (AdamicandHuberman,2001)andsimulation(Kavassaliseta l.,2004).Power-lawdistributionshavealsobeenobservedinnumerousotherinstancessuchasthesalesofprodu cts(Anderson,2006);frequencyofwordusageinEnglish text;numberoftelephonecallsreceived;frequencyoffami lynamesintheUnitedStates;andcitationsofacademic papers(Newman,2005). 3

PAGE 17

1.1ResearchQuestionsThefollowingsubsectionsdescribetheresearchquestions guidingthisdissertation.Therstresearchquestionisinrelationtotheconceptofinformation patches.Thesecondresearchquestion morefullyexplorestheconceptofinformationscent.Final ly,thethirdresearchquestionbringsall theconceptsofIFTtogethertodevelopaclickstreammodelo finformationforaging. 1.1.1ResearchQuestion1–LearningPatchesAninformationpatchisdenedasanareaofthesearchenviro nmentwithsimilarinformation (Pirolli,2007).WithinaWeb-context,whatconstitutesap atchisdependentonthelevelofanalysisbeingexamined.Atahigh-levelofanalysis,anentireW ebsitecanbeconsideredapatch. Whenexaminedfromaner-grainedlevelofanalysis,eachin dividualpageofaWebsitecanalso beconsideredapatch.Whilesuchconceptualizationsofapa tcharestraightforward,theyareeffectivelybeingdenedbythecreatorofthecontentrathert hantheuser. TheWeb,however,isapliableenvironmentwhereforagersha vethechoiceofwhatmaterialto view.Effectively,thisallowsaforagertodenetheirowni nformationpatchthatisuniquelyrelevanttotheirgoal.SuchpatchesmayconsistofagroupofWeb pages,whichindividuallymay meanverylittle,butwhencombinedprovideanareaofthesea rchenvironmentthatisseenas valuabletotheuser.Therefore,therstresearchquestion attemptedtodiscoverhowsuchpatches canbelearned.ResearchQuestion1: HowcaninformationpatchesbelearnedfromalongtailWebsi te? Althougheachuserisfreetodenepatchesofvalueastheyse et,certainpatternsofpatches mayemergeamongforagerswithsimilarinformationgoals.F romtheviewpointoftheonline rm,knowingwhovalueswhatpatchcanprovideinsightsinto theinformationgoaloftheforager.Bycategorizingpatchesasvaluabletogoal-achiever sornon-goal-achievers,thermmaybe abletobetterexplaingoalachievementatlongtailsitesde pendentonwhatpatcheswerevisited byauser.Therefore,ameasurewasalsodevelopedwhichquan tiedauser'svisitationofvaluable goalpatches 4

PAGE 18

1.1.2ResearchQuestion2–LearningScentTrailsInformationscentisthedrivingforcebehindwhyapersonma kesanavigationalselectionamongst agroupofcompetingoptions.Asforagersareassumedtobera tional,scentisamechanismby whichforagers'reducetheirsearchcostsbyincreasingthe iraccuracyonwhichoptionleadstothe informationofvalue(Pirolli,2007).Basedontheinformat iongoalofaforager,eachhyperlinkon aWebpagegivesoffascent.Thehigherthescentthemorelike lythepagethatisbeinglinkedto maycontaintheinformationbeingsought.Similartoablood houndthatfollowsascenttrailover distancestondanitemofinterest,aforageralsofollowsa scenttrailtondtheinformationthey seekovermultipleWebpages.Thesecondresearchquestions oughttoexplainhowscenttrails maybelearned.ResearchQuestion2: HowcaninformationscenttrailsbelearnedfromalongtailW ebsite? Similartothelearningofpatches,eachusermayhavetheiro wnscenttrail.However,patterns mayexistfromfragmentsofscenttrailsthatemergedamongf oragerswithsimilarinformation goals.Thesefragmentsofscenttrailsareofvaluetotheonl inermindistinguishingbetweenpossiblegoal-achieversandnon-goal-achievers.Whenauserf ollowstheseknownfragmentsofscent trailsitmayprovidecluesintotheirinformationgoalandt hushelpinexplaininggoalachievement atlongtailsites.Thus,ameasurewasdevelopedwhichcompu tedthefollowingof goalscent trails 1.1.3ResearchQuestion3–ClickstreamModelofInformatio nForaging Theprevioustworesearchquestionsexaminedtheconceptso finformationscentandpatchesindividually.However,therealvalueofIFTisitsabilitytocom bineaspectsofauser'ssearchenvironment(i.e.,patches)andtheiractions(i.e.,scent)togeth er.Thusthemainfocusofthisdissertation andthenalresearchquestionwashowtheseconceptscouldb ecombinedusingclickstreamdata toinfergoalachievement.ResearchQuestion3: Howcaninformationforagingtheoryandclickstreamdatabe usedtoexplaintheachievementofagoalatalongtailWebsite? Toanswerthethirdresearchquestion,twoversionsofaclic kstreammodelofinformationfor5

PAGE 19

aging(CMIF)werecreatedwhichusedclickstreammetricsto representtheconceptsofinformationscentandpatches.Theuser-centric(UC)modelexploit eduser-centricdata(Padmanabhan etal.,2001)aboutaforager'sentirebrowsingbehaviortoe xplaingoalachievementatalongtail Website.Thismodelcomparedaforager'sbehavioracrossmu ltipleWebsites.However,dueto user-centricdatatypicallybeingaggregatedatthesessio nlevel,themodellackeddepthatindividualWebsites. Sincedataaboutauser'sentireclickstreamovermultiples itesisrarelyavailabletoanonline rm,asite-centric(SC)versionofthemodelemployingsite -centricdata(Padmanabhanetal., 2001)wasalsodeveloped.Page-leveldatamadethesite-cen tricmodelcapableofanalyzingpatches atalllevelsofanalysisalongwithinformationscentataWe bsite. 1.2ContributionsListedbelowarethemajorcontributionsofthisdissertati on. First,thisdissertationdemonstratedhowIFTcouldbeused asatheoreticalbasisforclickstream research.Throughthecreationoftwoversionsofaclickstr eammodelofinformationforaging, theconceptsofIFTwerequantiedoutsideofaproductionru leenvironment.Inaddition,the CMIFnotonlyoperationalizedthecoreconceptsofIFT,buta lsoextendedthetheorybyintroducingmemory,forager-independentvaluationofpatchesa ndtrails,alongwithreneddenitions ofscent(e.g.,strictandrelaxedscent).Oncetested,many ofthecoreconceptsofIFTweresupported,asweremanyofthetheoreticalextensions.Thus,th isdissertationnotonlydemonstrated theabilityofIFTtoexplaingoalachievement,butitalsoin troducedtheoreticalextensionswhich providedamorein-depthexplanationofgoalbehavior. Thisdissertationalsopresentedamethodologyonhowtolea rnpatchesandscenttrailsusing notonlysignicant,butalsosupportedcontrastsets.Meas ureswerealsocreatedwhichquantied aforager'svisitationofpatchesandfollowingoftrails.T hemetricsmeasuredthemostvaluable, last,andsummationofallpatchesandtrailsthatwerevisit edorfollowed.ForthoseWebsites withintheCMIFthatdiscoveredpatchesandtrails,themeas ureswerecapableofdistinguishing goalfromnon-goalsessionsaccordingtoaforager'svisita tionandfollowingbehavior. Thethirdcontributionwasamethodologythatdetailedhowt opreprocessdatasetswithlongtail Websites.Inparticular,aseparateuser-andsite-centric methodologywaspresentedwhichhigh6

PAGE 20

lightedtheuniquechallengesassociatedwithpreprocessi ngeachdataset.Forexample,aprocess wasprovidedforthesite-centricdatasetabouthowtolocat eandselectasingledenablegoalon Websiteswhichhavemorethanoneavailablegoal. Finally,duetothepresenceofIFTguidinganalysis,tradit ionallyunder-studiedlongtailWeb siteswereabletobeexaminedeveninlightoftheirsparseda tasets.Asfarascanbedetermined, thisdissertationisthersttoempiricallystudygoalachi evementonlongtailWebsites. 1.3DissertationStructureThestructureoftheremainingchaptersofthisdissertatio nisoutlinedbelow.Eachiteminthelist providesabriefsummaryofthemainpurposeofeachchapter.Chapter2.LiteratureReview–Anoverviewofpriorclickstr eamresearchalongwiththedatasets andmetricsusedinthatresearch. Chapter3.Theory–Adetailedexplanationofinformationfo ragingtheoryalongwiththetwo theoriesIFTdrawsfrom:ACT-RandOFT. Chapter4.Hypotheses–Thehypothesesforboththeuser-and site-centricversionsoftheCMIF (thirdresearchquestion).Inaddition,anexplanationoft heextensionstoIFTisprovided. Chapter5.Methodology–Aseparatemethodologyfortheuser -andsite-centricversionsofthe CMIFispresentedthatcoversthedataused,howmeasureswer ecalculated,andnallyhow thehypothesesweretested.Theappendixcontainsadescrip tionofhowtolearnpatchesand trails(rsttworesearchquestions). Chapter6.Datasets–Adetailedexplanationoftheseriesof preprocessingstepseachdataset wentthroughtoobtainanaldataset. Chapter7.Results–Alistinganddiscussionoftheresultsf oreachofthethreeresearchquestions.Descriptivestatisticsareprovidedaboutlearnedp atchesandtrails.Inaddition,statisticaltestsandadiscussionofeachofthehypothesesforthe thirdresearchquestionarealso provided. Chapter8.Temporal–Analternatetime-sensitiverepresen tationofthesite-centricCMIF.The methodology,results,anddiscussionareprovidedforthes eventestedhypotheses. 7

PAGE 21

Chapter9.Conclusion–Summarizesthisdissertationandpr ovideslimitations,contributions, anddirectionsforfutureresearch. 8

PAGE 22

Chapter2 LiteratureReview Thischapterprovidesasummaryofpriorresearchwhichhasf ocusedonthebehaviorofvisitorsat Websitesusingclickstreamdata.Abrieflistoftermscommo nlyusedthroughoutthisdissertation areprovidedrstin§2.1.Thenpriorresearchissummarized andclassiedbyresearchfocusin §2.2.Table1liststhegeneralresearchquestionsandresul tsofpriorresearch,while§2.2.1–2.2.4 givesmorein-depthdescriptionsoftheliterature.Thedat asetsandmetricsusedineachstudyare thendiscussedin§2.3and§2.4,respectively.2.1TerminologyInordertobeclearandconsistent,denitionsoftermscomm onlyusedthroughoutthisdissertationareprovidedbelow.Eachboldedtermisfollowedbyitsd enition.Ifanysynonymousterms existtheyareitalicizedinparenthesesimmediatelyfollo wingtheboldedterm. Path –sequenceofWebpagesviewedduringasessionorusersessio n. Sector –acollectionofWebsiteswithproducts,services,and/ori nformationwhichareofasimilarnature(e.g.,food). Session –atime-contiguoussequenceofWebpageviewsatthesameWeb siteforthesamevisitor. UserSession –atime-contiguoussequenceofWebpageviewsatanynumbero fWebsitesfor thesamevisitor. Visitor ( user )–apersonmakingHypertextTransferProtocol(HTTP)reque stsatasingleWeb siteormultipleWebsites. Webpage ( page )–alewritteninHypertextMarkupLanguage(HTML)contain inginformation thatisviewableviaaWebbrowser(e.g.,index.html). Website ( site )–acollectionofWebpageshousedunderthesametop-andsec ond-leveldomain name(e.g.,amazon.com). 9

PAGE 23

2.2PriorResearchInkeepingwithpriorframeworks,theobjectiveofavisitor atasitecanbeclassiedasbrowsingorpurchasing(Bucklinetal.,2002).Abrowsingobjecti vereectshowavisitormaynavigate withinasite(BucklinandSismeiro,2003),acrossmultiple sites(ParkandFader,2004),orhow sitevisitsevolveovertime(MoeandFader,2004a).Convers ely,apurchasingobjectiveisinterestedindiscoveringfactorswhichaffectavisitor'sprope nsitytopurchase(SismeiroandBucklin, 2004).However,thepurchasingobjectivecanbeseenasaspe cicinstanceofthemoregeneral goalachievementobjectiveasmanysiteshavepurposesothe rthanpurchasing(e.g.,llingina contactform,postingamessage,respondingtoasurvey).Th erefore,theobjectiveofavisitorcan beclassiedasbrowsing,purchasing,achievingagoal 1 ,orexploringmultipleobjectivessimultaneously(Moe,2003). Table1categorizespaststudiesbywhichobjectivetherese archwasexaminingandthensummarizestheresearchquestionsandresultsobtained.Amore thoroughdescriptionofpriorresearch isprovidedinthesubsectionsfollowingthetable. 1 Althoughpurchasingisasubsetofgoalachievementitisret ainedasaseparateobjectivesincenumerousstudies specicallyexaminepurchasingbehavior. 10

PAGE 24

Table1:PriorLiterature:Results ArticleResearchQuestion/PurposeResults M ULTIPLE O BJECTIVES Kalczynskietal.(2006)Howwelldo clickstream-complexitymeasurespredicttaskcompletion? TwoWebsite-independentclickstream-complexitymeasuresrepresentingthelinearityanddensityofasessionwerefoundtoperformthebestwithaccura-ciesbetween65%and93%dependingonthetaskandsite. Moe(2003)Whatvisitorbehaviorcanbe uncoveredfromthepatternandtypeofpagesviewed? Fourgroupsofvisitorsdifferinginsearchbehav-iorandpurchasinghorizonwerefound,alongwithafthgroupofnon-seriousvisitors.Thepurchaseprobabilityofeachgroupdiffereddependingonhowimmediatethepurchasewasandhowdirectedthebrowsingbehaviorwas. B ROWSING BucklinandSismeiro(2003)Dovisitorschangethewaythey browseaWebsiteatthesessionorsitelevel? Visitorsdiddynamicallychangetheirbrowsingbe-havioratboththesessionandsitelevel.Withinasessionbrowsersexhibitedlock-inastheybrowseddeeperintoaWebsite.Acrosssessionsalearningeffectwasobservedwhichreducedthenumberofpagesviewed,butnotthedurationspentoneachpage. ContinuedonNextPage...11

PAGE 25

Table1:PriorLiteratureResults–Continued ArticleResearchQuestion/PurposeResults Danaheretal.(2006)Whatfactorsaffectvisit duration? Ageinteractedwithgender,Websitefunctionality,andthegraphicalcontentofasitenegativelywithre-gardstothedurationspentonaWebsite.Ageinter-actedpositivelyincreasingdurationforoldervisitorsforhigherlevelsoftextandadvertisementsonaWebsite. Johnsonetal.(2004)Doesreducedsearchcostleadto increasedsearch? Overallsearchlevelswerelowacrossthethreesec-torsexamined.Browsingbehaviorwasalsofoundtodifferdependingonsectorandlevelofactivity. MoeandFader(2004a)Tomodelindividual-level evolvingvisitpatternsovertime. Examiningdataatanindividual-levelcontradictedaggregatedvisitpatterns.Morefrequentvisitsandanincreaseinvisitingratesincreasedvisitors'probabil-ityofpurchasing. ParkandFader(2004)Understandcross-sitevisiting behaviorattheindividuallevel. AnabilitytopredictwhenavisitorwillrstvisitaWebsitegiventheirvisitingpatternatanothersite. Zhangetal.(2006)Howdoessearchcost,product characteristics,previoussearchbehavior,andconsumercharacteristicsaffectsearchdepth? Lowersearchcostsandpriorsearchbehaviorwerepositivelycorrelatedwithsearchdepth.Priceandconsumercharacteristicswerepositivelycorrelatedtosearchdepthforonlycertainproducttypes. ContinuedonNextPage...12

PAGE 26

Table1:PriorLiteratureResults–Continued ArticleResearchQuestion/PurposeResults P URCHASING MoeandFader(2004b)Tomodelindividual-level dynamicconversionbehavior. Theindividual-levelmodelcontradictedaggregatedconversiontrends.Overtimetheoverallpurchaseprobabilityofavisitordecreased,repeatvisitshadlessofanimpactonpurchasing,andvisitorexperi-enceraisedthepurchasingthreshold. Montgomeryetal.(2004)Canthepathavisitortakes throughaWebsitehelppredictpurchase? Futurepathswerepredictedwithgreateraccuracybythemodelusingpathsandbyallowingsearchbe-havior(i.e.,exploratory,directed)tochangeduringasession.Purchasepredictionwas10%and21%ac-curateafteravisitorviewedonepageandsixpages,respectively. Padmanabhanetal.(2001)Whataretheimplicationsof usingsite-centric(i.e.,incomplete)dataversususer-centric(i.e.,complete)data? Modelsusinguser-centricdataoutperformedmodelsusingsite-centricdatabyawidemargin.Usingsite-centricdatacanleadtoerroneousresultssincesignif-icantmetricsinsite-centricmodelsmaynolongerbesignicantinuser-centricmodels. SismeiroandBucklin(2004)Doesviewingthepurchasing processasaseriesoftasksincreasepredictionaccuracy? Themulti-taskmodeloutperformedthecompetingsingletaskmodelssupportingtheseriesoftaskscon-cept.Themodelmetricsdifferedineffectsign,size,andsignicancebetweentasksindicatingsomemet-ricswerebetterpredictorsofsometasksoverothers. VandenPoelandBuckinx(2005) Howwelldodifferenttypesofmetricspredictpurchases? Detailedclickstreammetrics,whichweredividedaccordingtotheunderlyingcontentofthepage(e.g.,productinformation,communitypages),werefoundtobethemostimportantpredictorsofpurchase. ContinuedonNextPage...13

PAGE 27

Table1:PriorLiteratureResults–Continued ArticleResearchQuestion/PurposeResults G OAL A CHIEVEMENT Chatterjeeetal.(2003)Tomodelavisitor'sprobability ofclickingabanneradvertisement. Advertisementsexhibited“wearout”suchthatmul-tipleexposuresreducedtheprobabilityofavisitorclickinganadvertisement.Infrequentvisitorswerealsomorelikelytoclickonabanneradvertisementthanfrequentvisitors. 14

PAGE 28

2.2.1MultipleObjectivesMoe(2003)createdatwo-dimensionaltypologyandsoughtto discovermetricswhichhelpedcategorizethewithin-sessionshoppingstrategiesofvisitor s.Therstdimensionofthetypology, searchbehavior,wasdichotomizedintofollowingadirecte dorexploratorypattern 2 (Janiszewski, 1998).Directedsearchingoccurswhenavisitorhasapartic ulargoalorproductinmind(Rowley,2000).Exploratorysearch,ontheotherhand,takesanu ndirectedapproachwherethevisitor maynotbeattemptingtolocateaparticularproductormeeta specicgoal.Thetimehorizon,in whichtheexpectedpurchaseistotakeplace,eitherimmedia telyorinthefuture,wasthesecond dimensionofthetypology. Asseeningure2,fourcategoriesofshoppingstrategiesem ergedfromthetypology:directed buying;searchanddeliberation;hedonicbrowsing(i.e.,e xploratoryorstimulus-drivenbrowsing); andknowledgebuilding.Eachstrategywasexpectedtohavea uniquepatternofthetype,variety, andnumberofrepeatviewingsofparticulartypesofpages. PurchasingHorizonSearchBehavior DirectedExploratory ImmediateDirectedbuyingHedonicbrowsing (20.0%)(1.4%) FutureSearch/deliberationKnowledgebuilding (6.6%)( < 0.1%) Figure2.:ShoppingStrategyTypology(Moe,2003) Usingsevenweeksofdatafromanutritionalsupplementstor e,Moe(2003)empiricallytested thetypologyusingclusteranalysisandfoundallfourtheor izedcategorieswerepresentalongwith afthcategoryofnon-seriousvisitors 3 .Thetwomostimportantmetricsfoundfordiscriminating 2 Blochetal.(1986)createdaframeworkforconsumerinforma tionsearchandalsodelineatedbetweentwosearch behaviors,pre-purchaseandongoingsearch.Pre-purchase search,whichwasdenedasseekingtofacilitatedecision makingaboutaparticulargoal,mapstothedirectedsearchb ehavior.Ongoingsearchmapstotheexploratorysearch behaviorandwasdenedassearchingthatisindependentofa goal. 3 Visitorsinthefthcategory,ongeneral,viewedtwopagesa ndspentashortamountoftimeoneachpage.Dueto thelimitedbrowsingbehaviorexhibitedonthesitebythese visitorsbeforeleavingtheywerenotconsideredashaving aseriousinterestinthesite.Nicholasetal.(2007)termed thosenon-seriousvisitorsas'bouncers'whogofromsiteto sitewithoutdeeplypenetratingorfrequentlyreturningto theWebsiteofinterest. 15

PAGE 29

betweenshoppingstrategieswithinasessionwasthenumber ofdifferentcategorypagesviewed andthemaximumnumberoftimesaproductpagewasviewed(Moe ,2003).Figure2alsocontainstheconversionrateofeachcategory,inparentheses, whichwasfoundtorangefrom < 0.1% to20.0%. Examiningthebehaviorofvisitorsperformingpurchasinga ndinformation-seekingtasksover veWebsites,Kalczynskietal.(2006)usedthenavigationa lcomplexityofavisitor'ssession tohelppredictthecompletionoftasks.Thecentralideaofn avigationalcomplexityisthecorrespondencewithanunderlyingsearchbehavior(e.g.,Moe,20 03)where,forexample,alesscomplexsessionisassociatedwithadirectedsearchbehaviorw hereasgreatercomplexityinasession pointstowardanexploratorysearchbehavior.Usinggrapht heory,eachvisitor'ssessionwasdecomposedintoaclickstreamgraphwhichrepresentedtheWeb pagesandlinkstraversedwithina Websiteandallowedforthecalculationofnavigationalcom plexity. Atotalof485sessions,inacontrolledexperiment,attempt edtocompletethreepurchasingand threeinformation-seekingtaskswiththeoverallsuccessr atesforthetasksvaryingfrom8.8%to 56%(Kalczynskietal.,2006).Twoclickstreamgraph-compl exitymetricsrepresentingthelinearityanddensityofasessionwereusedinbinarylogisticregr essionmodelsforeachtask.Overall, themodelscorrectlyclassiedasessionbetween64.9%and9 3.1%ofthetimedependingonthe Websiteandtask 4 (Kalczynskietal.,2006). 2.2.2BrowsingMoeandFader(2004a)exploredthepatternandevolutionove rtimeofavisitor'sbrowsingbehaviorattheindividual-level.Theauthorsarguedthataggreg atingbrowsingbehavioratthesite-level tocreategeneraltrafcpatternsmayleadtoafalseunderst andingofthecompletebrowsingbehavioroccurringatasite(MoeandFader,2004a).Forinstan ce,aggregateddatamayindicatean upwardtrendinboththenumberofvisitorsandratesofvisit stoasite.Theinclusionofnewvisitorsmayhowever,bemaskingadeclineinvisitingratesfore xperiencedvisitors(i.e.,established customers). MoeandFader(2004a)usedeightmonthsofuser-centricdata focusingonAmazon.comand 4 Themodelwith93.1%accuracywasforthetaskwithonly8.8%s uccess.Asonlytheoverallaccuracyandnot thespecicityandsensitivitywereprovidedthepractical benetofthemodelisunknown,Kalczynskietal.(2006) acknowledgedthislimitation. 16

PAGE 30

CDNow.comtovalidateanonstationaryevolvingvisitmodel .Themodeltookintoaccountanindividual'sheterogeneity,visitingrate,andevolutionof visitingratesovertime.Comparedagainst anexponential-gammatimingprocess,whichdidnotallowfo rchangeovertime,theevolving visitmodelwasmoreaccurateinestimatingthelikelihooda ndwhenavisitorwouldreturntoa site(5%overpredictionversus37%)(MoeandFader,2004a). Inaddition,thedistributionofvisitingratesdidshowadeclineinvisitingratesforexperien cedvisitorswhichcontradictedtheaggregatedtrends.Furthermore,morefrequentvisitsandani ncreaseinvisitingrateswerefoundto besignicantintermsofavisitor'sprobabilityofpurchas ing(16.6%vs.11.1%and5.5%versus 2.4%,respectively)(MoeandFader,2004a). Alsoconcernedwithaggregatedstatisticsbeingusedtoinf ervisitors'browsingbehavior,BucklinandSismeiro(2003)createdanindividual-levelmodelo fbrowsingbehaviorwithinaWebsite. Therstaspectofthemodelaccountedforavisitor'sdecisi ontocontinuebrowsingthesiteorexit thesite.Thesecondaspectwasconcernedwiththedurationo ftimeavisitorspentoneachindividualpage.UsingatypeIItobitmodelandonemonthofdataf romacommercialautomotive Website,fourdistinctbrowsingbehaviorswereidentied: alearningeffect,within-sitelock-in, time-constraints,andacost-benetview. Theresultsoftherstbehavior,learningeffect,wasconsi stentwithpriorresearchshowingthe overalldurationofsessionsdecreasedwitheachsubsequen tsession(Johnsonetal.,2003).Althoughtheoverallsessiondurationandnumberofpagesview eddecreased,thedurationspenton eachpagedidnotsignicantlydifferfromprevioussession s(BucklinandSismeiro,2003).The secondbehaviorwasbasedontheconceptoflock-in(Johnson etal.,2003;Zauberman,2003); however,inthiscontextthelock-incorrespondedtoavisit orbecomingmoreengrossedasthey continuedtobrowseaWebsitewithinthesamesessioninstea dofovertime.Theresultssupported thisideaofwithin-sitelock-insincetheamountoftimespe ntviewingeachpageincreasedasthe numberofpagesviewedinasessionincreased(BucklinandSi smeiro,2003). Time-constraints,thethirdbehavior,showedtheprobabil ityofavisitorstayingontheWebsite decreasedastheoverallsessiondurationincreased.Then albehaviordemonstratedvisitors' likelyperformedsometypeofcost-benetanalysissinceap agewithgreateramountsofinformationincreasedavisitor'sprobabilityofstayingontheW ebsite.However,theprobabilityofa userleavingthesitecanalsoincreasewithgreaterlevelso finformation.Forexample,readingall 17

PAGE 31

theinformationonapagemayresultinlongerpagedurations whichtranslateintolongersession durations.Duetotime-constraints,longersessiondurati onsthenleadstoagreaterprobabilityofa userleavingtheWebsite(BucklinandSismeiro,2003). EchoingtheconcernsofPadmanabhanetal.(2001)aboutusin gincompletedata,Parkand Fader(2004)positedthetimingandfrequencyoffuturevisi tstoasitecanbebetterexplainedby examiningvisitingbehavioratothersites.Specically,t hebrowsingbehaviorintermsofvisittimingandvisitratescomparedtoothersitescanbeexamined.F orinstance,onevisitormayhave highvisittiminginwhichavisittoonesiteisfollowedshor tlybyavisittoanothersite.Adifferentvisitormayhaveahighvisitratewherethenumberofvisi tstoeachsiteissimilar,regardless ofthecoincidenceofvisittiming.Therelationshipofboth theseconceptstoavisitor'sbrowsing behaviorcanbeusedtopredictfuturevisitstoasiteofinte rest. Usingamulti-variatetimingmixturemodelwithclosed-for manalyticexpressions,Parkand Fader(2004)lookedatthebrowsingbehaviorofvisitorsfro mtwopairsofsiteswithinthebook andmusicsectors.Fourmodelswerecompared,whichdiffere dbasedonifcorrelationinvisittimingandrateswereaccountedfor,withtheproposedmodelacc ountingforbothcorrelations.The proposedmodelwasfoundtoprovidethebesttandperformed wellwhenlongspacesoftime occurredbetweenvisits(ParkandFader,2004).However,wh envisitstodifferentsitesoccurred onthesameday,theproposedmodelfailedtoperformaswell. Theproposedmodelalsooutperformedtheotherthreemodelsinidentifyingzeroclasscust omers(i.e.,customerswhohavenot visitedtheWebsite)whobecomenon-zeroclasscustomers(i .e.,customerswhowillvisittheWeb site)inthefuture(ParkandFader,2004). SincetheaveragedurationavisitorspendsonaWebsiteisac omponentofthestickinessof thatsite(i.e.,abilitytoattractandkeeptheinterestofa visitor)(Bhatetal.,2002),Danaheretal. (2006)setouttouncoverthefactorsthataffectvisitdurat ion.Theresultingmodeltookintoaccounttwosourcesofindividual-levelheterogeneityinthe formofdemographicsandavisitor's situationalcharacteristicsforaparticularvisittoasit e(e.g.,weekdayversusweekendvisit,numberofpreviousvisits).Site-levelheterogeneityinclude dmeasuresofthetextual,graphical,and advertisingcontentofaWebsite.Measuresrepresentingth ebackgroundcomplexityandoverall Websitefunctionality 5 werealsoincludedinthemodel. 5 Functionalitywasmeasuredastheaverageof19binaryitems indicatingthepresenceorabsenceoffeaturesonthe Websitesuchas”...onlinehelp,searchfunctions,sitemap s,userregistration,e-mailcontactavailability,chatro oms, 18

PAGE 32

Usingamonthofpaneldatafrom1,655panelistsforthe50mos t-visitedWebsitesinthedataset, thedevelopedmodeldemonstratedthatallthreesourcesofh eterogeneityweresignicantinexplainingduration.Althoughallsignicant,visitors'sit uationalcharacteristicsaccountedforalmost80%ofthevarianceexplained(Danaheretal.,2006)pro vidingsupportthatclickstreammetricsintheabsenceofdemographicsandWebsitecharacteris ticscanexplainasubstantiallypart ofvisitors'behavior.Intermsofspecicmetrics,agewasf oundtointeractsignicantlywithalmostallofthedemographicandWebsite-specicmetrics.Fo rinstance,thefunctionalityofaWeb siteandageinteractedsuchthatanincreaseinfunctionali tydecreasedthedurationavisitorspent onthesitetheolderthevisitorwas.Theoppositerelations hipwasfoundbetweenageandadvertisingcontentwherevisitdurationincreasedforoldervis itorswhenvisitingWebsiteswithmore advertisements. DuetotherelativelycostlessnatureofsearchingontheInt ernet,Johnsonetal.(2004)soughtto answerthequestiondoesreducedsearchcostsleadtoincrea sesinsearchbehavior?Toanswerthat questionsearchbehaviorwasoperationalizedintothreeco mponentsconsistingofthedepth,dynamics,andactivityofsearch.TheresultingHierarchical Bayesianmodelaccountedforahousehold'svisitationofmultiplesiteswithinthesamesector( depth),changeinsearchbehaviorover time(dynamics),andamountofoverallactivityinasector( activity). Focusingonthreesectors(books,music,andairtravel)the searchbehaviorofhouseholdsat51 ofthemostvisitedWebsites(13books,16music,and22airtr avel)withinthedatasetwereanalyzed.Consistentwithpriorresearch,itwasfoundthatov erallhouseholdssearchedverylittle (Zauberman,2003)inallthreesectors,althoughmoresearc hbehaviorwasfoundwithintheair travelsectorthantheothers(Johnsonetal.,2004).Howeve r,householdssearchingwithintheair travelsectorweremorelikelytogravitatetowardapreferr edWebsiteovertime(Johnsonetal., 2004),thusindicatingapropensityforlesssearchinthefu ture.Notsurprising,arelationshipbetweenactivityanddepthofsearchwassignicantforallsec torsindicatinghouseholdsthatwere moreactiveinasectorweremorelikelytosearchacrosssite s(Johnsonetal.,2004). FollowinginthefootstepsofJohnsonetal.(2004),Zhanget al.(2006)alsoexaminedthesearch behaviorofhouseholds,albeitusingdatacollectedfourye arslater.Thetimespanbetweenthe datasetshighlightedthecontrastingsearchbehaviorofho useholdsfromtheinfancyofe-commerce andmessageboards“(Danaheretal.,2006,pgs.186–187). 19

PAGE 33

toitsrelativematurity.Lookingatbothproductpriceandt hequalityofthee-commercestore,an analyticalmodelandpropositionsofahousehold'sonlines earchbehaviorwerecreated.ExaminingtwoofthethreesamesectorsasJohnsonetal.(2004)(mus icandairtravel)andonenewsector (computerhardware),linearregressionmodelswereusedto testhypothesesderivedfromtheanalyticalmodel'spropositions.Thehypothesessoughttodet erminetherelationshipofsearchdepth tosearchcost,productcharacteristics,previoussearchb ehavior,andconsumercharacteristics. Comparedtopriorresearch,overallsearchdepthincreased andloyaltytoaWebsitedecreased (Zhangetal.,2006),whichiscontrarytothebeliefthathou seholdswouldgravitatetowardsa preferredWebsiteovertime(Johnsonetal.,2004).Itwasal sofoundthathouseholdstookboth thepriceoftheproductandthequalityofthee-commercesto reintoconsideration(Zhangetal., 2006).LikeDanaheretal.(2006),whofoundagewasanimport antmoderatingvariableforvisit durationtoasite,agewasalsofoundtobepositivelyrelate dtosearchdepth,althoughonlywithin theairtravelsector.Alltold,thelinearregressionmodel saccountedfor4.4%to11.5%ofthe adjustedR 2 (Zhangetal.,2006),indicatingothermetricsmayalsobeof interestforexplaining searchdepth.2.2.3PurchasingMontgomeryetal.(2004)soughttopredictpurchaseconvers ionbyexaminingthepathavisitor tookastheybrowsedaWebsite.Thepathwasassumedtoprovid ecluesintothegoalsofthevisitorandconsistedofthesequenceandtypesofpages(e.g.,h ome,product,andcategory)viewed throughoutasession. Figure3providesanexampleoftwodistinctpathsfromtwovi sitorswhoeventuallyarriveat thesameproductpage.Therstvisitorappearstohavetaken adirectroutetotheproductofinterest,thusexhibitingadeliberatepath.Incontrast,the secondvisitorappearstobebrowsing,due tothenumberofproductandcategorypagesbeingviewed.The setwobehaviorsareverysimilar tothesearchbehaviordimensionoftheshoppingstrategyty pologyfromMoe(2003).However, unlikeMoe(2003)whichcategorizedavisitorashavingasta ticsearchbehaviorfortheentiresession,thedynamicmultinomialprobitmodelbyMontgomeryet al.(2004)includedtheabilityto accountforchangestoavisitor'ssearchbehaviorwithinas ession.Therefore,whileavisitormay nothaveaspecicgoalatthebeginningofasession,theymay transitionatsomepointintheses20

PAGE 34

sionintohavingagoalorvice-versa. VisitorSession a 1 h H;C;P i 2 h H;C;P;P;P;C;P;P;C;P;P;H;C;P i a Typesofpages:H=Home;C=Category;P=Product. Figure3.:CategoryofPagesViewedinaPath OnemonthofpaneldatafocusingonlyonvisitorstoBarnesan dNoble.comwasusedtoempiricallyevaluatetheaccuracyoftheproposedmodel.Firs t,thegeneralaccuracyofthemodel's abilitytocorrectlypredictfuturepathsbasedonpriorpat hsofthesamevisitorwasevaluated.Usingaholdoutsample,futurepathswerepredictedwith83.2% accuracy(Montgomeryetal.,2004). Second,itwasfoundthatmodelswhichallowedforsearchbeh aviortochangewithinasession weremoreaccurateatpredictingpathsthanothermodels(Mo ntgomeryetal.,2004).Lastly,the accuracyofpredictingpurchaseconversionbytheendofase ssionusingpathinformationofthat sessionwasevaluated.Asapathisadiscretesetofpagesvie wed,thepurchaseconversionpredictioncanbecalculatedaftereachpageviewed.Usingaholdou tsampletheaccuracyafteronepage andsixpagesviewedwas10.4%and21.2%,respectively,with theaccuracyincreasingasmore pageswereviewed(Montgomeryetal.,2004). Predictingifapurchasewouldbemadeduringavisitor'snex tvisittoasite,VandenPoeland Buckinx(2005)investigatedtheimportanceoffourdiffere ntcategoriesofmetricsonpurchase prediction.Therstcategoryofmetricsaggregatedclicks treammeasuresforaparticularvisitor regardingalltheirpreviousvisitstothesite.Thesecondc ategoryprovideddetailedclickstream metricsaccordingtotheparticularcontentbeingvisited( e.g.,aproductpage),asopposedtothe entiresiteingeneral.Thethirdandfourthcategoriesdeal twithdemographicandpastpurchase metrics,respectively. Anexploratoryapproachtocullthelistof92availablemetr icsdowntoareasonablesetforuse inlogitmodelswasdoneviathreecompetingmetricselectio nmethods.Using10monthsofdata fromacommercialwineseller,11distinctmetricswereused tocreatethemodelscorrespondingtothemetricselectionmethodemployed.Thecriteriafo rjudgingthemodelsfoundthebest modelusingthevalidationdatasetwaslowinaccuracyforbo ththeproportionalchancecriterion 21

PAGE 35

(Morrison,1969)andtheareaunderreceiveroperatingchar acteristiccurvecriterion(Fischeretal., 2003).Althoughnotextremelyaccurate,VandenPoelandBuc kinx(2005)didndthedetailed metricsprovidedthegreatestpredictiveperformance. Padmanabhanetal.(2001)investigatedtheimplicationsof usingincompleteclickstreamdata totrainmodelsforpredictionpurposes.Specically,thep urposewastodetermineifapurchase wouldoccurduringtheremainderofasessionoratanypointi nthefuture.Thepotentialproblemofusingincompletedataisonlyavisitor'sbrowsingbeh aviorfortheparticularsiteofinterest isobserved.Forinstance,gure4providesanexampleoftwo -typesofdatafortwovisitors.Examiningonlythesite-centricdata,itappearsbothvisitor saresimilarsincetheyhavebothvisited threepagesatsiteA.However,ifuser-centricdataisexami nedinstead,thepictureofthetwovisitors'browsingsessionsismuchdifferent.Visitor1isvisi tingtwoothersitesinadditiontositeA, whereasvisitor2isonlyvisitingsiteA. VisitorUser-centric a Site-centric b 1 h A 1 ;A 2 ;B 1 ;A 3 ;C 1 ;C 2 ih A 1 ;A 2 ;A 3 i 2 h A 1 ;A 2 ;A 3 ; ih A 1 ;A 2 ;A 3 i a Notation:X y indicatesthey th pageviewedfromsiteX. b Assumessite-centricdataisforsiteA. Figure4.:User-centricVersusSite-centricData Toexploretheeffectsofusingsuchincompletedata,Padman abhanetal.(2001)recreateda site-centricdatasetfromsixmonthsofuser-centricdata. Alinearregression,logisticregression, classicationtree,andneuralnetworkwerecreatedforeac hclassofproblem(i.e.,purchaseinthe currentsessionorfuturepurchase).Theresultsofeachmod elwerecomparedagainstthesite-and user-centricdatasets.Allthemodelsusinguser-centricd atahadsignicantlyhigherliftscompared tothemodelsbuiltusingsite-centricdata(Padmanabhanet al.,2001).Inaddition,somemetrics foundtobesignicantinsite-centricmodelswereinsigni cantintheuser-centricmodels,thus leadingtothepossibilitythaterroneousconclusionsmayb ereachedfromrelyingsolelyonsitecentricdata(Padmanabhanetal.,2001).Lastly,somehighl ysignicantmetricswereonlyavailableintheuser-centricdataset(Padmanabhanetal.,2001) highlightingtheimportanceofusinga completepictureofavisitor'sbrowsingbehavior. 22

PAGE 36

Insteadofattemptingtopredicttheprobabilityofpurchas ingasadiscretepurchaseornotpurchaseoutcome,SismeiroandBucklin(2004)viewedthepu rchasingprocessasaseriesoftasks tobecompleted.Eachtask(e.g.,ndaproduct,addtheprodu cttotheshoppingcart,checkout)is sequentialinnatureandrequirespriortaskstoalreadyhav ebeencompleted.Therefore,theproductofachainofconditionalprobabilitiescanbecalculate dforavisitoraftereachtaskhasbeen completed(SismeiroandBucklin,2004).Forexample,thepr obabilitycanbecalculatedforavisitoraddingaproducttotheirshoppingcartgiventhevisito rhasalreadyfoundtheproduct.Inadditiontothetask-completionaspectoftheirmodel,Sismei roandBucklin(2004)alsoallowedfor heterogeneityacrossvisitorsatthegeographicalcountyl evel. Inordertoevaluatethemulti-stagebinarychoicemodel,Si smeiroandBucklin(2004)gathered 70daysofclickstreamdatafromamajorcommercialWebsitei ntheautomotiveindustry.Three sequentialtasksweredenedascriticaljuncturesleading uptothepurchaseofanautomobile: completingthecongurationofanautomobile;inputtingpe rsonalinformation;andcompleting anorder.Todeterminetheeffectivenessofthetask-comple tionapproach,twosingle-taskhierarchicalprobitmodels,onewithdummyvariablesrepresentin gthecompletionofthersttwotasks andonewithout,werecomparedagainstthemulti-stagebina rychoicemodel.Themulti-stage modeloutperformedbothsingle-taskmodelsinhitrateandm eansquareerror(MSE)forpredictingvehicleorders(SismeiroandBucklin,2004).Inadditio n,themulti-stagemodeldemonstrated thatsomemetrics'effectsignsdifferdependingonthetask andsomemetricsarevaluableforpredictingthecompletionofsometasksbutnotforothers(Sism eiroandBucklin,2004).Onestated limitationofSismeiroandBucklin(2004)wastherequireme ntthateachtaskmustbeperformed intheorderspecied.Themodelcannotconsideralternater outesavisitortakesatasitethatmay alsoleadtoapurchase 6 Recognizingthatvisitorsmayhavedistinctpurchasingpat terns,MoeandFader(2004b)investigatedhowpurchasingprobabilitiescanbeimprovedbytak ingintoaccountvisitorheterogeneity andvisithistory.Specically,theycreatedaconversionm odelwhichcontainedsixcomponents. Therstcomponentwasabaselineprobabilityofpurchasing foreachvisitorwhichwasindependentofthevisitor'spasthistory.Thepositiveeffectonpu rchasing(i.e.,visiteffects)wasthesecondcomponentandassumedthateachvisitincreased,althou ghbyvaryingamounts,thelikelihood 6 SismeiroandBucklin(2004)citeAmazon.com's“One-Click” checkoutserviceasanexampleofanalternate route. 23

PAGE 37

ofafuturepurchase.Thethirdcomponent,purchasingthres hold,allowedforanegativeeffecton purchasingwhichmaybecausedbytherisk-adversenessorre luctanceofavisitortopurchase. Adecreasingthresholdwouldindicatelessofanegativeeff ectonavisitor'spurchasingprobability.Thefourthcomponentpermittedheterogeneityacro ssvisitorsbydifferingvisiteffectsand purchasingthresholdsforeachvisitor.Thefthcomponent allowedforchangesandevolutions inthevisiteffectsandpurchasingthresholdovertime.Las tly,themodelincludedacomponent toremoveshopperswhowereconsidered“hard-corenever-bu yers”andhadnointentionofever purchasing(MoeandFader,2004b). Usinganeight-monthsampleofpaneldatafocusingonAmazon .com,visiteffectswerefound toaccumulateovertimeandincreasepurchasingprobabilit ies(MoeandFader,2004b).However, theconversionmodelshowedanoveralldecreaseinpurchasi ngprobabilitiesovertime.Thethird andfthcomponentsshowedrepeatvisitshadlessofanimpac tonpurchasingovertimeandpurchasingthresholdsincreasedasvisitorsbecamemoreexper ienced(MoeandFader,2004b).These resultsmirrorMoeandFader(2004a)indicatingtheimporta nceofexploringvisitor-leveldataas theconversionmodelcontradictedtheaggregateddataset' sdemonstrationofincreasingpurchasingprobabilities.Lastly,theconversionmodeloutperfor medalogisticregressionmodel,duration model,beta-binomial,andhistoricalconversionratesbyh avingthelowestrelativeerrorinpredictingconversionrates(14.7%predictedversus15.7%actual) 2.2.4GoalAchievementChatterjeeetal.(2003)analyticallymodeledavisitor'sr esponsetobanneradvertisementexposure onanad-sponsoredmagazineWebsite 7 .Theproposedmodelallowedforheterogeneityofvisitors andtheirsessions(bothwithinandacross)totheWebsite.T hreebenchmarkmodels,withvarying levelsofheterogeneity,wereusedtocomparetheproposedm odel. Usingdatafrom1995,theabilityofthemodeltopredictthec lick-throughofbanneradvertisementsof3,611visitorsfortwosponsorswastested.Theprop osedmodelobtaineda41%hitrate comparedtothethreealternativemodelsobtaininga2.4%,2 4.2%,and33.3%hitrate,respectively (Chatterjeeetal.,2003).Asexpecteditwasfoundthat“wea rout”tobannerexposurewasafactor 7 Anotablepointaboutthistypeofgoalachievementisitcano ccurmultipletimeswithinthesamesession(i.e.,a visitorcanclickmultiplebanneradvertisementsduringas ession).Althoughmultiplepurchasesorothergoalsmaybe achievedwithinthesamesession,itisunlikelytooccurwit hmuchfrequency. 24

PAGE 38

andthustheprobabilityofclickingonabanneradvertiseme ntwashigherwhenavisitorwasrst exposedtotheadvertisement.Inaddition,the“wearout”co nceptalsoextendedoversessionssuch thatinfrequentvisitorsweremorelikelytoclickonabanne radvertisementthanfrequentvisitors (Chatterjeeetal.,2003).2.3DatasetsTable2providesinformationaboutthedatasetusedineachs tudy.Asclickstreamresearchistypicallydata-driven,theresultsfoundinpriorliteraturem aybespecictoparticulardatasets.Thus, havingknowledgeofthedatasetsusedmaybehelpfulinunder standingdifferingresults.Each datasetisbrokendownbytype,sectorthesiteorsitesofint erestbelongto,yearwhenthedatawas collected,durationofdatacollection,andsizeofthedata set. 2.3.1TypeThetypeofdatasetusedcanbecategorizedaseithersite-ce ntricoruser-centric,termscoinedby Padmanabhanetal.(2001)torefertothefocusofaclickstre amdataset.Site-centricdataisfocusedonthesiteitselfandisdenedas“...clickstreamdat acollectedatasiteaugmentedwith userdemographicsandcookiestoidentifyusers”(Padmanab hanetal.,2001,pg.154).Although site-centricdataisadvantageousintermsofbeingreadily availabletosite-owners(althoughthey mightnothaveaccesstodemographics)andincludingalltra fctoasite,itcanonlyprovideinformationonavisitor'sbrowsingbehavioronthatsite.Avisit or'sentirebrowsingsession(i.e.,user session),whichmayincludebrowsingatothersites,cannot beobtainedfromsite-centricdata. User-centricdataovercomesthedisadvantageofsite-cent ricdatabyprovidinganentirevisitor's sessionregardlessofthedifferentsitesvisited.Havingc ompleteinformation,user-centricdata hasproventobemoreaccuratethansite-centricdatawhenbu ildingmodelspredictingpurchasing probabilities(Padmanabhanetal.,2001).User-centricda taisdenedas“...site-centricdataplus dataonwhereelsetheuserwentinthecurrentsession”(Padm anabhanetal.,2001,pgs.154–155). User-centricdataistypicallyobtainedfromrandomlysele ctedparticipantswhoarerepresentative ofthepopulationatlarge.Unfortunately,somemorerecent user-centricdatasetslackdetailsof eachpageauservisitsduringtheirsession.Thislimitatio nrestrictstheexaminationofpathsand othersuchtechniquesfrombeingstudiedusingthesedatase ts. 25

PAGE 39

Inadditiontothe“pure”site-anduser-centricdatasets,s omestudiesconstructsite-centricdatasets fromuser-centricdataforasinglesiteorsetofsites.Thes econstructedsite-centricdatasetstypicallyviewthesiteorsitesofinterestinisolationwithout regardtowhereavisitormaybrowseat othersites(e.g.,Montgomeryetal.,2004).2.3.2SectorThesitesexaminedinastudy'sdatasetcanbecategorizedin toageneralsectoraccordingtothe mainpurposeofthesiteortypeofproductssold.Awarenesso fthesectormaybedesirablesince browsingbehaviorhasbeenfoundtovarybysector(Johnsone tal.,2004).Therefore,theresults obtainedfromasiteinonesectormaynotbegeneralizableto sitesinanothersector.Forexample, VandenPoelandBuckinx(2005)lookedatane-commercesites ellingwine,whichiswithinthe foodsector.Theresultsfromsuchasectormaynotgeneraliz ewelltoothersectorsaswineisa perishablegoodwhichmayrequirerestockingasconsumed.A sitewithintheconsumerelectronicssectormayhavedrasticallydifferenttrafcpatternsa ndfactorsthatleadtopurchaseduetothe non-perishableaspectoftheproductssold. SiteswerecategorizedintosectorsaccordingtotheOpenDi rectoryProject(ODP).Asearchon asite'sdomainwasdoneonODPandthemostrelevantcategory returnedwasusedasthesector. Forstudieswhichdidnotexplicitlymentionthesitesused, thesectorwaslocatedaccordingtothe generalpurposeortypeofproductsold.Assitesmaychanged rasticallyovertimethepurposeor typeofproductsoldatthedataset'stimewasusedtocategor izethesite 8 2.3.3Time,Duration,andSizeAstheInternethasseenanevolutionofvisitors'behaviort osites(Zhangetal.,2006),theyearor yearsinwhichadatasetwascollectedcanhaveadirectimpac tontheresultsobtained(cf.Johnson etal.,2004;Zhangetal.,2006).Thedurationofdatacollec tioncanalsohaveprofoundresults. Forinstance,collectingdataforonemonthinacyclicalind ustrymayresultindifferingconclusionswhencomparedtodatacollectedinthesameindustryov eralongerperiodoftime.Lastly, thesizeofthedatasetisprovided.Forsite-centricdatath enumberofmonthlyvisitorscanbecon8 Amazon.comcirca1998soldonlybooks(MoeandFader,2004b) andthuswouldbeassignedtotheBooksector. Today,however,Amazon.comsellsmanydifferentcategorie sofproductsrangingfromconsumerelectronicstobedding andthuswouldbeassignedtotheGeneralsector. 26

PAGE 40

sideredanaccurate,albeitconservative,estimateofactu alvisitors 9 .Asuser-centricdatarepresent onlyasub-sampleofallvisitorstoasite,suchdatasetscan notaccuratelyrepresentthesizeofa site.However,manystudiesusinguser-centricdatasetsfo cusonlargewell-knownsitessuchas Amazon.com(MoeandFader,2004b),BarnesAndNoble.com(Mo ntgomeryetal.,2004),andCDNow.com(MoeandFader,2004a)inordertoobtainadequatesi zedsamplesandgeneralizability. 9 Aspartofthepreprocessingofclickstreamdata,sessionsc onsistingofonlyasinglepagearetypicallydiscarded (BucklinandSismeiro,2003).Therefore,site-centricdat ameasuresofmonthlyvisitorsarelikelyaconservativeest imateofactualuniquevisitorstoasite. 27

PAGE 41

Table2:PriorLiterature:Datasets Article Dataset Monthly Type a SectorYearDurationVisitors b M ULTIPLE O BJECTIVES Kalczynskietal.(2006) f site-centricconstruction,nancial, government,insurance,&travel N/AN/A 97 Moe(2003)site-centricnutritionalsupplements20007wee ks 3 ; 508 B ROWSING BucklinandSismeiro(2003)site-centricautos19991month 164 ; 429 Danaheretal.(2006)user-centricall20001month 1 ; 665 d Johnsonetal.(2004)user-centricbooks,music,&travel19 97-19981year 893 d MoeandFader(2004a)site-centricbooks&music19988month s 741 d ParkandFader(2004)user-centricbooks&music1997-19988 months 1 ; 039 d Zhangetal.(2006)user-centricmusic,computerhardware, & travel 20026months 2 ; 277 d P URCHASING MoeandFader(2004b)site-centric c books19988months 536 Montgomeryetal.(2004)site-centric c general20021month 1 ; 160 Padmanabhanetal.(2001)both c multipleN/A6months 3 ; 297 ContinuedonNextPage...28

PAGE 42

Table2:PriorLiteratureDatasets–Continued Article Dataset Monthly Type a SectorYearDurationVisitors b SismeiroandBucklin(2004)site-centricautos2000-20017 0days 37 ; 597 VandenPoelandBuckinx(2005) site-centricfood2001-200211months 126 G OAL A CHIEVEMENT Chatterjeeetal.(2003)site-centricmagazine19957month s 479 e a Adatasetisconsideredsite-centricifonlyinformationab outtheparticularsiteorsitesunderstudywereexaminedin dependentlyofanyothersitesthevisitormay havevisited.Thus,whilethedatamaybeuser-centricinnat ure(i.e.,paneldata)theresearchershavetakenasite-cen tricapproachbydisregardingbrowsingbehaviorat othersites. b Monthlyvisitorswerecalculatedasshowninequation2.1.T rafcpatternswereassumedtobeconstantthroughoutadata set'sduration.Ifthespecicdatesofthe datasetwerenotprovidedapproximationsweremade. totalnumberofvisitors durationofdatasetindays 30 (2.1) c Indicatesadatasetconstructedfromuser-centricdata.Th esedatasetsonlyreectasubsetofallmonthlyvisitorstoa site. d Indicatestotalmonthlyvisitorsacrossallsitesanalyzed inthedataset. e Indicatesasubsetofallmonthlyvisitorstoasitethatmets peciedcriteria. f Capturedclickstreamdatafromexperimentalsubjectsperf ormingaprespeciedtask.29

PAGE 43

2.4MetricsTable3providesinformationaboutthemetricsusedineachs tudy.Categorizedbythefocusofthe research,themetricsusedaredescribedintermsoftheirle velofanalysisandgeneraltype. 2.4.1AnalysisLevelThemetricsusedforclickstreamresearchcanbedenedatfo urbasiclevelsofanalysis:session, site,sector,anduser.Atthesession-level,whichisthemo stdetailed,eachsessionofavisitoris typicallytreatedasanindependentdatapoint.Asession-l evelmetricisbasedonlyoninformation withinthesamesession(e.g.,numberofclicksinasession, timespentduringasession).When allsessionsataparticularsiteforavisitorareaggregate dtogether,theyrepresentthesitelevelof analysis.Atthesitelevelofanalysiseachvisitorisconsi deredadatapoint,regardlessofthenumberofsessionsatasite.Metricsatthesite-levelcanprovi deahistoricalperspectiveofavisitor's browsinghistoryatthatsite(e.g.,conversionrateforthe visitor). Thesector-levelperformsanotherlevelofaggregationfor allsitesvisitedwithinthesamesector.Asmorethanonesiteisavailable,thesector-levelcan providemetricsthatcompareavisitor's browsingandpurchasinghabitsacrossvarioussiteswithin thesector(e.g.,percentageofvisitsto siteA).Finally,theuser-levelaggregateseverysector,w hichincludesallsitesandallsessions,of avisitor'sbrowsingbehaviortogether.Similartosectorlevelmetrics,user-levelmetricsarealso abletocomparebrowsingandpurchasinghabits,albeitatah igherlevelofanalysis(i.e.,thesector) 10 Figure5isanexampleofthedifferentlevelsofanalysispos sibleforasingleuser.Theuser U 1 depictedingure5hadtensessions( S 1 10 )atfoursites( I 1 4 )withinthreesectors( C 1 3 ).Takingasite-centricapproach,eithersession-levelorsitelevelmetricscouldbeused.Forinstance, asite-centricapproachatsite I 1 couldexamineeachoftheuser'ssessions( S 1 3 ; 5 )asindividual datapoints;allsessionsaggregatedtothesite-level( I 1 )asasingledatapoint;oracombinationof bothwherethelastsession( S 5 )isusedatthesession-levelandallprevioussessions( S 1 3 )are aggregatedatthesite-levelforhistoricalmetrics. Takingauser-centricapproach,notonlycansession-andsi te-levelmetricsbeused,butalso 10 CatledgeandPitkow(1995)alsonotedvaryinglevelsofanal ysis.However,theyonlyconsideredthesessionand userlevelofanalysis. 30

PAGE 44

User(U) U 1 = f C 1 3 g Sector(C) C 1 = f I 1 2 g C 2 = f I 3 g C 3 = f I 4 g Site(I) I 1 = f S 1 3 ; 5 g I 2 = f S 4 ; 9 g I 3 = f S 6 g I 4 = f S 7 8 ; 10 g Session(S) S 1 S 2 S 3 S 5 S 4 S 9 S 6 S 7 S 8 S 10 Figure5.:ExampleMetricLevelofAnalysis sector-anduser-levelmetrics.Forinstance,continuingt heexampleofuser U 1 fromgure5,at sector C 1 allsite-leveldatacanbeaggregatedasasingledatapointf ortheuser.Inaddition,the site-andsession-levelmetricswouldalsobeavailablefor bothsites( I 1 2 )andthecorresponding sessions( S 1 5 ; 9 ).Atthemostgenerallevel,usermetricsaggregatedfromal lsectors( C 1 3 )for theuserwouldberepresentedasasingledatapoint.Further more,allotherlevel-of-analysismetricswouldbeavailableforallthesectors,sites,andsessi onsoftheuser. Eachlevelofanalysiscanbefurtherbrokendownintogenera landdetailedmetrics.General metricsrefertoanymetricsataparticularlevelofanalysi sthatincludesallrelevantbehavior, withoutregardtotheunderlyingcontentofthesite.Detail ed,ontheotherhand,breaksmetrics downbythetypeofcontentbeingviewed(VandenPoelandBuck inx,2005).Forinstance,ageneralsession-levelmetricwouldbethenumberofpagesviewe dinasessionwhereasadetailed session-levelmetricwouldbethenumberofproductpagesvi ewedinasession.Aggregatingall thedetailedmetricsforalevelofanalysiswouldresultint hegeneralmetricsforthesamelevelof analysis.2.4.2MetricCategoriesDirectmarketershaveusedRecency,Frequency,andMonetar y(RFM)metricstosegmentcustomersfordecades(Shaver,1996;StoneandJacobs,2001)an dsummarizetheirpriorbehavior (Faderetal.,2005).Recencyreferstotheamountoftimeela psedsinceaparticularactionorbehaviorhasbeenobserved,frequencyisconcernedwiththenu mberoftimesthesameactionorbehaviorismade,andmonetarydealswiththeamountofmoneysp entoncurrentorpastpurchases. Forexample,theamountoftimesinceavisitorlastvisiteda Website,thenumberofpagesviewed duringasession,andthetotalamountspentonapreviouspur chase,wouldrepresentarecency, frequency,andmonetarymetric,respectively. 31

PAGE 45

Asthebasicunderlyinggoalofidentifyingvaluablecustom ersiscommoninbothdirectmarketingandclickstreamresearch,theclassicRFMmetricsca nbealogicalstartingpointforcategorizingthemetricsusedintheclickstreamliterature.Howe ver,differencesbetweentheonlineand ofineenvironmentaffectthedataavailableandthusthety pesofmetricswhichcanbeused 11 WithintheRFMmetricsbothrecencyandfrequencyarewellre presentedinclickstreamresearch, butmonetarymetricshavenotseenwidespreadusagesinceda tasetsdonotexplicitlycontainpricinginformationofvisitors'purchases 12 BesidesRFMmetrics,usercharacteristics(i.e.,demograp hics)havebeenusedinbothdirect marketingandclickstreamresearchwithsomeregularity.A lthoughnotavailableinauser'sclickstream,user-centricpaneldatatypicallyobtaindemograp hicdataseparatelywhicharethenmapped totheappropriateclickstream.Durationandtimingaretwo measuresmorespecictoclickstream research,duetotheeaseatwhichtheycanbeobtained.Durat iondealswiththeamountoftime spentdoinganactionorbehavior,whereastimingistherate atwhichanactionorbehavioris done.TheamountoftimespentonaWebsiteandthevisitation rateofavisitortoasiteareexamplesofdurationandtiming,respectively.Lastly,thest ructureoftheInternetandcharacteristics ofWebsitesandtheirpageslendthemselvestoawidevariety ofothermetricswhichdonott intothepreviouslydiscussedmetrics.ThereferringWebsi te,numberoflinksonapage,andsize ofapageinbytesareallexamplesofthetypeofmetricswhich belonginthe“other”category. Table3classiesthemetricsusedinpriorliteratureaccor dingtothecategoriestheybelong: demographics,recency,frequency,monetary,duration,ti ming,andothermetrics. 11 Taketheexampleofdeterminingthenumberofproductsviewe dforanonlinestoreversusanofinecatalog.Onlineasimplecountofthenumberofpagesviewedwithproduct informationwouldprovidetherelevantinformation. Ofineattemptingtogaininformationforsuchametricwoul dbeprohibitivelyexpensivesincesometypeofobservationwouldbeneededforeachviewerofthecatalog. 12 Monetarymetricscanbeobtainedfromsecondarysourcesand hasbeenexaminedinVandenPoelandBuckinx (2005),butitisnotanaturalbyproductfoundinserverlogs andothersuchsourcesthatclickstreamdataaretypically gatheredfrom.Someuser-centricdatasets;however,dopro videmonetaryvaluesforproductssold. 32

PAGE 46

Table3:PriorLiterature:Metrics Analysis ArticleLevelDemographicsRecencyFrequencyMonetaryDur ationTimingOther M ULTIPLE O BJECTIVES Kalczynskietal.(2006)session Y Moe(2003)session YYY B ROWSING BucklinandSismeiro(2003)session&siteYYYDanaheretal.(2006)sessionYYYYJohnsonetal.(2004)sector Y Y MoeandFader(2004a)session YY ParkandFader(2004)sector YY Zhangetal.(2006)site§orYYY P URCHASING MoeandFader(2004b)session Y Montgomeryetal.(2004)sessionY Y Padmanabhanetal.(2001)site§orYYYYSismeiroandBucklin(2004)session&siteYYYYVandenPoelandBuckinx(2005) session&siteYYY Y ContinuedonNextPage...33

PAGE 47

Table3:PriorLiteratureMetrics–Continued Analysis ArticleLevelDemographicsRecencyFrequencyMonetaryDur ationTimingOther G OAL A CHIEVEMENT Chatterjeeetal.(2003)session&siteYYY 34

PAGE 48

2.5ConclusionTheprecedingsectionsinthischapterorganizedandsummar izedresearchusingclickstreamdata forprediction.Alltold,themajorityofresearchhasfocus edonthebrowsing(Johnsonetal.,2004) andpurchasingbehavior(Padmanabhanetal.,2001)ofWebus ersate-commercesites,withlittle attentionbeingpaidtoalternativeobjectives(i.e.,goal achievement)orcontexts(e.g.,informationalWebsites).AstherearemanydifferenttypesofWebsi tes(Jaillet,2003),focusingonjust e-commercesitesisdoneattheexpenseofunderstandingvis itorbehavioratotherinterestingand valuablenone-commerceWebsites. Intermsofdata,user-centricdatasetsarecommonlyusedwh enexaminingbrowsingbehavior,butwiththeexceptionofPadmanabhanetal.(2001)isno n-existentforpurchasingorgoal achievementbehaviors.Thesectorsexaminedforbrowsingb ehaviorgenerallyoverlap(e.g.,books andmusic)allowingcomparisonsofresults.Forpurchasing andgoalachievement,however,there islittleoverlapandsomeofthesectorsanalyzeddiffersub stantiallyfromoneanother(e.g.,automobilesversuswine).AccordingtoZhangetal.(2006),wh ofoundbrowsingdiffersbysector, suchlittleoverlapmaymakeresultsdifculttocompareove rstudies.Lastly,manyofthedatasets arefairlydated.Althoughbenecialfromthestandpointof comparingandcontrastingchanges overtime(cf.Johnsonetal.,2004;Zhangetal.,2006),thev astchangesintheInternetandWeb usersoverthepastfewyearsmaypointtowardaneedformorer ecentandthusrelevantdatasets. Althoughmanystudiesusedmetricsthatdidnottneatlyint othecategoriesoftable3,general patternsofthetypesofmetricsusedcanstillbeseen.Overa ll,frequencyappearstobethemost commonlyusedtypeofmetricaseverysinglestudyexceptfor Kalczynskietal.(2006)andMontgomeryetal.(2004)includedsomeaspectofcountinginthei rmodels.Durationmetricswere alsocommonlyusedforalltypesofresearch.Lastly,timing metricsweremoreheavilyusedin browsingwhilerecencywasmorecommoninpurchasingandgoa lachievement.Determininghow wellthesetypesofmetricsdoforotherobjectivesandconte xtsalongwithndingacommonset ofmetricscanprovidethebasisforbetterunderstandingvi sitorbehavior.Furthermore,looking outsidethesemetrictypesintothe“other”category 13 canalsohelpprovideexplanationintothe “whys”ofvisitorbehavior. 13 However,these“other”metricsshouldbereadilyavailable toallWebsitesandnotbeanartifactofaparticular siteorhowitisorganized. 35

PAGE 49

Chapter3 Theory InformationForagingTheory(IFT)“...aimstoexplainandp redicthowpeoplewillbestshape themselvesfortheirinformationenvironmentsandhowinfo rmationenvironmentscanbestbe shapedbypeople”(Pirolli,2007,pg.3).Simplystatedonea spectofIFTisitsabilitytoexplain thebehaviorofapersonastheysearchforinformationwithi napliableenvironment.Centralto thetheoryofinformationforagingaretheconceptsofinfor mationscentandpatches.Information scentisthedrivingforceofwhyapersonmakesanavigationa lselectionamongstagroupofcompetingoptions.Informationpatchesaredistinctareasoft hesearchenvironmentwhichdifferin theirinformationalcontent.Thesynthesisofbehavior(i. e.,informationscent)andenvironment (i.e.,informationpatches)providesforarichtheoryofin formationforaging. IFThasastrongtheoreticalfoundationbydrawinguponOpti malForagingTheory(OFT)(Stephens andKrebs,1986)andtheAdaptiveControlofThought-Ration alTheory(ACT-R)(Andersonetal., 2004),twowell-knowntheorieswithintheirrespectiveel ds.OFTisanecologicaltheoryconcernedwithexplainingtheforagingbehaviorofanimalsast heyhuntforfood.ACT-Risapsychologicaltheoryofthehumanmindthatincludesthecognit ivearchitectureandprocessbywhich cognitionworks.WithinIFT,OFTisusedtoexplainthebehav ioralelementsofpeopleforaging forinformation(i.e.,whytheygoaboutsearching),wherea sACT-R'spurposeistoexplainthe mechanismofhowthebehaviorisbeingdrivenatthecognitiv elevel. Theremainderofthischapterisorganizedasfollows.First anintroductionofOFTandACT-R areprovidedin§3.1and§3.2asbackgroundinformationforI FT.Thendetailsregardingthetwo centralconceptsofIFTarepresentedin§3.3,followedbytw oversionsofamodelthattestthe theory. 36

PAGE 50

3.1OptimalForagingTheoryTheaimofoptimalforagingtheory(OFT)istoexplainthefee dingbehaviorsandadaptations ofanimals(StephensandKrebs,1986).OFThasbeenusedtode scribethecommutingbehaviorofseabirdstodistalfeedinggrounds(Nevitt,2000);as sessthenutritionalratiosofants'food (Kay,2002);predictthefeedingstrategiesofcoyotes(Mac CrackenandHansen,1987);andtest thegroupforagingbehaviorsofcranes(Alonsoetal.,1995) .Inaddition,OFThasalsobeenappliedtohumansbyexplainingthehuntingandgatheringprac ticesoftheAcheofeasternParaguay (Hawkesetal.,1982);variabilityinAmazonianIndians'di etselections(HamesandVickers, 1982) 1 ;decisionsbetweenambiguousandunambiguouschoices(Rod eetal.,1999);andapplicabilityofOFTininformationseekingbehaviors(Sandstro m,1994). TheoverarchingassumptionofOFTisanimalshavedeveloped benecialforagingadaptations andbehaviorsthatincreasetheirnetenergy.Againinneten ergy(aboveananimal'smetabolicrequirement)allowsspareenergytobespentonvitalnon-feed ingactivitiessuchasghting,eeing, andreproducing(StephensandKrebs,1986).Asanimalswith higherlevelsofspareenergyare morelikelytosurviveandreproduce,successivegeneratio nsareassumedtoinheritthosebenecialforagingadaptationsandbehaviors. FollowingMacArthurandPianka(1966),thegeneralconcept susedinOFTarethatofpredators,prey,andpatches(StephensandKrebs,1986).Predato rsaretheanimalsdoingtheforaging(i.e.,huntingforfood)andtheirbehaviorsarethefoca lpointofOFT.Preyreferstoanyitem offoodthatapredatormayconsumesuchasarabbit,berry,or plantroot.Eachtypeofpreydiffersintheirprevalenceintheenvironmentalongwiththeam ountofenergythepredatorexpends andgainsfromchasingandeatingtheprey,respectively.Ap atchissomeareaoftheenvironment whichcontainsprey.Likeprey,patchesofdifferenttypesd emonstratevariabilityintermsofthe netenergyapredatorgainsfromforagingwithinthem. Predatorsareassumedtoforageforfoodaccordingtoaseque ntialsearch–encounter–decision process(StephensandKrebs,1986).Whilesearching,anani malusesitssensoryabilitiestopick uponcuestohelplocatepreyorpatches.Forexample,seabir dsusetheirsenseofsmellto(1)locatepatchesoverthousandsofkilometersfromtheirnestin gcolonyand(2)ndpreywithinthose patches(Nevitt,2000).Withoutsensoryguidance,thefora ger'sprobabilityofencounteringprey 1 AmorecompletereviewofOFT'suseinanthropologicalresea rchcanbefoundinSmith(1983). 37

PAGE 51

orpatchesiseffectivelyreducedtorandomchance.Searchi ngstopsoncepreyorapatchhasbeen located(i.e.,anencounterhasoccurred).Atthepointofen countertheforagermakesadecisionof howtoproceed. ThetwoconventionalmodelsofOFTagreeonthesearch–encou nter–decisionprocess,but fundamentallydifferinwhatdecisiontomakeonceanencoun tertakesplace.Thepreymodel (CharnovandOrians,1973)asksthequestion“attackorcont inuesearching?”(StephensandKrebs, 1986,pg.13)whenencounteringprey.Inthepatchmodel(Cha rnov,1976)theforagerasksthe question“howlongtostayinapatch?”(StephensandKrebs,1 986,pg.14)whenapatchhasbeen encountered. AstheoverarchingassumptionofOFTistheincreaseofneten ergy,bothmodelsareconcerned withmaximizingtheaveragerateofenergyintake.Therefor e,eachmodelusesavariantofHolling's discequation(equation3.1)(Holling,1959).Theaverager ateofenergyintakeisrepresentedas R andiswhatboththepreyandpatchmodelsaremaximizing.Dep endingonthemodelused iseithertherateofencounterofpreyorapatch. e istheaverageenergygainedfromeachencounter, s isthesearchcostperunitoftime,andnally h istheaveragehandlingtimeperencounter.Within eachmodelthetheoryassumespredatorshaveperfectinform ationregardingthecharacteristics (e.g., ; e; h )ofitspreyorpatches(StephensandKrebs,1986).Eventhou ghanimalsdonotpossessperfectinformation,thetheoryhasstillfoundempiri calsupportevenwhentheassumptionof perfectinformationhasbeenviolated(Kay,2002). R = e s 1+ h (3.1) 3.1.1PreyModelThepreymodeldeterminesifananimalshouldattackandcons umeaparticulartypeofpreyor continuesearchingforotherpreytypes(CharnovandOrians ,1973).Thedecisionismadeby maximizingtheaveragerateofenergyintakebypreytypeto ndtheoptimaldiet(Stephensand Krebs,1986).Withinthepreymodelthereare n differentpreytypesencounteredatrandomwith i representingthe i th preytype.Let D representthesetofpreytypessuchthat D = f 1 ; 2 ;:::;n g Associatedwitheachpreytypearethefollowingcharacteri stics 2 : 2 NotationsfollowPirolli(2007). 38

PAGE 52

• t Bi =averagetimebetweenlocatingpreyoftype i • i =rateofencounterofpreytype i whensearching( 1 =t Bi ). • t Wi =handlingtimeassociatedwithpursuing,capturing,andco nsumingpreytype i • g i =netenergygainfromconsumingpreytype i • i =protabilityofpreytype i ( g i =t Wi ). • p i =probabilityofattackingpreytype i uponencounter 3 Thelong-termaveragerateofenergyintakeforallpreytype sisdeterminedfromequation3.2 (avariantofHolling'sdiscequation)(StephensandKrebs, 1986). R = P i 2 D p i i g i 1+ P i 2 D p i i t Wi (3.2) Theinclusionofsomepreytypesintoaforager'sdiet,whenc omparedtothealternatives,may neverbeworththeenergytoattackandconsume.Determining whichpreytypesshouldbeexcludedfromconsiderationisexpressedviatheprobability ofattackingapreytype( p i ).Inorder tomaximize R thepreymodelfollowsthezero-onerulewhichstatesapreyt ypeiseitheralways attacked( p i =1 )oralwaysignored( p i =0) (StephensandKrebs,1986).Asexpressedinequation3.3,preytypesareexcludedwhentheprotabilityofpr eytype i islessthantheaveragerate ofenergyintakeofallotherpreytypes 4 p i = 8<: 0 if i < P j 2 D f i g j g j 1+ P j 2 D f i g j t Wj 1 otherwise (3.3) Oncetheprobabilityofattackingeachpreytypehasbeendet ermined,thedecisionthenturnsto selectingwhichpreytypestoincludeforanoptimaldiet.Th etwo-steppreyalgorithmmakesthe selectionbasedontheprotabilityofapreytypecomparedt othecurrentaveragerateofenergy ( R )(StephensandKrebs,1986).Therststepistorankthe k remainingpreytypesinorderof decreasingprotabilitysuchthat 1 > 2 >:::> k .Inthesecondstepeachpreytypeisadded totheforager'sdietuntilequation3.4istrue.Thelastpre ytypeaddedtothedietisthelowest 3 Probabilityistheonlycharacteristicofapreytypethefor agerhascontrolover. 4 Thederivationofequation3.2todeterminewhichpreytypes shouldorshouldnotbeattackedwhenencountered (i.e.,equation3.3)canbefoundinStephensandKrebs(1986 ). 39

PAGE 53

rankingpreytypeinthediet.Iftheequationisnevertrueth enallpreytypesareincludedinthe diet. R ( k )= P ki =1 i g i 1+ P ki =1 i t Wi > k +1 (3.4) Figure6showsasimulatedexampleoftendifferentpreytype savailabletoapredator.Thegureillustratestherelationshipbetweentheprotability ofapreytype( k )againsttheaveragerate ofenergy( R ( k ) ).Inthisexample,theaveragerateofenergyismaximizedwh enthevemost protablepreytypesareaddedtotheforager'sdiet.Thisma ximizationisthepointatwhichthe currentrateofenergy( R (5)=0 : 9888 )isgreaterthantheprotabilityofthenextpreytype ( 6 =0 : 9838 )(equation3.4).Asillustrated,addingadditionalpreyty pesorremovinganyof theveselectedpreytypesleadstoadecreasein R andthusasub-optimaldiet. 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 1 2 3 4 5 6 7 8 9 10 Energy Rate / ProfitabilityPrey Type k p k R(k) Figure6.:OFT:SimulatedOptimalDiet Notableaboutequation3.4isthattheinclusionofapreytyp eintoaforager'sdietisindependentofitsrateofencounter(CharnovandOrians,1973).The decisiontoaddapreytypeisdependent,however,ontherateofencounterforthosepreytyp esrankedhigherthanthecurrentprey type.Forexample,inasituationwithtwopreytypesthedeci siontoincludethesecondpreytype isonlydependenton(1)itsprotability( 2 )and(2)therateofencounter,netenergy,andhandlingtimeoftherstpreytype( 1 ;e 1 ;h 1 ). 40

PAGE 54

ForagingExampleToillustratethepreymodelconsiderahypotheticalexampl eofabrownbearforagingforfood 5 Brownbearsareknownfortheirdiversediets(Garshelis,20 07)andthereforethedecisionof whichpreytypestoincludeintheirdietsisgermanetothedi scussionofthepreymodel.Assume thatfourdifferentpreytypesarepresentinthebear'senvi ronment.Eachofthefourpreytypesand theircharacteristicsarelistedintable4. Table4:OFT:ExamplePreyTypesforaBrownBear PreyType t B t W gp a Deer 3 ; 600sec2 ; 580sec3 ; 200kCal1 : 2403kCal = sec 1 Berries 12sec180sec200kCal1 : 1111kCal = sec 1 Squirrels 6sec600sec610kCal1 : 0167kCal = sec 1 Chipmunks 6sec720sec500kCal0 : 6944kCal = sec 0 a Ascalculatedfromequation3.3. Basedonthecharacteristicsoftheotherpreytypes,thepro tabilityofchipmunkswillneverbe highenoughtowarrantthebeareatingthemandtherefore p chipmunks issetto0(equation3.3). Fortherststepofthepreyalgorithmtheremainingpreytyp esarerankedindescendingorderaccordingtotheirprotabilitywhichyields deer > berries > squirrels .Theresultsoftheiterative secondstepofthepreyalgorithmcanbeseenintable5.The R columnisthelong-termaverage rateofenergyfortheincludedpreytypes(left-handsideof equation3.4)andthe columnisthe protabilityofthenextlowestrankingpreytype(right-ha ndsideofequation3.4).Thenalcolumn Stop? issettoyesifthelastaddedpreytypecausestheinequality tobetrue(i.e., R> )and settonootherwise 6 Asseenintable5adietconsistingofdeerandberriesisopti malforthebear.Eatingonlydeer orchoosingtoeatallthreepreytypeswouldresultinasub-o ptimalrateofenergyasillustratedby thelowervaluesof R 5 TheforagingexamplewasadaptedfromPirolli(2007). 6 Althoughthealgorithmwouldstopafterdeerandberriesare includedinthediet,thecalculationforincluding squirrelsintothedietwasdoneforillustrativepurposes. 41

PAGE 55

Table5:OFT:ExampleDietforaBrownBear IncludedPreyTypes R ( k ) k +1 Stop? Deer 0 : 5178kCal = sec1 : 1111kCal = sec No Deer&Berries 1 : 0502kCal = sec1 : 0167kCal = sec Yes Deer,Berries,&Squirrels 1 : 0215kCal = sec n/an/a 3.1.2PatchModelThepatchmodeldeterminestheoptimaldurationananimalsh ouldforagewithinanynumberof patchtypes(Charnov,1976).Thedecisionofhowlongtospen dinapatchofaparticulartypeis determinedbymaximizingtheaveragerateofenergyintake( similartothepreymodel)(Stephens andKrebs,1986).Withinthepatchmodelthereare n differentpatchtypeswith i representingthe i th patchtype.Let P representthesetofpatchtypessuchthat P = f 1 ; 2 ;:::;n g .Associatedwith eachpatchtypearethefollowingcharacteristics 7 : • t Bi =averagetimebetweenlocatingpatchesoftype i • i =rateofencounterofpatchtype i whensearching( 1 =t Bi ). • t Wi =theamountoftimespentsearchingwithinpatchtype i (i.e.,patchresidencetime) 8 • g i ( t Wi ) =netenergygainfrompatchtype i when t Wi timeunitsarespentforagingwithinthe patch(i.e.,gainfunction). Thelong-termaveragerateofenergyintakeforallpatchtyp esisdeterminedfromequation3.5 (avariantofHolling'sdiscequation)(StephensandKrebs, 1986). R = P i 2 P i g i ( t Wi ) 1+ P i 2 P i t Wi (3.5) Tobetterillustratepatchesandtheircharacteristicscon siderahypotheticalexampleofaseabird foragingforfoodinanenvironmentwithmultiplepatchesof asinglepatchtype.Figure7details theenvironmentwherethesolidlinerepresentstheseabird 'spathasitforages.Thesquaresare patchesandwithinthepatchesaresourcesoffoodshownass h.Thehorizontalaxisrepresents time,wherethetimespentwithinapatchis t W andthetimespentbetweenpatchesis t B .Asseen 7 NotationsfollowPirolli(2007). 8 Patchresidencetimeistheonlycharacteristicofapatchty petheforagerhascontrolover. 42

PAGE 56

ingure7theseabirdonlyeatssomeoftheavailableshinea chofthepatches.Sincethereisa niteamountoffood,eachpatchdemonstratesdiminishingr eturnsofenergyasafunctionoftime. Duetothisdiminishingreturnitwouldhavebeensuboptimal fortheseabirdtoremaininapatch untiltotaldepletion. n rn n n n n Figure7.:OFT:PatchyEnvironment–adaptedfromPirolli(2 007,pg.32) Thegainfunctionassociatedwithapatchtype, g i ( t Wi ) ,determinestheamountofenergygained perunitoftimespentforagingwithinapatch.Eachgainfunc tionis“...assumedtobeawelldened,continuous,deterministic,andnegativelyaccele rated(curvingdown)function”(Stephens andKrebs,1986,pg.25) 9 .Figure8illustratesagainfunctionwiththehorizontalax isrepresentingtimespentforaginginthepatchandtheverticalaxisrep resentingnetenergygain.Suchagain functionhelpsexplainthebehavioroftheseabird.Initial lytheseabirdrealizedarapidenergygain asthereweremanyshwithinthepatch.However,asfewersh wereavailablelessenergywas gainedperunitoftime.Thusatsomepointitwasmoreworthwh ilefortheseabirdtotraveltoanotherpatchratherthanremaininthecurrentpatch. Energy GainTime g(t W ) Figure8.:OFT:ExamplePatchGainFunction 9 StephensandKrebs(1986)acknowledgedsomegainfunctions maynotexhibitaneventualnegativeacceleration. Whenpatchesaresearchedsystematicallytheirgainfuncti onsmayexhibitadepletionofresourceswithoutanydepression. 43

PAGE 57

Thedeterminationofhowmuchtimetospendineachpatchtype sismadeonthebasisofthe MarginalValueTheorem(Charnov,1976).Forpatchtypesexh ibitinganegativelyaccelerated gainfunction(asshowningure8),thetheoremiscapableof determiningtheoptimalallocation oftimeacrossanynumberofpatchtypessoastomaximizethea veragerateofenergywithinthe environment.Toobtainoptimalitythetheoremstatesthepr edatorshouldcontinuetoforagewithin apatchtypeuntilthemarginalrate(i.e.,slopeofthegainf unction)equalstheaveragerateofenergygain(equation3.5).Followingsucharequirementmean sbydenitionthemarginalrateof eachpatchtypewillbeequaltotheaveragerate.Equation3. 6statestheequalityconditionwhere g 0 ( d t Wi ) isthemarginalrateofpatchtype i and R ( d t W 1 ; d t W 2 ;:::; d t Wn ) istheaveragerateofenergycalculatedfromtheoptimalvectoroftimesforeachpat chtype. g 0 ( d t W 1 )= R ( d t W 1 ; d t W 2 ;:::; d t Wn ) g 0 ( d t W 2 )= R ( d t W 1 ; d t W 2 ;:::; d t Wn ) ... g 0 ( d t Wn )= R ( d t W 1 ; d t W 2 ;:::; d t Wn ) (3.6) Insituationswhereonlyasinglepatchtypeexists,theaver agerateofenergyintakecanbesimpliedasshowninequation3.7. R ( t W )= g ( t W ) 1+ t W (3.7) Thereductiontoonlyasinglepatchtypealsosimpliesthem arginalvaluetheoremasshownin equation3.8. g 0 ( d t W 1 )= R ( d t W 1 ) (3.8) Examplesofthepatchmodelbeingusedtondtheoptimalfora gingtimewhen(1)onlyasinglepatchtypeexistsand(2)whenmultiplepatchtypesexist arepresentednext 10 10 ThepatchexampleswereadaptedfromCharnov(1976);Stephe nsandKrebs(1986);andPirolli(2007). 44

PAGE 58

SinglePatchTypeExampleConsideranexampleofabrownbearforagingforberriesover athree-yearperiod.Withinthe bear'senvironmentthereexistmultiplepatchesofasingle typerepresentingberrybushes.With eachseasonthecharacteristicsofthepatchtypechanges.T able6detailsthecharacteristicsofthe patchtypeforeachofthethreeyears. Table6:OFT:ExampleSinglePatchTypeforaBrownBear Year t B g ( t W ) Rt W a Y1 10sec 0 : 8 t W 2 +6 : 5 t W 0 : 9593kCal = sec3 : 4629sec Y2 5sec 0 : 8 t W 2 +6 : 5 t W 1 : 5385kCal = sec3 : 1009sec Y3 5sec 2 : 5 t W 2 +17 : 5 t W 3 : 7702kCal = sec2 : 7460sec a Ascalculatedfromequation3.8. Intherstyeartheoptimaltimetospendinapatchwas 3 : 4629sec .Illustratedgraphicallyin gure9theoptimalpointiswherethedashedlinewithitsori ginat t B liestangentialtothegain function. 0 5 10 15 20 25 30 35 10 5 0 5 10 Energy GainTime BetweenPatch WithinPatch g y1 (t W ) R y1 t W y1 Figure9.:OFT:ExampleYearOnePatch Inthesecondyear,anareaoftheenvironmentpreviouslydes troyedbywildrebloomedwith berrybushes.Thisrepresentsanincreaseinthenumberofpa tchesavailabletothebearandthereforeadecreaseinthetimebetweenpatches( t B ).Figure10showsgraphicallyhowthedecreasein timebetweenpatchesleadstoareductioninthetimespentwi thinapatch.Althoughlesstimeis spentperpatch,theaveragerateofenergygain( R y 2 )ishigherduringthesecondyear.Withlower movingcosts,thebearisbetterservedtomovetoanotherpat chonce R y 2 dropstoolow. 45

PAGE 59

0 5 10 15 20 25 30 35 10 5 0 5 10 Energy GainTime BetweenPatch WithinPatch g y1 (t W ) R y1 t W y1 R y2 t W y2 Figure10.:OFT:ExampleYearTwoPatch Abountifulrainduringthethirdyearincreasedthedensity ofberrybusheswithineachpatch. Agreaterdensityofbushesrepresentsamorevaluablepatch .Therefore,thegainfunctionfor thepatchtypeischangedtoreectgreaterenergygainsperu nitoftimespentinapatch.Figure11illustratesthedifferencewhenthegainfunctionofa patchtypechanges.Inthissituation thegainfunctionreectedanincreaseinenergygainandthe reforetheamountoftimespentforagingwithinapatchisreduced.Asaresultofthenewgainfunct iontheaveragerateofenergygain ( R y 3 )ishigherthanthepreviousyear. 0 5 10 15 20 25 30 35 10 5 0 5 10 Energy GainTime BetweenPatch WithinPatch g y2 (t W ) R y2 t W y2 R y3 t W y3 g y3 (t W ) Figure11.:OFT:ExampleYearThreePatch MultiplePatchTypesExampleInthepreviousexamplethebrownbear'senvironmentonlyco nsistedofasinglepatchtype.However,asbrownbears'forageoverlargeterritoriesspannin gthousandsofsquaremiles(Garshelis, 2007);itislikelymorethanonepatchtypeexistsintheiren vironment(e.g.,forests,rivers).Inthis exampletherearetwopatchtypesavailabletothebrownbear .Table7detailsthecharacteristics ofeachofthepatchtypes.Noticeableiseachpatchtypediff ersintheirtimebetweenpatchesand 46

PAGE 60

gainfunction. Table7:OFT:ExampleMultiplePatchTypesforaBrownBear PatchType t B g ( t W ) Rt W a 1 10sec 0 : 8 t W 2 +6 : 5 t W 3 : 9061kCal = sec1 : 6212sec 2 5sec 2 : 5 t W 2 +17 : 5 t W 3 : 9061kCal = sec2 : 7188sec a Ascalculatedfromequation3.6. Wheninthespeciedenvironment,thebearwillspend 1 : 6212seconds intherstpatchtype and 2 : 7188seconds inthesecondtype.Figure12graphicallyillustratestheop timaltimetospend ineachpatchtypeandalsotheaveragerateofenergygain.As themarginalratesforeachpatch isthesameastheaveragerate,thetangentiallinesallhave thesameslopeandarethusparallelto oneanother. 0 5 10 15 20 25 30 35 0 5 10 Energy GainWithin Patch Time g p1 (t W ) R p1 t W p2 R p2 t W p1 g p2 (t W ) R Figure12.:OFT:ExampleOptimalMulti-PatchTime 3.2AdaptiveControlofThought-RationalTheoryTheACT-Rtheoryaimstoexplainhumancognitionby(1)descr ibinganarchitectureofthehumanmindand(2)theprocessbywhichcognitionoccurswithin thestatedarchitecture(Anderson etal.,2004).ThetheoreticalfoundationforACT-Risratio nalanalysiswhichassumes“...each componentofthecognitivesystemisoptimizedwithrespect todemandsfromtheenvironment, givenitscomputationallimitations”(TaatgenandAnderso n,2002,pg.130).ThetheoryandarchitectureofACT-Rhasbeenusedinresearchareassuchasperce ptionandattention(Byrne,2001); learningandmemory(Fuetal.,2006);problemsolvingandde cisionmaking(Grayetal.,2005); 47

PAGE 61

languageprocessing(Andersonetal.,2001);andotherdoma insrelevanttothisdissertation,such asinformationsearch(PirolliandCard,1999). Figure13illustratesthebasicarchitectureofACT-R5.0(A ndersonetal.,2004)whichconsists ofmodules,buffers,andacentralproductionsystem.Eachm odulewithinACT-Risindependent ofoneanotherandisresponsibleforaparticulartask 11 .Thevisualandmotormodulesarepart oftheperceptual-motorsystemthatinteractswiththeexte rnalenvironment 12 .Thevisualmodule controlsvision;attendingandidentifyingobjectsinthev isualspace.Themanualmoduledirects thehandstoperformactions(e.g.,pickingupanobject,cli ckingamousebutton).Theintentional modulekeepstrackofastackofgoals,intentions,andthecu rrentstateoftheproblemathand 13 Finally,thedeclarativemoduleinteractswithdeclarativ ememory(i.e.,whatisknown). n r n n n n n Figure13.:ACT-R:5.0Architecture(Andersonetal.,2004, pg.1037) 11 TheACT-Rtheorydoesnotstatethemoduleslistedaretheonl yvalidmodulesusedinhumancognition(Andersonetal.,2004).Rather,thesemodulesareatthecoreofthe systemdevelopedthusfar. 12 AlthoughACT-RisafoundationaltheoryforIFT,onlythehig hercognitionportionofthetheoryisused.The perceptual-motorsystemisnotdetailedinIFTandthusthat portionofACT-Risonlybrieydescribedhere.Formore informationabouttheperceptual-motorsystemthereaderc anrefertoAndersonetal.(2004). 13 AstackissimplyaLastInFirstOut(LIFO)datastructure.Wh eneveranewitemisaddedtothestackitis “pushed”ontothetopofthestack.Whenretrievinganitemfr omastackthetopmostitemofthestackis“popped” offthetopofthestack. 48

PAGE 62

3.2.1CentralProductionSystemThecentralproductionsystem(CPS)isresponsibleforcoor dinatingtheactivitiesofalltheindependentmodules(Andersonetal.,2004).TheCPSdoesnotdir ectlyinteractwitheachmodule, ratherbuffers(i.e.,workingmemory)actasintermediarie sprovidinganareaforinformationexchange 14 .Eachbuffer,however,islimitedincapacitytoasinglechu nkofinformation(i.e.,unitof knowledge)(Miller,1956)atasingletimeperiod(Anderson etal.,2004).Suchalimitationisto reecthuman'slimitedworkingmemorycapacity.Forexampl e,onlymemoriesbeingfocusedon inlong-termmemoryareavailableatanyonetimeasopposedt ohavingallmemoriesavailableat alltimes. TheCPSrepresentsproceduralmemory(i.e.,howtodothings )intheformofproductionswhich consistofasetofrulesknownasproductionrules.Eachprod uctionruleconsistsofaconditionor setofconditionsandthensomeactiontoperformwhenthecon ditionorconditionsaretrue(i.e., whentheconditionsmatchthecurrentstate).Figure14illu stratessixproductionrules(PR1-PR6) intheform IF THEN .TheCPSusesproductionsandinformationfromthebuffersinorderto(1)ndruleswhichmatcht hecurrentstate,(2)selectthemost benecialrule,andnally(3)executearulewhichresultsi nsomeaction(Andersonetal.,2004). IFgoalisWrite-answer &answerunknown THENsetandpushsubgoal Find-solutiontothegoalstack (a)PR1 IFgoalisWrite-answer &answerunknown THENquitandpop thegoalfromthegoalstack (b)PR2 IFgoalisWrite-answer &answerknown THENwriteanswerandpop thegoalfromthegoalstack (c)PR3 IFgoalisFind-solution &answerunknown&operationisaddition&N1known&N2known THENsetandpushsubgoal Add-numberstothegoalstack (d)PR4 IFgoalisFind-solution &answerknown THENpopthegoal fromthegoalstack (e)PR5 IFgoalisAdd-numbers &N1known&N2known THENretrieveanswerandpop thegoalfromthegoalstack (f)PR6 Figure14.:ACT-R:ExampleProductionRules–adaptedfromA ndersonetal.(2001,pg.338) ApatternmatchingmechanismwithintheCPSdeterminesifth econtentsofanyofthebuffers matchtheconditionofanyoftherules.Ifamatchexiststhep roductionruleisselectedandthen 14 BoththemodulesandtheCPScanreadfromandwritetothecorr espondingbuffer. 49

PAGE 63

red(i.e.,executed).Insituationswithmultiplematchin gproductionrules,conictresolution isundertakenwhereaconictsetisformedandtherulewitht hehighestprobability(basedon utility)isselectedandexecuted.Theutilityofaruleisde terminedfrompastexperiencesofthe productionrulewithinthecontextofthecurrentgoal(stor edinthegoalbuffer).Theutilityofproduction i iscalculatedfromequation3.9 15 as U i (Andersonetal.,2001). P i istheprobabilityfor achievingthegoalusingproduction i basedonpastperformance. G istheexpectedgainfromsuccessfullycompletingthegoalindependentoftheproductio nused. C i istheaveragetimepreviousattemptsusingproduction i took(i.e.,cost)tocompletethegoal.Finally, representsrandom noise. U i = P i G C i + (3.9) Theactualselectionofaproductionisdependentonitsprob abilityasexpressedinequation3.10. U representstheutilityofaproductionand t controlsthenoiseintheutilities(Fuetal.,2006). Theactualselectionofaproductionisthusprobabilistica lly-basedratherthanbyabsoluteutility values. P i = e ( U i =t ) P nj e ( U j =t ) (3.10) Consideranexampleofastudentattemptingtosolvethefoll owingequationonamathtest 16 : 4+1 .Usingtheproductionrulesingure14,thecognitiveproce ssofthestudent(brokendown ateachstepintheprocess)isillustratedingure15.Theto pportionofthegureliststhegoalsin thegoalstack,withthetopmostgoalsignifyingthegoalint hegoalbuffer(i.e.,thegoalcurrently beingattendedto).Themiddlesectionshowstheproduction rulesselectedbytheCPStore.Finally,thebottomportionrepresentsthecontentsoftheret rievalbuffer.Thevaluesspeciedforthe goalstackandretrievalbufferarerepresentativeafterth ecorrespondingproductionhasred. Asthestudent'soverallgoalistowritedowntheanswertoth eproblemthegoal Write-answer isaddedtothegoalstack.IntherststeptheCPSperformspa tternmatchingonthecurrentgoal andndstwoproductionrules(PR1andPR2)arevalidrules.S incetherearetwoviablecandi15 Andersonetal.(2001)doesnotexplicitlyincludethenoise term(i.e., )in U i .However,otherACT-Rresearchers do(e.g.,Fuetal.(2006))andastheinclusionismorespeci cthenoisetermisprovidedinequation3.9. 16 ThearithmeticexamplewasadaptedfromAndersonetal.(200 1). 50

PAGE 64

Step 1 2 3 4 5 6 GoalStack Write-answer Find-solution Add-numbers Find-solution Write-answer Write-answer Find-solution Write-answer Write-answer CentralProductionSystem ConictSetf PR1,PR2 g PR1 PR4 PR6 PR5 PR3 RetrievalBuffer 4+1=5 Figure15.:ACT-R:ExampleCognitiveProblem-SolvingProc ess datestheutilityofeachproductionisthencalculated.Ass umingtheutilityofPR1washigher, thestudentwillattempttosolvetheproblembyaddingthego al Find-solution tothegoalstack insteptwo.However,partoftheprocessofndingasolution isthesummationofthetwogiven numbers(N1andN2),whichisrepresentedintheadditionofg oal Add-numbers atstepthree. WhenproductionPR6resinstepfour,partoftheactioninvo lvesretrievingthechunkrepresentingtheadditionof 4 and 1 fromdeclarativememory.Onceinpossessionoftheanswerth egoal Add-numbers isremovedfromthestacksinceitisnolongerneeded(i.e.,t henumbershavebeen added).Atstepvethecurrentgoalis Find-solution andsincetheanswerisknownproduction PR5isredwhichremovesthegoalfromthestack.Finallyats tepsix,theonlyremaininggoal is Write-answer andsincetheanswerisknownproductionPR3willrewhichwi ll(1)causethe studenttowritetheanswerdownand(2)removethegoalfromt hestack,thusendingthecognitive process.3.2.2ProductionLearningACT-Riscapableoflearningnewproductionrulesviaamecha nismcalledproductioncompilation(TaatgenandAnderson,2002).Compilationcanoccurwh entwoproductionrulesareusedin sequencetorequestandthenretrieveachunkfromdeclarati vememory.Asingleproductionrule iscreatedwhichaggregatesthetwoproductionrulesandemb edsthedeclarativeknowledgeinto therule 17 .Learninginthiscontextremovesthepotentiallyexpensiv eoperationofchunkretrieval fromdeclarativememory. Toillustrateproductioncompilation,considertheprevio usexample(gure14)whereproduc17 Whenanewproductionruleiscreated,theoriginalproducti onrulesitwascreatedfromarenotremovedfrom proceduralmemory. 51

PAGE 65

tionPR4requestedproductionPR6toretrieveananswerfrom declarativememory.Intheexample 4 and 1 wereprovidedasnumberstoaddwiththeanswerof 5 beingretrievedfromdeclarative memory.AslearningoccurstheproductionrulesPR4andPR6c anbecombinedforthespecial caseof 4+1=5 .Therefore,stepsthreeandfourfromgure15arecombined, resultingina reductionoftheoverallnumberofstepsfromsixtove.Figu re16illustratesthenewproduction rulePR7createdthroughproductioncompilation.Nowwhen 4+1 isencounteredtheutilitiesof productionsPR4andPR7willbecomparedtodeterminewhichp roductionisred(equations3.9 and3.10).Theeventuallikelyoutcomeistheutilityofprod uctionPR7willbehigherasthecost doesnotinclude(1)theringofanotherproductionand(2)t heretrievalofchunk ve fromdeclarativememory. IFgoalisFind-solution &answerunknown&operationisaddition&N1is4&N2is1 THENsetanswerto5andpop thegoalfromthegoalstack (a)PR7 Figure16.:ACT-R:ExampleProductionCompilation 3.2.3ChunkAsmentionedinsection3.2.1achunkrepresentsasingleuni tofknowledge(Miller,1956).The unitofknowledgediffersbychunkandcanrefertoaword,dig it,color,shape,phrase,orother suchpatterns(Simon,1974).InACT-Reachchunkisofaparti culartypeandassociatedwithslots whichrepresentanotherchunkorsomeothervalue(Stewarta ndWest,2007).Figure17shows anexampleofthreedifferentchunksstoredindeclarativem emory(Andersonetal.,2001;StewartandWest,2007).Eachofthechunksisgivenaname(e.g., four-plus-one ve large-friendlydog )forconveniencealongwithatype(e.g.,addition,integer ,dog)andsomeslots.Inobtaining theanswertothepreviousexampleof 4+1 ,thechunk four-plus-one wouldhavebeenactivated whichwouldhaveleadtotheretrievalofchunk ve (sincethesumslotof four-plus-one refers tothe ve chunk).Ifapersonwastryingtorecallknowledgeaboutalar ge,friendlydoginstead, chunk large-friendly-dog wouldberetrieved. 52

PAGE 66

Chunk:four-plus-one isaadditionaddend1fouraddend2onesumve (a)Four-plus-one Chunk:ve isaintegervalue5 (b)Five Chunk:large-friendly-dog isadogsizelargemannerfriendly (c)Large-friendly-dog Figure17.:ACT-R:ExampleChunks(Andersonetal.,2001;St ewartandWest,2007) 3.2.4DeclarativeMemoryAsseeninstepfourofgure15,retrievinginformationfrom long-termmemoryisanimportant processofhumancognition.WithinACT-Rdeclarativeknowl edgeisencodedasanetworkstructure(AndersonandPirolli,1984).Thenetworkconsistsofn odes(i.e.,chunks)connectedvia links(CollinsandLoftus,1975).Linksaredeterminedbase dontheassociationbetweennodes. Stronglyassociatednodesarelocatedincloseproximityto oneanother,whereasweaklyassociatednodesaredistalfromoneanother.Nodesmayalsobeindi rectlyassociatedviaintermediarynodes.Figure18providesanexampleofthenetworkstruc turefoundindeclarativememory. Eachellipserepresentsanode,whileeachlineisalinkandt husrepresentsanassociationbetween nodes. Figure18.:ACT-R:DeclarativeMemoryNetworkStructure Spreadingactivationistheprocessbywhichchunksrelated toagivensourcechunkcanbe retrievedfrommemory(CollinsandLoftus,1975).Whensome cueisattendedto(e.g.,whena userreadsaparticularword)thechunk j representingthatcueisactivatedinmemory(AndersonandPirolli,1984).Theactivationthenspreadsfromthe sourceoftheactivation(i.e.,thecue) throughouttheentirenetworkactivatinganyassociatedno des.Thespreadingoccursinstantaneouslythroughoutthenetworkandthestrengthofactivati onateachnodedecaysexponentially withdistancefromthesource(AndersonandPirolli,1984). Theendresultismorestronglyacti53

PAGE 67

vatednodesrepresentknowledgewhichismorerelevanttoth eactivationsource. Thetotalactivationassociatedwithchunk i (i.e.,achunkindeclarativememory)whenchunk j isthesourceactivationisexpressedinequation3.11 18 as A i (Andersonetal.,2004). B i isthe base-levelactivationwhichtakesintoaccountthehistory ofchunk i independentofchunk j (AndersonandMilson,1989).Thebase-levelactivationisdepe ndentonthefrequencyandrecencyof prioractivationsofchunk i (Andersonetal.,2004)andfollowsapowerlawoflearningan dforgettingwhereactivationstrengthincreaseswithrecentan drepeatedusage(Andersonetal.,2001). W j isaweightrepresentingtheamountofattentionbeingpaidt othesourcechunk j S ji isthe strengthofassociationbetweenchunks j and i .Finally, isnoiseassociatedwiththeactivation process. A i = B i + X j W j S ji + (3.11) Themannerinwhichspreadingactivationappliestoinforma tionretrievalisbasedonthestrength ofactivationforthechunktoberetrieved(chunk i )whenthesourceofactivationistheproximal cuechunk j .Thestrengthofachunk'sactivationdetermines“...itspr obabilityofbeingretrieved anditsspeedofretrieval”(Andersonetal.,2004,pg.1042) .Therefore,ifchunk i hasweakactivationitmay(1)notberetrievedor(2)taketoolongtoret rieve.However,absoluteactivation strengthdoesnotguaranteechunkretrievalsinceeachchun khasaretrievalprobabilityasexpressedinequation3.12(Andersonetal.,2004).Inequatio n3.12, A i isthetotalactivationof chunk i representsathresholdwhichtheactivationmustbeabove,a nd s isrelatedtothevarianceofactivationnoise(Andersonetal.,2004). P i = 1 1+ e ( A i ) =s (3.12) Assumeingure18thatnodes A B ,and C representchunks four-plus-one ve ,and largefriendly-dog ,respectivelyfromgure17.Atstepfourofthestudent'spr ocessforsolvingthe equation 4+1 (gure15),thesourceofactivationwouldhavebeenchunk four-plus-one .Basedon thegivennetworkstructurethechunks four-plus-one and ve aredirectlyandcloselyassociated withoneanotherindicatingsomedegreeofsimilarity.Ther efore,thetotalactivationofchunk ve 18 Andersonetal.(2004)doesnotexplicitlyincludethenoise term(i.e., )in A i .However,otherACT-Rresearchers do(e.g.,Fuetal.(2006))andastheinclusionismorespeci cthenoisetermisprovidedinequation3.11. 54

PAGE 68

wouldlikelybehighandprobablyleadtoasuccessfulretrie valofthechunk.Thesamesuccess wouldbelesslikelyifthe large-friendly-dog chunkwastoberetrievedgivenchunk four-plus-one asthesourceofactivation.Aweakeractivationwouldbelik elysincethedistancebetweenthe chunksisgreaterandtherearenodirectassociations 19 3.3InformationForagingTheoryThetheoryofinformationforagingisconcernedwithnotonl ythewayapersonsearcheswithin theirenvironment,butalsohowtheenvironmentcanbeshape dtobetterfacilitateforaging.Therefore,researchhasusedIFTtonotonlylookatnavigationalp atternsofforagers,butalsohowinformationenvironmentscanbealteredtofacilitateforagi ng.IFThasbeenusedtoinformthedesignofgraphicaluserinterfacecontrols(e.g.,checkboxe s,listboxes)whichprovidesocialactivity visualizationsasnavigationalcues(Willettetal.,2007) ;highlightScentTrailsonWebpageswhich facilitateauser'ssearchforinformation(OlstonandChi, 2003);ndoptimalbrowsingpathsfor largepicturesdisplayedinlimitedviewingareas(Xieetal .,2006);explainnavigationalchoices withinsourcecodeduringprogrammaintenancetasks(Lawra nceetal.,2007);describetheeffects ofdelay,familiarity,andbreadthonusers'performance,a ttitude,andintentionsatWebsites(Gallettaetal.,2006);andtheroleofscentinthedecisiontobr owseamenuasopposedtosearchinga Website(KatzandByrne,2003). ThefoundationaltheoriesOFTandACT-RareusedbyIFTtoexp lainthebehavioralandcognitiveaspectsofinformationforaging.LikeOFT,thesames equentialsearch–encounter–decision processisusedtoexplainthebasicbehaviorsofaninformat ionforager.Similartohowanimals searchforpatchesusingtheirsenseofsmell,informationf oragersuseametaphoricalsenseof smelltolocateandfollowaninformationscenttrail.Theme chanismbywhichthisinformation scentworksisexplainedviatheACT-Rtheory.Onceaninform ationpatch(i.e.,anitemofinterest)hasbeenlocatedthedecisionturnstoansweringtheque stionof“howlongtostayinapatch?” fromtheclassicalpatchmodel.ACT-Ralsoexplainsthedeta ilsofhowthedecisionofwhento 19 Althoughthelinkbetweenanadditionproblemandalarge,fr iendlydogseemstotallyunrelated,suchassociationsmayexistwithinaperson'smind.Thustheactivationc hunk four-plus-one mayinfactallowretrievalofchunk large-friendly-dog especiallywhentakingintoaccounttheprobabilisticnatu reofchunkretrieval.Forexample,the summationproblemmayleadtoanassociationwithsummer.Su mmermaybeassociatedwithsummerbreaksfrom schoolwhichinturnisassociatedwithearlychildhood.Chi ldhoodmaythenbeassociatedwiththefamilypetthatwas inturnalarge,friendlydog. 55

PAGE 69

stayorleaveapatchisdeterminedbytheinformationforage r. Thefollowingsectionspresentanin-depthexplanationoft heconceptsofinformationscentand informationpatchesfollowedbyadescriptionoftwoversio nsofanIFTmodel(SNIF-ACT1.0 and2.0).3.3.1InformationScentInformationscentis“thedetectionanduseofcues,suchasW orldWideWeb(Web)links...that provideuserswithconciseinformationaboutcontentthati snotimmediatelyavailable”(Pirolli, 2007,pg.68).Theconceptofinformationscentcorresponds tothesearchportionofthesearch– encounter–decisionprocessfromOFT.Justasinthewild,al ackofscentmakestheprobability ofencounteringtheitemofinterestdifcult.However,unl ikeanimalshuntingforfoodwhereone berryisjustasbenecialasanotherberry,informationisn otasinterchangeable.Rather,informationofvalueshouldbe(1)relevanttoaninformationfora ger'sgoaland(2)novel(Sandstrom, 1994). Thetwomainwaysinwhichinformationscentisusedisto(1)g uideuserstotheinformation beingsoughtand(2)provideageneralimpressionoftheavai lablecontentwithinapatch.Ina Webenvironmentcuesareobtainedfromthetextandimagesas sociatedwithahyperlink.The predictedutilityofalinkisbasedonhowthecuesfromalink matchauser'sgoal(i.e.,theprobabilityofalinkprovidingaWebpagewiththedesiredinform ation 20 ).Thelinkwiththehighest predictedutility(i.e.,scent)isthenselectedasthenext navigationalchoice. Thescentofalinkisbasedonthegoalofauser.Theuser'sgoa l G isthedesireddistalinformationwhere i representseachgoalfeature(i.e.,eachwordofagoal).Eac hproximalcue(i.e., link), L ,ontheWebpageindicatesthedistalcontentofthelinkedpa ge,where j representseach cuefeature(i.e.,eachwordofthelink) 21 .Thefeaturesforboth G and L arerepresentedcognitivelyaschunks(Miller,1956). Thevalueoflink L inthecontextofgoal G isexpressedinequation3.13asthesumactivation (equation3.11)ofeachgoalfeature(Pirolli,2007). 20 Sucharelationshipbetweenlinktextandthecontentofthel inkedpagehasbeendemonstratedempiricallyby Davison(2000). 21 Commonstopwordsfromhyperlinkssuchas and the a ,etc.arenotincludedasfeaturesofacue. 56

PAGE 70

V L j G = X i 2 G A i (3.13) Thechoiceofwhichlinktoselectisbasedonthelinkwiththe highestutilitywithinthecontext ofthegoal G .Expressedinequation3.14,theutilityisthevalueoflink L (equation3.13)along witharandomcomponent (Pirolli,2007).Therandomcomponentrepresentsuserandc ontext variability. U L j G = V L j G + L j G (3.14) SimilarinthewaythatACT-Rselectswhichproductiontore probabilistically,IFTdetermines whichlinktoselectviaprobability.Equation3.15isthepr obabilityofselectingagivenlink L fromasetoflinks C withinthecontextofgoal G (Pirolli,2007). U L j G isthelinkbeingevaluated, U k j G representstheutilityforeachlinkintheset,and reectsascalingparameterforrandom noise. Pr ( L j C;G )= e U L j G P k 2 C e U k j G (3.15) Toillustratetheconceptofinformationscent,considerth efollowingexampleofapersonsearchingforinformationontheWeb.Thegoal( G )oftheuseristondinformationregarding“white lilyowers.”Figure19representstherelevantfragmentof theuser'sdeclarativememorywhere eachfeaturechunk i ofthegoalisrepresentedasablackellipse.OnthecurrentW ebpagetheuser ispresentedwithtwolinks“redroses”( L 1 )and“cherrytrees”( L 2 ).Thechunkfeaturesoflinks L 1 and L 2 aresymbolizedaslightgrayanddarkgrayellipsesingure1 9,respectively.Asseenin gure19thefeaturesof L 1 arecloserthan L 2 tothegoalchunksandthusmoresimilarandmore likelytostronglyactivatethefeaturesofthegoal.Theref ore,thescentoflink L 1 isstronger(i.e., hasahigherutility)andtheuserwillselecttherstlink 22 AlthoughbasedonACT-R,theconceptofinformationscentin IFTdeviatesfromtheACTRtheoryinthreemainways(Pirolli,2007).First,thesourc eofactivationinACT-Risthegoal chunk.InIFTthechunkrepresentingthefeatureofaproxima lcueisthesourceandthegoal 22 Thisexampleassumesthenoisefromequation3.11andtheran domcomponentofequation3.14arecomparable acrosslinkfeaturesandlinks. 57

PAGE 71

n r r Figure19.:ACT-R:ExampleMemorySchematic–adaptedfromC ollinsandLoftus(1975, pg.412)chunkisthedestination.Second,thepurposeofspreadinga ctivationinACT-Ristoretrievesome chunkfromdeclarativememory.IFTisnotinterestedinther etrievalofachunk,butratherthetotallevelofactivationonagoalchunk.Lastly,theutilityo fwhichlinktoselectisnotbasedonpast successesandfailureslikeproductionsinACT-Rare.Inste ad,utilityisbasedontotalactivation strengthofalinkwhichdoesnottakeintoaccountpastperfo rmance.Alackofhistorytherefore meansknowledgeofpreviouslysuccessfulassociationsbet weenlinksandsuccessarenotconsidered.Forexample,theutilityofthelink“contactus”wo uldbemadeindependentofanyprior successesauserhashadwhenclickingonasimilarlynamedli nktondcontactinformationfora Website.3.3.2InformationPatchAninformationpatchisagroupingofinformationwhere“... itiseasiertonavigateandprocess informationthatresideswithinthesamepatchthantonavig ateandprocessinformationacross patches”(Pirolli,2007,pg.49).WithinaWebcontextwhatc onstitutesaninformationpatchcan differdependingonthelevelofanalysis.Atahigh-levelan individualWebsitecouldbeconsideredapatchwhereasatalower-leveltheWebpageswithinasi ngleWebsitecouldeachbeconsideredapatch.Priorresearchhasnotexplicitlymadedistinc tionsbetweenpatchesatdifferinglevels ofanalysis.Inordertobeclear,theterms site-patch and page-patch willrefertopatcheswhich constituteanentireWebsiteorWebpagewithinasite,respe ctively. Althoughthedenitionofwhatapatchisdiffersbylevelofa nalysis,therelationshipofsimilaritywithinandacrosspatchesdoesnotdiffer.Forexamp le,informationwithinaWebpageis moresimilarthanacrossWebpagesofthesamesitewhichintu rn,ismoresimilarthanWebpages ofanothersite.Likewise,asingleWebsitewillhavemoreco incidinginformationcomparedto 58

PAGE 72

anotherWebsite. SuchatopicalpatchystructureoftheWebhasbeenempirical lydemonstrated(Davison,2000) strengtheningthelogicalargumentforapatchyWeb.Thesim ilarityofcontentbetweenpages fromthemostsimilartoleastwere(1)linkedpageswithinth esamedomain;(2)unlinkedpages withinthesamedomain;(3)linkedpagestodifferentdomain s;andnally(4)randompages 23 (Davison,2000).Anincreaseinlinkdistance(i.e.,degree sofseparation)withinthesamedomain hasalsobeenassociatedwithdecreasesinpagecontentsimi larity(Pirolli,2007).Theaforementionedresearchlendssupporttotheassertionofapatchygr oupingofinformationontheInternet wherepatchesin“closeproximity”tooneanotheraremoresi milarthanpatchesfartherapart. AstheInternetexhibitsapatchystructure,thepatchmodel fromOFT(Charnov,1976)isappropriatetouseandcandeterminetheoptimallengthofstay foraforager.Thedecisiontostay inorleaveapatchisdeterminedsuchthatapersonwill“...f orageinaninformationpatchuntil theexpectedpotentialofthatpatchislessthanthemeanexp ectedvalueofgoingtoanewpatch” (Pirolli,2007,pg.81).Thepatchleavingruleismathemati callystatedinequation3.16(Pirolli, 2007)asavariantoftheMarginalValueTheorem(Charnov,19 76). U ( x ) istheutilityofaforager intheircurrentstateand U isthemeanutilityofotherpatches.Thusjustlikethemargi nalvalue theorem,thevisitorwillcontinuetoforageinthepatchasl ongastheutility(i.e.,marginalvalue) ishigherthantheaverageofallotherpatches(i.e.,averag erateofreturn).Oncethecurrentstate utilityisequaltothemean,theforagerwillleavethepatch U ( x ) > U (3.16) 3.3.3SNIF-ACTPirolli(2007)implementedtwoversionsofamodelbasedonI FTcalledSNIF-ACT(Scent-based NavigationandInformationForagingintheACTarchitectur e).AstheACTarchitectureisamajorcomponentofIFT,asetofproductionrulesweredenedfo rbothversionsoftheSNIF-ACT modelswhichcharacterizedusers'actionswhileforaging. Figure20listseachofthepertinent productionsshowinghowauserstartsprocessinganewpage; evaluateslinksonapage;anddecidesamongstclickingalink,goingbacktoapreviouspage, orleavingthesite. 23 Similaritywithinasinglepageisnotincludedsincebyden itionnopagecanbemoresimilartoapagethanitself. 59

PAGE 73

Therstfourproductionsofgure20( Start-process-page Process-links-on-a-page Attendto-link Read-and-evaluate-link )areconcernedwiththecognitiveaspectsofreading,atten ding, andevaluatinglinksonapage.IntermsofOFT,ifanyofther stfourproductionsrethenthat isadecisionbythevisitortocontinueforagingwithinthes amepage-patch.Thenalthreeproductionsalsorelatetothepatch-leavingruleofOFT,altho ughtheirlevelsofanalysisdiffer.Production Click-link representseitheradecisiontoleaveapage-patchorsite-p atchdependingonif thelinkwasinternaltotheWebsiteorexternal.Production Leave-site relatestotheleavingofa site-patchandproduction Backup-a-page isconcernedwithgoingtoanalreadyvisitedpage-patch. Inanycase,aproductionrepresentingtheleavingofeither apage-patchorsite-patchshouldre whenthemarginalvalueofthepatchdropstotheaveragerate ofreturnforallpatches. IFgoalisStart-next-patch &thereisataskdescription&thereisabrowser&browseronunprocessedpage THENsetandpushsubgoal Process-pagetothegoalstack (a)Start-process-page IFgoalisProcess-page &thereisataskdescription&thereisabrowser&thereisanunprocessedlink THENsetandpushsubgoal Process-linktothegoalstack (b)Process-links-on-page IFgoalisProcess-link &thereisataskdescription&thereisabrowser&thereisanunattendedlink THENchooseanunattendedlinkand attendtoit (c)Attend-to-link IFgoalisProcess-link &thereisataskdescription&thereisabrowser&thecurrentattentionisonalink THENreadandevaluatethelink (d)Read-and-evaluate-link IFgoalisProcess-link &thereisataskdescription&thereisabrowser&thereisanevaluatedlink&thelinkhashighestactivation THENclickonthelink (e)Click-link IFgoalisProcess-link &thereisataskdescription&thereisabrowser&thereisanevaluatedlink&themeanactivationonpageislow THENleavethesiteand popthegoalfromthegoalstack (f)Leave-site IFgoalisProcess-link &thereisataskdescription&thereisabrowser&thereisanevaluatedlink&themeanactivationonpageislow THENgobacktothepreviouspage (g)Backup-a-page Figure20.:SNIF-ACT:ProductionRules(Pirolli,2007,pg. 97) 60

PAGE 74

AlsofundamentaltoIFTistheconceptofinformationscentw hichisdeterminedbythelevel ofactivation(equation3.11)ofagoalfromagivenlink.The base-levelactivationofthegoal chunk( B i )inbothversionsofSNIF-ACTwasassumedtobestatic(i.e., thegoaldidnotchange) andthus B i wassettozero(Pirolli,2007).Theamountofattentionpaid toalinkcue( W j )was modeledasexponentiallydecayingwithrespecttothenumbe rofcuesinalinkasshowninequation3.17(Pirolli,2007). W and d arescalingfactorsand n isthenumberofcues(i.e.,words)ina link. W j = We dn (3.17) Todeterminesimilaritybetweenacueandgoalfeature( S ji ),ameasurefrominformationtheoryknownaspointwisemutualinformation(PMI)wasused(Ch urchandHanks,1989).TheformulaforPMI(equation3.18)determinestheassociationbet weentwowords i and j (orinIFT betweenacueandgoalfeature).Thenumeratorinequation3. 18istheprobabilityofthetwo wordsoccurringtogetherwhereasthedenominatorspecies theprobabilityofthewordsoccurring independently.Whennormalized,aPMIscoreof 0 indicatesnoassociationwhereasasascore of 1 meansperfectassociationbetweenthetwowords.PMIhasbee nfoundtobeagoodproximalmeasureoftheassociationsapersonmaymakebetweenchu nkswithintheirowndeclarative memory.Forexample,PMIwasmoreaccurateontestsofsynony mythantypicalcollegeapplicantstakingtheTestofEnglishasaForeignLanguage(TOEFL )(Turney,2001). PMI ( i;j )= log Pr ( ij ) Pr ( i ) Pr ( j ) (3.18) SNIF-ACT1.0TherstversionofSNIF-ACTassumedforagersevaluatedall linksonapagebeforedeciding whichlinktoselect(Pirolli,2007).Themodelwastestedag ainstprotocoldatacollectedfrom Cardetal.(2001).Fourstudentsubjectsweregiventwoexpe rimentalinformationndingWeb tasks.Thersttaskrequiredthesubjecttoobtainthedatea ndapictureofacomedygroupperformingatacollegecampus.Forthesecondtask,subjectswe reinstructedtondfourposters fromthemovieAntz.Thekeystrokes,mousemovements,eyemo vements,Webpagesvisited,and think-aloudcommentswerecapturedfromeachsubject. 61

PAGE 75

Themodelwasevaluatedontheabilityofinformationscentt opredictlink-followingandsiteleavingactions.Fromtheeightdatasets(foursubjectswit htwotasksapiece),atotalof91link clickswerecaptured.Usingtheconceptofinformationscen t,theSNIF-ACTmodel'sprediction ofwhichlinkwouldbefollowedwasfoundtobesignicantlyd ifferentfromrandomselection ( 2 (30)=18 ; 589 : 45 ; p< 0 : 0001 )(Pirolli,2007).Sucharesultlendscredencetotheideaof informationscentbeinganindicatorusedbypeopletolocat eproximalinformation. Intermsofsite-leaving,theSNIF-ACTmodelwasalsofoundt ofollowthepatch-leavingrule whereasthesubjectsforagedinasite-patchuntilthe“...e xpectedpotentialofthatpatchisless thanthemeanexpectedvalueofgoingtoanewpatch”(Pirolli ,2007,pg.81).Figure21illustrates howadropininformationscentcanbeacueforthevalueofasi te-patch.Thescentofthelast pagevisitedataWebsitewas,onaverage,lowerthantheaver agescentoftherstpageofanew Website.Therefore,thislackofscentwasanindicatortoth esubjectthatthissite-patchdoesnot containthesoughtaftergoalinformation. Last 3 Last 2 Last 1 Last Information ScentPage Visit Prior to Leaving Website Average for first pageof next website Figure21.:SNIF-ACT:Site-leavingActions(Pirolli,2007 ,pg.100) SNIF-ACT2.0ThesecondversionofSNIF-ACTremovedtheunrealisticassu mptionfromSNIF-ACT1.0that foragerswouldattendtoandevaluateeachlinkbeforemakin gadecisionofwheretogo.Instead, alearningmechanismwasusedwhichreliedontheconceptofs atiscing(Simon,1956).Asa foragerhasimperfectinformationandlimitedcomputation alfacilitiesanoptimaldecisionisunlikely.However,adecisionwhichsatisesaneedatsomespe ciedlevelisprobable.Therefore, withregardstosatiscingtheforagerwouldcontinuetoeva luatelinksinSNIF-ACTuntila“good enough”linkwasfound(eventhoughthelinkmightnotbeopti mal).Thedeterminationofwhatis 62

PAGE 76

“goodenough”reliesontheabilityoftheforagertolearnas informationisuncoveredwhileforaging. Inordertoimplementsuchalearningmechanismtheutilitie sandprobabilitiesofproductions Attend-to-link Click-link ,and Backup-a-page wereupdatedtoincludethehistoryoflinksalready attendedtoandpagesalreadyvisited 24 (Pirolli,2007).Afterevaluatingalink,theforagerisfac ed withthedecisionofwhethertoattendtothenextlink,click apreviouslyevaluatedlink,orleave thepage.Determinedfromeachproduction'sutility(equat ions3.19-3.21),theforager'sultimate decisionisbasedontheprobabilitiesforeachproduction( equations3.22-3.24)(Pirolli,2007). Equation3.19istheutilityforproduction Attend-to-link where U L j G representstheutilityofthe currentlink(equation3.14)and n isthecurrentnumberoflinksalreadyevaluated. U A ( n +1)= U A ( n )+ U L j G 1+ n (3.19) Theutilityforproduction Click-link isshowninequation3.20where max ( U L j G ) isthemaximumlinkutilityofthelinksevaluatedsofarand k isascalingparameter. U C ( n +1)= U C ( n )+ max ( U L j G ) 1+ k + n (3.20) Takingintoaccountthevalueofpriorpagesandthecostofba ckingup,equation3.21representstheutilityforproduction Backup-a-page U Page istheaverageutilityofpreviouslyvisited pages(withinthesameWebsite), U L ( n ) istheaveragelinkutilityoflinks 1 to n ,and C Back reectsthecostofreturningtoapreviouspage. U B ( n +1)= U Page U L ( n ) C Back (3.21) Theprobabilitiesforeachofthethreeproductionrulesare expressedinequations3.22-3.24. Eachequationistheprobabilityofselectingthegivenprod uctionaftertheevaluationof n linkson apage.Ineachequation, representsascalingparameter. Pr ( Attend-to-link ;n )= exp h U A ( n ) i exp h U A ( n ) i +exp h U B ( n ) i +exp h U C ( n ) i (3.22) 24 Production Leave-site wasnotupdatedsincetheexperimentusingSNIF-ACT2.0took placeonasingleWebsite. 63

PAGE 77

Pr ( Click-link ;n )= exp h U C ( n ) i exp h U A ( n ) i +exp h U B ( n ) i +exp h U C ( n ) i (3.23) Pr ( Backup-a-page ;n )= exp h U B ( n ) i exp h U A ( n ) i +exp h U B ( n ) i +exp h U C ( n ) i (3.24) Example Tobettervisualizetheinterplayamongstthethreeproduct ionrules,ahypotheticalexampleof aWebpagewith15linksisprovided.Thedistributionoflink utilities( U L j G )isdenedbythe function 15 e 0 : 7 x +1 andshowngraphicallyingure22.Noticeableisthesharpde clineinscent fromtherstlink( U L 1 j G =8 : 4488 )tothelastlink( U L 15 j G =1 : 0004 )onthepage. 0 2 4 6 8 10 0 5 10 15 Scent-based utility U L|GSequential position of link L Figure22.:SNIF-ACT:HypotheticalDistributionofLinkUt ilities(Pirolli,2007) Tosimulatetheutilitiesandprobabilitiesforeachproduc tionthefollowingmeasureswereset inaccordancewithPirolli(2007): k wassetto 5 (equation3.20); U Page and C Back weresetto 10 and 5 (equation3.21);and wassetto 1 (equations3.22-3.24).Theprobabilityofaforager choosingfromeachofthethreeproductionsgiventhestated linkutilitydistributionisillustrated ingure23. Ingure23theprobabilityofattendingtothenextlinkishi ghwhenonlyacouplelinkshave alreadybeenevaluated.Thisrepresentstheforagerlearni ngthevalueofthecurrentpage'slinks. Aftermorelinksareevaluated( n 4 )theforagerisbetterinformedoftheexistenceofany highly-scentedlinkswhichmayleadtoagoal.Therefore,th eprobabilityofclickingonalink risestoitshighestlevel.However,eachsuccessivelink's scent( n & 5 )dropsandbeginslevelingoffneartheminimumscentvalue.Consideringnoneofthe previouslinksweresatisfactory 64

PAGE 78

0 0.2 0.4 0.6 0.8 1 5 10 15 Probability of choiceNumber of links evaluated (n) Attend Backup Click Figure23.:SNIF-ACT:HypotheticalProductionProbabilit ies(Pirolli,2007) incausingtheforagertoclickonthem,thelikelihoodofany oftheremaininglow-scentlinks causingalink-followingactionislow(asevidencedbythed eclineintheprobabilityofclicking alink).Astheforagerreachestheendofthelinks,theproba bilityofthevisitorreturningtoapreviouslyvisitedpagecontinuestoincrease.ModelValidation TheSNIF-ACT2.0modelwastestedagainstdatafromChietal. (2003)forttobothlinkfollowingactionsandthedecisionofforagerstogobackapa ge.244subjectswererecruited tocompletesomeportionofatotalof32informationforagin gtasksonfourdifferentWebsites (eighttaskspersite).TotesttheSNIF-ACT2.0model,Pirol li(2007)included74subjectswho completedthetasksattwooftheWebsites.TheWebsiteswere chosenduetothestaticnatureof theirWebpages(i.e.,thecontentandlinksofthepagesdidn otchangedynamically).Eightofthe taskstookplaceonYahoo!'shelpWebsitewhiletheremainin geightoccurredonParcWeb'sinternalcompanyintranet.UnlikeSNIF-ACT1.0,whichwastested againstindividualclickstreams,the aggregatedstatisticsofthesubjectsandtheSNIF-ACT2.0m odelwerecompared. UsinglinearregressionthetoftheaggregatedSNIF-ACTmo delobtainedgoodtforboththe ParcWebtasks( R 2 =0 : 72 )andtheYahoo!tasks( R 2 =0 : 90 )(Pirolli,2007).Thehigh R 2 furtherbolstersthesupportfoundinSNIF-ACT1.0thatinforma tionscentisareliableindicatorof thenavigationalchoicesavisitormakeswhenforaging.Tot estforsubjectsreturningtoapreviouspageanotherlinearregressionmodelwascreated.Simil artothelink-followingresults,good twasalsofoundfortheParcWebtasks( R 2 =0 : 73 )andtheYahoo!tasks( R 2 =0 : 80 )(Pirolli, 2007).Sincebackingupapageisconcernedwithleavingapat chatthepage-patchlevel,there65

PAGE 79

sultsdonotdirectlybolstertheresultsfoundinSNIF-ACT1 .0(whichlookedatthesite-patch level).Instead,theresultsprovideinitialsupportingev idenceofthepatch-leavingruleatthepagepatchlevel.3.4ConclusionTheprecedingsectionsprovidedathoroughreviewofOFT,AC T-R,andIFT.SinceIFTdraws quiteheavilyfrombothOFTandACT-R,detailsofeachtheory wasincludedtogiveamorecompleteunderstandingofIFT.Specically,thebasicpreyand patchmodelsfromOFTandthearchitectureandmechanismsforcognitionfromACT-Rweredescri bed.Adiscussionofinformation scentandpatches,inregardstoIFT,wasthengivenalongwit hthedetailsoftheSNIF-ACT1.0 and2.0models. 66

PAGE 80

Chapter4 Hypotheses Informationforagingtheory(IFT)isconcernedwiththeinf ormationgathering search process. However,investigatinggoalachievementonlongtailWebsi tesisfocusedonhowinformation gatheringcharacteristicscanbeusedtopredictaction,su chassubmittingacontactform.Inorder forthepossibilityofactiontooccur,avisitormustmovebe yondtheinformationgatheringsearch stagetoadecision-makingpointwhereanactionmayormayno ttakeplace.Therefore,theinformationgatheringcharacteristicswhicharelikelytoleadt oaconversionarethosewhichbringa visitorclosertomeetingtheirinformationrequirements. Figure24illustrateshowIFTisusedwithinthedecisionmak ingprocess. n r n r n n rn n n r n n r rn n rn n Figure24.:ConsumerDecisionProcessModelandInformatio nForagingTheory Ontheleft-handsideofgure24istheconsumerdecision-ma kingprocess(CDP)model(Engeletal.,1990).ThepurposeoftheCDPmodelistoillustrat ethebasicstagesaconsumergoes 67

PAGE 81

throughwhenfacedwithadecision.Althoughthestagesared epictedlinearlyinthegure,the processitselfmaybeaniterativeone. Thedecisionmakingprocessbeginswhenaconsumerrecogniz essomeneedtobemet.Inorder tofullltheirneed,theconsumerwillthensearchforinfor mationtondpossiblesolutions.Inthe nextstage,eachofthepotentialalternativesfoundareeva luatedagainstoneanotheruntilasingle alternativeisselected.Inthenalstage,theconsumerre ectsontheoutcomesoftheprocess. Ontheright-handsideofgure24arethemainconceptsofIFT .InIFT,usergoalsinitiateand drivethesearchprocess.Thus,thegoaloftheuseraffectst hescentofeverycueencounteredand howapatchisjudged.Inaddition,thescentofalinkalsoaff ectswhichpatcheswillbeselectedto foragewithin. ThemannerinwhichIFTappliestotheCDPmodelisshownviali nesL1andL2ingure24. LineL1demonstratesthattheneedbeingrecognizedinCDPis theusergoalinIFT.Thisneedor goalisthereasonfortheprocessorforagingtooccur.Inadd ition,theneedorgoalsetsthecontext forallsubsequentactivity.LineL2illustratesthatinfor mationscentandpatchesareconcerned withtheinformationsearchprocess.Informationscentisu sedtolocatepatchesofinformationand foragingwithinapatchobtainsrelevantinformation. NoticeableishowIFTonlyappliestothersttwostagesofth eCPTmodel.However,thepossibilityofagoalbeingachievedonaWebsitecanonlyoccuri nthefourthstage,onceachoice hasbeenmade.Inordertogettothefourthstage,enoughinfo rmationmustrstbegatheredand anyalternativesneedtobeevaluated.Theterminationofth einformationsearchprocessoccurs at“...somepointbecausethepersonjudgesthathehasenoug hinformationtomovetothenext stageintheproblem-solvingordecision-makingprocess”( Browneetal.,2007,pg.91). Thedeterminationofwhenenoughinformationhasbeengathe redisviaacognitivestopping rule(Browneetal.,2007)asillustratedbylineL3ingure2 4.Thecognitivestoppingrulemaybe concernedwiththefulllmentofasinglecriterion,listof items,amountofinformation,amount ofnewinformation,orwhenunderstandingoftheinformatio nstabilizes(Browneetal.,2007; PittsandBrowne,2004).Regardlessofthecognitivestoppi ngruleavisitorusestojudgethesufciencyoftheirgatheredinformation,somerulemustbemet beforethereisachanceofagoal occurring. Onceaforagerhasstoppedcollectinginformation,thealte rnativesareevaluated.Thealterna68

PAGE 82

tivesmaybebetweenmultipleproductsorservices;orsimpl ybetweenselectingthisproductor serviceornot.Ifthechoiceismadeforsomeproductorservi cethentheforagermayperform someaction(e.g.,submittingcontactinformation);ifnot ,theforagermayleavethesitelooking formoreinformationandotheralternatives. Theremainderofthischapterisoutlinedasfollows:rstab riefreviewisgivenofthewayin whichaninformationforagerbrowses.Next,theuser-andsi te-centricclickstreammodelsofinformationforaging(CMIF)areintroduced.Thehypothesesg eneratedfromthemodelstohelp answerresearchquestion3(listedbelow)arethenpresente din§4.2.1fortheuser-centricmodel and§4.2.2forthesite-centricmodel.ResearchQuestion3: Howcaninformationforagingtheoryandclickstreamdatabe usedtoexplaintheachievementofagoalatalongtailWebsite?4.1InformationForagingThebasicwayinwhichaninformationforagerevaluatesever yWebpagetheyarepresentedwith isexplainedintherstsubsectionbelow.Anexamplesessio nisthenshownofauser'sclickstreamastheyhuntforinformationovermultipleWebsites. Usingtheconceptsofinformation foraging,therationalebehindtheuser'sbrowsingbehavio risprovided. 4.1.1PageEvaluationWhenpresentedwithapage,aforagerhasfourbasicactionsw hichcanbeselectedatanyparticulartime:(1)evaluateorcontinuetoevaluatethelinksona page;(2)clickonanalreadyevaluatedlink;(3)gotoapreviouslyvisitedpage;or(4)leaveth esite(Pirolli,2007).Theprobability ofwhatactionaforagerwillchoosechangesovertime.When rstpresentedwithanewpage,it ismoreprobablethattheuserwillbeginevaluatingthepage comparedtotheotherthreeactions. Thepurposeofevaluationistogetageneralsenseofthevalu eofthepageanditslinks. Withcontinuedevaluation,itislikelytheprobabilityofa tleastoneoftheotherthreeactions becomeshigherthantheprobabilityforfurtherevaluation .Thischangeinprobabilitiesisdueto theconceptofsatiscing(Simon,1956;Pirolli,2007),whe retheuserwillcontinuetoevaluatea pageuntilalinkwitha“goodenough”scentisfoundoritisde terminedthepagedoesnotcontain 69

PAGE 83

any“goodenough”links.Ifahighly-scentedlinkisfound,i twillbeclicked.Ifnot,theuserwill eitherbackuptoapreviouslyvisitedpageorleavethesitei nsearchofaWebsitewithahigher meanexpectedvaluethanthecurrentsite(Pirolli,2007). Therationalebehindwhyscentisbenecialtoauserisdueto thecostsassociatedwithbrowsing.Foragersareassumedtoberationalandthustrytoreduc etheirsearchcostswhilehuntingfor information(Pirolli,2007).Aseachadditionalpageviewe dincursasearchcost,takingmeandering,wrong,oralreadytraversedpathsislessefcientthan takingadirectpathtotheinformation sought.Informationscentisamechanismbywhichforagersa reabletoreducetheirsearchcosts byincreasingtheiraccuracyonwhichoptionleadstoinform ationofvalue.Therefore,aforager willclickalinkifthescentisdeemedhighenoughtoefcien tlyleadtovaluableinformation.If noneofthelinksprovidesufcientlyhighscent,theforage rwillperformoneoftheotherthree actionsinanticipationtheactionwillleadtohigher-scen tedlinks. 4.1.2SampleSessionBydenition,longtailWebsitesdonotgenerateheavytraf c.Theirrelativeobscuritymeansitis unlikelymanynewvisitorswillknowofthesite'sexistence letaloneitsuniformresourcelocator (URL).However,thewidespreaduseofsearchenginesbyInte rnetusers(comScore,Inc.,2007a) providesagatewaytotheselongtailWebsites.Theresultsf romsearchenginesalsoprovidelinks toanumberofotherknownandunknownWebsitestoo.Therefor e,aninformationforagerhas easyaccesstonumerousWebsiteswhenhuntingforinformati on. Figure25showsanexampleoftheclickstreamofasuccessful foragingtripbyausersearchingforinformationaboutanupcominggigforacomedytroupe atacollegecampus(Cardetal., 2001).ThegureisanadaptationofaWebbehaviorgraph(WBG )(Cardetal.,2001)whichillustrateseachWebpagevisitedbyauser.Thegureismeantt obereadlefttorightandtopto bottom.EachrectangularboxrepresentsaWebpageandeachr oundedboxrepresentstheresults returnedfromasearchquery.TheletterineachboxistheWeb siteandthenumberistheWeb pageatthatsite.AlltheboxesfromthesameWebsiteareshad edthesamecolor.Straightarrows representtheuserclickingalinkfromoneWebpagetoanothe r.Curvedarrowsattheendofaline representauserreturningtowhateverWebpageislistedrs tonthenextrowdown.Verticallines indicateareturntoapreviouslyvisitedWebpage.Figure25 isagraphicalrepresentationofthe 70

PAGE 84

followingclickstream: Figure25.:User-centric:ExampleUserClickstreamGraph– Adaptedfrom(Cardetal.,2001) Theclickstreamillustratedingure25isuser-centricinn aturesinceitincludesthebrowsing behavioroftheforageracrosseveryWebsitetheuservisite d(Padmanabhanetal.,2001).Asitecentricversionofthisclickstreamwouldonlyincludetheb rowsingbehavioratasingleWebsite withoutknowledgeofwhatoccurredattheotherWebsites.Th eterm user-session referstoatimecontiguoussequenceofWebpagesviewedatanyWebsitefromt hesameuser,suchasseenina user-centricclickstream.Incontrast, session representsatime-contiguoussequenceofWebpages viewedatthe same Websitebythesameuser,likeinasite-centricclickstream ForagingExplanation Ingure25theuserstartedtheiruser-sessionwithinpatch A(i.e.,WebsiteA)atpageA1,a Webpagewithsearchcapabilities,andenteredasearchquer y.Afterevaluatingtheresultsofthe queryonA2,noneoftheresultinglinkshadahighenoughscen ttowarrantclickingonandthus theuserreturnedtotherstpage.Re-evaluatingthevalueo fthepatchinlightofpageA1andthe resultsreturnedonA2,theuserdecidedtoleavethesitefor patchB. SiteBalsohadsearchcapabilitiesandtheuseragainentere dasearchquery.Thistime,while evaluatingtheresultsofthequeryonB2,oneofthelinkshad ahighenoughscenttocausethe 71

PAGE 85

usertoclickonit.OnsiteCtheuserfoundahighly-scentedl inktositeDandclickedthatlink. OnsiteD,theusercanbeseenashavingrelativelypoorscent duetotheinefcientrevisitingofa numberofpages(D1,D2,andD3)multipletimes. AfterdeterminingthatthevalueofpatchDhaddroppedbelow whatcouldbeexpectedelsewhere,theuserreturnedbacktothepreviouspatch(siteC)a ndnallybacktotheresultspageof siteB.Re-evaluatingthelinksontheresultspageB2leadth eusertoselectanotherlinkwhich leadtositeE.ThescentthroughoutsiteEwasstrongastheus erdidnotbacktrack.Inaddition,the userfoundtheinformationtheysoughtaboutthecomedytrou peonpageE4. TheprecedingexampleillustratedhowconceptsfromIFTcan beusedtoexplainusers'browsingbehaviors.Forexample,alackofhighlyscentedlinksfr omanyoftheresultsreturnedonpage A2explainswhytheuserbacktrackedtopageA1afterexecuti ngasearch.Inaddition,themovementfromWebsiteAtoBcanbedecipheredastheuserbelievin ginformationofgreatervalue couldbeobtainedfromanotherpatch.Thenextsectionprese ntsaclickstreammodeldeveloped fromIFTwhichcapturestheseconceptsusingclickstreamme trics. 4.2ClickstreamModelofInformationForagingTheclickstreammodelofinformationforagingusesclickst reammetricstorepresenttheconcepts ofinformationscentandpatches.Theuser-centric(UC)mod elispresentedrstwhichusesinformationaboutaforager'sentirebrowsingbehaviortodeterm inetheoverallscentatandthevalue ofaWebsite.Sincedataaboutauser'sentireclickstreamis rarelyavailable,asite-centric(SC) versionofthemodelisalsopresentedwhichprovidesaltern ativeconceptualizationoftheIFTconceptsusingonlysite-centricdata.4.2.1User-centricOfthefourpossibleactionsausermaytakeatanypointonany page,onlythreeofthoseactions aredirectlyobservableviaauser'sclickstream:clickona link,returntoapreviouslyvisitedpage, andleavethesite.Althoughthedeterminationofscentisin ternallyrepresentedastheactivation betweenthefeaturesofthelinksandgoal(Pirolli,2007),t heobservableactionsofauser'sclickstreamcanbeusedasproxiesfordetermininghowauserperce ivedscentandjudgedthevalueof apatch. 72

PAGE 86

Thejudgmentofinformationscentorthevalueofthepatchca nnot,however,bedetermined inabsolutetermsfromauser'sclickstreamsincethereisno absolutetocompareagainst.Rather, judgmentmustbedoneinrelativeterms.Forexample,assume auserisonapagewhichhasone linkthatgoestopageAandanothertopageB.Iftheuser'scli ckstreamshowspageAwasvisited nextthenthelinktopageAhadhigherscentthanthelinktopa geB.Theactualscentandthusthe differenceinscentbetweenthelinksareunknown. Potentiallymoreimportantthanthescentofeachindividua llink,however,istheoverallscent andpatchvalueataparticularWebsiteincomparisontotheo therWebsitesvisited.Suchameans ofdeterminingwhichWebsitewasofmorevaluetoauserprovi desaclueintowhichsitemight fulllthegoaloftheuser.Forexample,ingure25theuserv isitedthreepagesmultipletimeson siteCwhichwouldindicateapoorscentatthesite.Contrast thatbrowsingbehaviorwithsiteE wherefourpageswerevisitedonlyasingletime. Therelativejudgmentbetweensitesisalsoimportantincas eswheretheuser'sinformationgoal iscomplex.Forsuchgoalsitislikelytheclickstreamofaus erwillbecomplexregardlessofthe sitebeingvisited.Ifjudgedinabsoluteterms,itwouldsee munlikelytheuserwouldndtheinformationtheysoughtatanysite.However,ifjudgedrelativel y,itmaybefoundthatonesite,while stillhavinganoveralllowscent,hasahigherscentthanthe othersitesandthuswasthemostuseful. Forexample,assumeauservisited15pagesatonesite,withs ixofthosepagesbeingdistinct. Atanothersitetheuseralsovisited15pages,withonlyveo fthosepagesbeingdistinct.Inabsoluteterms,thescentatbothWebsitesappeartobepoorsince anumberofpreviouslyvisitedpages werevisitedagain.However,relativetooneanother,ther stsiteappearstohaveastrongerscent thenthesecond. Thefollowingsubsectionsillustratemannersinwhichthev alueofapatchandlevelofinformationscentcanbegleanedfromtheclickstreamofauser.Byta kingauser-centricviewpoint,many oftheproposedconceptualizationsarerelativetotheuser 'sbrowsingbehavioratotherWebsites. Table8liststheninehypothesesoftheuser-centricmodel. Thefollowingsubsectionsprovide therationalebehindeachofthehypotheses. 73

PAGE 87

Table8:User-centric:Hypotheses Hypothesis#Hypothesis I NFORMATION P ATCH –S ITE -P ATCH UC1Highertotaldurationspentatthissite-patchrelative toothersite-patcheswithin auser-sessionwillbepositivelyassociatedwithachievin gagoalonthislongtail Website. UC2Highernumberofpagesviewedatthissite-patchrelativ etoothersite-patches withinauser-sessionwillbepositivelyassociatedwithac hievingagoalonthis longtailWebsite. UC3Returningtothissite-patchduringthesameuser-sessi onwillbepositivelyassociatedwithachievingagoalonthislongtailWebsite. UC4Returningtothissite-patchduringadifferentuser-se ssionwillbepositivelyassociatedwithachievingagoalonthislongtailWebsite. I NFORMATION P ATCH –P AGE -P ATCH UC5Visitationofmorehighlyvaluedgoalpage-patchesatth issite-patchrelativeto othersite-patcheswithinauser-sessionwillbepositivel yassociatedwithachievingagoalonthislongtailWebsite,wherevalueisdenedast he: (a)maximumvalueofanyvisitedgoalpage-patch.(b)valuefromthelastvisitedgoalpage-patch.(c)summationofvaluesfromallvisitedgoalpage-patches. UC6Highermediantotaldurationspentwithinvisitedgoalp age-patchesatthissitepatchrelativetoothersite-patcheswithinauser-session willbepositivelyassociatedwithachievingagoalonthislongtailWebsite. S TRICT I NFORMATION S CENT UC7Alowerproportionofrepeatedlyvisitedpagesatthissi te-patchrelativetoother site-patcheswithinauser-sessionwillbepositivelyasso ciatedwithachievinga goalonthislongtailWebsite. UC8Amorelinearclickstreamatthissite-patchrelativeto othersite-patchesinthis user-sessionwillbepositivelyassociatedwithachieving agoalonthislongtail Website. R ELAXED I NFORMATION S CENT UC9Followingofmorehighlyvaluedgoalscenttrailsatthis site-patchrelativeto othersite-patcheswithinauser-sessionwillbepositivel yassociatedwithachievingagoalonthislongtailWebsite,wherevalueisdenedast he: (a)maximumvalueofanyfollowedgoalscenttrail.(b)valuefromthelastfollowedgoalscenttrail.(c)summationofvaluesfromallfollowedscenttrails. 74

PAGE 88

InformationPatchAninformationpatchisagroupingofsimilarinformation,l ikeaWebpageorWebsite(Pirolli, 2007).Whatapatchrepresentsdependsonthelevelofanalys isbeingexamined.Atahighlevel, anentireWebsitecanbeconsideredapatch.Atalowerlevel, aWebpageorsetofWebpages maybeconsideredapatch.Theterm site-patch isusedtodenoteanentireWebsiteasapatch, while page-patch referstoanindividualWebpageorsetofWebpagesasapatch. Therstfourhypothesesinthissectionexaminehowbrowsin gbehaviorcanleadtogoalachievementbyconsideringtheWebsiteasapatch(i.e.,site-patch ).Abenetoftakingasite-patchperspectiveisonlycoarsedataonbrowsingbehaviorisrequire d.Thelasttwohypothesesinthissection,however,takeamoredetailedviewpointbyfocusingon specicpagesorsetsofpagesbeing visited(i.e.,page-patches).Althoughconcentratingonp age-patchesrequiresner-graineddata, thelowerlevelofanalysismayteaseoutdifferencesnotsee natthesite-patchlevelbetweengoalandnon-goal-achievingforagers.Site-patch Sinceaforagerhasimperfectinformationandlimitedcompu tationalfacilities,anoptimaldecisionofhowlongtospendinasite-patchisunlikely.Instead ,aforagerislikelytoemploysatiscing(Pirolli,2007;Simon,1956),makingadecisionthatsat isesaneed(e.g.,rateofinformation gain)atsomespeciedlevel.Whenreadingonlinetextsforl earning,satiscingisacommonly usedtechnique(ReaderandPayne,2007).Usingsatiscing, aforagerwillcontinuetospendtime readingpagesonaWebsiteaslongasinformationofvalueisb eingobtained.Therefore,ahigher totaldurationspentatonesite-patchrelativetoothersit e-patchescanbeassociatedwithobtaining moreinformationrelevanttoauser'sinformationgoal,whi chleadstoHypothesisUC1. HypothesisUC1: Highertotaldurationspentatthissite-patchrelativetoo thersite-patches withinauser-sessionwillbepositivelyassociatedwithac hievingagoalonthislongtailWebsite. Priorresearchhasfoundmixedsupportfortheassociationb etweenabsolutetotaldurationand theachievementofagoal.Apositive,negative,andinsigni cantassociationwasfounddependent onthetaskononee-commerceWebsite(SismeiroandBucklin, 2004).Apositiveandinsignicantassociationwasfoundusingsite-centricanduser-cen tricdataatanothergroupofe-commerce 75

PAGE 89

Websites,respectively(Padmanabhanetal.,2001). Eachadditionalpagevisitedrepresentsadecisionpointwh eretheuserbelievedthevalueof continuingtobrowseatthissite-patchwashigherthanwhat theyexpectedtondelsewhere.Ina similarveinashypothesisUC1,aforagerwillcontinuetovi sitpageswithinasite-patchaslongas informationofinterestisstillbeingobtained.Therefore ,morepagesviewedatonesite-patchrelativetootherscanbeassociatedwithobtainingmoreinform ationrelevanttoauser'sinformation goal,whichleadstoHypothesisUC2.HypothesisUC2: Highernumberofpagesviewedatthissite-patchrelativeto othersite-patches withinauser-sessionwillbepositivelyassociatedwithac hievingagoalonthislongtailWebsite. Empirically,supporthasalsobeenmixedfortheassociatio nbetweenabsolutenumberofpages viewedandconversion.Priorresearchhasfoundapositivea ssociation(Awadetal.,2006;Moe, 2003),noassociation(Chatterjeeetal.,2003),andmixeda ssociationdependingonthetask(SismeiroandBucklin,2004)ortypeofpagesviewed(VandenPoel andBuckinx,2005). Whileforagingwithinasite-patch,auserformsageneralop inionofthevalueoftheWebsite. Whenleavingonesite-patchforanother,aforagerbelieves greatervaluemaybefoundelsewhere. However,ifauserreturnsshortlyafterleaving,theforage rwasunabletondamorevaluable site-patch.Therefore,thesite-patchofinterestismorel ikelythanothersite-patchestocontainthe informationnecessarytofullltheuser'sgoal,whichlead stoHypothesesUC3. HypothesisUC3: Returningtothissite-patchduringthesameuser-sessionw illbepositively associatedwithachievingagoalonthislongtailWebsite. Whenthespanoftimebetweenvisitsisgreater,returningto asite-patchdemonstratesthepositiveevaluationofthesiteintwomanners.First,theactof returningtoasiteindicatestheforager originallyvaluedthesite-patchenoughtorememberitsexi stence.Second,havingageneralrecollectionofthesiteandthenreturningalsoindicatesthesit e-patchisexpectedtocontaintheinformationneededtofullltheuser'sgoal,whichleadstoHypot hesesUC4. HypothesisUC4: Returningtothissite-patchduringadifferentuser-sessi onwillbepositively associatedwithachievingagoalonthislongtailWebsite. Priorresearchhasfoundpositive,negative,andinsignic antsupportdependingonthetaskfor 76

PAGE 90

theassociationbetweenreturningtoaWebsiteafterasessi onhasendedandachievingagoal(SismeiroandBucklin,2004).Asfarascanbedetermined,theexi tandreturnofauserduringausersessionhasnotbeenexaminedinpriorresearch.Page-patch Aspreviouslydiscussed,apage-patchconsistsofaWebpage orsetofWebpagesthatcollectivelyprovideinformationforanindividual.However,cer tainpage-patchesmayprovidemore usefulinformationtoauserthanothers.Theidentication ofwhichpage-patchesareusefulis likelytobesimilaramongstforagerswithcomparablegoals .Patchespredominatelyusefulto goal-achievingforagersareknownas goalpage-patches ,whereas non-goalpage-patches are patchesprimarilyofusetonon-goal-achievingforagers. Auserwhovisitsmorehighlyvaluedgoalpage-patchesislik elytohaveagoalsimilartothe goal-achievingforagersonthatWebsite.Thevalueofapatc hisconsideredinthreedifferent ways:maximum,mostrecent,andsummation.Maximumvalueco ntendsahighlyvaluedpatch visitedatanypointduringasessionisneededfortheforage rtojudgethesitefavorablyandthus considerachievingagoal.Thevalueofthemostrecent(i.e. ,lastvisitedpatch)conjecturesagoal ismorelikelytobeachievedsoonaftervisitingahighlyval uedpatch.Finally,summationhypothesizesthattheoverallevaluationoftheWebsite,inte rmsofitsvaluablepatches,affectsthe decisionofaforagertoachieveagoalornot. IncomparisontootherWebsitesvisitedduringauser-sessi on,auserwhovisitsrelativelymore valuablegoalpage-patchesatthisWebsiteismorelikelyto achieveagoal,whichleadstoHypothesisUC5.HypothesisUC5: Visitationofmorehighlyvaluedgoalpage-patchesatthiss ite-patchrelativeto othersite-patcheswithinauser-sessionwillbepositivel yassociatedwithachievingagoalonthis longtailWebsite,wherevalueisdenedasthe: (a)maximumvalueofanyvisitedgoalpage-patch.(b)valuefromthelastvisitedgoalpage-patch. (c)summationofvaluesfromallvisitedgoalpage-patches. 77

PAGE 91

Positive,negative,andnon-signicantassociationsbetw eenspecicpagesandconversionhave beenfoundinpriorresearch(SismeiroandBucklin,2004).D ifferencesbetweenthetypesofpages visitedandconversionratehavebeenalsofoundatonee-com merceWebsite(Moe,2003).The actualrelationshipbetweentypesofpagesviewedandconve rsionwasfoundtobemixedatanothere-commercesite(VandenPoelandBuckinx,2005).Asfa rascanbedetermined,relationshipsbetweengroupsofpages(ofpotentiallydifferenttyp es)andconversionhavenotbeenexaminedinpriorresearch. Thesimplevisitationofgoalpage-patches;however,doesn otprovideacompleteindicationof howaforageractuallyprocessesapageorsetofpages.Forex ample,ifaforagerspendsavery shortamountoftimeinagoalpage-patchitmaysignaltheuse rdidnotfullyrecognizethevalue ofthepatch.Alackofrecognitionmaybebecauseofapoorlye xpressedinformationgoalorsimplyadifferentinformationgoalfrompreviousgoal-achiev ingforagers.Regardless,eitherreason wouldunlikelyresultingoalachievementatthisWebsite. SimilartohypothesisUC1,aforagerwillcontinuetospendt imereadingpageswithingoal patchesaslongasinformationofvalueisbeingobtained.Ho wever,unlikehypothesisUC1only thetimespentonpageswithinalreadyidentiedvaluablego alpage-patchesisconsidered.Thus,a highermediantotaldurationspentwithingoalpage-patche satonesite-patchrelativetoothersitepatchescanbeassociatedwithobtainingmoreinformationr elevanttoauser'sinformationgoal, whichleadstohypothesisUC6 1 HypothesisUC6: Highermediantotaldurationspentwithinvisitedgoalpage -patchesatthis site-patchrelativetoothersite-patcheswithinauser-se ssionwillbepositivelyassociatedwith achievingagoalonthislongtailWebsite. Asfarascanbedetermined,priorresearchhasnotspecical lyexaminedtheassociationbetweenamountoftimespentongoalpage-patchesandgoalachi evement. InformationScentThissectionpresentsthreehypothesesdealingwithinform ationscent.Inthersttwohypotheses,informationscentischaracterizedbyconsideringaus er'sentiresessionasasinglemonolithic 1 Goalpage-patchesareuniquetoeachsite. 78

PAGE 92

piece.Inboththesehypothesesafairlystrictdenitionof informationscentisconsideredwhich viewsanyinefcienciesinauser'sclickstream(e.g.,back tracking)ashavingpoorerscent.The lasthypothesistakesamoredetailedviewpointbylookinga tinformationscentamongdifferent fragmentsofauser'ssession.Inthishypothesisamorerela xedcharacterizationofinformation scentisusedwhichrecognizesthatcomplexsessionsmaysti llbeofhighscenteveninthepresenceofsomeinefciencies.StrictInformationScent Whenaforagerhasasinglewell-denedgoalinminditwouldb eexpectedtheuserwouldexhibitafocusedsearchpattern(Moe,2003).Withawell-den edgoal,theforagerisbetterableto evaluatethescentofeachlinkandhencemakemoreaccuraten avigationalchoices.Viewedasa whole,suchnavigationalchoicesforaforagerwithhighlev elsofscentshouldresultinadirected clickstream. Adirectedpathischaracterizedbyfew(ifany)repeatvisit ationsofpages,sinceitisassumed arationalforagerwouldobtainanyandallinformationfrom apagethersttimeitwasvisited. However,asevenwell-denedgoalsmaybecomplexandhencer esultinlessthandirectclickstreams,scentrelativetootherWebsitesvisitedismoreap propriatetoexaminethanabsolute scent.Therefore,agoalismorelikelytobeachievedwhenas mallerproportionofpagesarevisitedmultipletimesatthisWebsiterelativetoothersites, whichleadstohypothesisUC7. HypothesisUC7: Alowerproportionofrepeatedlyvisitedpagesatthissitepatchrelativeto othersite-patcheswithinauser-sessionwillbepositivel yassociatedwithachievingagoalonthis longtailWebsite. Empirically,theproportionofrepeatedlyvisitedpagesha sdiffereddependingonthetypeof pageandfocusofthebrowser.Forexample,Moe(2003)foundt hatdirectedshoppersatanecommercesiteviewedmostlyuniqueproductbrandpages,som ewhatuniquecategorypages, andnotveryuniqueproductpages.Asfarascanbedetermined ,theuseofproportionofrepeated pagesforanentiresessionhasnotbeenexaminedinpriorres earch. Takinganer-grainedconceptualizationofstrictinforma tionscentconsiderstheoverallcomplexityofauser'sclickstream,asopposedtojustgeneralb acktrackingbehavior.Alesscomplex clickstreamisonewhichexhibitsalinearpaththroughasit e(Senecaletal.,2005),whichisin79

PAGE 93

dicativeofhighscent.Aspathinformationisusedtodeterm inecomplexity,backtrackingbehavior atmanydifferentpagesratherthanasinglepagemaybetease doutfromasession. Forexample,considerauser'sbrowsingbehaviorattwoWebs ites.Atonesitesevenpages werevisitedandfourofthosepageswereunique.Alloftheno n-uniquepageswerethehome pagewhichwasusedasthemainhubforalltheotherpagesbein gvisited.Attheothersitethe samenumberoftotalpagesanduniquepageswerevisited.Att hisWebsite,however,eachnonuniquepagewasdifferentfromoneanother.Althoughthecli ckstreamsfrombothWebsiteshave thesameproportionofrepeatedlyvisitedpages,theclicks treamfromthesecondsiteismorelinearandthuslesscomplexthanthesecond. Withhighscent,aforagerwillexhibitalesscomplexandmor elinearclickstreamthanwith lowscent.However,similartotheprevioushypothesis,abs oluteclickstreamcomplexityisnotappropriatetoconsiderinlightofpotentiallycomplexinfor mationgoals.Therefore,alesscomplex clickstream,intermsoflinearity,atthisWebsiterelativ etootherWebsitesismorelikelytolead togoalachievement,whichleadstoHypothesisUC8.HypothesisUC8: Amorelinearclickstreamatthissite-patchrelativetooth ersite-patchesinthis user-sessionwillbepositivelyassociatedwithachieving agoalonthislongtailWebsite. Sessioncomplexityhasbeenusedtosuccessfullydiscrimin ateusersviatheirclickstreaminto highandlowscoringgroups(McEneaney,2001);intheuseofp roductrecommendationagents (Senecaletal.,2005);andinpredictingthecompletionofi nformationalande-commercetasks (Kalczynskietal.,2006).RelaxedInformationScent Theprevioustwohypothesesconsideredthesessionasawhol eandassumedtwothings.First, inefcienciesinauser'sclickstreamwereconsideredindi catorsofpoorscent.However,certain “inefciencies”mayinsteadbeapartofthenaturaldecisio nmakingprocessofauser.Forexample,Moe(2003)foundthatwhendirectedshoppersweredecid ingbetweenproducts,theirclickstreamsdemonstratedmultiplerepeatedvisitstothepages oftheproductsbeingconsidered.Second,itwasassumedtheforagerhadasingleinformationgoal inmindwhenforaging.However, Montgomeryetal.(2004)demonstratedthatmodelswhichacc ountedforchangesinvisitors' goalsonane-commerceWebsiteperformedbetteratpredicti ngconversionthanmodelswhich 80

PAGE 94

onlyallowedforasinglegoal. Astheinformationgoalofaforagermaychangeduringasessi on(orsubgoalsmaybeintroducedasinformationisobtained)thescentoflinkswillcha ngeinaccordancetothecurrentgoal. Therefore,alinktoapagewhichhasalreadybeenvisitedmig htbeselectedagainbecause(1)the linkhasthehighestscentforthenewcurrentgoaland(2)the scentgivesanindicationofnovelinformationonthelinkedpage.Soeventhoughthepathattheag gregatedsessionlevelofanalysis mayappearundirectedduetoanon-linearpathorrepeatedvi ewingsofthesamepage,ifaforager'sclickstreamwereseparatedbygoal,amoredirectedm annerofbrowsingwithinthecontext ofthecurrentinformationgoalwouldlikelybeseen. Figure26illustratesanexampleofanundirectedpathatthe sessionlevelofanalysisandadirectedpathatthegoallevel.Theentiresessionconsistsof vepageviewswith60%ofthose pagesbeingunique.Althoughthesessionasawholedoesnota ppeartobedirected,breakingthe sessiondownbytheuser'sinformationgoalsrevealsadiffe rentpattern.Withinthecontextofa particularinformationgoal,thepagesviewedwereuniquea sevidencedbythe100%pathuniquenessforeachgoal'spathsubset. PathSubsetPagesViewedPathUniqueness EntireSession A B C B A 60% InformationGoal1 A B C 100% InformationGoal2 B A 100% Figure26.:User-centric:ExampleForager'sPath Thus,aner-grainedconceptualizationofinformationsce ntisneededwhichiscapableofdetectinghighscentinsituationsofchanginginformationgo alsand“inefcient”behavior.Tomeet thatneed, goalscenttrails and non-goalscenttrails 2 areusedinasimilarspiritasgoalandnongoalpage-patchesfromhypothesisUC5.Goalscenttrailsar epathfragmentsthatgoal-achieving foragerspredominatelyfollow.Non-goalscenttrailsarep redominatelyfollowedbynon-goalachievingforagers. 2 OlstonandChi(2003)introducedtheconceptofScentTrails whichhighlightedthepathausershouldtakegiven aninformationgoal.ScentTrailsdifferfrom scenttrails inthattheformershowsapaththroughaWebsitegivena user'sgoal,whereasthelatterusespastforagers'behavio rtodeterminegoalandnon-goalpathfragments. 81

PAGE 95

Byonlyusingportionsofusers'pathstoderivescenttrails ,thosepartsofasessionmostand leastalignedwithgoal-achievingcanbeteasedfromanenti resession.Auserwhofollowsmore highlyvaluedgoalscenttrailsislikelytohaveagoalsimil artothegoal-achievingforagerson thatWebsite.InthesamemannerashypothesisUC5,thevalue ofascenttrailisdenedinthree ways:maximum,mostrecent,andsummation.Maximumvalueco ntendsthemosthighlyvalued scenttrailthatisfollowedatanypointduringasessionisn eededfortheforagertojudgethesite favorablyandthusconsiderachievingagoal.Thevalueofth emostrecent(i.e.,lastfollowedscent trail)conjecturesagoalismorelikelytobeachievedsoona fterfollowingahighlyscentedtrail. Finally,summationhypothesizesthattheoverallevaluati onoftheWebsite,intermsofitsvaluabletrails,affectsthedecisionofaforagertoachieveago alornot. WhencomparedtootherWebsitesvisitedduringthesameuser -session,aforagerwhofollows relativelymorevaluablegoalscenttrailsatthisWebsitei smorelikelytoachieveagoal,which leadstoHypothesisUC9.HypothesisUC9: Followingofmorehighlyvaluedgoalscenttrailsatthissit e-patchrelativeto othersite-patcheswithinauser-sessionwillbepositivel yassociatedwithachievingagoalonthis longtailWebsite,wherevalueisdenedasthe: (a)maximumvalueofanyfollowedgoalscenttrail.(b)valuefromthelastfollowedgoalscenttrail. (c)summationofvaluesfromallfollowedscenttrails. Pathinformationhasbeenusedsuccessfullyinclickstream researchtopredictfuturepathselections(Montgomeryetal.,2004).Variouswaysofreprese ntingpathshavealsobeentested.The useofpathfragments,whichtakeintoaccounttheorder,adj acency,andrecencyofinformation, havebeenfoundtobemoreaccurateforpredictingfuturepat hsthanothermannersofrepresenting paths(Yangetal.,2004).Asfarascanbedetermined,theuse ofpathfragmentswhichdistinguish betweengroupsofaWebsitepopulationhasnotbeenexamined inpriorresearch. 82

PAGE 96

RelationofHypothesestoInformationForagingTheoryForeachoftheninehypotheses,table9listswhetherthehyp othesisistestingorextendingIFT. ForeachIFT-extendinghypothesis,ashortdescriptionisp rovidedbelowwhichexplainsinwhat waythetheoryisbeingextended. Table9:RelationofHypothesestoInformationForag-ingTheory Hypothesis#HypothesisExtendsIFT? I NFORMATION P ATCH –S ITE -P ATCH UC1DurationNoUC2NumberofpagesNoUC3LeavingandreturningPartiallyUC4ReturningbackYes I NFORMATION P ATCH –P AGE -P ATCH UC5PatchvisitationYesUC6PatchdurationPartially S TRICT I NFORMATION S CENT UC7UniquepagesYesUC8LinearclickstreamYes R ELAXED I NFORMATION S CENT UC9TrailfollowingYes Thersttwohypotheses(UC1–UC2)testIFTwithoutextendin gthetheory.Bothofthehypothesestestthetheory'sexpectationthatusersemployth econceptofsatiscingwhenforaging forinformation(Pirolli,2007;Simon,1956).Aspatchesar eassumedtoexhibitdiminishingreturns,avisitorshouldonlyforagewithinapatchaslongast heyaresatisedwiththerateofinformationgaintheyareobtaining. Thethirdhypothesis(hypothesisUC3)partiallyextendsIF T.Theideaisnotnovelthataforager wouldleaveapatchwhentherateofinformationgainfallsbe lowthemeanrateofgainobtainablefromtheenvironment.However,theMarginalValueTheo rem(Charnov,1976)assumesan optimalforagerwithperfectinformation.Sinceforagersa reknowntopossessimperfectinformation,theactualjudgmentonthemeanrateofgainobtainable fromotherpatchesmaybeincorrect. 83

PAGE 97

Therefore,aforagermayreturntotheoriginalpatchaftere xploringotherpartsoftheenvironment andrealizingtheoriginalsitestillprovidedthehighestr ateofinformationgain. HypothesisUC4isconsideredanextensiontoIFTbecauseiti ntroducesmemoryfrompast sessions.Whensearchingforinformation,aforagerwillus einformationscenttoguidethemto patchesofinterest(e.g.,aWebsite).Thelevelofscentrec ognizedbyaforagerisdependenton thestrengthofchunksactivatedfromdeclarativememory(P irolli,2007).Itisassumedthatwhen aforagervisitsaWebsiteofvalue,greaterattentionwillb epaidtothecuesthatrepresentthat sitecomparedtositesofalowervalue.Greatercueattentio nwillinturnmorestronglyactivate thechunksrepresentingthosecuesindeclarativememory(A ndersonetal.,2004).Atalatertime, whentheforagerhasaninformationgoalthatmaybeachieved fromthevaluableWebsite,those chunksrepresentingtheWebsitewillhaveagreaterprobabi lityofbeingretrieved(thanchunks representinglower-valuedWebsites)fromdeclarativemem oryduetobeingpreviouslyactivated. Thehypothesesdealingwithpage-patches(UC5–UC6)areals oconsideredanextensionbecauseIFTdoesnotdenepatchesasbeingassociatedwithapa rticulargroupofforagers(e.g., goalversusnon-goalsessions).Instead,thepatchystruct ureoftheWebisassumedtobeindependentofaforager'sinformationgoal(Pirolli,2007).Hypot hesisUC5isalsoanextensiontothe theorybecausepatchesinIFTarenotgivenvalueindependen tofthecurrentforager.Instead,the valueofapatchisdeterminedbyanindividual'sbehaviorwi thinthatpatch 3 (e.g.,timespent). ThenalthreehypothesesareseenasanextensiontoIFTtoo. WithinIFT,scentisviewedas areal-timemechanismthatforagersusetoselectanavigati onaloption(e.g.,selectingwhichlink toclicknext).Whileallthreehypothesesstillassumescen tworksbythesamemechanism,an overall levelofscentfromaforager's aggregated behaviorisconceptualizedinstead.Inaddition, hypothesisUC9alsoextendsIFTbyintroducingtheconcepto ftrailsofscentthatarecommon amongstforagers.4.2.2Site-centricThesite-centricmodelisusefulwhenonlytheclickstreamo faforageratasinglesiteisknown. Asaresultofhavingincompletedata;however,twowaysinwh ichconceptsaredenedtotap themainconstructsofIFTintheuser-centricmodelcannott obeusedinthesite-centricmodel. 3 Forexample,valueisequatedwithdurationinhypothesesUC 1andUC6. 84

PAGE 98

Instead,alternativeformsofconceptualizingtheconstru ctsareneeded. Therstwaythedenitionsdifferisintheusageofaforager 'sbrowsingbehavioratthesiteof interestrelativetotheirbrowsingbehaviorattheothersi tesvisitedduringtheiruser-session.Since thesite-centricmodelhasnoknowledgeofbrowsingbehavio ratotherWebsites,comparisonsare insteadmaderelativetoaxedvalueofzero(i.e.,inabsolu teterms) 4 .Forexample,userswereassumedtohavespentzerominutesandviewedzeropagesonothe rWebsites.Thus,thesite-centric modelwasreducedtoonlyusingavisitor's absolute browsingbehavioratthesiteofinterest. Theseconddifferencebetweenthetwomodelsistheabilityt odetermineiftheforagerleft thesiteandthencamebackduringthesession.Site-centric clickstreamdatawouldsimplyshow acontiguousclickstream,regardlessofiftheforagerleft thesiteornot.However,site-centric datasetstypicallyhaveaccesstoareferringeldwhichsho wswhichURLausercamefrom(Fieldingetal.,1999;GourleyandTotty,2002).Theuseoftherefe rringeldisnotwithoutdisadvantagesascommonbrowsingbehaviorsmayleadtheeldtobebla nk(e.g.,typinginaURL,using abookmark).Despitetheselimitations,theuseofreferrin ginformationdoesprovideameansthat site-centricdatasetsmayusetodetermineifforagershave leftandreturnedtothesiteofinterest withinasession. Forexample,gure27illustratesasite-centricviewofthe clickstreamdataavailablefroma user.Bylookingattheuser'sentireclickstream(gure25) ,itisknownthattheuserleftthesite andreturnedaftervisitingpageD1thethirdtime.But,thef actthattheuserleftthesiteandreturnedcannotbedeterminedfromsimplyexaminingthesitecentricclickstreamasshowningure27.However,assumingtheuserfollowedlinks,therefer ringeldwouldindicatepageC1was visitedafterthethirdD1pageandthustheforagerleftthes iteandreturned. Withthosetwodifferencesinmind,thehypothesesareresta tedforthesite-centricclickstream modelofinformationforagingintable10. 4 Chapter8providesacomparisonofbrowsingbehaviorrelati vetouserswhohadpreviouslyachievedagoalatthe siteofinterest.Thetemporalversionofthesite-centricm odelassumesdeviationsfromknowngoal-achievingbrowsin g behaviorindicateslowerlevelsofscentorpatchvalueandt husalowerprobabilityofagoalbeingachieved. 85

PAGE 99

Table10:Site-centric:Hypotheses Hypothesis#Hypothesis I NFORMATION P ATCH –S ITE -P ATCH SC1Highertotaldurationspentatthissite-patchwillbepo sitivelyassociatedwith achievingagoalonthislongtailWebsite. SC2Highernumberofpagesviewedatthissite-patchwillbep ositivelyassociated withachievingagoalonthislongtailWebsite. SC3Returningtothissite-patchduringthesamesessionwil lbepositivelyassociated withachievingagoalonthislongtailWebsite. SC4Returningtothissite-patchduringadifferentsession willbepositivelyassociated withachievingagoalonthislongtailWebsite. I NFORMATION P ATCH –P AGE -P ATCH SC5Visitationofmorehighlyvaluedgoalpage-patcheswill bepositivelyassociated withachievingagoalonthislongtailWebsite,wherevaluei sdenedasthe: (a)maximumvalueofanyvisitedgoalpage-patch.(b)valuefromthelastvisitedgoalpage-patch.(c)summationofvaluesfromallvisitedgoalpage-patches. SC6Highermediantotaldurationspentwithinvisitedgoalp age-patchesatthissitepatchwillbepositivelyassociatedwithachievingagoalon thislongtailWebsite. S TRICT I NFORMATION S CENT SC7Alowerproportionofrepeatedlyvisitedpagesatthissi te-patchwillbepositively associatedwithachievingagoalonthislongtailWebsite. SC8Amorelinearclickstreamatthissite-patchwillbeposi tivelyassociatedwith achievingagoalonthislongtailWebsite. R ELAXED I NFORMATION S CENT SC9Followingofmorehighlyvaluedgoalscenttrailswillbe positivelyassociated withachievingagoalonthislongtailWebsite,wherevaluei sdenedasthe: (a)maximumvalueofanyfollowedgoalscenttrail.(b)valuefromthelastfollowedgoalscenttrail.(c)summationofvaluesfromallfollowedscenttrails. Thesite-centrichypotheseshavethesametheoreticalrela tiontoIFTastheuser-centrichypotheses.§4.2.1provides anexplanationofwhichhypothesesextendIFTandhowthethe orywasextended. 86

PAGE 100

Figure27.:Site-centric:ExampleUserClickstreamGraph– Adaptedfrom(Cardetal.,2001) 4.3ConclusionThischapterprovidedanexplanationonhowIFT,atheorycon cernedwithinformationsearch, canbeusedtohelppredictaction.Inaddition,abriefoverv iewwasgivenofhowaninformation foragerprocessesapageandanexampleoftheprocessusingu ser-centricdata.Theuser-andsitecentricclickstreammodelsofinformationforagingwereth enintroduced.Finally,hypothesesgeneratedfromtheuser-andsite-centricmodelswerepresente d. 87

PAGE 101

Chapter5 Methodology Thischapteroutlinesthestepstakentotestthehypotheses forboththeuser-centric(UC)andsitecentric(SC)clickstreammodelsofinformationforaging.T hemethodologyfortheuser-centric modelispresentedrstin§5.1,followedbythesite-centri cmodelin§5.2.Foreachmodeladescriptionisgivenaboutthedatasampleandhowtocalculate eachhypothesis'measure.Finally, §5.3outlinesthestatisticaltestsusedtotesteachhypoth esis. 5.1User-centricClickstreamModelofInformationForagin g Therstsubsectionbelowdescribeshowtheuser-centricda tasetwasprocessedtocreateusersessions 1 .Inthenalsubsection,detailsaregivenonhowmeasuresfo rthemodel'shypotheses werecalculated.However,sinceonlysession-levelinform ationaboutaforagerwasavailable inthedata,onlymeasureswhichwerecalculablefromthegiv endataarepresented(hypotheses UC1-UC4).5.1.1DatasetSampleTheuser-centricdatasetconsistsofasetof n sessions S ( S 0 S 1 ,..., S N 1 ),where S i represents asinglesessiontuple.Eachtupleconsistsofeightpieceso finformation:auniqueidentierforthe user,session,Website,andreferringdomain;dateandtime thesessionstarted;numberofpages viewed;howmuchtimewasspentonthesite;andifthesession resultedinapurchasebeingmade (i.e.,agoal).Table11illustratesasetofsessiontuples. 1 Summarystatisticsabouttheuser-centricdatasetcanbefo undinchapter6. 88

PAGE 102

Table11:User-centric:ExampleSessions UserSessionWebsiteReferringDomainDateandTimePagesDu rationGoal? U5S1W7R35/25/0815:39:071217 min Yes U5S2W8n/a5/25/0815:40:5823 min No U6S3W5n/a5/25/0815:53:0259 min No U7S4W6R15/25/0816:02:3434 min No User-sessionsTheuser-centricmodelisbasedontheideaof user-sessions whichallowforanexaminationofa forager'sbehavioratonesiterelativetotheirbehaviorat othersites.Auser-session U containsa targetsession T andasetof n othersessions S ,where S = f S 0 S 1 ,..., S n 1 g T and S i areboth tuplesthatrepresentsinformationaboutaparticularsess ion(asillustratedintable11). Thetargetsession T isasessionthatoccurredatalongtaile-commercesite.Int hedataset,a Websitewasaggedasane-commercesiteifatleastonepurch asewasmadeatthesitebyany useratanypointduringthedataset'stimeperiod.Websites thatmadeupthelowest20 % ofall goalsachievedwereconsideredlongtaile-commercesites. Arandomsampleof20 % ofthose longtaile-commerceWebsiteswithatleast50goalsessions wereselectedforanalysis 2 .Each sessiontakingplaceatoneoftheselectedlongtaile-comme rcesitesbecameatargetsessionfora potentiallyvaliduser-session. Tobecomeavaliduser-session,theremusthavebeenatleast oneothersessionatane-commerce sitebytheuserduringthetimethetargetsessionwasactive 3 .Asessionwasconsideredactive duringthetargetsessionifitended30minutesorlessbefor ethestartofthetargetsession 4 .In addition,thesessionmusthavealsoendedbytheendoftheta rgetsession 5 .Atleastoneothersessionwasrequiredforavaliduser-sessioninordertocalcul aterelativebehaviorfromthetarget 2 Furtherdetailsabouttheselectionoflongtaile-commerce sitesmaybefoundin§6.1.1. 3 Theothersessioncouldtakeplaceon any e-commercesitefromthedataset. 4 A30minutewindowbeforethebeginningofthesessionwasuse dbecausepriorresearchhasusedatimeoutperiodof30minutesfordeningsessions(BucklinandSismeir o,2003;SismeiroandBucklin,2004;VandenPoeland Buckinx,2005). 5 Thepurposeofthisresearchwastopredictgoalachievement ataparticularinstantintime(i.e.,whenthetarget sessionended).Includingsessionsthatendedafterthetar getsessionwouldrelyondatafromthefuture.Anentiresessionwasremovedfromauser-sessionbecausethecomScoreda tasetonlyincludedsession-levelinformation.Therefore asession'sbrowsingbehaviorcouldonlybedeterminedafte rasessionhadended.Ifpage-levelinformationwasavailableinstead,thesession'sinformationknownuptothetarg etsession'sendwouldhavebeenused. 89

PAGE 103

session.Targetsessionswhichdidnothaveanyothersessio nsduringthewindowoftimewerenot consideredvaliduser-sessionsandhencewerenotusedinth eanalysis. The createUserSessions algorithmingure28illustratesthebasicstepsfollowedt ocreatethe user-sessions.Thealgorithmrequiresasetoflongtaile-c ommerceWebsitesandthenumberof minutestouseforatimewindowtobepassedtothemethodwhen itiscalled.ForeachWebsite, asetofsessionswhichvisitedthesitewerereturned(line1 9).Thesesessionsweretargetsessions forpotentialuser-sessions.Foreachtargetsession,allo thersessionsbytheuserwhich(1)visited ane-commerceWebsiteand(2)endedbetweenthespeciednum berofminutesbeforethestartof thetargetsessionorbytheendofthetargetsessionwereret urned(line22) 6 .Ifatleastoneother sessionwasreturnedfromline22,thenauser-sessionwascr eatedandaddedtothesetofvalid user-sessions(line25).Thisprocesscontinuedforeachpo tentialtargetsessionfromeachofthe longtaile-commerceWebsites.Afterallprocessingwascom plete,asetofvaliduser-sessions wasreturnedfromthealgorithm(line29). Table12illustrateshowthe createUserSessions algorithmoperates.Thetablelistsasubsample ofsessionsfromthesameuserate-commercesitessortedbys essiondateandtime.The“Long Tail?”columnspecieswhetherthesitevisitedwasalongta ile-commerceWebsite.Thenal column,“Target?”,specieswhichlongtailWebsitewasthe focusoftheuser-session.Thetarget columnisprovidedtobeclearwhichWebsitewouldbeusedtoc ompareagainst,especiallyinthe situationofmultiplelongtailsitesexistingwithinthesa meuser-session. AssumingWebsite W 5 wascurrentlybeingprocessed,thensession S 4 wouldhavebeenreturnedfromline17ofthealgorithm.Eachofthereturnedses sionsfromtheWebsitewouldhave thenbeeniteratedthrough.Whensession S 4 wasprocessed,sessions S 3 and S 5 wouldhavebeen returnedfromline20ofthealgorithm.Session S 3 wouldhavebeenreturnedbecausetheendof thesessionwaswithin30minutesofthestartoftargetsessi on S 4 (11:35:00-11:33:00=2:00). Session S 2 wouldnothavebeenincludedbecausetheendofthesessionwa smorethan30minutesfromthestartofthetargetsession(11:35:00-11:04:0 0=31:00).Althoughsession S 5 started afterthetargetsession,itwouldstillbeincludedbecause theendofthesessionwasequaltoor lessthantheendofthetargetsession(11:44:00forbothses sions). Sincetwoothersessionswerefoundforthetargetsession,a validuser-sessionwouldhavebeen 6 Thetargetsessionwasnotreturnedinthesetofothersessio nsinline22ofthe createUserSessions algorithm. 90

PAGE 104

1 / ** 2 Parameters:(a)SetoflongtailWebsites: 3 W= f W 0 W 1 ::: W n 1 g 4 (b)Timewindowdurationinminutes:timeWindow 5 Returns:Setofvaliduser sessions: 6 U= f U 0 U 1 ::: U m 1 g 7 where U i isatuple: 8 < T, f S 0 S 1 ::: S N 1 g > 9 Methods:(a)getSessions(w):returnssetofsessionsfromW ebsitew 10 (b)getOtherSessions(s,time):returnssetof 11 sessionforthisuserfromanye commerceWebsite 12 withinthespecifiedwindowoftime 13 (c)createUserSession(s,O):returnsavaliduser session 14 / 15createUserSessions(W,timeWindow) f 16U= fg ; 17 18foreach(w 2 W) f 19S=getSessions(w); 20 21foreach(s 2 S) f 22O=getOtherSessions(s,timeWindow); 23 24if( k O k > 0) f 25U+=createUserSession(s,O); 26 g 27 g 28 g 29returnU; 30 g Figure28.:User-centric:createUserSessionsAlgorithm Table12:User-centric:ExampleUser-Sessions User-sessionSessionWebsiteDateandTimeDurationLongTa il?Target? S1W85/25/0810:00:0017 min No– S2W45/25/0811:00:004 min No– U1S3W75/25/0811:30:003 min No– U1,U2S4W55/25/0811:35:009 min YesU1 U1,U2S5W65/25/0811:40:004 min YesU2 S6W25/25/0813:00:0019 min No– U3S7W15/25/0815:00:0023 min No– U3S8W35/25/0815:30:0031 min YesU3 S9W25/25/0818:05:0023 min YesU4 91

PAGE 105

createdonline23ofthealgorithm.Thevaliduser-sessionw ouldhaveincludedtargetsession S 4 andothersessions S 3 and S 5 Table12alsoillustratesthreeotherpotentialuser-sessi ons.Bothuser-sessions U 2 and U 3 would havebeenvalidbecausetheybothincludedothersessionsbe yondtheirtargetsessions(session S 4 for U 2 and S 7 for U 3 ).User-session U 4 wouldnothavebeenvalidbecausetherewerenotany othersessionswithinthetimewindowofsession S 9 .User-session U 4 wouldnothavebeenincludedintheanalysis.5.1.2MetricsTable13summarizesthemetricsusedtotestthehypothesesf ortheuser-centricclickstreammodel. Thenameofeachmetricalongwithadescriptionofhowitwasc alculatedisprovided.Inaddition,thehypothesiswhichcorrespondstothemetricisalso providedinthetable.Amorein-depth descriptionofthemetricsisgiveninthefollowingsubsect ions. Table13:User-centric:ModelMetrics Hypothesis#MetricDescription I NFORMATION P ATCH –S ITE -P ATCH UC1 RELDUR DurationinminutesspentonaWebsiterelativetome-diantimespentduringothersessions. UC2 RELPGS NumberofpagesviewedonaWebsiterelativetomediannumberofpagesfromothersessions. UC3 RETURN IfvisitorlefttheWebsiteandreturnedduringthesamesession. UC4 VISITED IfvisitorhadpreviouslyvisitedtheWebsitebefore. O THER n/a GOAL Whetheragoaloccurredduringthesession. Tohelpclarifythenotationbeingusedbelowforthemetrics ,eachuser-session U containsa targetsession T andasetof n othersessions S ,where S = f S 0 S 1 ,..., S n 1 g T and S i areboth tuplesthatrepresentsinformationaboutaparticularsess ion(see§5.1.1formoredetails). 92

PAGE 106

InformationPatch–Site-Patch RELDUR isthetotaldurationinminutesavisitorspentatthetarget Websiterelativetothemedian timespentatothersiteswithinthesameuser-session.Ther elativedurationoftheuser-session U iscalculatedfromequation5.1,where duration ( i ) isthedurationspentduringsession i .To acquire RELDUR ,themediandurationofallsessionsintheuser-session U issubtractedfromthe totaldurationofthetargetsession T RELDUR = duration ( T ) median ( foreach i 2 S [ duration ( i )]) (5.1) RELPGS isthenumberofpagesviewedatthetargetWebsiterelativet othemediannumberof pagesviewedatothersiteswithinthesameuser-session.Th erelativenumberofpagesforthe targetsession T iscalculatedasshowninequation5.2,where pages ( i ) isthenumberofpages viewedduringsession i .Toobtain RELPGS ,themediannumberofpagesviewedattheotherWeb sitesissubtractedfromthenumberofpagesviewedduringth etargetsession. RELPGS = pages ( T ) median ( foreach i 2 S [ pages ( i )]) (5.2) RETURN isabinomialvariablewhichis true iftheuserleftandreturnedtothetargetWebsite duringtheuser-sessionand false otherwise.Auserisdesignatedasleavingandreturningtot he targetWebsiteifanothersessionisactiveduringsomepart ofthetargetsession.Thiscanoccur ifanewsessionisstartedwhiletimeisstillbeingspentatt hetargetWebsite.Anothersituation wherethiscanoccurisifasessionwasstartedbeforethetar getsessionandcontinuestobeactive duringsomeportionofthetargetsession. Forexample, RETURN wouldbe true ifsession S 4 fromuser-session U 1 (table12)wasthetargetsession.Thisisbecausesession S 5 started(11:40:00)duringthetimesession S 4 wasstillactive(11:35:00to11:44:00). RETURN wouldbe false ,however,ifsession S 8 fromuser-session U 3 wasthetargetsession.Sincesession S 7 wasnished(15:00:00to15:23:00)beforesession S 8 began(15:30:00),theforagercouldnothaveleftandreturn edtotheWebsitefrom S 8 VISITED isabinomialvariablewhichis true iftheforagerhadvisitedthetargetWebsiteduring anothersessionatsomepointinthepastand false otherwise. VISITED iscalculatedbyexamining thepriorsessionsofaforageranddeterminingiftheuserha devervisitedtheWebsiteofinterest 93

PAGE 107

before.OtherThemutuallyexclusivebinomiallydistributedmetric GOAL specieswhetherapurchasewas madeduringthesession.Ifagoalwasachievedduringtheses sion, GOAL wouldhavethevalue of true .Otherwise, GOAL wouldhaveavalueof false 5.2Site-centricClickstreamModelofInformationForagin g Intherstsubsectionbelow,themethodologyispresentedo nhowthedatawasusedtotestthe site-centricmodel 7 .Thenalsubsectiondetailshowthemeasuresforthesite-c entrichypotheses werecalculated.Unliketheuser-centricdataset,thesite -centricdatasetwasatthepage-leveland thuseachofthemeasuresforthesite-centrichypotheseswa sabletobecalculated. 5.2.1DatasetSampleThesupplieddatacontainedasetof n sessions S ( S 0 S 1 ,..., S n 1 ),where S i representsasingle session.Eachsession( S i )containedasetof m pageinformationtuples P ( P i 0 P i 1 ,..., P im 1 ), where P ij representsinformationaboutaparticularpageviewedduri ngasession.Eachpageinformationtuplewasmadeupofsevenpiecesofinformation:a uniqueidentierforthesession, Website,referringdomain,andpageviewed;dateandtimeth epagewasviewed;howmuchtime wasspentonthepage;andifthepagerepresentedacontactgo albeingachieved. Table14illustratesasetofpagetuplesforsession S 9 atWebsite W 4 .Ofnoteistherightcensorednatureofthesite-centricdata.Thedurationonth enalpageofthesessionismissing becauseitisnotknownwhenthenextpagewasvisitedbythisu ser(atthissiteoranother). Table15providessomebasicstatisticsonthenumberofpage sviewedandtotaldurationof session S 9 .Therstrowofthetableshowsstatisticsusingtheentires ession.However,only thosepartsofasessionoccurring before theachievementofacontactgoalwereusedintheanalysis.Thistruncationwasdonebecausetheproblembeinginv estigatedwasthepredictionofgoal achievementduringthe remainder ofasession.Thus,predictionwasdonefromapointrightbeforeaformsubmissionoccurred. 7 Summarystatisticsaboutthesite-centricdatasetcanbefo undinchapter6. 94

PAGE 108

Table14:Site-centric:ExampleSessionTuples SessionWebsiteReferrerPageDateandTimeDurationContac tGoal S9W4W6A5/25/0815:37:0232 s n/a S9W4W4B5/25/0815:37:3493 s n/a S9W4W9C5/25/0815:39:07111 s n/a S9W4W4D5/25/0815:40:5895 s CG1 S9W4W4A5/25/0815:42:339 s n/a S9W4W4E5/25/0815:42:42n/an/a Table15:Site-centric:ExampleSes-sionStatisticsbyContactGoal PagesDuration Entiresession6340 s ContactGoal13236 s ContactGoal26340 s Toillustratethetruncationofasession,assumecontactgo al CG 1 wasbeingexamined(representedaspage D ).Forsession S 9 ,onlyactivityonpages A B ,and C wouldbeused(asillustratedinthesecondrowoftable15).Ifcontactgoal CG 2 werebeingexaminedinstead(representedaspage R ),thentheactivityfromtheentiresessionwouldbeused.Th isisbecausesession S 9 nevervisitedthepagerepresentingthesubmissionofacont actformforcontactgoal CG 2 Thus,allpagesofsession S 9 wereusablesincetheyoccurredbeforethenon-existentsub mission. 5.2.2MetricsTable16summarizesthemetricsusedtotestthehypothesesf orthesite-centricclickstreammodel. Thenameofeachmetricalongwithadescriptionofhowitwasc alculatedisprovided.Inaddition,thehypothesiswhichcorrespondstothemetricisalso providedinthetable.Amorein-depth descriptionofthemetricsisgiveninthefollowingsubsect ions. Tohelpclarifythenotationbeingusedbelowforthemetrics ,eachsessioncontainsasetof m pageinformationtuples P ,where P =

P j representsinformationabout 95

PAGE 109

Table16:Site-centric:ModelMetrics Hypothesis#MetricDescription I NFORMATION P ATCH –S ITE -P ATCH SC1 SITEDUR DurationinsecondsspentonaWebsite. SC2 SITEPGS NumberofpagesviewedonaWebsite. SC3 RETURN IfvisitorlefttheWebsiteandreturnedduringthesamesession. SC4 VISITED IfuserhadpreviouslyvisitedtheWebsitebefore. I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX Maximumvalueofanygoalpage-patchvisited. SC5b PATCHLAST Valueoflastgoalpage-patchvisited. SC5c PATCHSUM Totalvalueofallgoalpage-patchesvisited. SC6 PATCHDUR Mediandurationinsecondsspentinallgoalpage-patches. S TRICT I NFORMATION S CENT SC7 UNIQUE Percentageofuniquepagesviewed. SC8 LINEAR Linearityofclickstream. R ELAXED I NFORMATION S CENT SC9a TRAILMAX Maximumvalueofanygoaltrailfollowed. SC9b TRAILLAST Valueoflastgoaltrailfollowed. SC9c TRAILSUM Totalvalueofallgoaltrailsfollowed. O THER n/a GOAL Whetheragoaloccurredduringthesession. 96

PAGE 110

aparticularpageviewedduringthesession(see§5.2.1form oredetails). P only containspages whichoccurred before thecontactformwassubmittedforthecontactgoalofintere st. InformationPatch–Site-Patch SITEDUR isthetotaldurationinsecondsavisitorhasspentataWebsi te.Thetotaldurationforthe currentvisitoriscalculatedaccordingtoequation5.3,wh ere time ( i ) isthetimespentonthe i th page. SITEDUR = X i 2 P time ( i ) (5.3) SITEPGS isthenumberofpagesviewedduringasession.Thenumberofp agesviewedduring thecurrentuser'ssessionissimply k P k (equation5.4). SITEPGS = k P k (5.4) RETURN isabinomialmetricwhichis true iftheuserleftandreturnedtotheWebsiteduring thesessionand false otherwise.Sincethedatasetissite-centric,thedetermin ationofleavingand returningtoaWebsitecannotalwaysbedenitivelydetermi ned.However,inmanycaseshowever theHTTPreferer[ sic ]eld(Fieldingetal.,1999)containsinformationonwhatU RLaforager wasonbeforearrivingatthecurrentpage.Thus,iftherefer ringURLfromanypageviewedin asession(exceptfortherstpageviewed)isfromadomainot herthanthecurrentWebsite,it canbeconcludedtheuserleftthesiteandreturned.Theprec edingruledoesnotapplytotherst viewedpageofasessionsinceaforagercannotleaveaWebsit eandreturnbeforeasessionhas actuallystarted. Toillustrate,intable14(§5.2.1)thereferrerofthethird pageviewed(P3)wasfromadifferent domainthanthecurrentWebsite(W9versusW4).Therefore,t heforagerwouldhavea RETURN valueof true sincetheuserleftsiteW4,visitedW9,andthenreturnedtos iteW4.Thefactthatthe rstpageviewed(P1)hadareferringURLofadifferentWebsi te(W6versusW4)hasnobearing onthevalueof RETURN VISITED isabinomialmetricwhichis true iftheforagerhadvisitedtheWebpageduringanothersessionatsomepointinthepastand false otherwise. VISITED iscalculatedbyexaminingthe 97

PAGE 111

priorsessionsofaforageranddeterminingiftheusereverv isitedtheWebsiteofinterestbefore. InformationPatch–Page-PatchPatchesataWebsitemustalreadybeknowninordertocalcula tethefour PATCH visitationmetrics: PATCHMAX PATCHLAST PATCHSUM ,and PATCHDUR .Themethodologyforlearningpatches isdescribedindetailinappendix5.B.Ingeneral,learning patchesrequiresasetofgoalandnongoalsessionstodeterminewhichpartsofaWebsite(i.e.,pa ges)arebetterabletodistinguishbetweenthetwogroups.PatchesarespecictoasingleWebsite Asthefour PATCH metricsrequirepatchestobelearnedrstinordertoquanti fyasession's patchvisitation,thesessionsforeachWebsitewerespliti ntotwogroups:trainingandtesting sets.ThetrainingsetwasusedtodiscovergoalpatchesataW ebsite.Thesessionsinthetesting seteachcalculatedthe PATCH metricsfortheirindividualsessionfromthelearnedgoalp atches. However,asessionfromthetestingsetwouldonlycalculate the PATCH metrics ifandonlyif goal patcheswerefoundattheWebsite.Inaddition,the PATCHDUR metricwouldonlybecalculated forasessionfromthetestingset ifandonlyif thatsessionvisitedatleastoneoftheWebsite's discoveredgoalpatches.TrainingandTestingSet EachWebsitecontainedsessionswhereeitheragoalwasachi evedduringasessionornot. Sessionswereseparatedaccordingtotheirachievementand placedintoaWebsite'sgoaldataset ( D G )ornon-goaldataset( D N ) 8 .TocreateaWebsite'strainingset( R ),thesessionsfromboththe goalandnon-goaldatasetsweresortedinascendingorderby theirsessionstartdate.Thentherst 70 % ofsessionsfromthegoaldataset( D G )wereplacedintothetrainingset( R ).Thedateofthe lastgoalsessionaddedto R wasnoted.Sessionsfromthenon-goaldataset( D N )whichoccurred atorbeforethenoteddateofthelastgoalsessionfrom R werealsoaddedtothetrainingset.All sessionsfrom D G and D N notaddedtothetrainingsetwereputintothetestingset( E ). LearningPatches PatcheswerelearnedforaWebsiteusingthetrainingdatase t( R )accordingtothemethodology outlinedinappendix5.B.Patcheswerelearnedat levelsof0.05and0.01andsupportedlevelsof 8 Tosimplifynotation, D G and D N areusedtorefertothecurrentWebsitebeingexamined. 98

PAGE 112

0.25to1.50(in0.25increments) 9 Specically,asetof n valuablepatches A ( A 0 A 1 ,..., A n 1 )werediscovered,where A i representsasinglevaluablepatch 10 A i consistsofasetof m unorderedanddistinctpages U ( U 0 U 1 ,..., U m 1 ). Eachpatch( A i )wasalsogivenavalueaccordingtoequation5.5(YangandPa dmanabhan, 2003). S Gi and S Ni representthenumberofgoalandnon-goalsessionsfromthet rainingdataset thatvisitedpatch A i ,respectively. R G and R N isthetotalnumberofgoalandnon-goalsessions fromthetrainingdataset.Thevalueofpatch A i couldrangefromzerototwo,withhighernumbersrepresentingagreaterdifferenceinsupportofthepat chindistinguishingbetweengoaland non-goalsessions(i.e.,beingmorevaluable). value ( A i )= S Gi R G S Ni R N 1 2 S Gi R G + S Ni R N (5.5) Table17providesanexampleofthreevaluablepatchesfound ataWebsite.Eachpatchismade upofasetofuniqueanddistinctpages.Inaddition,thevalu e(ascalculatedfromequation5.5)is providedforeachpatch. Table17:Site-centric:Ex-ampleValuablePatches PatchPagesValue A1 f A,C g 0.75 A2 f B,C g 1.15 A3 f B,C,E g 1.35 Calculating PATCH Metrics Tocalculatethe PATCH metricsforagivensessionfromthetestingset( E ),twostepswererequired.First,itwasdeterminedwhatpatchesthesessionvi sitedfromthesetofvaluablepatches ( A ).Eachsessionhadasetof l visitedpatches V ( V 0 V 1 ,..., V l 1 ),where V j wasanindividual 9 Theresultsofthesensitivityanalysiscanbefoundin§7.2. 3. 10 A i isasimpliedformofnotationwhichassumesaxedWebsitea ndsignicanceorsupportlevel. 99

PAGE 113

patchvisitedbythecurrentsession 11 .Asessionwasconsideredtohavevisitedapatchifallpages ofthepatch( U )werevisitedatleastonce(inanyorder)bythecurrentsess ion(asdeterminedby thesetofpages P fromthesession).Formally, A i wasaddedto V if U P .Onceitwasknown whatpatcheswerevisited,thenthefourmeasureswerecalcu lated. Table18providesanexampleoftwopatchesvisitedbyasessi onwiththreepageviews(
). V 1 and V 2 aresimplypatches A 1 and A 2 fromtable17thatwerevisitedbythesession. A 3 wasnotincludedbecausethesessionnevervisitedpage E .Table18isalsousedtocalculate examplesforeachofthemeasuresinthissubsection. Table18:Site-centric:ExampleVisitedPatches PatchPagesValue V1 f A,C g 0.75 V2 f B,C g 1.15 PATCHMAX isthevalueofthemostvaluablepatchvisitedbythecurrent user.Themaximum valueisdeterminedbyiteratingovereveryvisitedpatchto ndtheonewiththehighestvalue (equation5.6).Iftheuserdidnotvisitanypatchesthenthe valueof PATCHMAX wouldbezero. PATCHMAX = 8<: max ( foreach j 2 V ( value ( V j ))) if k V k > 0 0 else (5.6) Toillustrateequation5.6, PATCHMAX wouldbe1.15( max (0 : 75 ; 1 : 15) )assumingauservisited thepatchesintable18. PATCHLAST isthevalueofthelastpatchvisitedbytheuser.Afoursteph euristicwasusedto determinewhichpatchwasvisitedlastduringauser'ssessi on. (1)Foreachpatchvisited,thepositionwithintheuser'sse ssionwhentheforager last visiteda pagefromthatpatchwasnoted 12 PATCHLAST thenequaledthevalueofthepatchwiththe highestendingposition.Ifmorethanonepatchhadthesameh ighestendingpositionthenthe processcontinuedtothesecondstep. 11 V j isasimpliedformofnotationwhichassumesavaluablevisi tedpatchfromaxedWebsiteandsignicance orsupportlevel. 12 Ifauservisitedapagemorethanonce,thenthelasttimethep agewasvisitedwasused. 100

PAGE 114

(2) PATCHLAST equaledthevalueofthelargestpatchfromthetiedpatchesi ntherststep.Largest wasdenedasthepatchwiththehighestnumberofpages(i.e. k U k ).Ifmorethanonepatch tiedforthelargestpatchthentheprocesscontinuedtothet hirdstep. (3)Foreachoftheremainingtiedpatchesfromsteptwo,thep ositionwithintheuser'ssession whentheforager rst visitedapagefromthatpatchwasnoted 13 PATCHLAST thenequaled thevalueofthepatchwiththehigheststartingposition(i. e.,startedexploringthepatchlast). Ifmorethanonepatchhadthesamehigheststartingposition thentheprocesscontinuedtothe fourthstep. (4) PATCHLAST equaledthemedianvalueofalltiedpatchesfromstepthree. Table19illustratesthevaluesobtainedfromfollowingthe heuristiconthevisitedpatchesfrom table18.Inthisexample, PATCHLAST wouldbe1.15.Thestepsfortheheuristicforthisexample areprovidedafterthetable. Table19:Site-centric:ExampleLastVisitedPatches PatchPagesValueEndingPositionSizeStartingPosition V1 f A,C g 0.753( max (1 ; 3) )21( min (1 ; 3) ) V2 f B,C g 1.153( max (2 ; 3) )22( min (2 ; 3) ) (1)Thehighestendingpositionforbothpatcheswasthree.S incemorethanonepatchwastied withthemaximalvalue,asinglepatchcouldnotbeconsidere dlast,andthustheprocesscontinuedtosteptwo. (2)Bothpatchesalsohadapatchsizeoftwo.Therefore,thep atchesweretiedagainsinceneither ofthepatcheswaslargerthantheotherpatch. (3)Patches V 1 and V 2 wererstvisitedduringtherstandsecondpageoftheuser' ssession, respectively.Sincepatch V 2 hadalaterstartingpositionitwasdeemedthelastpatch.Th erefore,thevalueof PATCHLAST wasthevalueofpatch V 2 (1.15). 13 Ifauservisitedapagemorethanonce,thenthersttimethep agewasvisitedwasused. 101

PAGE 115

PATCHSUM addsupthevalueofeverypatchvisitedbythecurrentuser(e quation5.7).Avalue ofzeroisgiventoanyuserthatdidnotvisitanypatches. PATCHSUM = 8<: P j 2 V ( value ( V j )) if k V k > 0 0 else (5.7) PATCHSUM wouldbe1.90( 0 : 75+1 : 15 )usingthepatchesvisitedintable18. PATCHDUR isthemediandurationauserspentinalltheirvisitedpatch es.Onlysessionswhich visitedatleastonepatch(i.e., k V k > 0 )wouldhaveavaluefor PATCHDUR .Thecalculationfor PATCHDUR isshowninequation5.8. totalTime ( k;P ) returnsthetotaltimeasessionwithpages P spentonpage k .Ifasessionvisitedpage k morethanoncein P ,thenthesumdurationfromall k pagevisitationswasreturned. PATCHDUR = median foreach j 2 V X k 2 G totalTime ( k;P ) !# (5.8) PATCHDUR wouldbe164.50 s ( median (125 ; 204) )forvisitedpatches V 1 and V 2 (table18) andsession S 9 (table14). StrictInformationScent UNIQUE isthepercentageofuniquepagesviewedduringasession.Th epercentageofunique pagesviewedforthecurrentvisitoriscalculatedaccordin gtoequation5.9,where distinct ( P ) isthenumberofdistinctpagesviewedinthesetofpageinfor mationtuples P UNIQUE = distinct ( P ) k P k 100 (5.9) LINEAR isthecomplexityofasessionascalculatedviathestratumm easure.Complexityis determinedviathestraightness(i.e.,absenceofvisiting pagesrepeatedly)ofauser'sbrowsing behavior,wherehigherlinearityequatestolesscomplexit y.Stratumisameasureoflinearityfrom graphtheory(McEneaney,2001)anddetailsonitscalculati onmaybefoundinappendix5.A. 102

PAGE 116

RelaxedInformationScentThethree TRAIL metricsfortherelaxedinformationscentwerecalculatedi naverysimilarmannerasthe PATCH metrics.Thesametrainingsetusedtodiscoverpatcheswasu sedtolearntrails. Sessionsfromthetestingsetthenusedthoselearnedtrails tocalculatetheirvaluesforthethree TRAIL metrics. Specically,asetof n valuabletrails T ( T 0 T 1 ,..., T n 1 )werediscoveredfromthetraining set,where T i representsasinglevaluabletrail 14 T i consistsofasetof m ordered pages O ( O 0 O 1 ,..., O m 1 ),wherethepagesmayrepeatthemselvesintheorderedset(e .g.,
).Oncediscovered,trailsweregivenavaluelikepatchesus ingequation5.5(with T i being usedinsteadof A i ).Table20providesanexampleofthreediscoveredtrails. Table20:Site-centric:Exam-pleValuableTrails TrailPagesValue T1 0.35 T2 1.25 T3 1.15 Oncethetrailswerediscovered,eachsessioninthetesting set( E )requiredtwostepstocalculatethe TRAIL measures.First,itwasdeterminedwhattrailswerefollowe dbythesessionof interestfromthesetofvaluabletrails( T ).Eachsessionhadasetof l followedtrails F ( F 0 F 1 ..., F l 1 ),where F j wasanindividualtrailfollowedbythecurrentsession 15 .Asessionwasconsideredtohavefollowedatrailifallpagesofthetrail( O )werefollowedinorderbythecurrent session(asdeterminedbythesetofpages P fromthesession).Althoughallpagesmusthavebeen followedinorder,repeatvisitationandgapsbetweenpages wereallowed(i.e.,otherpagesmaybe visitedinbetweenpagesfromthetrail).Morespecically, T i wasaddedto F if O P andthe pagesof O werefoundinthesameorderin P .Onceitwasknownwhattrailswerefollowed,then thethreemeasureswerecalculated. Table21providesanexampleoftwotrailsfollowedbyasessi onwithsixpageviews(
PAGE 117

A B A D C> ). F 1 and F 2 aresimplytrails T 1 and T 2 fromtable20thatwerefollowedby thesession. T 3 wasnotincludedbecausepage C wasnotvisitedbeforepage D inthesession. Table21isalsousedtocalculateexamplesforeachofthemea suresinthissubsection. Table21:Site-centric:Exam-pleFollwedTrails TrailPagesValue F1 0.35 F2 1.25 TRAILMAX isthevalueofthemostvaluablefollowedtrailbythecurren tuser.Themaximum valueisdeterminedbyiteratingovereveryfollowedtrailt ondtheonewiththehighestvalue (equation5.10).Iftheuserdidnotvisitanytrailsthenthe valueof TRAILMAX wouldbezero. TRAILMAX = 8<: max ( foreach j 2 F ( value ( F j ))) if k F k > 0 0 else (5.10) Toillustrateequation5.10, TRAILMAX wouldbe1.25( max (0 : 35 ; 1 : 25) )assumingauserfollowedthetrailsintable21. TRAILLAST isthevalueofthelasttrailfollowedbytheuser.Afourstep heuristicwasusedto determinewhichtrailwasfollowedlastduringauser'ssess ion. (1)Foreachtrailfollowed,thepositionwithintheuser'ss essionwhentheforager last visited thenalpageofthetrailwasnoted 16 TRAILLAST thenequaledthevalueofthetrailwiththe highestendingposition.Ifmorethanonetrailhadthesameh ighestendingpositionthenthe processcontinuedtothesecondstep. (2) TRAILLAST equaledthevalueofthelongesttrailfromthetiedtrailsin therststep.Longest wasdenedasthetrailwiththehighestnumberofpages(i.e. k O k ).Ifmorethanonetrailtied forthelongesttrailthentheprocesscontinuedtothethird step. (3)Foreachoftheremainingtiedtrailsfromsteptwo,thepo sitionwithintheuser'ssessionwhen 16 Ifauservisitedthenalpageofthetrailmorethanonce,the nthelasttimethepagewasvisitedwasused. 104

PAGE 118

theforager rst visitedtherstpageofthetrailwasnoted 17 TRAILLAST thenequaledthe valueofthetrailwiththehigheststartingposition(i.e., startedfollowingthetraillast).Ifmore thanonetrailhadthesamehigheststartingpositionthenth eprocesscontinuedtothefourth step. (4) TRAILLAST equaledthemedianvalueofalltiedtrailsfromstepthree. Table22illustratesthevaluesobtainedfromfollowingthe heuristiconthefollowedtrailsfrom table21.Inthisexample, TRAILLAST wouldbe1.25.Thestepsfortheheuristicforthisexample areprovidedafterthetable. Table22:Site-centric:ExampleLastFollowedTrails TrailPagesValueEndingPositionLengthStartingPosition F1 0.35621 F2 1.25631 (1)Thehighestendingpositionforbothtrailswassix.Sinc emorethanonetrailwastiedwiththe maximalvalue,asingletrailcouldnotbeconsideredlast,a ndthustheprocesscontinuedto steptwo. (2)Trail F 2 hadalengthofthreepages,while F 1 onlyhadtwopagesinitstrail.Therefore,the valueof TRAILLAST wasthevalueoftrail F 2 (1.25). TRAILSUM addsupthevalueofeveryfollowedtrailbythecurrentuser( equation5.11).Avalue ofzeroisgiventoanyuserthatdidnotvisitanytrails. TRAILSUM = 8<: P j 2 F ( value ( F j )) if k F k > 0 0 else (5.11) TRAILSUM wouldbe1.60( 0 : 35+1 : 25 )usingthetrailsvisitedintable20. 17 Ifauservisitedtherstpageofthetrailmorethanonce,the nthersttimethepagewasvisitedwasused. 105

PAGE 119

OtherThemutuallyexclusivebinomiallydistributedmetric GOAL specieswhetheratsomepointduring theremainderofasessionacontactformwassubmittedforth econtactgoalofinterest.Ifagoal willbeachievedduringthesession, GOAL willhavethevalueof true .Otherwise, GOAL willhave avalueof false 5.3MetricTestingEachofthemetricswastestedindividuallytodetermineift heywereabletodistinguishbetween goalandnon-goalsessionsat any longtailWebsite.ThemetricsweretestedattheWebsiteuni t ofanalysissincethegoalwastondmetricswhichweresigni cantover multiple longtailsites. SinceeachWebsitehadnumerousgoalandnon-goalsessions, themedianvaluewasseparately takenforeachgroup 18 .Themedianvaluesforthegoalandnon-goalsessionswereth enusedas eachWebsite'spaireddatapoints. Thebinomialmetrics RETURN and VISITED didnotusemedianvaluessincethemetricswere onlyagsindicatingifsomeoneleftthesiteorhadvisitedt hesitebefore.Therefore,eachWeb sitewascomparedaccordingtotheprobabilityofagoaloccu rringgiveniftheuserleftandreturnedtothesiteorstayedatthesitetheentiresession 19 Table23illustratesthecontingencytableconstructedfor eachWebsitethatwasusedtocalculatetheprobabilities 20 .CountsofsessionsattheWebsitewerecategorizedaccordi ngtotwo dimensions:goalornon-goalsession;andifthesessionlef tandreturnedorstayedonthesite. Equations5.12and5.13detailhoweachoftheprobabilities werecalculatedforeachWebsite. TheprobabilitieswerethenusedaseachWebsite'spairedda tapoints. P ( Goal j Return )= a a + b (5.12) P ( Goal j Stayed )= c c + d (5.13) 18 Medianvalueswereusedinsteadofmeanvaluestoreducethei mpactofoutliersonthedataset. 19 Forthe VISITED metrictheprobabilitiesbeingcomparedwereforagoaloccu rringgiveniftheuserhadvisitedthe sitebeforeorifthiswastheuser'srstvisittothesite. 20 Allnotationsarestatedforthe RETURN metric.Tobeapplicabletothe VISITED metric,thenotation“return” becomes“visited”and“stayed”changesto“new”. 106

PAGE 120

Table23:ContingencyTablefor RETURN and VISITED Goal Non-goal N Return a b a + b Stayed c d c + d N a + c b + d a + b + c + d Atotalofthreedifferentstatisticaltestswereperformed oneachmetric:pairedt-test,exact Wilcoxonsignedranktest,anddependent-samplessign-tes t.Thepairedt-testisaparametrictest whichassumesthedatacamefromanormaldistribution(Cono ver,1999).TheexactWilcoxon signedranktestandthedependent-samplessigntestarebot hnon-parametrictestswhichdonot makeanyassumptionaboutthetypeofunderlyingdistributi on(Conover,1999).Thereasonthree testswereperformedisduetoeachtest'sdifferinglevelso fassumptionstringency.Whenmetricsdeviatefromtheassumptionsofatest,theotherlessst ringenttestscanprovidea“worst-case” baselineforthesignicanceofthemetric. Eachofthethreetestsisdescribedingreaterdetailbelow, startingwiththemoststringenttest. Anexampleofeachtestisalsoprovidedwhichillustratesth eteststatisticbeingcalculated. Pairedt-testThepairedt-testisaparametrictesttodetermineifthemea ndifferencebetweengroupsiszero (Conover,1999).Thedifferenceofeachpair'smeasuresare calculatedandthenusedtodetermine theteststatistic t .Thesignicanceof t isthendeterminedbasedontheassumptionofanunderlyingnormaldistribution. Inthedatathereare n pairsof X and Y observations ( X 0 ;Y 0 ) ( X 1 ;Y 1 ) ,..., ( X n ;Y n ) (Conover, 1999).Foreachobservationpair,thedifference D i iscalculatedbetween X i and Y i ,where D i = Y i X i Theteststatistic t iscalculatedaccordingtoequation5.14(Conover,1999),w here D isthe meanofall D i s. 107

PAGE 121

t = D q 1 n ( n 1) P ni =1 D i D 2 (5.14) Assumptions Theveassumptionsforthepairedt-testareprovidedbelow .Themoststringentassumptionfor thetestistherequirementofnormallydistributedrandomv ariables. (1)“The D i sareidenticallydistributednormalrandomvariables.”(C onover,1999,pg.363) (2)“Thedistributionofeach D i issymmetric. (3)The D i saremutuallyindependent. (4)The D i sallhavethesamemean. (5)Themeasurementscaleofthe D i sisatleastinterval.”(Conover,1999,pg.353) Example Toillustratethepairedt-test,table24providesanexampl eofdatafromveWebsites(A-E). WithineachWebsitethemedianvalueforthemetricbeinginv estigatedisprovidedseparatelyfor thegoal( X i )andnon-goalsessions( Y i ).Inaddition,thenalcolumnshowsthedifference( D i ) betweenthenon-goalandgoalsessions(i.e., Y i X i ). Table24:ExampleT-testMetricTestingDataset WebsiteGoalSessions( X i )Non-goalSessions( Y i ) D i A3.752.151.60B7.154.352.80C12.2013.40 1.20 D4.754.750.00E7.505.901.60 Usingequation5.14,thet-statisticforthegivendatais1. 3720(withfourdegreesoffreedom), andap-valueof0.121(assumingahypothesisthat X isgreaterthan Y ). 108

PAGE 122

ExactWilcoxonSignedRankTestTheexactWilcoxonSignedRankTest(Wilcoxon,1945)isanon -parametrictestthatdeterminesif pairedobservationshavethesamemeanasoneanother(Conov er,1999).EachWebsiteisranked accordingtoitsabsolutedifferencebetweenthemedianval uesofallgoalandnon-goalsessions forthegivenmetric.TheranksofallWebsiteswithapositiv edifferencearethenaddeduptoobtaintheteststatistic V (Dalgaard,2008). Inthedatathereare m pairsof X and Y observations ( X 0 ;Y 0 ) ( X 1 ;Y 1 ) ,..., ( X m ;Y m ) (Conover, 1999).Foreachobservationpair,thedifference D i iscalculatedbetween X i and Y i ,where D i = Y i X i .Anyobservationpairwithadifferenceofzeroisremovedfr omtheanalysis(i.e.,when D i =0 ).Atotalof n observationpairsthenremain,where n m The n remainingobservationpairsarethenrankedfrom1to n ,where R i istherankofthe i th observationpair.Observationspairsarerankedfromthesm allesttothelargestvalueof absolute difference(i.e., j D i j ).Incaseswheremorethanoneobservationpairsharesthesa me absolute difference,thentheassignedrankisaveragedamongstallt iedpairs.Forexample,assumetherank beingassignedwasthreeandfourobservationpairssharedt henextsmallestabsolutedifference. Therankforallfourpairswouldthenbe4.5( 3+4+5+6 4 ). Afterranking, R i takesthesamesignas D i (e.g.,if D i isnegativethen R i isalsonegative).The ranksofallpositive R i sarethensummedtoobtaintheteststatistic V (Dalgaard,2008).Anexact p-valueisthencomputedfrom V usingtheShift-Algorithm(StreitbergandR¨ohmel,1986)i nR(R DevelopmentCoreTeam,2008).Assumptions TheexactWilcoxonsignedranktesthasthesameassumptions asthet-test,exceptitdoesnot requireidenticallydistributednormalrandomvariables. ThefourassumptionsfortheWilcoxon testarelistedbelow. (1)“Thedistributionofeach D i issymmetric. (2)The D i saremutuallyindependent. (3)The D i sallhavethesamemean. (4)Themeasurementscaleofthe D i sisatleastinterval.”(Conover,1999,pg.353) 109

PAGE 123

Example ToillustratetheexactWilcoxonsignedranktest,table25p rovidesanexampleofdatafromve Websites(A-E).WithineachWebsitethemedianvalueforthe metricbeinginvestigatedisprovidedseparatelyforthegoal( X i )andnon-goalsessions( Y i ).Thefourthcolumnisthecalculated difference( D i )betweenthenon-goalandgoalsessions(i.e., Y i X i ).SinceWebsiteDhada differenceof0.00,itwasremovedfromanyfurtheranalysis Table25:ExampleWilcoxonMetricTestingDataset WebsiteGoalSessions( X i )Non-goalSessions( Y i ) D i j R i j R i A3.752.151.602.502.50B7.154.352.804.004.00C12.2013.40 1.201.00 1.00 D4.754.750.00——E7.505.901.602.502.50 ThefthcolumnshowstherankingsofthefourremainingWebs itesaccordingtotheabsolute valueof D i .WebsiteCwasrankedrstbecauseithadthesmallestvalueo f j D i j (1.20).Thenext smallestvalueof j D i j wastiedbetweenWebsiteAandE(1.60).BothWebsitesweregi vena rankof2.50( 2+3 2 ).Finally,WebsiteBwasgivenarankof4.00sinceithadthel argestvalueof j D i j ThenalcolumndisplaystherankingsofeachWebsiteaftert akingintoaccountthesignof D i .WebsiteC'ssignfor R i wasswitchedtonegativesince D i hadavaluelessthanzero.After addingallpositivelyrankedWebsites,theteststatistic( V )forthisexamplewas9.00,withapvalueof0.125(assumingahypothesisthat X isgreaterthan Y ). Dependent-samplesSignTestThedependent-samplessigntestisanon-parametrictestth atcanalsobeusedtotestifthereare differencesbetweenobservations.Sincethesigntesthasl essstringentassumptionsthanmany othernon-parametrictests,itcanbeuseditmanymoresitua tions.Forexample,ifthedifferences ( D i s)betweenobservationswerenotsymmetricalintheexactWi lcoxonsignedranktest,thesign testcouldbeusedasanalternative.However,thesigntesti sgenerallylesspowerfulthanother 110

PAGE 124

non-parametrictests(Conover,1999). Inthedatathereare m pairsof X and Y observations ( X 0 ;Y 0 ) ( X 1 ;Y 1 ) ,..., ( X m ;Y m ) (Conover, 1999).Foreachobservationpair,acomparisonismade.Assu mingthepurposeofthetestistodetermineif X>Y ,thenapairisclassiedas“+”if X i >Y i ,“-”if X i P(-)foronepair( X i ;Y i ),then P(+) > P(-)forallpairs.”(Conover,1999,pgs.157-158) Example Toillustratethesigntest,table26providesanexampleofd atafromveWebsites(A-E).Within eachWebsitethemedianvalueforthemetricbeinginvestiga tedisprovidedseparatelyforthe goal( X i )andnon-goalsessions( Y i ).ThenalcolumnsprovidestheclassicationforeachWeb site(e.g.,“+”,“-”,“0”).SinceWebsiteDwasclassiedas“ 0”,itwasremovedfromanyfurther analysis. AftercountingalltheWebsitesclassiedas“+”,thetestst atistic( S )forthisexamplewas3.00, withap-valueof0.3125(assumingahypothesisthat X isgreaterthan Y ). 111

PAGE 125

Table26:ExampleSignTestMetricTestingDataset WebsiteGoalSessions( X i )Non-goalSessions( Y i )Classied A3.752.15+B7.154.35+C12.2013.40-D4.754.750E7.505.90+ 5.4ConclusionTheprecedingsubsectionspresentedthemethodologyforbo ththeuser-andsite-centricclickstreammodelsofinformationforaging.First,adescriptio nwasprovidedforeachmodel'sdata sampleandhowthemeasuresforeachhypothesiswerecalcula ted.Finally,thethreestatistical testsusedtotesteachhypothesiswerepresented.5.AClickstreamComplexityAppendixTheclickstreamcomplexitymetricscompactnessandstratu mwereoriginallydevelopedbyBotafogo etal.(1992)toassistinthedesignofhypertextdocumentco llections(i.e.,Websites).Themetrics weremeanttoquantifythecomplexityandconnectednessofW ebpageswithinaWebsite.CompactnessdealtwithhowwellconnectedWebpagesweretoonea nother,wherehighcompactness meantmostpageshadlinkstomostotherpages.Stratumwasco ncernedwiththedegreeoflinearityinwhichWebpagesmustberead.Highstratumoccurredifa structuredorderexistedinwhich Webpagesmustbereadoneafteranother. McEneaney(2001)extendedtheworkofBotafogoetal.(1992) byadaptingthecompactness andstratummetricstobeusefulforquantifyingusers'path s.Thissectiondetailshowcompactnessandstratumcanbecalculatedfromauser'sclickstream .Althoughonlythestratummetric isusedinthisresearch,thecompactnessmetricisexplaine dforcompleteness.First,anexample clickstreamfortwousersispresented.Thenthestepstocon vertauser'sclickstreamtoadirected graph,pathmatrix,distancematrix,andnallyconvertedd istancematrixareexplained.Finally, equationsarepresentedtocalculatecompactnessandstrat umfromtheconverteddistancematrix. 112

PAGE 126

5.A.1ExampleClickstreamsFigure29illustratesclickstreamsfortwoseparatevisito rs,V1andV2,usingaWebbehavior graph.Bothforagersvisitedsevenpageswithfourofthosep agesbeingdistinct.Thepathofthe rstvisitor(V1)is

whereasthepathforthesecondvisitor(V2) is

(a)VisitorV1 (b)VisitorV2 Figure29.:Site-centric:ExampleClickstreamWebGraphs Table27liststhecompactnessandstratumvaluesforeachof thevisitors.VisitorV1showsa moderatelyconnectedclickstream(compactness)becausep ageP2linkstotwodistinctpageandis linkedfromthreedistinctpages.ThelinearityofV1'sclic kstream(stratum)ismoderatesincethe pathtakendoesnotendwhereitbegan.Thesecondvisitor(V2 )hasanevenmoredenselyconnectedclickstreamthanV1sincemanyofthepagesarelinked tomorethanoneotherpage.In contrasttotherstvisitor,however,V2hasamuchlessline arclickstream.AlthoughV2appears todoverylittlebacktrackingtoasinglepage,stratumislo wsincethepathnishedwhereitbegan 21 Table27:Site-centric:ExampleVisitorClickstreamComplexityMetrics VisitorCompactnessStratum V10.63890.6250V20.75000.1250 21 ThevalueforstratumwouldchangedramaticallyifvisitorV 2wouldhavevisitedpageP4insteadofP1asthelast pageofthepath.Withapathof

,thecompactnessandstratumvalueswouldbe 0.5833and0.7500,respectively. 113

PAGE 127

5.A.2GraphTheoryCompactnessandstratumarecalculatedbyusingconceptsfr omgraphtheory(McEneaney,2001). Tographicallydepictaclickstream,itcanbeconvertedint oadirectedgraph.Adirectedgraph consistsofasetofnodesanddirectedlinksbetweenthenode s.ThenodesofagrapharethedistinctWebpagesviewedbyaforager,whilethelinksbetweenn odesrepresentthetransitionsofa userfromonepagetoanother. Forexample,gure30aisadirectedgraphcreatedfromther stvisitor'sclickstream.Thegurehasfournodesrepresentingeachofthedistinctpagesvi sited.Asingle-headedarrowmeans theforagertraveledfromonenodetoanother.Adouble-head edarrowrepresentsausertraveling fromonepagetoanotherandthenbackagain.Ofnoteistheseq uenceoftheclickstreamalong withmultipletraversalsofthesamepathislostwhenconver tingaclickstreamtoadirectedgraph. P 2 P1 P4 P3 (a)DirectedGraph To/FromP1P2P3P4 P1 0 1 0 0 P2 0 1 1 1 P3 0 1 0 0 P4 0 1 0 0 (b)PathMatrix To/FromP1P2P3P4 P1 0 1 2 2 P2 1 0 1 1 P3 1 1 0 2 P4 1 1 2 0 (c)DistanceMatrix To/FromP1P2P3P4 P1 0 1 2 2 P2 4 0 1 1 P3 4 1 0 2 P4 4 1 2 0 (d)ConvertedDistanceMatrix Figure30.:Site-centric:ExampleClickstreamGraphandMa trices Awaytorepresentthesameinformationasthedirectedgraph andallowforcalculationsisvia apathmatrix.Apathmatrixhaseachofthenodesascolumnand rowheadings.Eachoftheelementsofapathmatrixrepresentsthenumberoftransitionsf romonenodetoanother.Initially,all elementsinthematrixhavevaluesofzero.Foreachpairofno desvisited,thecountattheintersectionofthosenodesinthematrixisincreased.Afterprocess ingallnodepairs,anyelementsinthe matrixwithvaluesgreaterthanonearethensettooneinorde rtocreatea“pathadjacencymatrix” (McEneaney,2001,pg.770). 114

PAGE 128

Figure30billustratesthepathmatrixgeneratedfromther stvisitor'sclickstream.Thevalue ofoneintherstrowandsecondcolumnrepresentstheforage rtravelingfromnodeP1tonode P2.Inthenextcolumnover,thezeroelementmeanstheuserne verwentfromnodeP3tonodeP1. Figures30aand30bconveythesameinformationintwodiffer entformats. Usingthepathmatrix,adistancematrixcanthenbecreated. Theelementsofadistancematrix representtheminimumdistance(intermsofhops)betweentw onodes.Theminimumdistance betweennodesisdeterminedbyusingtheshortestpathalgor ithmbyFloyd(1962).Unreachable pathsbetweennodesarerepresentedbytheinnitysymbol. Anexampleofadistancematrixfromtherstvisitor'sclick streamisshowningure30c. WhengoingfromnodeP1toP3therearetwohopswhichmusttake place(P1toP2andP2to P3),andthustheelementhasavalueoftwo.GoingfromnodeP2 toP1issettoinnitybecause onlyapathfromnodeP1toP2exists,notonefromnodeP2toP1. Sinceitisinconvenienttocalculatethecomplexitymetric susinginnitevalues,thedistance matrixmustbeconverted(Botafogoetal.,1992).Following Botafogoetal.(1992),allinnitevaluesarereplacedwiththenumberofdistinctnodesfromthepa thmatrix.Ingure30dtheinnite elementvaluesareallreplacedwiththenumberfour.5.A.3CompactnessCompactnessiscalculatedaccordingtoequation5.15(McEn eaney,2001). k N k isthenumberof distinctnodesintheuser'sclickstream. C istheconverteddistancematrixand C ij referstotheelementinthe i th rowand j th column. P i P j C ij simplysumsalltheelementsfromtheconverted distancematrix.Thevalueofcompactnessrangesfromzerot oone,withvaluesclosertooneindicatingamoredenselyconnectedandthusmorecomplexclicks tream(McEneaney,2001). COMPACTNESS = k N k 2 ( k N k 1) P i P j C ij k N k ( k N k 1) 2 (5.15) 5.A.4StratumStratumiscalculatedaccordingtoequation5.16(Botafogo etal.,1992).APandLAPbothreferto equationsmorefullyexplainedbelow.Valuesforstratumca nrangefromzerotoone,withvalues closetooneindicatingamorelinearpathandthusalesscomp lexclickstream(McEneaney,2001). 115

PAGE 129

STRATUM = AP LAP (5.16) Absoluteprestige(AP)isthenetstatusofanodewithinahyp ertextnetworkandiscalculated accordingtoequation5.17(Botafogoetal.,1992). S i and CS i refertothestatusandcontrastatusofanode.Statusandcontrastatuswereoriginallydevel opedforuseinsocialnetworktheory (SNT)(Harary,1959).InSNT,statusreferredtothenumbero fsubordinatesassignedtoaperson, whereascontrastatuswasthenumberofsuperiorsapersonha d.Thesamebasicideaofstatusand contrastatuswereadoptedbyBotafogoetal.(1992)forthes tratummetric. AP = X i j S i CS i j (5.17) Thestatusofanode(S)showninequation5.18isthenumberof othernodeswhichlinkfrom thenodeofinterest(Senecaletal.,2005).Statusisthesum ofallnon-inniteelements(e.g., C ij < k N k )inanode'srowfromtheconverteddistancematrix C S i = X i 8>>><>>>: k N k if C ij < k N k 0 otherwise (5.18) Contrastatus(CS)isthenumberofnodeswhichlinktothenod eofinterestandiscalculatedaccordingtoequation5.19(Senecaletal.,2005).Contrastat usisthesumofallnon-inniteelements (e.g., C ij < k N k )inanode'scolumnfromtheconverteddistancematrix C CS j = X j 8>>><>>>: k N k if C ij < k N k 0 otherwise (5.19) Finally,equation5.20containstheformulaforcalculatin gthelinearabsoluteprestige(LAN), whichnormalizesthesizeofthenetworkforthestratummetr ic(Botafogoetal.,1992). LAP = 8>>><>>>: k N k 3 4 if k N k iseven k N k 3 k N k 4 if k N k isodd (5.20) 116

PAGE 130

5.BLearningPatchesandScentTrailsAppendixThisappendixisconcernedwithansweringthersttworesea rchquestionsabouthowtolearn informationpatchesandscenttrails 22 .Sincethemethodologyforlearningpatchesandscenttrail s isverysimilartooneanother,theentiremethodologyiswri ttenin§5.B.1fromtheviewpointof learningpatches.§5.B.2providesadiscussionofhowtheme thodologydiffersforlearningscent trails.ResearchQuestion1: HowcaninformationpatchesbelearnedfromalongtailWebsi te? ResearchQuestion2: HowcaninformationscenttrailsbelearnedfromalongtailW ebsite? 5.B.1InformationPatchesAninformationpatchisdenedasanareaofthesearchenviro nmentwithsimilarinformation (Pirolli,2007).WithinaWeb-context,whatconstitutesap atchisdependentonthelevelofanalysisbeingexamined.Atahigh-levelofanalysis,anentireW ebsitecanbeconsideredapatch. Whenexaminedfromalowerlevelofanalysis,eachindividua lpageofaWebsitecanalsobeconsideredapatch.Whilesuchconceptualizationsofapatchar estraightforward,theyareeffectively beingdenedbythecreatorofthecontentratherthantheuse r. TheWeb,however,isapliableenvironmentwhereforagersha vethechoiceofwhatmaterialto view.Effectively,thisallowsaforagertodenetheirowni nformationpatchthatisuniquelyrelevanttotheirgoal.SuchpatchesmayconsistofagroupofWeb pages,whichindividuallymay meanverylittle,butwhencombinedprovideanareaofthesea rchenvironmentthatisseenas valuabletotheuser. Althougheachuserisfreetodenepatchesastheyseet,cer tainpatternsofpatchesmayemerge amongforagerswithsimilarinformationgoals.Fromthevie wpointoftheonlinerm,knowing whovalueswhatpatchcanprovideinsightsintotheinformat iongoaloftheforager.Bycategorizingapatchasvaluabletogoal-achieversornon-goal-ac hievers,thermmaybeabletobetter explaingoalachievementatlongtailsitesdependentonwha tpatchesarevisitedbyauser. 22 Althoughmentionedtogether,thelearningofinformationp atchesandscenttrailsaredoneseparatelyfromone another. 117

PAGE 131

LearningInformationPatchesAninformationpatchiseitherasingleWebpageorasetofWeb pagesthatcollectivelyprovide informationforanindividual 23 .Fromtheperspectiveoftheonlinerm,apatchisdenedasv aluableifitcandistinguishbetweenvisitorsessionswhichre sultinagoalbeingachievedfortherm (e.g.,auserpurchases,orllsoutacontactform),versust hosethatdonot. Thissectiondetailshowvaluableinformationpatchesarel earned.Therstsubsectionprovides thedenitionofacontrastset,whichisusedtodiscoverpat ches.Themethodologyoflearning patchesusingclickstreamdataandcontrastsetsisthenout linedinthenextsubsection.Thethird subsectiondescribeshowpatchesaredeemedtobevaluableo rnotdependingontheirabilityto signicantlydistinguishbetweengoal-achieversandnongoal-achievers.Finally,analternative denitionofcontrastsetsisgiven,whichdoesnotrequirep atchestobestatisticallysignicant betweengroupstobeconsideredvaluable.ContrastSets Fromthedataminingliterature,contrastsetsareawayton ddifferencesbetweengroups(Bay andPazzani,1999).Acontrastsetisacombinationofattrib utesandtheirvalueswhichdifferin supportamongstseparategroups(BayandPazzani,1999).Le ttherebe k attributes A ( A 1 A 2 ..., A k ),where A i canhaveoneof m values( V i 1 V i 2 ,..., V im ).Acontrastsetisaconjunctionof attributesdenedfor n groups( G 1 G 2 ,..., G n )(BayandPazzani,1999).Forexample,acontrast setmaybe ( PageA =1) ^ ( PageC =1) ,wheretheattributesrepresentWebpagesandavalueof “1”signiesapagewasvisited.Supportinagroupisdeneda sthepercentageofinstanceswhere thecontrastsetistruewithinthegroup(BayandPazzani,19 99).Thesupportfromtheprevious examplemaybe5%forgoalsessionsand17%fornon-goalsessi ons. Apotentialcontrastset(PCS)isonewherethecontrastset( cset )issufcientlylargeinatleast oneofthegroups,wherelargenessishavingasupportgreate rthanorequaltoaspeciedminimumsupport( minSup ).Formally,aPCSbetweentwogroupsisonethatsatisesthe condition: max ( support ( cset;G 1 ) ;support ( cset;G 2 )) minSup (adaptedfromSatsangiandZaiane (2007)).Asignicantcontrastset(SCS)isaPCSthatalsome etsthesignicancecondition.Formally,acontrastsetissignicantbetweentwogroupsif P ( cset j G 1 ) 6 = P ( cset j G 2 ) ataspecied 23 ThisappendixdoesnotexamineanentireWebsiteasapatchan dthusthegeneralterm“patch”onlyreferstoa singleWebpageoragroupofWebpages. 118

PAGE 132

alphalevel(adaptedfromBayandPazzani(1999)).ASCSishe nceavaluableinformationpatch sinceitrepresentsthefactthatthesetofpagestendstobev isitedmoreinonegroupthananother. Therefore,thepresenceorabsenceofavisitorwithinsucha patchmaysignalanexpectedgoal outcomefortherm.DiscoveringPatches Todiscovervaluablepatches,contrastsetswerefoundwher etheattributesofthesetconsisted ofthedistinctpagesvisitedbyauserduringtheirsessiona taWebsite.Certainpages,however, werenotincludedintheanalysissincetheirvisitationwas arequirementfororconsequenceof achievingagoal(e.g.,contactform,formsubmission,andt hankyoupages).Inaddition,only pagesoccurring before aformsubmissionwereincludedintheanalysis. EachWebsitecontainedsessionswhereeitheragoalwasachi evedduringasessionornot. Sessionswereseparatedaccordingtotheirachievementand placedintoWebsite i 'sgoaldataset ( D Gi )ornon-goaldataset( D Ni ).EachWebsitehad N i sessions. N Gi and N Ni aredenotedto correspondtothesizesofdatasets D Gi and D Ni forWebsite i ,respectively. FrequentitemsetswerediscoveredfromeachWebsite'sdata setsusingtheMAFIA( Ma ximal F requent I temsets A lgorithm)(Burdicketal.,2001)algorithm 24 .Thealgorithmwasrunseparatelyon D Gi and D Ni foreachWebsiteandresultedinasetoffrequentitemsets I Gi and I Ni 25 Theminimumsupportwassetto0.10.Afrequentitemsetisapo tentialcontrastset. Figure31isanexampleoffrequentitemsetsminedfromaWebs itewiththreeWebpages(A, B,andC)(assuminga minSup of10 % ).Ontheleft-handsideofthegurearetheitemsetsdiscoveredfromthegoaldataset( D Gi ),whereastheitemsetsfromthenon-goaldataset( D Ni )areon theright-handside.Theitemsetsarearrangedinalatticeb ylevelaccordingtotheirsize(i.e.,how manypagesareintheitemset).Linesaredrawnbetweenitems etstoshowtheirrelationtoother itemsets.Totherightofeachitemsetinparenthesesisthec ountofsupportfortheitemset.The emptyitemsetatlevel0representstheentiredataset. 24 Animplementationofthealgorithmcanbefoundathttp://hi malaya-tools.sourceforge.net/Maa/.Version1.4was usedinthisresearch. 25 Thediscoveryoffrequentitemsetsindatasetsseparatedby goalissimilartodiscoveringrulesfollowingtheform f pages g! G asdoneinSatsangiandZaiane(2007),where f pages g isasetofdistinctpagesandGisthegroupgoal ornon-goal.However,whenthesizebetweengroupsisimbala nced,ndingfrequentitemsetsintheminoritygroup maybecomeimpossibleusingacombineddataset.Forexample ,theminoritygroupwouldnotbeabletondanyfrequentitemsets(ataminimumsupportof10 % )ifthemajoritygrouphad10timesmorerecordsthanthemino ritygroup. Miningfrequentitemsetsseparatelydoesnotsufferfromth isclassimbalancelimitation. 119

PAGE 133

GoalDatasetLevelNon-goalDataset fg (53) 0 fg (1,784) f A g (47) f B g (28) f C g (34) 1 f A g (1,109) f B g (987) f C g (687) f A;B g (25) f A;C g (32) 2 f A;B g (378) f B;C g (597) f A;B;C g (17) 3 Figure31.:Site-centric:ExampleItemsetsbyDataset Potentialcontrastsetswereformedstartingwiththelowes t-levelfrequentitemsetsfoundin eitherdatasetandthencontinuingontohigher-levelitems ets.ToevaluateaPCS,acontingency tablewasneededthatwaspopulatedwiththeamountofsuppor tandnon-supportforthePCS's itemsetfromeachdataset.WhentheitemsetforaPCSwasfoun dinboth I Gi and I Ni ,thenthe contingencytablewascreatedaccordingtotable28,where S Gij ( S Nij )isthecountofsupportfor itemset j fromWebsite i inthegoal(non-goal)dataset. Table28:Site-centric:ExampleContingencyTableforaPot entialContrast Set SupportCountforItemset j NonSupportCountforItemset j D Gi S Gij S Gij = N Gi S Gij D Ni S Nij S Nij = N Ni S Nij Whentheitemsetwasmissingfromoneofthedatasets(i.e.,i twasnotfrequent),thenthecount ofsupportandnon-supportwasunknown.Insuchacasethesup portfrequencyforthecontingencytable( S Gij or S Nij )wascalculated(SatsangiandZaiane,2007)accordingtoth esupCount formula: supCount = round ( N minSup ) ,where N is N Gi or N Ni and minSup isminimum 120

PAGE 134

support. supCount representsagenerouscountofsupportforanitemset,which resultsinaPCSbeingconservativelyevaluated.Whenanitemsetisnotfreque ntitisunknownhowmuchlessthan minSup itssupportreallyis.Usingasupportcountofzeroinplaceo f supCount wouldunderestimatetheimportanceofanitemset,whichmayleadtoaTyp eIerrorwhenevaluatingthePCS. Therefore,asupportcountequivalentto minSup isusedwhichover-estimatestheimportance ofanitemsetandhencelowersthechanceofaTypeIerroroccu rringwhenthePCSisevaluated. However,thismethoddoesincreasetheprobabilityofaType IIerroroccurring. Toillustrate,assume minSup was10 % andthesupportforaPCS'sitemsetwas9.9 % inthe goaldatasetand45.0 % inthenon-goaldataset.Sinceminimumsupportfortheitems etfromthe goaldatasetwasnotmet,thetruesupportwouldbeunknown.I fasupportofzerowereusedwhen populatingthecontingencytableforthePCS,theimportanc eoftheitemsetinthegoaldataset wouldbeunderstatedby9.9 % .Inotherwordsthedifferenceinsupportbetweendatasetsw ouldbe 9.9 % more thanitwasactuallywas(45.0 % versus35.1 % ).Settingthesupportto minSup would insteadover-estimatetheimportanceoftheitemsetby0.1 % .Herethedifferenceinsupportwould be0.1 % lessthanitactuallywas(35.0 % versus35.1 % ). Table29presentsthreeexamplesofpotentialcontrastsets fromthesecondlevelofgure31. TherstcolumnshowstheitemsetusedforthePCS(i.e.,theW ebpagesthatmakeupthepatch). Columnstwothroughveareinreferencetothegoaldataseta ndlistiftheitemsetwasfoundto befrequent,numberofgoalsessions,supportfortheitemse t(withsupportpercentageinparentheses),andnon-supportfortheitemset.Columnssixthrou ghninehavethesamemeaningas columnstwothroughveexcepttheyrefertothenon-goaldat aset. Table29:Site-centric:ExamplePotentialContrastSets GoalNon-Goal PCSFound? N Gi S Gij S Gij Found? N Ni S Nij S Nij f A,B g Yes5325(47.2 % )28 Yes1,784378(21.2 % )1,406 f A,C g Yes5332(60.4 % )21 No1,784178(10.0 % )1,606 f B,C g No535(9.4 % )48 Yes1,784597(33.5 % )1,187 TherstexampleshowsaPCSforitemset f A;B g .Sincetheitemsetwasfoundinbothdatasets 121

PAGE 135

thesupportfromeachdatasetisknown.ThesecondPCS(items et f A;C g )illustratesanexample wheretheitemsetofinterestisfoundinthegoaldatasetbut notinthenon-goaldataset.Therefore, the supCount formulawasused(witha minSup of10 % )tocalculatethesupportcount( S Nij ). ThenalPCS(itemset f B;C g )showstheoppositesituationwheretheitemsetwasnotfreq uent inthegoaldatasetbutitwasinthenon-goaldataset. S Gij wouldthereforebecalculatedusingthe supCount formula 26 DeterminingPatchValue Thesignicanceofeachpotentialcontrastsetwasthencalc ulatedusingFisher'sexacttest (Conover,1999).Althoughpriorresearchhasusedthechi-s quaretestforindependencetodeterminesignicance(BayandPazzani,1999),theapproximatio nof maysufferwhentheexpected valueofatleast20%ofthecellsinthecontingencytableare belowveoranyoneexpectedvalue islessthanone(Cochran,1954).Whenconsideringgoalachi evementonlongtailWebsites,the countsinthecontingencytableareoftentoosmallortooimb alancedintheirdistributionforthe chi-squaretesttoadequatelyapproximate .Thus,Fisher'sexacttest,whichmakesnosuchapproximation,wasusedinstead. Whentestingmultiplehypotheses,suchasinthesituationo ftestingeachpotentialcontrastset, thefamilywiseerrorrate(FWER)shouldbecontrolled.TheF WERisameasureinstatisticsthat referstotheprobabilityofcommittingatleastoneTypeIer ror.Acommonmethodofdealing withtheFWERistoxthealphaacrossalltests. Forexample,withanexpectedfamilywiseerrorrateof ,thealphalevelforeachindividual potentialcontrastset( ind )wouldbexedusingaBonferroniprocedure: ind = =NC ,where NC isthenumberofPCSsbeingtested.Thedisadvantageofsucha napproachisthesamealphalevelisusedregardlessofthePCS'sitemsetsize.Thisresu ltsinalossofpowerandabilitytodetectdifferencesineventhemostgeneralPCSswhichuselowe r-levelitemsets(BayandPazzani, 1999). Tocombatsuchalossofpower,adifferentalphalevelwasuse dforeachleveloftheitemset lattice.Thepurposeofsuchachangein wastodistribute“...1/2ofthetotal totestsatlevel1, 1/4totestsatlevel2,andsoon”(BayandPazzani,1999,pg.3 04).Thisresultsingreaterpower beingavailabletotestthemostgeneralPCSs(i.e.,thosewi thitemsetsfromthelowestlevels). 26 Thesupportpercentageis9.4 % insteadof10.0 % duetoroundinginthe supCount formula. 122

PAGE 136

Equation5.21(BayandPazzani,1999)wasusedtodeterminet healphalevel( l )fortesting allPCSsataspeciedlevel.Intheequation, istheexpectedfamilywiseerrorrate, l isthelevel, and C l isthenumberofcandidatePCSsbeingtestedatlevel l .Thepurposeofthe min function wastoensurethat becomesmorestringentwitheachsubsequentlevel.Usingan of0.01,a potentialcontrastsetwasdeemedsignicantifitsp-value fromFisher'sexacttestwaslessthanor equalto l l = min 2 l C l ; l 1 (5.21) IfaPCSwasfoundtobesignicant,thenthepatchitrepresen ts(i.e.,itemset)wasdeemedvaluable.Avaluablepatchwhichwaspredominatelyvisitedbyus ersfromthegoalgroupwasknown asa goalpatch andplacedintheset P Gi ,whereasvisitationmostlyfromthenon-goalgroupresultedinapatchbeinglabeledasa non-goalpatch andbeingplacedintheset P Ni .Formally,a patchwasdeemeda goalpatch if S Gij N Gi > S Nij N Ni ,anda non-goalpatch otherwise 27 .Byclassifying patchesinsuchamanner,avisitormaysignalexpectedgoalo utcometothermviathepresence orabsenceofpatchvisitation.SupportedContrastSets Asanalternativetosignicantcontrastsets,asupportedc ontrastsetwasapotentialcontrastset thathadadifferenceinsupportaboveauser-denedthresho ld( thresh ).Formally,asupported contrastsetbetweentwogroupswasonethatsatisesthecon dition: difference ( support ( cset;G 1 ) ;support ( cset;G 2 )) thresh ,where difference ( S Gij ;S Nij ) isdenedinequation5.22(YangandPadmanabhan,2003).Ifa PCSmetorexceededthethresholdsupportconditionthenthepatchwasconsideredvaluabl e.Theclassicationaseitheragoalor non-goalpatchwasdoneinthesamemanneraswithsignicant contrastsets. difference ( S Gij ;S Nij )= S Gij N Gi S Nij N Ni 1 2 S Gij N Gi + S Nij N Ni (5.22) Thepurposeofdeningasupportedcontrastsetwasbecause ndingstatisticalsignicance maybedifcultwhenmanyPCSsexist.Forexample,assume100 potentialcontrastsetsexisted 27 Onlygoalpatcheswereusedintheanalysisofthisdissertat ion. 123

PAGE 137

onlevel1andtheexpectedfamilywiseerrorratewassetto0. 01.Inorderforapotentialcontrast settobesignicant,itsp-valuewouldneedtobelowerthan0 .00005.Therefore,thesupported denitionofacontrastsetwasusedtodiscovercontrastset swhichmaybeimportant,butfailto reachsignicance.5.B.2ScentTrailsInformationscentisthedrivingforcebehindwhyapersonma kesanavigationalselectionamongst agroupofcompetingoptions.Asforagersareassumedtobera tional,scentisamechanismby whichforagers'reducetheirsearchcostsbyincreasingthe iraccuracyonwhichoptionleadstothe informationofvalue(Pirolli,2007).Basedontheinformat iongoalofaforager,eachhyperlinkon aWebpagegivesoffascent.Thehigherthescentthemorelike lythepagethatisbeinglinkedto maycontaintheinformationbeingsought.Similartoablood houndthatfollowsascenttrailover distancestondanitemofinterest,aforageralsofollowsa scenttrailtondtheinformationthey seekovermultipleWebpages. Althougheachuserfollowsascenttrailthattswiththeiri nformationgoal,patternsfromfragmentsofscenttrailsmayexistthatemergeamongforagerswi thsimilarinformationgoals.Like patches,thesefragmentsofscenttrailsareofvaluetotheo nlinermindistinguishingbetween possiblegoal-achieversandnon-goal-achievers.Whenaus erfollowstheseknownfragments ofscenttrailsitmayprovidecluesintotheirinformationg oalandthushelpinexplaininggoal achievementatlongtailsites.LearningScentTrailsAscenttrailisthepathaforagertravelsuponbyfollowingt heinformationscentoflinks.More specically,ascenttrailisasetofpagesinaspeciedorde r.Todiscovervaluablescenttrails, contrastsetswerefoundwheretheattributesofthesetcons istedofanorderedsetofpages(i.e.,a sequentialpattern)visitedbyauserduringtheirsessiona taWebsite. FrequentsequentialpatternswerediscoveredfromeachWeb site'sdatasetsusingtheSPAM ( S equential PA ttern M ining)algorithm(Ayresetal.,2002) 28 .Thealgorithmwasrunseparatelyon 28 Animplementationofthealgorithmcanbefoundathttp://hi malaya-tools.sourceforge.net/Spam/.Version1.3.3 wasusedinthisresearch. 124

PAGE 138

D Gi and D Ni foreachWebsiteandresultedinasetoffrequentsequential patterns P Gi and P Ni Afrequentsequentialpatternisapotentialcontrastset. Figure32isanexampleoffrequentsequentialpatternsmine dfromaWebsitewithtwoWeb pages(AandB)(assuminga minSup of10 % ).Thegureonlyshowspatternsfoundfrequent startingwithWebpageA.Onthetoppartofthegurearethepa tternsdiscoveredfromthegoal dataset( D Gi ),whereasthepatternsfromthenon-goaldataset( D Ni )areonthebottompartofthe gure.Thepatternsarearrangedinalatticebylevelaccord ingtotheirsize(i.e.,howmanyWeb pagesareinthepattern).Linesaredrawnbetweenpatternst oshowtheirrelationtootherpatterns. Totherightofeachpatterninparenthesesisthecountofsup portforthepattern.Theemptypatternatlevel0representstheentiredataset. Anexampleofapotentialcontrastsetwithasequentialpatt ernfromthethirdlevelofgure32 is
.Thispatternmeanspage A wasvisitedtwotimesbeforepage B wasvisited.However,thevisitationofthesepagesneednotberightafteron eanother.Theremaybemanyother pagesthatwerevisitedinbetweeneachpage.Forexample,as essionwiththefollowingseven pageviews visitedpages C D ,and E beforevisitingpage A againand thenpage B IfaPCSwasfoundtobesignicant,thenthescenttrailitrep resented(i.e.,sequentialpattern) wasdeemedvaluable.Avaluablescenttrailwhichwaspredom inatelyfollowedbyusersfromthe goalgroupwasknownasa goalscenttrail andplacedintheset T Gi ,whereasfollowingmostly fromthenon-goalgroupresultedinascenttrailbeinglabel edasa non-goalscenttrail andbeing placedintheset T Ni .Formally,ascenttrailwasdeemeda goalscenttrail if S Gij N Gi > S Nij N Ni ,anda non-goalscenttrail otherwise 29 Forascenttrailtobeconsideredvaluablebywayofsupport( i.e.,asupportedcontrastset),then thePCSmusthavemetorexceededthethresholdsupportcondi tion.Theclassicationaseithera goalornon-goalscenttrailwasdoneinthesamemanneraswit hsignicantcontrastsets. 29 Onlygoalscenttrailswereusedintheanalysisofthisdisse rtation. 125

PAGE 139

GoalDatasetLevel <> (53) 0
(47) 1 (34) (28) 2 (25) (17) (8) 3 (a)FrequentSequentialPatternsfromGoalDataset Non-goalDatasetLevel <> (1,784) 0 (1,109) 1 (378) (597) 2 (255) (421) (189) 3 (b)FrequentSequentialPatternsfromNon-GoalDataset Figure32.:Site-centric:ExamplePatternsbyDataset–ada ptedfromAyresetal.(2002) 126

PAGE 140

Chapter6 Datasets Thedatasetsusedtotesttheuser-andsite-centricclickst reammodelsofinformationforagingare presentedinthischapter.Inparticular,detailsofthepre processingstepsundertakentoarriveat thenaldatasetusedtotesteachmodelareshown.Afterallt hepreprocessingsteps,descriptive statisticsofeachnaldatasetarealsoprovided.§6.1cont ainsinformationabouttheuser-centric dataset,while§6.2detailsthedatafromthesite-centricd ataset. 6.1User-centricDatasetThedatausedfortheuser-centricclickstreammodelofinfo rmationforagingwasprovidedby comScore,Inc.,amarketingresearchcompany.Thedomain-l eveldatawascapturedfor100,000 UnitedStates-basedpanelists 1 overaoneyearperiodfromJanuary1,2006toDecember31,200 6 2 (comScore,Inc.,2007b).Eachpanelistwasrandomlyselect edfromcomScore'spoolofmorethan twomillionglobalInternetusers. Datawascollectedfrompanelistsusingaproprietarymetho dologythat“...enable[d]comScoretopassivelyobservethefulldetailsofpanelists'In ternetactivity,includingeveryWebsite visitedanditempurchased”(comScore,Inc.,2005,pg.1).A panelist'ssessionwasdenedas anysequenceofWebpagesonthesameWebsitebythesamevisit orwithlessthana30minute timeperiodbetweenpageviewings.A30minutesessiontimeo uthasalsobeenusedinprevious clickstreamresearch(BucklinandSismeiro,2003;Sismeir oandBucklin,2004;VandenPoeland Buckinx,2005). Theremainderofthissectionprovidesinformationregardi ngthestepstakentoarriveatthenaldatasetusedtotesttheuser-centricmodel,alongwithg eneraldescriptivestatisticsaboutthe 1 Theactualdatasetcontainedatotalof88,814panelists.Th edocumentationbycomScoredidnotprovideanexplanationforthe11,186missingpanelists. 2 Notallpanelistswereactiveduringtheentiredatacollect ionperiod. 127

PAGE 141

data.Thepreprocessingstepsappliedtothedataaredescri bedinthenextsection.Descriptive statisticsarethenprovidedinthefollowingsection.6.1.1PreprocessingofOriginalDatasetThedataobtainedfromcomScore,Inc.includedmanydataele mentsnotapplicabletothecurrent research.Therefore,anumberofprocessingstepswereperf ormedtoobtainanaldatasetusable fortestingtheuser-centricclickstreammodelofinformat ionforaging.Table30listseachstepof theprocessalongwiththetotalnumberofWebsites,session s,andgoalsessions;andhowmany Websites,sessions,andgoalsessionswereremovedatthats tep(ifapplicable).Table31liststhe parametersusedineachpreprocessingstep.Adiscussionof eachstepanditsparametersareprovidedbelow. 128

PAGE 142

Table30:User-centric:PreprocessingofOriginalDataset Statistics StepDescriptionWebSitesSessions a Goals b M inWebSites M inSessions a M inGoals Originaldataset1,417,745548,428,562354,985n/an/an/a 1Removenone-commercesites625106,686,274354,985 1,417,120 441,742,288n/a 2Removesitesnotrandomlyselected fromlongtail 58798,30613,872 567 105,887,968 341,113 3Removesinglepagesessions58616,60713,8700 181,699 2 4Identifyuser-sessions58511,39711,100n/a 105,210 2,770 5Removesiteswith < 50goal user-sessions 52502,13110,834 6 9,266 266 6Removeoutliers52496,34310,714n/a 5,788 120 7Removesiteswith < 50goal user-sessions 52496,34310,714000 a Startingatstep4,the“Sessions”and“ M inSessions”columnsrefertouser-sessions. b Theoriginaldatasetcontainedatotalof355,064goals.How ever,79ofthosegoals(0.02 % )didnothaveanysessiondetailsassociatedwiththepurcha se.Therefore,thetotalnumberofgoalsforthisdatasetwaslistedas 354,985.129

PAGE 143

Table31:User-centric:PreprocessingParameters StepDescriptionParameters Originaldatasetn/a 1Removenone-commercesites goalsAtWebsite ==0 2Removesitesnotrandomly shortHeadWebsites ==80 = 20 rule selectedfromlongtail minGoalsInLongTail ==50 randomPercent =20% 3Removesinglepagesessions sessionLength ==1 4Identifyuser-sessions sessionsInWindow 1 5Removesiteswith < 50goal user-sessions userSessionGoalsAtWebsite< 50 6Removeoutliers MinPts =4 Eps =0 : 0266 (goalsessions) Eps =0 : 0536 (non-goalsessions) sample %=100% (goalsessions) sample %=15% (non-goalsessions) 7Removesiteswith < 50goal user-sessions userSessionGoalsAtWebsite< 50 Step1.RemoveNone-CommerceWebsitesTherststepoftheprocessremovedanynone-commercesites .Ane-commerceWebsitewas denedasanysiteinwhichapurchasewasmadebyanyuseratan ypointwithinthedataset's timeperiod.Atotalof625Websiteswerefoundwhereapurcha sewasmade 3 .Theremaining 1,417,120none-commerceWebsiteswereremovedalongwitht he441,742,288corresponding sessionsthattookplaceonthosesites.Step2.RemoveWebsitesNotRandomlySelectedfromtheLongT ail Thesecondstepoftheprocessselectedasampleoflongtaile -commerceWebsitestoanalyze. Siteswithinthedatasetweredenedaccordingtothe80/20r ule(Newman,2005)aseitherparts 3 ThemethodologybywhichcomScorerecognizesandrecordsap urchasewasnotavailable.Consideringonlya totalof625outof1,417,745Websiteshadpurchasesonthem, itislikelythedatasetdidnotincludeallpurchasesmade atallWebsites. 130

PAGE 144

oftheshortheadorlongtailofapowerlawdistribution.Ing eneral,the80/20rulestates80%of somequantiableobject(e.g.,wealth)shouldbeheldby20% ofthepopulation.Withinthecontextofgoalachievement,80%ofachievedgoalsshouldhavet akenplaceon20%(125)ofthe625 e-commerceWebsites.ShortheadWebsiteswerethosesitesi ncludedinthe80/20group,while allothersiteswereconsideredlongtailWebsites. Figures33a–33billustratetheseparationbetweenshorthe adandlongtailWebsites.Figure33ashowsthenumberofgoalsachievedateachWebsite,wh ilegure33bshowsthecumulativenumberofgoalsachieved.Theverticaldashedlineon theleftofeachgurerepresentsthe boundarybetweenshortheadandlongtailWebsites.TheWebs itestotheleftoftherstdashed linerepresent79.89 % ofallgoalsessions,whilemakingup17.12 % oftheWebsitepopulation. 0 5000 10000 15000 20000 25000 30000 0 100 200 300 400 500 600 700 Goals at WebsiteWebsite Long Tail Region (a)GoalSessionsbyWebSites 0 50000 100000 150000 200000 250000 300000 350000 400000 0 100 200 300 400 500 600 700 Cumulative GoalsWebsite Long Tail Region (b)CumulativeGoalSessionsbyWebSites Figure33.:User-centric:GoalSessionsbyWebSites Thesecondverticaldashedlineineachgurerepresentsase parationbetweenWebsitesinthe longtailandthoseintheverylongtail.AWebsiteswasconsi deredtoofardownthelongtailto analyzeiftherewerefewerthan50goalsessionsattheWebsi te. Atotalof107Websites(17.12 % )intheshortheadwereremovedalongwiththe41,708,093 correspondingsessions(283,609goals)atthoseWebsites. Inaddition,228Websites(36.48 % )in theverylongtailwereremovedalongwiththe38,947,086cor respondingsessions(3,693goals)at thoseWebsites. 290Websites(46.40 % )inthelongtailregionremainedwith26,031,095sessions( 67,683goals). Sinceprocessingover26millionsessionswouldbetoocompu tationallyexpensive,arandomsampleof20 % ofthe290Websiteswastaken.58Websiteswererandomlysele ctedwhichhadatotal of798,306sessions(3.07 % )(13,872goal(20.50 % )). 131

PAGE 145

Step3.RemoveSinglePageSessionsThethirdstepfollowedBucklinandSismeiro(2003)andremo vedallsessionswhichconsistedof onlyasinglepage-view.Asinglepage-viewdoesnotreprese nt“browsing”behavioronaWebsite (BucklinandSismeiro,2003)andthusisunlikelytoprovide interestingvisitorpatterns.181,699 single-pagesessionswereremoved.Step4.IdentifyUser-sessionsThefourthstepoftheprocessidentieduser-sessionsfrom theremaining616,607sessions.A user-session encapsulatesthesessionbeinganalyzed(i.e.,the“target ”)andallothersessionsthat mettworequirements. (1)Theothersessionmusthavetakenplaceatane-commerceW ebsite. (2)Theothersessionmusthaveendedbetween30minutesbefo rethestartofthetargetsession andtheendofthetargetsession. Avaliduser-sessionwasneededtocalculaterelativemeasu resfortheuser-centricclickstream modelofinformationforaging.Tobeconsideredvalid,ause r-sessionmusthavehadatleastone sessioninadditiontothetargetsession 4 Eachofthe616,607sessionsfromthethirdstepwasanalyzed todetermineiftheywerepartof avaliduser-session 5 .Atotalof511,397validuser-sessions(82.94 % )werefoundandretained. Theremaining105,210sessions(17.06 % )didnothaveavaliduser-sessionandwereremoved. Step5.RemoveSiteswith < 50GoalUser-sessions Forthefthstepoftheprocess,Websiteswithoutatleast50 goaluser-sessionswereremoved. AlthoughsteptwoinitiallycheckedforWebsiteshavingatl east50goalsessions,thenumberof goaluser-sessionsmayhavebeenreducedbecauseagoalsess ionmaynothaverepresentedavalid user-sessionifno“other”sessionswereassociatedwithth etargetsession. AtotalofsixWebsites(10.34 % )wereremovedfromthedatasetbecausetheyhadfewerthan 50goaluser-sessions.The9,266user-sessions(1.81 % )atthosesixWebsiteswerealsoremoved. 4 Specifyingatleasttwosessionsforavaliduser-sessionis similartorequiringatleasttwopageviewsforavalid session(BucklinandSismeiro,2003). 5 Furtherdetailabouthowuser-sessionsweredeterminedcan befoundin§5.1.1. 132

PAGE 146

Step6.RemoveOutliersThedatasetwasthenexaminedforoutliersinthesixthstepo ftheprocess.Anoutlierwasdened as“anobservation(orsubsetofobservations)whichappear stobeinconsistentwiththeremainder ofthatsetofdata”(BarnettandLewis,1994,pg.7).Sinceth eclickstreammodelofinformation foragingreliesonrelativebehaviortoothersessions,inc onsistentsessionswereremovedfromthe dataset.Consistencywascomparedviathecombinationofth etotalnumberofpagesviewedand thedurationofasession. Anunsuperviseddensity-basedclusteringalgorithmcalle dDBSCAN 6 (Esteretal.,1996)was usedtolocateoutlyingsessions 7 .DBSCANidentiesclustersofarbitraryshape,wherethenu mberofclustersisautomaticallydeterminedviathealgorit hm.Aclusterisformedbyhavingaminimumnumberofneighborpoints 8 ( MinPts ),ordensity,withinaspeciedradius( Eps ).Points notclassiedtoaclusterarelabeledas“noise”(i.e.,outl iers). DBSCANrequirestwouser-speciedparameters: MinPts and Eps MinPts–theminimumnumberofpointswithinaneighborhoodo fradius Eps .Fortwo-dimensional datasets, MinPts iscommonlysettofour(Esteretal.,1996;HodgeandAustin, 2004). Eps–theEps-neighborhoodorradiusofacluster.Thevalueo f Eps isdeterminedvisuallyviaa sortedk-distgraph(seepointthreebelow)(Esteretal.,19 96). ToperformtheoutlieranalysisusingDBSCAN,fourstepswer efollowed. (1)Goalandnon-goalsessionswereseparatedintotwosepar atedatasets.Eachoftheremaining threestepswasperformedindependentlyoneachdataset. (a)Fortheuser-centricmodelthegoalandnon-goaldataset salsoincludedthe“other”sessionsassociatedwithausersession.Allothersessionstha talsoachievedagoalwere includedinthegoaldataset,whiletheremainingsessionsw ereplacedinthenon-goal dataset. 6 TheaverageruntimecomplexityofDBSCANis O ( n log ( n )) (Esteretal.,1996). 7 DBSCANwaschosenovercommonstatisticaltechniquesforre movingoutliers,suchasremovingvaluesgreater thanthreestandarddeviationsaway,fortworeasons:(1)DB SCANdoesnotrequireknowledgeofanunderlyingdistributionand(2)DBSCANiscapableofndingoutliersinmultip ledimensions. 8 Thetermpointswillbeusedtorefertosessionswithaunique combinationofpagesviewedandsessionduration duringtheremainderofthissubsection. 133

PAGE 147

Thegoaldatasetconsistedof20,121goalsessions 9 .10,834ofthosesessions(53.84 % ) weretargetsessions,whiletheother9,287(46.16 % )wereothersessions.Thenon-goal datasetconsistedof1,589,407non-goalsessions 10 .491,297ofthosesessions(30.91 % ) weretargetsessions,whiletheother1,098,110(69.09 % )wereothersessions. (2)Valuesfromeachdimensionwerenormalizedbetween0and 1accordingtoequation6.1, where x isasetofdistinctvaluesforadimension, x i isthe i th elementoftheset,and min ( x ) and max ( x ) aretheminimumandmaximumvaluesfoundinset x ,respectively. norm ( x i )= x i min ( x ) max ( x ) min ( x ) (6.1) Normalizationwasdonebecausedistancecalculationswere usedforgeneratingboththesorted k-distgraphandfordeterminingwhichpointsbelongedwith inthesameneighborhood.The Euclideandistance(equation6.2)wasusedtocalculatethe distancebetweentwopoints P and Q with n dimensions,suchthat P =( p 1 ;p 2 ;:::;p n ) and Q =( q 1 ;q 2 ;:::;q n ) Euclideandistance = vuut n 1 X i =0 ( p i q i ) 2 (6.2) Ifvalueswerenotnormalized,thendistancesmaydifferorn otdiffersimplyduetothescale ofonedimension.Forexample,gure34illustratestheposi tionsofthreepoints:A,B,and C.Table32displaysthenon-normalizedandnormalizedvalu esforeachofthepoint'stwodimensions:numberofpagesviewedandsessionduration.Tabl e33liststheEuclideandistance betweenpairsofpointsusingeachpoint'snon-normalizeda ndnormalizedvaluesforbothdimensions.Usingthenon-normalizedvaluesfromtable32,thedistance (asseenintable33)fromAto C(45.00)isthesameasthedistancefromBtoC(45.00).Howev er,lookingatgure34it isapparentthatanincreaseof45pagesviewedrepresentsal argerchangeindistancethana decreaseof45minutesinsessionduration.Thenormalizedd istancesforAtoC(0.12)andB toC(0.45)betterreecttheactualdistancebetweenpoints 9 Agoalsessionmaybepresentinmorethanoneuser-session. 10 Anon-goalsessionmaybepresentinmorethanoneuser-sessi on. 134

PAGE 148

0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 90 100 Session Duration (min)Pages Viewed A B C Figure34.:ExampleOutlierPoints Table32:ExampleOutlierPoints Non-normalizedNormalized PointPagesViewedSessionDuration(min)PagesViewed a SessionDuration(min) b A51700.050.47B501250.500.35C51250.050.35 a Assumesminimumandmaximumvaluesof0and100forpagesview ed,respectively. b Assumesminimumandmaximumvaluesof0and360forsessiondu ration,respectively. Table33:ExampleOutlierDistances PointsNon-normalizedDistanceNormalizedDistance AtoB63.640.47AtoC45.000.12BtoC45.000.45 135

PAGE 149

(a)Arandomsampleof15.00 % ofthenon-goaldataset'snormalizedpointswereselected forprocessingbyDBSCAN.Asamplewasusedbecausethetimer equiredtocompute clustersfromtheoriginaldatasetusingDBSCANwouldhaveb eenprohibitivelyhigh 11 Thesamplewasusedtondtheboundarybetweennon-outlying andoutlyingpoints. Pointsnotincludedintherandomsamplethatfelloutsideth enon-outlyingregionwere classiedasoutliers.Theentirenon-goaldatasetconsistedof1,589,407non-goa lsessions.Arandomsample of238,411sessions(15.00 % )wasselectedandusedintheremainingtwosteps. (3)TheparametersforDBSCANweresettodenethe“thinnest ”clusterinthedatasetbyfollowingathree-stepheuristicoutlinedbyEsteretal.(1996).T he“thinnest”clusteristhesmallest orleastdensegroupingofpointsthatarenotconsiderednoi se. (a) MinPts wassettofoursinceeachdatasetonlyhadtwodimensions(Es teretal.,1996) 12 (b)Thethresholddistance,whichdistinguishesbetweenno iseandclusterablepoints,was located.Pointsfartherawaythanthethresholddistance(i .e.,totheleft)wereconsidered “noise”,whilepointscloserthanthethresholddistance(i .e.,totheright)wereclusterable.Todeterminethethreshold,asortedk-distgraphwasc reated( k = MinPts ), wherethedistanceofeachpointtoits k th neighborisfound,sortedindescendingorder, andthengraphed.Thepurposeofthesorted-k-distgraphwas tovisuallylocatetherst “valley”indistancevalues,whichrepresentsthethreshol ddistance. TheApproximateNearestNeighbor(ANN)searchlibraryvers ion1.1.1(Mountand Arya,2006)wasusedtocalculatethedistanceofeachpointt oitsfourthnearestneighbor 13 .Figures35aand35bshowthesorted4-distgraphsoftherst 100valuesforthe goalandnon-goalsessions.Eachgurewasmanuallyinspect edtondtherst“valley”, whichisshownattheintersectionofdashedlines. (c) Eps wassettothethresholddistancefoundinstep(b). Table34liststheparametervaluesusedforeachofthedatas ets. MinPts wassettofour 11 Allpointsfromthegoaldatasetwereused. 12 Inasurveyofoutlierdetectionmethodologies,HodgeandAu stin(2004)alsostated MinPts iscommonlysetto fourforDBSCAN. 13 ANNisavailableathttp://www.cs.umd.edu/mount/ANN/. 136

PAGE 150

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 Normalized DistanceSession (a)GoalSessions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 Normalized DistanceSession (b)Non-goalSessions Figure35.:User-centric:Sorted4-DistGraphs:GoalandNo n-Goal accordingtoEsteretal.(1996)and Eps wasdeterminedfromvisuallyexaminingthek-dist graphforeachdataset(gures35aand35b). Table34:User-centric:ParameterValuesforDB-SCAN SessionsMinPtsEps Goal40.0266Non-goal40.0536 (4)TheDBSCANalgorithmwasrunusingRapidMinerCommunity Editionversion4.4 14 with thespeciedparametervaluesfromtable34. DBSCANlabeled22goalsessions(0.11 % )asnoise(i.e.,outliers).Ofthe22goaloutliers,four (18.18 % )werefromtargetsessionsandtheother18(81.82 % )werefromothersessions 15 .The non-outlyingpointsallhaddurationsoflessthan500minut es(8.33hours)andviewedfewerthan 800pages. Atotalof7non-goaloutliers( < 0.01 % )werealsofoundbyDBSCANintherandomsample 16 Noneoftheoutlierswerefromtargetsessions.Theoutliers foundintherandomsampleweregoingtobeusedasboundarypointstoclassifysessionsfromth eentirenon-goaldataset.However, 14 RapidMinerisavailableathttp://www.rapidminer.com.Ra pidMinerwaspreviouslynamedYALE(YetAnother LearningEnvironment). 15 The22goaloutliersessionswererepresentedby22distinct combinationsofpoints. 16 The7non-goaloutliersessionswererepresentedby7distin ctcombinationsofpoints. 137

PAGE 151

afterexaminingtheresultsoftheDBSCANrun,anumberofses sionsnotaggedasoutliershad extremelyhighvaluesforeithersessiondurationornumber ofpagesviewed.Forexample,asessionintherandomsamplewithadurationof1,980minutes(33 hours)wasnotaggedasanoutlier,norwasasessionwith10,000pagesviewed 17 Althoughtherewerenotalargenumberofsessionswithextre mevalues,therewereenough pointswithinthesameareatobeconsideredaneighborhoodb yDBSCAN.Inaddition,therewere enoughofthesesmallgroupsthatwerewithinashortdistanc eofoneanotherthattheychainedtogethertobecomepartofthenon-outlyingcluster.Duetothe difcultyinndingaclearseparation betweenoutlyingandnon-outlyingpointsinthenon-goalda taset,theboundariesfoundinthegoal datasetwereusedforthenon-goaldataset. Usingthecutoffvaluesfromthegoaldataset( 800pagesviewedor an800minutesession duration),13,799non-goalsessions(0.87 % )werelabeledasoutliers.Ofthe13,799non-goaloutliers,89(0.65 % )werefromtargetsessionsandtheother13,710(99.36 % )werefromothersessions 18 Figures36aand36bshowplotsofthe distinct outlierandnon-outlierspointsforboththegoal andnon-goalsessions,respectively.Sinceonly distinct pointsareshowninthegures,anaccurate representationofthedensityofpointsinanareaisdifcul ttodetermine. Toillustratedensitywithinanarea,gures36cand36dpres entheatmapsforthegoalandnongoaldatasets,respectively.Thedarknessinshadeofeachp ointintheheatmapillustrateshow manyothersessionsexistwithinthesamearea.Ablackpoint represents10ormoresessionsin thegoaldataset,while100ormoresessionsarerepresented bythesameshadeinthenon-goal dataset. Noticeablewithinthenon-goalheatmap(gure36d)isthate venthoughgure36bshowssessionswithdurationscloseto2,000minutes,theheatmapdem onstratesthedensityofpointsin thoseareasispracticallynon-existent.Inaddition,the gurealsoillustratestheuseofthegoal boundariesonthenon-goaldatasetretainedthedensestare aofpointsasnon-outlyingsessions. Afterremovingalloutliers,atotalof5,788user-sessions (1.15 % )wereremovedfromthedataset. 17 Onepossibleexplanationforsessionswithextremelyhighv aluesmaybeduetoautomatedprogramsbrowsingthe Web.Forexample,aprogramwhichresidesasabackgroundpro cessmaymakeaconnectiontoaWebsitetorefresh itslocalcacheofinformationeveryfewminutes.Ifauserha sanalways-onInternetconnectionanddoesnotturnoff theircomputer,thenitisfeasibleasessionmaylastmanyho urs.Asimilarargumentcanalsobemadeforspidering programsthatvisitalargenumberofpagesataWebsite. 18 The13,799goaloutliersessionswererepresentedby12,642 distinctcombinationsofpoints(91.62 % ). 138

PAGE 152

0 500 1000 1500 2000 0 500 1000 1500 2000 Session Duration (min)Pages Viewed Non-outliers Outliers (a)DistinctGoalSessions 0 500 1000 1500 2000 0 25K 50K 75K 100K 125K 150K Session Duration (min)Pages Viewed Non-outliers Outliers (b)DistinctNon-goalSessions 0 500 1000 1500 2000 0500100015002000 Session Duration (min)Pages Viewed 0 2 4 6 8 10 (c)GoalHeatMap 0 500 1000 1500 2000 0500100015002000 Session Duration (min)Pages Viewed 0 20 40 60 80 100 (d)Non-goalHeapMap Figure36.:User-centric:OutlierPointsPlot 139

PAGE 153

93ofthose5,788user-sessions(1.61 % )wereremovedbecausethetargetsessionwasclassiedas anoutlier.Theremaining5,695user-sessions(98.39 % )wereremovedbecausetherewasnotat leastoneothersessionwithintheuser-session(i.e.,allt heothersessionswereclassiedasoutliersandremoved).Atotalof496,343user-sessions(10,71 4goals)remainedafterprocessingall outliers.Step7.RemoveSiteswith < 50GoalUser-sessions Forthenalstepoftheprocess,Websiteswithoutatleast50 goaluser-sessionswereremoved. AlthoughstepvealsocheckedforWebsiteshavingatleast5 0goaluser-sessions,thenumberof goaluser-sessionsmayhavebeenfurtherreducedduetotheo utlieranalysis.Ifauser-session's targetsession,all“other”sessions,orbothwereaggedas outliers,thentheuser-sessionwould becomeinvalid. NoWebsiteswereremoved,asallsitesretainedatleast50go aluser-sessions. 6.1.2FinalDatasetThefollowingsubsectionsprovidegeneralstatisticsabou tthenaldataset,alongwithcharacteristicsoftheWebsitesanduser-sessionsinthedataset.GeneralStatisticsTable35displaysgeneralstatisticsforthenaldataset.T herstrowofthetableliststhetotal numberofsessionsinthedataset 19 .Eachrowaftertherstliststhetotalcountforthemetric andalsoitspercentagecomparedtothetotalnumberofsessi ons.Theoverallconversionrateof thedatasetwas2.16 % ,whichissimilartothetwopercentconversionratetypical lyfoundatecommerceWebsites(Moe,2003;SismeiroandBucklin,2004). Uniquevisitorsaccountedfor 11.12 % ofthesessionswhereas88.88 % ofthesessionswerefromrepeatvisitors.Lastly,7,366,44 2 pageswereviewedoverall496,343sessionsfromthe52Websi tesinthedataset. 19 Unlessotherwisespeciedallstatisticsareaboutthetarg etsessionofeachuser-session.Theterm“session”will beusedinplaceof“targetsession”forreadabilitypurpose s. 140

PAGE 154

Table35:User-centric:FinalDatasetStatis-tics n % Sessions496,343n/a Goalsessions10,7142.16 % Non-goalsessions485,62997.84 % Uniquevisitors55,19511.12 % Repeatvisits441,14888.88 % Pagesviewed7,366,442n/aWebsites52n/a WebSiteCharacteristicsTable36providesthemean,standarddeviation,minimum,an dmaximumvaluesfromall52Web sitesforthenumberofgoal,non-goal,andtotalsessionsvi sitingeachsiteandtheconversionrate fromeachsite. Table36:User-centric:WebSiteCharacteristicStatistic s MeanSt.Dev.MinimumMaximum S ESSIONS Totalsessions9,545.0614,078.4840676,138 Goalsessions206.04154.9151597Non-goalsessions9,339.0214,064.4529076,041 O THER Conversion6.70 % 7.17 % 0.12 % 28.57 % Onaverage,eachWebsitehad9,545.06sessionsvisitingthe Website,withmorethan45times asmanynon-goalsessionsasgoalsessions.EachWebsitehad ,onaverage,206.04goalsessions (2.16 % )and9,339.02non-goalsessions(97.84 % ). 141

PAGE 155

Figures37a–37cillustratethedistributionofthenumbero ftotal,goal,andnon-goalsessions foreachWebsite,respectively.ThemajorityofWebsites(3 2outof52(61.54 % ))hadfairlylight trafc,havinglessthan8,000totalsessions(gure37a).H owever,therewere20Websites(38.46 % ) withmorethan8,000totalsessions,withthemostheavily-v isitedsitehaving76,138totalsessions.Intermsofgoalsessions(gure37b),40Websites(76 .92 % )hadbetween50and249goal sessions,withtheremaining12Websites(23.08 % )havingmorethan250goalsessions. 0 2 4 6 8 10 12 0-1K 1K-2K 2K-3K 3K-4K 4K-5K 5K-6K 6K-7K 7K-8K >8K Number of WebsitesNumber of Sessions Per Website (a)AllSessions 0 2 4 6 8 10 12 14 50-99 100-149 150-199 200-249 250-299 300-349 350-399 400-449 450-499 > 499 Number of WebsitesNumber of Goal Sessions Per Website (b)GoalSessions 0 2 4 6 8 10 12 0-1K 1K-2K 2K-3K 3K-4K 4K-5K 5K-6K 6K-7K 7K-8K >8K Number of WebsitesNumber of Non-goal Sessions Per Website (c)Non-goalSessions Figure37.:User-centric:WebsiteSessionsHistograms Theaverageconversionrateforthe52Websiteswas6.70 % ,withoneWebsitehavingthehighestrateof28.57 % .Figure38illustratesthedistributionofconversionrate sforeachWebsite.22 ofthe52Websites(42.31 % )hadlessthana3 % conversionrate.15oftheWebsites(28.85 % )had betweena3 % and8 % conversionrate.Theremaining15Websites(28.85 % )hadaconversion ratehigherthan8 % 142

PAGE 156

0 2 4 6 8 10 12 14 16 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 >8 Number of WebsitesConversion Rate % Per Website Figure38.:User-centric:WebsiteConversionHistogram SessionCharacteristicsTable37providesthemean,standarddeviation,minimum,an dmaximumvaluesforthenumber ofpagesviewedanddurationfromall496,343sessionsinthe dataset.Foreachmetric,valuesare providedforthreesetsofsessions:goal,non-goal,andall sessions. Table37:User-centric:SessionCharacteristicStatistic s MeanSt.Dev.MinimumMaximum P AGES V IEWED Allsessions14.8426.072794 Goalsessions42.3348.702791Non-goalsessions14.2425.012794 S ESSION D URATION ( MIN ) Allsessions10.1716.881495 Goalsessions27.3126.561367Non-goalsessions9.7916.401495 Eachsessionconsisted,onaverage,oflessthan15pageview s(14.84),withamaximumof794 pagesviewedbyonesession.Goalsessionsviewedalmostthr eetimesasmanypagespersession, onaverage,comparedtonon-goalsessions(42.33versus14. 24).Figures39a–39cshowthedistributionofpagesviewedbynumberofsessions. Theaveragedurationfromall496,343sessionswas10.17min utes,withonesessionspending over495minutes(8.25hours)onasite.Goalsessionsspenta lmostthreetimesasmanyminutes 143

PAGE 157

0 10000 20000 30000 40000 50000 60000 70000 80000 0 25 50 75 100 125 150 175 200 225 250 Number of SessionsPages Viewed (a)AllSessions 0 50 100 150 200 250 300 0 25 50 75 100 125 150 175 200 225 250 Number of SessionsPages Viewed (b)GoalSessions 0 10000 20000 30000 40000 50000 60000 70000 80000 0 25 50 75 100 125 150 175 200 225 250 Number of SessionsPages Viewed (c)Non-goalSessions Figure39.:User-centric:SessionPagesViewedHistograms 144

PAGE 158

onasitecomparedtonon-goalsessions(27.31 min versus9.79 min ).Figures40a–40cillustrate thedistributionofsessiondurationinminutesbynumberof sessions. 0 20000 40000 60000 80000 100000 120000 0 25 50 75 100 125 150 175 200 225 250 Number of SessionsSession Duration (min) (a)AllSessions 0 50 100 150 200 250 300 350 400 0 25 50 75 100 125 150 175 200 225 250 Number of SessionsSession Duration (min) (b)GoalSessions 0 20000 40000 60000 80000 100000 120000 0 25 50 75 100 125 150 175 200 225 250 Number of SessionsSession Duration (min) (c)Non-goalSessions Figure40.:User-centric:SessionDurationHistograms User-sessionCharacteristicsTable38providesthemean,standarddeviation,minimum,an dmaximumvaluesforthenumberof total,goal,andnon-goalothersessionsforall496,343use r-sessionsinthedataset. Table38:User-centric:User-sessionCharacteristicStat istics MeanSt.Dev.MinimumMaximum N UMBEROF O THERSESSIONS Allsessions2.371.70132 Goalsessions0.020.1504Non-goalsessions2.351.69032 145

PAGE 159

Eachuser-sessionconsisted,onaverage,oflessthanthree othersessions(2.37),withamaximumof32othersessionsbyoneuser-session.Othersessions weremostlycomprisedofnon-goal sessions(2.35versus0.02).However,oneuser-sessionhad fourothersessionsthatweregoals. Figures41a–41cshowthedistributionofothersessionsbyn umberofuser-sessions. 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 0 5 10 15 20 25 30 35 Number of User-SessionsNumber of Other Sessions (a)AllSessions 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 0 1 2 3 4 5 Number of User-SessionsNumber of Other Sessions (b)GoalSessions 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 0 5 10 15 20 25 30 35 Number of User-SessionsNumber of Other Sessions (c)Non-goalSessions Figure41.:User-centric:User-sessionSessionsHistogra ms 6.2Site-centricDatasetThedatausedforthesite-centricclickstreammodelofinfo rmationforagingwasprovidedbya Webhostingcompany.Thedatawascapturedoverayearperiod fromSeptember12,2007to September23,2008.Thewebhostingcompanywasuniquesince itprovidedacommonplatform forWebsitesofasimilarnature.Forexample,theirWebsite sallusedthesameplatformthatallowedthesiteownerstoaddcontenttotheirWebsitewithout knowledgeofHTML.Abenecial byproductofhavingsitesonthesameplatformwasacommonst ructuretoeachWebsite.Forexample,thoseWebsiteswithacontactformallsubmittedthei rcontactinformationtothesame 146

PAGE 160

commonplatformURL.Thecontentsofthecontactformwereth ensavedandaresultpagewas displayedtotheuser 20 .Therefore,itcouldbedeterminedthatagoalwasachieved( i.e.,avisitor lledoutacontactformandsubmittedit)ifavisitor'ssess ion“viewed”thecontactformsubmissionpage. SincethedataproviderhostedthousandsofWebsites,theyc reatedamechanismtocapturetrafcfromalltheirsiteswithoutrelyingontheindividualtra fclogsofeachWebsite.Whenevera uservisitedaWebpageofaparticipatingWebsiteasmalltra nsparentimagewasdownloadedvia aJavaScriptscript.Theimagehadparametersuniquetotheu seralongwithinformationsuchas theWebsiteandWebpagebeingvisited,timestampofvisit,a ndothermiscellaneousinformation. Oncethescriptwasdeployedontheplatform,itwasintegrat edintoWebsitesoncesiteowners updatedtheirsiteinsomeway(e.g.,apagewasedited).Ther efore,eventhoughthescriptwasdeployedonSeptember12,2007,datacollectionataparticula rWebsiteonlystartedoncethesite waschangedinsomeway. Eachpieceofdatawasstoredinadatawarehouseandlinkedto theuser,Website,andWeb pageitreferenced.Avisitor'ssessionwasdenedasanyseq uenceofWebpagesonthesame Websitebythesamevisitorwithlessthana30minutetimeper iodbetweenpageviewings.A30 minutesessiontimeouthasalsobeenusedinpreviousclicks treamresearch(BucklinandSismeiro, 2003;SismeiroandBucklin,2004;VandenPoelandBuckinx,2 005). Theremainderofthissectionprovidesinformationregardi ngthestepstakentoarriveatthenaldataset,alongwithgeneraldescriptivestatisticsabo utthedata.Thepreprocessingstepsappliedtothedataaredescribedinthenextsection.Descript ivestatisticsarethenprovidedinthe followingsection.6.2.1PreprocessingofOriginalDatasetThedataobtainedfromthedataproviderincludedmanydatae lementsnotapplicabletothecurrentresearch.Therefore,anumberofprocessingstepswere performedtoobtainanaldataset usablefortestingthesite-centricclickstreammodelofin formationforaging.Table39listseach stepoftheprocessalongwiththetotalnumberofWebsites,s essions,andgoalsessions;andhow manyWebsites,sessions,andgoalsessionswereremovedatt hatstep(ifapplicable).Table40 20 Theresultpagemaybeareturntothecontactformthatwassub mitted,apagethankingtheuserforsubmitting theirinformation,oranyotherpageontheWebsite(e.g.,th eindexpage). 147

PAGE 161

liststheparametersusedineachpreprocessingstep.Adisc ussionofeachstepanditsparameters areprovidedbelow. 148

PAGE 162

Table39:Site-centric:PreprocessingofOriginalDataset Statistics StepDescriptionWebSitesSessionsGoals M inWebSites M inSessions M inGoals Originaldataset6,0031,968,491n/an/an/an/a 1Mapvalidpages6,0031,968,491n/an/an/an/a2RemoveotherWebsites1,7101,692,275n/a 4,293 276,216n/a 3Removespamsessions1,5041,689,159n/a 206 3,116n/a 4Removesinglepagesessions1,483900,677n/a 21 788,482n/a 5Determinegoalsessions1,483900,67712,441n/an/an/a6RemovenogoalWebsites918790,69112,441 565 109,986n/a 7RemoveWebsiteswith < 50goal sessions 57278,4635,982 861 512,228 6,459 8Removeoutliers57278,4375,975n/a 26 7 9Identifycontactgoals57278,4375,975n/an/an/a 10Classifygoalsessions57278,4375,827n/an/a 148 11RemoveWebsiteswithoutanycontact goalshaving 50goalsessions 47250,1625,302 10 28,275 525 12Classifyothercontactgoalsessionsas non-goalsessions 47250,1624,979n/an/a 323 149

PAGE 163

Table40:Site-centric:PreprocessingParameters StepDescriptionParameters Originaldatasetn/a 1Mapvalidpagesn/a2RemoveotherWebsites platform 6 = informational 3Removespamsessions sessions == spam 4Removesinglepagesessions sessionLength ==1 5Determinegoalsessions formSubmissionPage 6 = visited 6RemovenogoalWebsites goalsAtWebsite ==0 7RemoveWebsiteswith < 50goal sessions goalsAtWebsite< 50 8Removeoutliers MinPts =4 Eps =0 : 0636 (goalsessions) Eps =0 : 0597 (non-goalsessions) 9Identifycontactgoals countedSupport =5 patternSize =3 pattern = AXA or AXB directMatches =5 10Classifygoalsessions 0 gap
PAGE 164

inGoogle'scache,suchpageswereeliminatedfromthedatas et(i.e.,aninvalidpage). Insteadofexaminingeachpagetodetermineifitwasvalidor invalid,thedomainsforeach pagewereexaminedinstead.First,anypagefromadomainwhi chwaspresentinmorethanone Websitewasconsideredinvalid 21 .Second,pagesfrommanysearchenginecachesarerecorded withanIPaddressratherthanadomainname.Therefore,anyp agesfromanumericalIPaddress wereconsideredinvalid.Finally,amanualinspectionofth eremainingdomainswasdonetoremoveknownoutsideservices(e.g.,Web-basedmaildomains) Inadditiontoremovinginvalidpages,validpagesneededto bemappedtogetheronthesame Website.Websiteswithmultipledomainnamespointingtoth esameWebsitewouldonlyshowa fragmentedpictureofthepagesbeingvisited.Forexample, assumedomainA.comanddomainB.com bothpointtothesameWebsite.Inthedata,domainA.com/myp age.htmlanddomainB.com/mypage.html wouldbeseenastotallyseparatepagesfromoneanother.Ins tead,avisittomypage.htmlshould becountedasthesamepage,aslongasthedomainwasvalid.Th us,pagesofthesamenamewere mappedtoasinglevalidpage. Atotalof43,544uniquepageswerepresentintheentiredata set.5,702ofthosepages(13.09 % ) wereaggedasinvalid.Oftheremaining37,842uniquepages ,4,102ofthose(10.84 % )were mappedtootherexistingpages(e.g.,domain.com/mappedto domain.com/index.html).Afterall processingwasdone,atotalof33,740uniquevalidpagesrem ained. Step2.RemoveOtherWebsitesAftercompletingtherststep,thedatasetstillretainedt heoriginal6,003Websitesand1,968,491 sessions.However,thedatasetincludeddatafromotherpla tformsthedataproviderhosted(e.g., socialnetworkingWebsites)whichwerenotthefocusofthis research.Therefore,thesecondstep oftheprocessremovedallWebsitesnotusingthedataprovid er'sinformationalplatform.Atotal of4,293Websiteswereremovedalongwith276,216correspon dingsessions. 21 Thedataproviderofferedanumberofservicesthatusedthes amedomainonmultipleWebsites.Thosedomains wereaggedas“invalid”eventhoughtheoriginofthedomain wasknown.However,thisdidnotaffecttheanalysis sinceWebsitesusingthoseshareddomainswerefromotherpl atforms(e.g.,socialnetworking)andwerenotbeing investigatedinthisresearch. 151

PAGE 165

Step3.RemoveSpamSessionsThethirdstepremovedanysessionsdesignatedasspam.Thed ataprovideraggedanysessions fromrobots,spiders,oranyotherautomatedbrowsingmecha nismsasspam(e.g.,Google'sindexingspider).Atotalof3,116sessionsand206Websites(w hichonlyhadspamsessions)were removed.Step4.RemoveSingle-pageSessionsThefourthstepfollowedBucklinandSismeiro(2003)andrem ovedallsessionswhichconsisted ofonlyasinglepage-view 22 .Asinglepage-viewdoesnotrepresent“browsing”behavior ona Website(BucklinandSismeiro,2003)andthusisunlikelyto provideinterestingvisitorpatterns. 788,482single-pagesessionswereremovedalongwith21Web siteswhichonlyhadsinglepage sessions.Step5.DetermineGoalSessionsForthefthstepoftheprocess,sessionswereclassiedase ither“goal”or“non-goal”sessions. Withinthedata,anycontactformsubmissionwasrepresente dasavisittoaspecicURL(e.g., formSubmission.html).Atotalof12,441sessions(1.38 % )visitedtheWebsite-uniqueformsubmissionURL,andthuswereclassiedasgoalsessions. 1,857ofthegoalsessions(14.93 % )visitedtheformsubmissionURLmorethanonceduring asession(i.e.,arepeatgoalsession).1,563oftherepeatg oalsessions(84.17 % )wereinstances wheretheformsubmissionpagewasvisitedmultipletimesin arow.Apotentialexplanationfor thisbehavioristheuserclickedthesubmitbuttononaformm ultipletimes. Theremaining294repeatgoalsessions(15.83 % )submittedaformandthenvisitedatleastone otherpagebeforesubmittingaformagain(i.e.,distinctfo rmsubmissions) 23 .Suchrepeatbehavior maybetheresultofausersubmittingdifferentcontactform sonaWebsite(e.g.,requestforinformationandsigningupforanewsletter),ormaybeaperson simplyresubmittedthesameform (forwhateverreason)aftergoingsomewhereelseonthesite .Figure42showsahistogramofthe 22 Onlysessionswithmorethanone valid pageviewedwereretained. 23 The294sessionswithdistinctformsubmissionswereretain edintheanalysis.Onlybrowsingbehavioroccurring beforethe rst formsubmissionwasconsideredintheanalysis,andthushav ingextraformsubmissionsdidnotimpact theanalysisforthisresearch. 152

PAGE 166

0 50 100 150 200 250 300 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of SessionsDistinct Form Submissions Figure42.:Site-centric:DistinctFormSubmissionsHisto gram numberofdistinctformsubmissionsbynumberofrepeatgoal sessions.Themaximumnumberof distinctformsubmissionswas14.Step6.RemoveNo-goalWebsitesThesixthstepremovedWebsiteswhichdidnothaveanygoalsa chievedduringthedatacollection period.Atotalof565Websiteswereremovedalongwiththe10 9,986sessionswhichoccurredon thoseWebsites.Step7.RemoveWebsiteswithFewerThan50GoalSessionsInordertoensurealargeenoughsamplesizeofgoalsessions foranalysis,theseventhstepremovedWebsiteswhichhadfewerthan50goalsessions.861Web siteswereremovedalongwith the512,228correspondingsessionsatthosesites. PriortotheremovalofWebsitesinthisstep,acutoffpointw asdeterminedbyexamininga histogramofthenumberofWebsitesaccordingtothenumbero fgoalsessionsattheirsite(gure43).98Websites(10.68 % )with30goalsessionsormorearedisplayedinthegure 24 .Of those98Websitesshown,41ofthem(41.84 % )hadfewerthan50goalsessions.31ofthose41 sites(75.61 % )onlyhadbetween30and39goalsessions.Thus,theselectio nof50goalsessions asacutoffpointappearstobeagoodselectionforincluding themaximumnumberofWebsites whileensuringalargeenoughgoalsessionsamplesizewithi neachsitefortheanalysis. 24 Toprovideareasonablescaleforthey-axis,theguredoesn otshowthe820Websites(89.32 % )withfewerthan 30goalsessions. 153

PAGE 167

0 5 10 15 20 25 30 35 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 >69 Number of WebsitesNumber of Goals Per Website Figure43.:Site-centric:GoalSessionsbyWebsiteHistogr am Step8.RemoveOutliersThedatasetwasthenexaminedforoutliersintheeighthstep oftheprocess.Anoutlierwasdenedas“anobservation(orsubsetofobservations)whichap pearstobeinconsistentwiththeremainderofthatsetofdata”(BarnettandLewis,1994,pg.7). Inconsistentsessionswereremoved fromthedataset.Consistencywascomparedviathecombinat ionofthetotalnumberofpages viewedandthedurationofasession. Anunsuperviseddensity-basedclusteringalgorithmcalle dDBSCAN 25 (Esteretal.,1996)was usedtolocateoutlyingsessions 26 .DBSCANidentiesclustersofarbitraryshape,wherethenu mberofclustersisautomaticallydeterminedviathealgorit hm.Aclusterisformedbyhavingaminimumnumberofneighborpoints 27 ( MinPts ),ordensity,withinaspeciedradius( Eps ).Points notclassiedtoaclusterarelabeledas“noise”(i.e.,outl iers). DBSCANrequirestwouser-speciedparameters: MinPts and Eps MinPts–theminimumnumberofpointswithinaneighborhoodo fradius Eps .Fortwo-dimensional datasets, MinPts iscommonlysettofour(Esteretal.,1996;HodgeandAustin, 2004). Eps–theEps-neighborhoodorradiusofacluster.Thevalueo f Eps isdeterminedvisuallyviaa sortedk-distgraph(seepointthreebelow)(Esteretal.,19 96). ToperformtheoutlieranalysisusingDBSCAN,fourstepswer eperformed.Theprocessisvery 25 TheaverageruntimecomplexityofDBSCANis O ( n log ( n )) (Esteretal.,1996). 26 DBSCANwaschosenovercommonstatisticaltechniquesforre movingoutliers,suchasremovingvaluesgreater thanthreestandarddeviationsaway,fortworeasons:(1)DB SCANdoesnotrequireknowledgeofanunderlyingdistributionand(2)DBSCANiscapableofndingoutliersinmultip ledimensions. 27 Thetermpointswillbeusedtorefertosessionswithaunique combinationofpagesviewedandsessionduration duringtheremainderofthissubsection. 154

PAGE 168

similartotheuser-centricprocessexceptuser-sessionsw erenotusedforthesite-centricdataset andthus“other”sessionswerenotincludedinthedatasets. Inaddition,randomsamplingofthe non-goaldatasetwasnotusedbecauseofthesmaller-sizeds ite-centricdataset. (1)Goalandnon-goalsessionswereseparatedintotwosepar atedatasets.Eachoftheremaining threestepswasperformedindependentlyoneachdataset. (2)Valuesfromeachdimensionwerenormalizedbetween0and 1accordingtoequation6.3, where x isasetofdistinctvaluesforadimension, x i isthe i th elementoftheset,and min ( x ) and max ( x ) aretheminimumandmaximumvaluesfoundinset x ,respectively. norm ( x i )= x i min ( x ) max ( x ) min ( x ) (6.3) (3)TheparametersforDBSCANweresettodenethe“thinnest ”clusterinthedatasetbyfollowingathree-stepheuristicoutlinedbyEsteretal.(1996).T he“thinnest”clusteristhesmallest orleastdensegroupingofpointsthatarenotconsiderednoi se. (a) MinPts wassettofoursinceeachdatasetonlyhadtwodimensions(Es teretal.,1996) 28 (b)Thethresholddistance,whichdistinguishesbetweenno iseandclusterablepoints,was located.Todeterminethethreshold,asortedk-distgraphw ascreated( k = MinPts ), wherethedistanceofeachpointtoits k th neighborisfound,sortedindescendingorder, andthengraphed.Thepurposeofthesorted-k-distgraphwas tovisuallylocatetherst “valley”indistancevalues,whichrepresentsthethreshol ddistance. TheApproximateNearestNeighbor(ANN)searchlibraryvers ion1.1.1(Mountand Arya,2006)wasusedtocalculatethedistanceofeachpointt oitsfourthnearestneighbor 29 .Figures44aand44bshowthesorted4-distgraphsoftherst 100valuesforthe goalandnon-goalsessions.Eachgurewasmanuallyinspect edtondtherst“valley”, whichisshownattheintersectionofdashedlines. (c)Set Eps tothethresholddistancefoundinstep(b). 28 Inasurveyofoutlierdetectionmethodologies,HodgeandAu stin(2004)alsostated MinPts iscommonlysetto fourforDBSCAN. 29 ANNisavailableathttp://www.cs.umd.edu/mount/ANN/. 155

PAGE 169

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 10 20 30 40 50 60 70 80 90 100 Normalized DistanceSession (a)GoalSessions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 10 20 30 40 50 60 70 80 90 100 Normalized DistanceSession (b)Non-goalSessions Figure44.:Site-centric:Sorted4-DistGraphs:GoalandNo n-Goal Table41liststheparametervaluesusedforeachofthedatas ets. MinPts wassettofour accordingtoEsteretal.(1996)and Eps wasdeterminedfromvisuallyexaminingthek-dist graphforeachdataset(gures44aand44b). Table41:Site-centric:ParameterValuesforDB-SCAN SessionsMinPtsEps Goal40.0636Non-goal40.0597 (4)TheDBSCANalgorithmwasrunusingRapidMinerCommunity Editionversion4.4 30 with thespeciedparametervaluesfromtable41. DBSCANlabeledsevengoalsessions(0.12 % )and19non-goalsessions(0.01 % )asnoise(i.e., outliers).Figures45aand45bshowplotsofdistinctoutlie randnon-outlierspointsforboththe goalandnon-goalsessions,respectively.Roughlyspeakin g,goalsessionswithover100pages viewedoradurationof145minutesormorewereconsideredou tliers.Forthenon-goalsessions, therewasnotaclearseparationbetweenoutliersandnon-ou tliersforglobalvaluesofpagesviewed orsessionduration.Instead,gure45billustrateshowdif ferentcombinationsofpagesviewedand sessiondurationcategorizedsessionsasoutliersornot. 30 RapidMinerisavailableathttp://www.rapidminer.com.Ra pidMinerwaspreviouslynamedYALE(YetAnother LearningEnvironment). 156

PAGE 170

0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 400 450 Session Duration (min)Pages Viewed Non-outliers Outliers (a)GoalSessions 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 400 450 Session Duration (min)Pages Viewed Non-outliers Outliers (b)Non-goalSessions Figure45.:Site-centric:OutlierPointsPlot Step9.IdentifyContactGoalsFortheninthstepofpreprocessing,contactgoalsateachWe bsitewereidentied.Acontactgoal isthesubmissionofaparticularcontactformonaWebsite.A Websitemayhavemorethanone contactgoal.Forexample,aWebsitemayhaveonecontactfor mforgeneralinquiries(contact goalA)andanothercontactformtorequestquotes(contactg oalB).Asaforager'sinformation goalmaydifferdrasticallydependingonthecontactformbe ingsubmitted,simplygroupingall goalsessionstogethermayintroducenoiseintotheanalysi s.Classifyinggoalsessionsbycontact goalattemptstoreducenoisebyonlygroupingforagerstoge therwithsimilarinformationgoals. Theeventualobservableoutcomesofthispreprocessingste pwerethree-fold: (1)Identifycontactgoalshavingatleast50goalsessions. Theselectionof50goalssessionswas madetobalancetheneedforsufcientsamplesizeofgoalses sionswithinasinglecontact goalandtoincludeasmanyWebsitesaspossible.Theactuals electionofasinglecontact goalforaWebsiteisdiscussedinpreprocessingstep12. (2)Identifypageswhichwerenecessaryconditionsforthes ubmissionofacontactgoal(e.g., contactformpage,thankyoupage).Onceidentied,thesene cessaryconditionpageswere thenexcludedfromfutureminingofpatchesandtrails. (3)Classifygoalsessionstoanidentiedcontactgoal(dis cussedfurtherinpreprocessingstep 10) 31 31 Notallgoalsessionswouldbeclassiabletoacontactgoal. Thispreprocessingprocesswasonlyconcernedwith discoveringmoderately-visitedcontactgoals.Thus,goal sessionswhichsubmittedformsfornon-discoveredcontact goalswouldnotbeclassied. 157

PAGE 171

Withinthedata,anycontactformsubmissionwasrepresente dasavisittoaspecicURL(e.g., formSubmission.html).Therefore,aforagersubmittingfr omeithercontactgoalAorcontactgoal BwouldbothshowavisittopageformSubmission.htmlwithin theirsession.Thelimitationof thisapproachisnotbeingabletodirectlyclassifygoalses sionsaccordingtotheircontactgoal. Therefore,anindirectmannerofdiscoveringcontactgoals viabrowsingpatternswasused. Thegeneralpatternofaformsubmissionconsistedofthreep agesinsequence:(1)acontact formpage,(2)apagerepresentingaformsubmission,and(3) athankyoupageorthesamecontactformpagefrom(1).Todiscoverthesesequences,freque ntsequentialpatternsweremined usingtheSequentialPatternMining(SPAM)algorithm(Ayre setal.,2002) 32 PotentialContactGoal Tobeconsideredapotentialcontactgoal,aminedsequentia lpattern musthavemetvecriteria. (1)Haveacountedsupportofatleastvegoalsessions.Alth oughtheinterestinthisprocessing wasoncontactgoalswithatleast50goalsessions,thecount edsupportwassettovefortwo reasons. (a)Therstreasonwastoaccountforvalid,butnon-standar dbrowsingbehavior.Forexample,althoughthegeneralsubmissionpatternconsistsofth reepages,thereareoccasions whereaforagerwillonlycompletethersttwopagesofthese quence.Thisisbecause aftertheformsubmission,thesystemautomaticallyforwar dedaforager(afterashort delay)tothethirdpage.However,duetothedelaysomeforag ersmaybrowseelsewhere orleavethesitebeforebeingautomaticallyforwarded. (b)Thesecondreasonwastodiscoverasmanycontactgoalsas possiblesothatgoalsessionswerenotincorrectlyclassiedtothewrongcontactgo al.Thebrowsingpatterns ofaforagermaymatch,todifferingdegrees,multipleconta ctgoals.Ifonlyhighlyvisitedcontactgoalswerediscovered,asessionmaybeclas siedtothatcontactgoal 32 Anothermethodofdiscoveringcontactgoalswouldbetouset hereferrereldoftheformsubmissionpageto discoverallcontactformpages.However,thedataprovider limitedthereferrereldinthisdatasettothedomain-leve l. Inaddition,aftersubmittingaform,aforagerisautomatic allyforwardedtothethirdpage.Thisforwardingisdone server-sideandthusthereferrereldwouldnotbepopulate d.Therefore,thepagesshownasaresultofasubmitted contactform(e.g.,athankyoupage)couldnotbediscovered bysearchingfortheURLoftheformsubmissionpagein thethirdpage'sreferrereld.Duetothesedataandmechani smlimitations,sequentialpatternminingwasused. 158

PAGE 172

eventhoughaless-visitedcontactgoalwasabettermatch.C lassifyingsessionstocontactgoalsisdiscussedfurtherinpreprocessingstep11. (2)Haveathree-pagesequencelength.(3)Therstpageofthesequencemust not havebeenanindexpageoraformsubmissionpage. (4)Thesecondpageofthesequence must havebeenaformsubmissionpage. (5)Thethirdpageofthesequencemust not havebeenaformsubmissionpage. ConrmedContactGoal Apotentialcontactgoalbecomesaconrmedcontactgoalifi tmet therequirementslistedaboveplusoneadditionalrequirem ent. (1)Aminimumofvegoalsessionsmustdirectlymatchthepat tern.Adirectmatchmeansagoal sessionvisitedtheexactsamethreepages,inorder,andwit houtanyadditionalpagesinbetweenanyofthepagesofthesequence.Theselectionofavalu elessthan50wasagaindue tovalid,butnon-standardbrowsingbehavior.Forexample, assumeacontactgoalconsistsof thepattern:pageApageFpageA.Aforagermayvisittheconta ctformpage(pageA),open anewtabfortheindexpage(pageI),andthenreturntothers ttabandsubmitthecontact formpage.Thesessionoftheforagerwouldberecordedaspag eApageIpageFpageA.Even thoughthissessiondoesnotexactlymatchthepatternforth econtactgoal,itwouldstillbe consideredasubmissionforthecontactgoal. ConictingContactGoals Duringtheprocessofdiscoveringcontactgoals,fourWebsi teswere aggedashavingconictingcontactgoals.Aconictingcon tactgoaliswherethesamepageeitherbeforeoraftertheformsubmissionissharedbyanother contactgoalonthesameWebsite. Forexample,aconictwouldoccuriftwocontactgoalsonthe sameWebsitehavethesamethird page(e.g.,contact.html)butdifferentrstpages(e.g.,c ontact.htmlandproduct.html). Table42providesinformationaboutthefourWebsitesandth eirconictingcontactgoals.The rstthreeWebsites(A-C)eachhadtwoconictingcontactgo alswhilethefourthWebsite(D) hadthreeconicts.Foreachconictedcontactgoalthetabl eliststhecontactgoalid,sequential patternforthecontactgoal,andthenumberofdirectsessio nsmatchedtothecontactgoal.The nalcolumnofthetabledescribestheactiontakentoresolv etheconictfortheWebsite. 159

PAGE 173

Table42:Site-centric:ConictingContactGoals WebsiteContactGoalPagePatternDirectMatchesAction A CG-1 1.contactus.html 267 RemoveCG-2 2.submission.html3.contactus.html CG-2 1.productABC.html 13 2.submission.html3.contactus.html B CG-1 1.contactus.html 104 RemoveCG-2 2.submission.html3.contactus.html CG-2 1.products.html 5 2.submission.html3.contactus.html C CG-1 1.contactus.html 550 RemoveCG-2 2.submission.html3.contactus.html CG-2 1.productABC.html 7 2.submission.html3.contactus.html D CG-1 1.signup.html 70 Combineall 2.submission.html3.thanks.html CG-2 1.signup1.html 99 2.submission.html3.thanks.html CG-3 1.signup1.html 88 2.submission.html3.signup1.html 160

PAGE 174

TheconictbetweenthecontactgoalsontherstthreeWebsi tessharedthreecommoncharacteristics. (1)Ahighly-visitedcontactgoalwithasymmetricalpagepa ttern(e.g.,contactus.html,submission.html,andthencontactus.htmlagain). (2)Ararely-visitedcontactgoalwithanasymmetricalpage pattern(e.g.,products.html,submission.html,andthencontactus.html). (3)Thethirdpageofthesequentialpatternwassharedbetwe enbothcontactgoals. Toresolvetheconictbetweenthecontactgoalsattherstt hreeWebsites,thehighly-visited contactgoalswereretainedandtherarely-visitedcontact goalswereaggedas“invalid”andremoved.Thedecisiontoremovetherarely-visitedcontactgo alswasmadefortworeasons. (1)Symmetricpatternswerethemostcommonpatternfoundam ongstcontactgoals.ThisisbecausethedefaultbehaviorforWebsitesontheinformationa lplatformwastoreturnauser backtotheoriginalcontactformafteraformwassubmitted. Therefore,ifonecontactgoalis symmetric,theotherisasymmetric,andtheybothsharethes amethirdpage,itismorelikely thesymmetriccontactgoalisvalid. (2)Therarely-visitedcontactgoalslikelyrepresentindi rectmatchesofthehighly-visitedcontactgoal.Inotherwords,thesessionswithadirectmatchto therarely-visitedcontactgoal werereallyindirectmatchestothehighly-visitedcontact goal.However,enoughsessionsvisitedthesamepageafterthecontactpage,butbeforesubmitt ingthecontactform(e.g.,contactus.html,products.html,submission.html,andthenco ntactus.html),tobediscoveredasa contactgoal.Thisrationaleisplausiblebecause(1)there weresofewdirectmatchesforeach rarely-visitedcontactgoaland(2)ofthe25directmatches forrarely-visitedcontactgoals,24 ofthem(96.00 % )wereindirectmatchesforthehighly-visitedcontactgoal 33 ForthenalWebsite(D),noneoftheconictingcontactgoal sfullymetthetworeasonslisted abovetobeconsidered“invalid”contactgoals.Inregardst otherstpointlisted,althoughCG-1 andCG-2sharedthesamethirdpage(thanks.html),neithero fthecontactgoalshadasymmetric 33 Thesinglenon-matchdidnotvisitanyotherpagesexceptfor thepatternforWebsiteAcontactgoalCG-2. 161

PAGE 175

pattern.Inaddition,unlikethesecondpointmentioned,al lthreecontactgoalshadasubstantial numberofdirectmatchesandnoindirectmatcheswerefoundb etweenCG-1andCG-2 34 .Therefore,theconictingcontactgoalswereexaminedtodetermi neiftheyrepresentedtheevolutionof asinglecontactgoal'sstructure(e.g.,changingnamesoff orms,thankyoubehavior). Informally,thesitewashypothesizedtocontainasingleco ntactgoal(CG-A)forsigningpeople upforactivitiesthatevolvedfromCG-1toCG-2,andthentoC G-3.Table43illustratestherst andthirdpagesusedforeachcontactgoal(columnstwothrou ghfour).Initially,CG-Awasbelievedtocontainthepagessignup.htmlandthanks.html(CG -1).However,atsomepointthepage signup.htmlwasreplacedorrenamedonCG-Awithsignup1.ht ml(CG-2).Lateron,CG-Awas changedathirdtimewhenthethankyoupage(thanks.html)wa sdroppedandtherstpagewas alsousedasthethankyoupage(CG-3). Forthehypothesistoholdthereshouldbenooverlapintheda tessessionssubmittedformsfor CG-1,CG-2,andCG-3.Inaddition,thepagessignup.htmland thanks.htmlshouldnotbevisited bysessionsafterCG-2andCG-3wereactive,respectively.I nsupportofthehypothesis,thenal columnoftable43showsaclearseparationinthedatessessi onssubmittedformsforeachofthe contactgoals 35 .Inaddition,table44illustratesthedaterangeswheneach pagewasvisitedbyany sessiononlyfallswithinthetimeperiodthecontactgoalwa sactive.Therefore,itwasbelieved thatallthreecontactgoalsrepresentedanevolutionofthe samecontactgoal,andthustheywere combinedintoonecontactgoal. 34 IndirectmatcheswerenotexaminedforCG-1versusCG-3sinc etheydonotshareanycommonpages.Indirect matcheswerealsonotdoneforCG-2versusCG-3sincethethir dpagediffered. 35 Sessionswereclassiedtothethreecontactgoalsaccordin gtopreprocessingstep11. 162

PAGE 176

Table43:Site-centric:ConictingContactGoalPages–Web siteD ContactGoalsignup.htmlsignup1.htmlthanks.htmlClassi edSessions CG-1139/12/075:50PM–1/28/087:28PMCG-2131/30/087:44PM–5/15/089:47PMCG-31&35/19/083:12PM–9/23/0810:52PM Table44:Site-centric:WebsiteDPageVisitations PageActiveVisitationRange signup.html9/12/072:47PM–1/30/087:19PMsignup1.html1/30/087:22PM–9/23/0811:05PMthanks.html9/12/076:03PM–5/15/0810:23PM a a Thereweretwoadditionalvisitstothanks.htmlafter5/15/ 08on 7/12/08and9/16/08.However,sincetherewereonlytwoview s duringafour-monthperiod,thepagewasconsideredinactiv e after5/15/08.163

PAGE 177

0 5 10 15 20 25 30 35 40 45 50 0 1 2 3 4 5 6 7 8 9 Number of WebsitesNumber of Contact Goals Per Website Figure46.:Site-centric:ContactGoalsPerWebsite ContactGoalStatistics Atotalof77contactgoalswerefoundonthe57remainingWebs ites inthedataset.Figure46illustrateshowmanycontactgoals werediscoveredateachWebsite.The vastmajorityofWebsites(46)onlyhadasinglecontactgoal .Onaverage,eachWebsitehad1.35 contactgoals(0.99standarddeviation),withonesitehavi ng7contactgoals(themaximumnumberfoundonaWebsite).Step10.ClassifyGoalSessionsAfterdiscoveringallthecontactgoalsforaWebsite,allgo alsessionswerethenclassied.Agoal sessionwasclassiedtoacontactgoalaccordingtotheheur isticoutlinedbelow. Directmatch–anexactpatternmatchwithoutanygapsbetwee npages.Thegoalsessionisclassiedtothecontactgoal.Ifnodirectmatchesexistforanyc ontactgoal,thencontinueonto indirectmatch. Indirectmatch–apatternmatchofatleastthersttwopages ,withgapsbetweenpagesallowed. Thegoalsessionisclassiedtothecontactgoalwiththesma llestgap(i.e.,numberofother pagespresent)between(1)thesecondandthirdpageandthen (2)therstandsecondpage. Thismethodassumedthatbecauseanautomatictransfertake splacefromthesecondtothird pageitislesslikelyanotherpagewouldbevisitedinbetwee nthattransfer.Ifnoindirect matchesexistforanycontactgoal,thencontinueontonomat ch. Nomatch–apatternmatchofatleastthersttwopages(witho rwithoutgaps)isnotfoundfor anyofthecontactgoals.Thegoalsessionisnotclassiedto anycontactgoal 36 36 Asessionwithoutamatchisnotclassiedtoanycontactgoal ,evenonWebsiteswithonlyasinglediscovered 164

PAGE 178

0 5 10 15 20 25 0-24 25-49 50-74 75-99 100-124 125-149 150-174 175-199 >199 Number of Contact GoalsNumber of Goals Per Contact Goal Figure47.:Site-centric:GoalsPerContactGoal Followingtheheuristicoutlinedabove,5,827ofthe5,975g oalsessions(97.52 % )wereclassiedtoacontactgoal.Ofthe148unclassiedgoalsessions,1 24ofthem(83.78 % )werenotclassiablebecausetherstpageoftheirsessionwastheformsubm issionpage.Theremaining24goal sessions(16.22 % )mayhavebeenunclassiableduetobeingamatchforanundis coveredcontactgoal,ortheusermayhavevisitedtherstpageofthecon tactgoalsequenceduringaprevious session. Figure47illustratesthenumberofgoalsessionspercontac tgoal.Outofthe77contactgoals, 49ofthem(63.64 % )had50ormoregoalsessions.Table45displaysthemean,sta ndarddeviation,minimum,andmaximumnumberofgoalsessionsforall77 contactgoals. Table45:Site-centric:AllContactGoalsStats MeanSt.Dev.MinimumMaximum Goalspercontactgoal75.6880.575587 Step11.RemoveWebsiteswithoutanycontactgoalshaving 50goalsessions Fortheeleventhstep,Websiteswithoutanycontactgoalsha vingatleast50goalsessionswere removed.Atotalof10Websiteswereremovedalongwiththe17 correspondingcontactgoalsfor thosesites.Inaddition,28,275sessionswereremoved,wit h525ofthosebeinggoalsessions. Figure48displaysthenumberofgoalsessionspercontactgo alfortheremaining60contact goals.49ofthe60contactgoals(81.67 % )had50ormoregoalsessions.Table46displaysthe contactgoal.ThisisbecausetheWebsitemaycontainotherc ontactgoalsthatweresimplytoosmalltodetectduring thepreviouspreprocessingstep. 165

PAGE 179

0 5 10 15 20 25 0-24 25-49 50-74 75-99 100-124 125-149 150-174 175-199 >199 Number of Contact GoalsNumber of Goals Per Contact Goal Figure48.:Site-centric:Websiteswith 50GoalSessionsPerContactGoal mean,standarddeviation,minimum,andmaximumnumberofgo alsessionsfortheremaining60 contactgoals. Table46:Site-centric:Websiteswith 50GoalSessions–Contact GoalsStats MeanSt.Dev.MinimumMaximum Goalspercontactgoal88.3786.895587 Step12.Classifyothercontactgoalsessionsasnon-goalse ssions Forthetwelfthandnalstepoftheprocess,thecontactgoal withthehighestnumberofgoalsessionswasselectedforeachWebsiteasthecontactgoaltobea nalyzed.Theselectionofasingle contactgoalperWebsitewasdonetosimplifytheanalysis.G oalsessionsfromanyothercontact goalattheWebsitewereclassiedasnon-goalsessions 37 Astherewere47Websites,atotalof47contactgoalsweresel ectedtobeanalyzed.Thegoal sessionsattheremaining13contactgoalswereclassiedas non-goalsessions.4,979goalswere achievedonthe47selectedcontactgoals(93.91 % ),whiletheremaining323goalsfromthe13 not-selectedcontactgoals(6.09 % )wereclassiedasnon-goalsessions. 37 Eventhoughitisknowntheseothergoalsessionsdidachieve agoal,thegoalwasforadifferentcontactform,and thusnotthegoalbeingfocusedonattheWebsitebeinganalyz ed. 166

PAGE 180

6.2.2FinalDatasetThefollowingsubsectionsprovidegeneralstatisticsabou tthenaldataset,alongwithcharacteristicsoftheWebsites 38 andsessionsinthedataset. GeneralStatisticsTable47displaysgeneralstatisticsforthenaldataset.T herstrowofthetableliststhetotal numberofsessionsinthedataset.Eachrowaftertherstlis tsthetotalcountforthemetricand alsoitspercentagecomparedtothetotalnumberofsessions .Theoverallconversionrateofthe datasetwas1.99 % ,whichissimilartoconversionratesfoundate-commerceWe bsites(Moe, 2003;SismeiroandBucklin,2004).Uniquevisitorsaccount edfor80.69 % ofthesessionswhereas 19.31 % ofthesessionswerefromrepeatvisitors.Lastly,1,229,19 0pageswereviewedoverall 250,162sessionsfromthe47Websitesinthedataset. Table47:Site-centric:FinalDatasetStatis-tics n % Sessions250,162n/a Goalssessions4,9791.99 % Non-goalsessions245,18398.01 % Uniquevisitors201,84580.69 % Repeatvisits48,31719.31 % Pagesviewed1,229,190n/aWebsites47n/a WebSiteCharacteristicsTable48providesthemean,standarddeviation,minimum,an dmaximumvaluesfromall47Web sitesforthenumberofdaysasitewasactiveinthedataset;t henumberofvalidandexcludedWeb 38 SincethereisonlyasinglecontactgoalataWebsite,theter ms“Website”and“contactgoal”willbeusedinterchangeably. 167

PAGE 181

pagesonasite;thenumberofgoal,non-goal,andtotalsessi onsvisitingeachsite;andtheconversionratefromeachsite.Validpagesincludedallpages aggedasvalidfromstepnumbertwo inthepreprocessingsection.ExcludedWebpageswerethose pagesaggedasnecessaryconditionsforachievingagoal.Excludedpageswereremovedfrom asessionwhenminingpatchesand trails. Table48:Site-centric:WebSiteCharacteristicStatistic s MeanSt.Dev.MinimumMaximum W EBSITE A CTIVITY DaysActive308.36104.3746377 P AGES Validpages16.3613.00579Excludedpages2.040.2924 S ESSIONS Totalsessions5,322.607,473.7624544,405 Goalsessions105.9490.1351587Non-goalsessions5,216.667,427.5319244,111 O THER Conversion5.26 % 5.70 % 0.51 % 24.25 % Theentiredatasetwascollectedovera377dayperiod(09/12 /2007to09/23/2008).Onaverage, the47Websitesinthenaldatasetwereactivefor308.36day s(81.79 % ) 39 .OneWebsitewas onlyavailableforroughlyamonthandahalf(46days),while anumberofWebsiteswerepresent duringthegreaterthanone-yeardatacollectionprocess(3 77days).Theactualdatesinwhicha Websitewasactiveisshowningure49a 40 ,wherethedashedlinesindicatethebeginningand endingdatesofthedatacollectionperiod.Figure49bisahi stogramillustratingthenumberof Websiteswithaspeciednumberofactivedays. 39 Activityisdeterminedbyndingtherstandlastsessionvi sitedateachWebsite.Theremaybeperiodsoftime betweentherstandlastsessionvisitdatesinwhichnoacti vityoccurredontheWebsite. 40 TheWebsitesweresortedinascendingorderbyrstsessiond ateandthendescendingorderbylastsessiondate. 168

PAGE 182

Ofthe47Websites,27ofthem(57.45 % )wereactivefromtherstdayofdatacollection.24 ofthose27Websites(51.06 % )remainedactivebythelastdayofdatacollection.Forthe2 0Web sites(42.55 % ),whichwerenotpresentatthebeginningofdatacollection ,14ofthem(70.00 % ) werestillactivebytheendofthedatacollectionperiod. 9/1/07 11/1/07 1/1/08 3/1/08 5/1/08 7/1/08 9/1/08 11/1/08 0 5 10 15 20 25 30 35 40 45 50 Active DatesWebsites (a)ActiveDurationofWebsites 0 5 10 15 20 25 1-49 50-99 100-149 150-199 200-249 250-299 300-349 350-376 377 (All) Number of WebsitesNumber of Days Active (b)WebsitesbyActiveDuration Figure49.:Site-centric:Websites'Activity Figures50aand50billustratethedistributionofnumberof validandexcludedWebpagesfor eachWebsite,respectively.Asseeningure50a,mostofthe Websites(31outof47(70.21 % )) werefairlysmallinsizehavingfewerthan17validWebsites (16.36Webpagesonaverage),with thelargestsitehaving79pages.IntermsofexcludedWebpag es(gure50b),46ofthe47Web sites(97.87 % )excludedonlytwoWebpages.Thismeansthatthevastmajori tyofWebsiteshad symmetricalcontactgoalpatterns(e.g.,contactform,for msubmission,contactform).TheWeb sitewhichcombinedthreecontactgoalstogether(fromprep rocessingstepnine)wastheonlyWeb sitewithmorethantwoexcludedpages(thesitehadthemaxim umoffourexcludedpages). 0 2 4 6 8 10 12 14 16 18 1-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 >40 Number of WebsitesNumber of Pages Per Website (a)ValidPages 0 5 10 15 20 25 30 35 40 45 50 0 1 2 3 4 5 6 7 8 9 Number of WebsitesNumber of Excluded Pages Per Website (b)ExcludedPages Figure50.:Site-centric:WebsitePagesHistograms 169

PAGE 183

Onaverage,eachWebsitehad5,322.60totalsessionsvisiti ngtheWebsite,withalmost50 timesasmanynon-goalsessionsasgoalsessions.EachWebsi tehad,onaverage,105.94goal sessions(1.99 % )and5,216.66non-goalsessions(98.01 % ).Figures51a–51cillustratethedistributionofthenumberoftotal,goal,andnon-goalsession sforeachWebsite,respectively.The majorityofWebsites(37outof47(78.72 % ))hadfairlylighttrafc,havinglessthan8,000totalsessions(gure51a).However,therewere10Websites(2 1.28 % )withmorethan8,000total sessions,withthemostheavily-visitedsitehaving44,405 totalsessions.Intermsofgoalsessions (gure51b),41Websites(87.23 % )hadbetween50and150goalsessions,with32ofthose41 (78.05 % )having50to100goalsessions. 0 2 4 6 8 10 12 0-1K 1K-2K 2K-3K 3K-4K 4K-5K 5K-6K 6K-7K 7K-8K >8K Number of WebsitesNumber of Sessions Per Website (a)AllSessions 0 5 10 15 20 25 30 35 0-50 51-100 101-150 151-200 201-250 251-300 301-350 351-400 >400 Number of WebsitesNumber of Goal Sessions Per Website (b)GoalSessions 0 2 4 6 8 10 12 0-1K 1K-2K 2K-3K 3K-4K 4K-5K 5K-6K 6K-7K 7K-8K >8K Number of WebsitesNumber of Non-goal Sessions Per Website (c)Non-goalSessions Figure51.:Site-centric:WebsiteSessionsHistograms Theaverageconversionrateforthe47Websiteswas5.26 % ,withonesitehavingthehighest rateof24.25 % .Figure52illustratesthedistributionofconversionrate sforeachWebsite.25of the47Websites(53.19 % )hadlessthana3 % conversionrate.14oftheWebsites(29.79 % )had betweena3 % and8 % conversionrate.TheremainingeightWebsites(17.02 % )hadaconversion ratehigherthan8 % 170

PAGE 184

0 2 4 6 8 10 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 >8 Number of WebsitesConversion Rate % Per Website Figure52.:Site-centric:WebsiteConversionHistogram SessionCharacteristicsTable49providesthemean,standarddeviation,minimum,an dmaximumvaluesforthenumber ofpagesviewedanddurationfromall250,162sessionsinthe dataset.Foreachmetric,valuesare providedforthreesetsofsessions:goal,non-goal,andall sessions.Sincethesite-centricclickstreammodelofinformationforagingusesmeasurescalcula tedpriortothesubmissionofacontactformtopredictagoal,thenumberofpagesviewedandses siondurationarealsoprovidedfor goalsessionsatthepointright before theysubmittedacontactform. Eachsessionconsisted,onaverage,oflessthanvepagevie ws(4.91),withamaximumof152 pagesviewedbyonesession.Goalsessionsviewedovertwice asmanypagespersession,onaverage,comparedtonon-goalsessions(10.34versus4.80).Eve nwhenonlythepagesviewedbefore aformsubmissionwereconsidered,goalsessionsstillview edalmostoneadditionalpage,onaverage,thannon-goalsessions(5.60versus4.80).Goalsessio nsalsoviewedalittlemorethanhalfof alltheirpages(54.16 % morepages)beforesubmittingacontactform.Figures53a–5 3dshowthe distributionofpagesviewedbynumberofsessions. Theaveragedurationfromall250,162sessionswas3.78minu tes,withonesessionspending over134minutesonasite.Thedifferencebetweengoalandno n-goalsessiondurationwaseven morepronouncedthanthenumberofpagesviewed.Goalsessio nsspentoverthreetimesasmany minutesonasitecomparedtonon-goalsessions(11.46 min versus3.62 min ).Beforesubmitting agoal,goalsessionsspentovertwotimesasmuchtimeasthea veragenon-goalsession(8.80 min versus3.62 min ).Inaddition,goalsessionsspentmorethanthree-quarter softheirtime(76.79 % oftheirtime)browsingthesitebeforesubmittingacontact form.Figures54a–54dillustratethe distributionofsessiondurationinminutesbynumberofses sions. 171

PAGE 185

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 25 50 75 100 125 150 175 Number of SessionsPages Viewed (a)AllSessions 0 100 200 300 400 500 600 0 10 20 30 40 50 60 70 80 90 Number of SessionsPages Viewed (b)GoalSessions 0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 80 90 Number of SessionsPages Viewed (c)GoalSessions-BeforeGoal 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 25 50 75 100 125 150 175 Number of SessionsPages Viewed (d)Non-goalSessions Figure53.:Site-centric:SessionPagesViewedHistograms 172

PAGE 186

0 10000 20000 30000 40000 50000 60000 70000 80000 0 25 50 75 100 125 150 Number of SessionsSession Duration (min) (a)AllSessions 0 50 100 150 200 250 300 350 400 450 500 0 25 50 75 100 125 Number of SessionsSession Duration (min) (b)GoalSessions 0 100 200 300 400 500 600 0 25 50 75 100 Number of SessionsSession Duration (min) (c)GoalSessions-BeforeGoal 0 10000 20000 30000 40000 50000 60000 70000 80000 0 25 50 75 100 125 150 Number of SessionsSession Duration (min) (d)Non-goalSessions Figure54.:Site-centric:SessionDurationHistograms 173

PAGE 187

Table49:Site-centric:SessionCharacteristicStatistic s MeanSt.Dev.MinimumMaximum P AGES V IEWED Allsessions4.915.052152 Goalsessions10.347.17287 Beforegoal5.605.261 a 84 Non-goalsessions4.804.942152 S ESSION D URATION ( MIN ) Allsessions3.786.990.00134.75 Goalsessions11.4611.960.17120.15 Beforegoal8.809.210.0894.17 Non-goalsessions3.626.760.00134.75 a Aminimumofonepageviewedisvalid(whensessionsarerestr ictedtoat leasttwopages)becauseonlythosepagesviewed before theformsubmission wereincluded.Inthissituationthecontactformpagewasvi ewedrstand thentheformwassubmitted. 6.3ConclusionThischapterprovidedanoverviewofthedatasetsusedtotes ttheuser-andsite-centricclickstream modelsofinformationforaging.Anexplanationwasgivenre gardingtheprocessbywhichthe datawascaptured,alongwiththepreprocessingstepsunder takentoarriveatthenaldatasetfor eachmodel.Generalstatisticswerethenshownforeachdata set,alongwiththeWebsiteandsessioncharacteristicsfromeachdataset.Graphicalreprese ntationsofmanymetricswerealsoshown toillustratethedistributionsofvalueswithinthedatase ts. 174

PAGE 188

Chapter7 Results Presentedinthischapteraretheresultsforboththeuser-a ndsite-centricclickstreammodelsof informationforaging.Descriptivestatistics,checksoft heassumptionsforthestatisticaltestsused totestthemodel'shypotheses,andtheresultsforeachhypo thesisaredescribedindividuallyfor bothofthemodels.Inaddition,thesite-centricsectionpr ovidesasensitivityanalysisofeightdifferentminingsignicanceandsupportlevelsusedtocalcul atemeasuresforthesevenhypotheses thatreliedonlearnedpatchesandtrails.7.1User-centricClickstreamModelofInformationForagin g Theuser-centricmodelconsistedoffourhypothesesregard ingthevalueofanentiresiteasapatch. Thedescriptivestatisticsofthedataset,metricstatisti csforeachhypothesis,andchecksofassumptionsforthethreestatisticaltestsusedtotestthehy pothesesarepresentedin§7.1.1.Theresultsfromeachofthestatisticaltestsperformedforallfo urhypothesesarethendetailedin§7.1.2. 7.1.1DescriptiveStatisticsTable50detailsthemean,standarddeviation,median,mini mum,andmaximumnumberofusersessionsbyWebsite.Statisticsforgoalandnon-goalusersessionsateachWebsitearealsoshown 1 Onaverage,eachWebsitehad9,545.06user-sessionswithmo rethan45asmanynon-goal user-sessionsasgoaluser-sessions(9,339.02versus206. 04).Theaverageconversionrateforeach Website(2.21 % )wassimilartothetwopercentconversionratetypicallyfo undate-commerce sites(Moe,2003;SismeiroandBucklin,2004) 2 Table51presentsthemean,standarddeviation,median,min imum,andmaximumvaluesofall 52Websitesforeachofthefourmetricsintheuser-centricm odel.Thestatisticsforthersttwo 1 Furtherdescriptivestatisticsforthedatasetcanbefound in§6.1.2. 2 Theaverageconversionratewhentakingtheaveragefromeac hWebsitewas6.70 % (see§6.1.2). 175

PAGE 189

Table50:User-centric:User-sessionsbySite MeanSt.Dev.MedianMinMax All9,545.0614,078.484,214.5040676,138 Goal206.04154.91141.0051597Non-goal9,339.0214,064.453,989.0029076,041 metricsaredisplayedinthreegroupsofuser-sessions:all ,goal,andnon-goal.Thestatisticsforthe lasttwometricsarealsodisplayedforalluser-sessions,b utwerealsosplittoshowtheconversion ratewithintwogroupsofsessions:thoseusersthatreturne dduringthesamesessionandthose foragerswhostayedontheWebsiteduringtheentiresession 3 Theaveragerelativedurationofusersspent1.27moreminut esateachtargetWebsitethanon othersites.Thegoaltargetsessionsspent,onaverage,7.4 4moreminutesonthetargetWebsite thanatothere-commercesiteswithintheirrespectiveuser -sessions.Thenon-goaltargetsessions spent4.89 fewer minutesonthetargetWebsitecomparedtothemediantimespe ntonotherWeb sites.Asimilardistinctionbetweengoalandnon-goaltarg etsessionswasalsoseenintherelative numberofpagesvisited.Over20morepages(20.27)werevisi tedbygoalsessionsattheirtarget Website,whilenon-goalsessionsvisitedroughlyonefewer page( 1.07)ontheirtargetsitecomparedtootherWebsiteswithinauser-session. Thenaltwomeasuresdemonstratedtheconversionratesfro mtwogroupsofsessions.Onaverage,a0.91 % increaseinconversionratewasfoundforsessionsthatstay edonaWebsitethe entiresessionversusthosethatleftandreturnedduringth esamesession(7.32 % versus6.41 % ).A similardifferencewasalsofoundbetweenthetwogroupswit hinthe REPEAT measure.A0.62 % increaseinconversionrate,onaverage,wasfoundforsessi onsthatwerepreviousvisitorsofthe Websiteversusnewvisitors(6.97 % versus6.35 % ). 3 For REPEAT ,thegroupsdemonstratedtheconversionrateofsessionsth athadvisitedtheWebsitebeforeandthose sessionsthatwerenewvisitorstotheWebsite. 176

PAGE 190

Table51:User-centric:MetricStatistics NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –S ITE -P ATCH RELDUR (inminutes) All521.279.28 2.00 11.0043.50 Goal7.449.596.00 10.7543.50 Non-goal 4.892.09 5.00 11.000.00 RELPGS All529.6015.572.00 6.0077.00 Goal20.2715.7917.251.0077.00Non-goal 1.072.77 1.25 6.0012.00 RETURN All526.87 % 7.51 % 3.88 % 0.05 % 41.04 % P ( Goal j Return ) 6.41 % 6.68 % 3.78 % 0.13 % 24.33 % P ( Goal j Stayed ) 7.32 % 8.30 % 4.11 % 0.05 % 41.04 % REPEAT All526.66 % 7.78 % 3.64 % 0.10 % 46.63 % P ( Goal j Repeat ) 6.97 % 6.84 % 4.33 % 0.10 % 28.72 % P ( Goal j New ) 6.35 % 8.68 % 2.83 % 0.20 % 46.63 % Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnon-goalsessions. 177

PAGE 191

AssumptionsofStatisticalTestsTable52liststheassumptionsforeachofthethreestatisti caltestsusedtotestthemodel'shypotheses 4 .A 2 symbolindicatestheassumptionwasmetforthestatistical test,whilea 2 symbolmeanstheassumptionwasnotmet.Ifbotha 2 anda 2 symbolareshownthentheassumptionheldforsomemetrics,butnotforallofthemetrics.The rewereatotalofveassumptionsfor thepairedt-test(assumptionsthreethroughseven);fourf ortheexactWilcoxonsignedranktest (assumptionsthreethroughsix);andthreeforthedependen t-samplessigntest(assumptionsone, two,andve). Table52:User-centric:AssumptionsofStatisticalTests #Assumption t-TestWilcoxonSignTest 1Thepairs( X i ;Y i )areinternallyconsistent,inthatifP(+) > P(-)foronepair( X i ;Y i ),thenP(+) > P(-)forallpairs. 2 2Themeasurementscaleisatleastordinalwithineachpair. 2 3Themeasurementscaleofthe D i sisatleastinterval. 22 4The D i sallhavethesamemean. 22 5The D i s(orbivariaterandomvariables( X i ;Y i ))aremutually independent. 222 6Thedistributionofeach D i issymmetric. 2222 7The D i sareidenticallydistributednormalrandomvariables. 2 (Conover,1999,pg.157-158,353,363) Furtherdetailsaboutwhetherassumptionsweremetornotfo reachofthestatisticaltestsare providedbelow.Thetestsarepresentedinorderofwhichtes thadtheleasttomoststringentassumptions:signtest,Wilcoxontest,andt-test.Dependent-samplesSignTest Allthreeassumptionsofthesigntestwerefullymet. Assumption1–Eachobservationpairwasinternallyconsist ent.IfP(+) > P(-),P(+) < P(-),or P(+)=P(-)forasingleobservationpair,thenP(+) > P(-),P(+) < P(-),orP(+)=P(-)wasthe sameacrossallobservationpairs,respectively. 4 Assumewithinthedatathereare n pairsof X and Y observations ( X 0 ;Y 0 ) ; ( X 1 ;Y 1 ) ;:::; ( X n ;Y n ) .Foreach observationpair,thedifference D i iscalculatedbetween X i and Y i ,where D i = Y i X i 178

PAGE 192

Assumption2–Eachmetricusedinthisresearchwasaquantit ativevariablemeasuredonatleast anintervalscale. Assumption5–Eachpairofbivariaterandomvariables( X i ;Y i )wastakenfromadifferentand independentWebsite. ExactWilcoxonSignedRankTest ThreeofthefourassumptionsfortheexactWilcoxonsignedr anktestweremet.Thefourth assumptiondealingwithsymmetryofthe D i swasmetforsomeofthemetrics,butnotforallof themetrics.Assumption3–Eachmetricusedinthisresearchwasaquantit ativevariablemeasuredonatleast anintervalscale. Assumption4–Eachofthedifferences( D i )wastakenfromaWebsitewithinthesamepopulation.Therefore,themeanofeachdifferencewasexpectedto bethesame. Assumption5–Eachofthedifferences( D i )wastakenfromadifferentandindependentWeb site. Assumption6–Thelastfourcolumnsfromtable53showtwodif ferentmeasuresofskewness forthefourmetrics:coefcientofskewnessandquartilesk ewcoefcient.SincetheWilcoxon testonlyconsidersdatapointswithnon-zerodifferences, onlytheskewvaluesfromthe“No zeros”columnswereanalyzed 5 5 Noneofthe52Websiteshadthesamemedianvalueforgoalandn on-goalsessionsforanyofthefourmeasures. Therefore,all52Websiteswereincludedwhencalculatings kewforboththe“All”and“NoZeros”columns. 179

PAGE 193

Table53:User-centric:MetricNormalityandSkew NLillieforsShapiroSkewQuartileSkew Hyp.MetricTotalNoZerosDp-ValueWp-ValueAllNoZerosAll NoZeros I NFORMATION P ATCH –S ITE -P ATCH SC1 RELDUR 52520.22 < 0.0001 *** 0.75 < 0.0001 *** 2.23 2.230.000.00 SC2 RELPGS 52520.180.0002 *** 0.83 < 0.0001 *** 1.79 1.790.110.11 SC3 RETURN 52520.21 < 0.0001 *** 0.69 < 0.0001 *** 2.402.400.930.93 SC4 REPEAT 52520.29 < 0.0001 *** 0.65 < 0.0001 *** 3.323.32 0.29 0.29 *p 0.10;**p 0.05;***p 0.01180

PAGE 194

Thersttwooftheskewcolumnsprovidethecommonlyusedcoe fcientofskewness(g),as showninequation7.1(HelselandHirsch,1992).Inequation 7.1, n isthenumberofelements, x i isthevalueofthe i th element, X isthesamplemean,and s isthesamplestandarddeviation.Althoughwidelyused,whenusingthecoefcientofske wness“...anotherwisesymmetricdistributionhavingoneoutlierwillproducealarge(an dpossiblymisleading)measureof skewness”(HelselandHirsch,1992,pg.10). g = n ( n 1) ( n 2) n X i =1 x i X 3 s 3 (7.1) Duetothesensitivityofthecoefcientofskewnesstooutly ingpoints,amorerobustandresilientmeasureofskewwhichisnotaffectedbyoutlierswas used(lasttwocolumnsoftable53).Theformulaforthequartileskewcoefcientisshow ninequation7.2(Helseland Hirsch,1992),where P 0 : 25 P 0 : 50 ,and P 0 : 75 refertothelowerquartile,median,andupper quartileofthedata,respectively.Thequartileskewcoef cientcanrangefromnegativeone toone.Sincethequartileskewmeasureonlyconsidersthedi fferencebetweentheupperand lowerquartilesandthemedian,outlyingpoints(suchasthe maximumandminimum)donot impactthevalueoftheskewmeasure. qs = ( P 0 : 75 P 0 : 50 ) ( P 0 : 50 P 0 : 25 ) P 0 : 75 P 0 : 25 (7.2) Besidesstatisticsonskewasshownintable53,gure55isal soprovidedtographicallyshow thedistributionofpointsforeachmeasure. RELDUR and RELPGS werebothnegativelyskewed( 2.23and 1.79),havingalongtailbelowthemedian.However,betweenthelowerandupperquartil es,thedistributionofpointsappearstobemostlysymmetricaroundthemedian.Examiningth equartileskewcoefcientvalues, RELDUR didnotdemonstrateanyskew(0.00),while RELPGS hadaslightpositiveskew (0.11). RETURN and REPEAT hadaskewoppositeofthersttwomeasuresandwerepositive lyskewed (2.40and3.32).Bothmeasureshadalongtailabovethemedia n.Thequartileskewcoefcient demonstratedaseverepositiveskewfor RETURN (0.93)andamoderatenegativeskewfor RE 181

PAGE 195

-50 -40 -30 -20 -10 0 10 All No Zeros Non-goals Minus GoalsWebsites Included (a) RELDUR -100 -80 -60 -40 -20 0 All No Zeros Non-goals Minus GoalsWebsites Included (b) RELPGS -15 -10 -5 0 5 10 15 20 All No Zeros Non-goals Minus Goals %Websites Included (c) RETURN -20 -10 0 10 20 30 40 All No Zeros Non-goals Minus Goals %Websites Included (d) REPEAT Figure55.:User-centric:DifferencePlots 182

PAGE 196

PEAT ( 0.29). RELDUR wastheonlymetricwhichmettheassumptionofnoskew(using thequartileskew coefcient). RELPGS and REPEAT bothhadslighttomoderateamountsofskewandthusdid notfullymeettheassumption.Lastly, RETURN wasfoundtohaveasevereskewanddidnot meettheassumptionofsymmetry. Pairedt-Test Threeoftheveassumptionsforthepairedt-testweremet.T hefourthassumptiondealingwith symmetryofthe D i swasmetforsomeofthemetrics,butnotforallofthemetrics .Thefthassumptionofnormalitywasnotformallymetforanyofthemetr ics. Assumption3–Eachmetricusedinthisresearchwasaquantit ativevariablemeasuredonatleast anintervalscale. Assumption4–Eachofthedifferences( D i )wastakenfromaWebsitefromthesamepopulation.Therefore,themeanofeachdifferencewasexpectedto bethesame. Assumption5–Eachofthedifferences( D i )wastakenfromadifferentandindependentWeb site. Assumption6–Thesymmetryforeachmeasurewasdeterminedi nthesamemannerasforthe exactWilcoxonsignedranktest 6 .Ofthefourmeasures,threeofthemeasureshadbetween zeroandmoderateamountsofskew: RELDUR didnothaveanyskew, RELPGS hadslightskew, and REPEAT hadmoderateskew. RETURN hadsevereskewanddidnotmeettheassumptionof symmetry. Assumption7–Twoformalstatisticaltestsofnormalitywer eperformedonthedifferences( D i s) foreachofthemetrics 7 :Lilliefors(Kolmogorov-Smirnov)andShapiro-Wilknorma litytests (Conover,1999) 8 .Eachtesthasanullhypothesisthatthedatafollowsanorma ldistribution 6 The“All”columnforvaluesofskewwasused(table53)todete rmineskewforthet-testbecausethet-testusesall differences(non-zeroandzero).SinceallWebsiteshadadi fferencebetweengoalandnon-goalsessions,the“All”and “NoZeros”columnsareidentical. 7 Symmetryisanecessary,butnotsufcient,conditionforno rmality.Although RETURN wasseverelyskewedand thusnotsymmetrical,thetestsofnormalitywerestillperf ormedonthemeasureforpurposesofcompleteness. 8 Althoughpresented,formaltestsofnormality(suchasLill ieforsandShapiro-Wilk)areknowntobesensitiveto evenslightdeparturesfromnormality(MendenhallandSinc ich,2003). 183

PAGE 197

withanunspeciedmeanandvariance.Lillieforsisanon-pa rametricnormalitytest,whereas Shapiro-Wilkhasbeenfoundtohavegreaterpowerthanother tests(suchasLilliefors)in manysituations(Conover,1999).Allfourmeasuresrejectedthenullhypothesisofanormaldi stributionusingboththeLillieforsandShapiro-Wilkstests.Inadditiontothetestsof normality,theskewvaluesfrom table53andthegraphicaldepictionofeachmetric'spoints (gure55)providedfurtherevidencethatthemeasuresdidnotfollowanormaldistribution Overall,onlytheassumptionsforthedependent-samplessi gntestwerefullymet.Therefore, thesigntestwasusedtotestthehypothesesoftheuser-cent ricmodel 9 .Theassumptionsforthe Wilcoxontestandt-testwerenotcompletelymetandareprov idedonlyforcomparisonpurposes. Thelackofsymmetry(andnormality)forsomeofthemeasures meanstheresultsfromtheWilcoxon andt-testmustbeinterpretedwithcaution.7.1.2HypothesesTestingTable54presentstheresultsforthefouruser-centrichypo theses.Thersttwocolumnsofthe tablelistthehypothesisnumberandnameofthemetricbeing tested.Thethirdandfourthcolumns listthetotalnumberofWebsitesandthenumberofsiteswith anon-zerodifference(i.e., D i 6 =0 ), respectively.ThetotalnumberofWebsiteswasusedinthettest,whileonlyWebsiteswithnonzerodifferenceswereusedfortheWilcoxonandsigntests 10 .Columnsvethroughsevenlistthe tstatistic,degreesoffreedom(df),andp-valueforthet-t est.Theeighthandninthcolumnsdisplay theVstatisticandp-valuefortheWilcoxontest.Thenaltw ocolumnslisttheSstatisticandpvalueforthesigntest 11 9 Thesigntestisgenerallytheleastpowerfulofthethreetes ts(Conover,1999).However,asallofthesigntest's assumptionsweremet,greatercondencecanbegiventother esultsofthesigntestcomparedtotheothertwotests. 10 Withintheuser-centricdatasetalloftheWebsiteshadnonzerodifferences. 11 Resultsofthesigntestarepresentedbelowsinceallthreea ssumptionsforthetestweremet.Theresultsofthe t-testandWilcoxontestareprovidedinfootnotes.Sincene itherthet-testnortheWilcoxontestmetalloftheirassump tions,theresultsofthosetestsshouldbeinterpretedwith caution. 184

PAGE 198

Table54:User-centric:Results Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –S ITE -P ATCH UC1 RELDUR 52529.7151 < 0.0001 *** 1,376 < 0.0001 *** 51 < 0.0001 *** UC2 RELPGS 525210.3751 < 0.0001 *** 1,378 < 0.0001 *** 52 < 0.0001 *** UC3 RETURN 5252 1.96510.97244450.9874220.8942 UC3(opp) a 52521.96510.0276 ** 9330.0129 ** 300.1659 UC4 REPEAT 52520.82510.20751,0210.0010 *** 360.0039 *** a Hypothesistestedinoppositedirectionasoriginal–i.e., leavingandreturningwillbe negatively associatedwithachievingagoalon thislongtailWebsite.*p 0.10;**p 0.05;***p 0.01185

PAGE 199

UC1– RELDUR Thersthypothesisconjecturedthatgoalachievingforage rswouldspendrelativelymoretime ontheWebsitewhereapurchasewasmadethanonanyothersite theyvisited.Therationalebehindthehypothesiswasbecauseforagersareassumedtobera tionalandthusarelookingtoreduce theirsearchcosts(Pirolli,2007),theywillonlyspendtim eonasiteaslongastheyareobtainingvaluefromthatsite(i.e.,satiscingonrateofinforma tiongain(Pirolli,2007;Simon,1956)). Thus,moreinformationcanbeassumedtobegatheredonaWebs itewheremoretimeisspent thanotherWebsites,whichbringsaforageronestepclosert obeingatapointtomakeadecision topurchaseornot. Theresultsofthesigntestprovidesupportforthehypothes isat =0.01(S=51;p-value= < 0.0001) 12 .Ofthe52Websitesinthedataset,51ofthemhadahighermedi anrelativeduration amongstgoalsessionsthannon-goalsessions.Comparedtoo thersitesintheiruser-sessions,goal sessionsspentoversevenadditionalminutesonthetargetW ebsitewhilenon-goalsessionsspent almostveminutesless. Theuseofdurationtoexplainchoicebehaviorhasnotfoundc onsistentresultsinpriorliterature.Theroleofabsolutedurationinpredictingchoicebe haviorhasfoundmixedassociations (SismeiroandBucklin,2004)andalsodifferencesinsigni cance(Padmanabhanetal.,2001)dependingonthetaskexaminedanddataused.Theresultsofhyp othesisUC1provideadditional supportforapositiveassociationbetweendurationandgoa lachievement.However,thehypothesisonlysupportsthenotionofapositiveassociationfor relative ratherthanabsoluteduration. UC2– RELPGS Thesecondhypothesiswassimilartothersthypothesisbec auseitalsoreliedontheconceptof satiscing(Pirolli,2007;Simon,1956).However,thenumb erofpagesviewedbyaforagerwas examinedinsteadofthedurationspentatthesite.Whenever aforagerclickedonalinkatasite thatwasanimplicitsignaltheuserbelievedotherinformat ionofvaluewouldbeobtainedfromthe site.Therefore,morepages(relativetoothersites)shoul dbeanindicationofagreaterwealthof informationbeingobtained. 12 HypothesisUC1wasalsosignicantat =0.01forboththet-test(t=9.71;df=51;p-value= < 0.0001)and Wilcoxontest(V=1,376;p-value= < 0.0001). 186

PAGE 200

Theresultsfromthesigntestsupportedthehypothesisat =0.01(S=52;p-value= < 0.0001) 13 All52Websitesinthedatasethadahighermedianrelativenu mberofpagesviewedforgoalsessionsversusnon-goalsessions.Goalsessionsviewed,onav erage,overtwentyadditionalpageson thetargetWebsitecomparedtoothersites,whereasnon-goa lsessionsviewedonefewerpagethan atotherWebsites. Supporthasbeenmixedinpriorliteraturefortherolenumbe rofpagesviewedhasonchoice behavior.Absolutenumberofpagesviewedhasfoundpositiv eassociation(Awadetal.,2006; Moe,2003),noassociation(Chatterjeeetal.,2003),andmi xedassociationdependingonthetask (SismeiroandBucklin,2004)ortypeofpagesviewed(Vanden PoelandBuckinx,2005).Like duration,theresultofthishypothesisalsolendsaddition alsupporttothenotionofapositiveassociationbetweennumberofpagesviewedandgoalachievement .However,thesupportisrestricted toa relative examinationofpagesviewedratherthantheabsolutevaluec ommonlyusedinprior research. AlthoughbothUC1andUC2weresupportedat =0.01, RELPGS wasslightlybetteratdistinguishingbetweenthetwogroupsofsessions(S=52versus51) .However,partorallofthedifferencebetweenthetwohypothesesmayhavebeenanartifact ofmeasurementconstraints,since RELDUR wasonlymeasuredattheminute-level.Thus,aner-grained measurementmaybebetter abletoteaseoutdifferencesindurationbetweensessionst hanwhatwasshownintheuser-centric model 14 UC3– RETURN Thethirdhypothesisexaminedthereturningbehaviorofafo rager.Inparticular,itwashypothesizedthatforagerswhoreturnedduringthesamesessionwou ldbemorelikelytoachieveagoal thanforagersthatdidnot.Therationalewasthatusersinit iallyleftasitebecausetheyexpectedto ndanotherWebsitewithahigherrateofinformationgain(i .e.,theyfollowedthepatch-leaving rulefromthemarginalvaluetheorem(Pirolli,2007;Charno v,1976)).However,aftertheforager 13 HypothesisUC2wasalsosignicantat =0.01forboththet-test(t=10.37;df=51;p-value= < 0.0001)and Wilcoxontest(V=1,378;p-value= < 0.0001). 14 Thesite-centricmodeldoesindicatedurationisabetterma nnerofdistinguishingbetweentypesofsessionsthan pagesviewed(see§7.2.2).However,consideringadifferen tdatasetwasused,theresultsarenotdirectlycomparable.Forexample,theWebsitesinthesite-centricdatasetm ayhavehadfewerpagesandthusnumberofpagesviewed wouldbelessabletodistinguishbetweengroupsofsessions .Inaddition,thesite-centricmodeldoesnottakeintoaccountbehaviorrelativetootherWebsites. 187

PAGE 201

exploredotheraspectsoftheirenvironment,theybetterre cognizedthevalueofthesitetheyinitiallyleft.Therefore,aforagerthatreturnedtothesitet heyleftnotonlyhadknowledgeofwhat wasonthesite,butalsoanexpectationthattheWebsitewoul dresultinadditionalinformation gain,whichwashypothesizedtoindicategreaterlikelihoo dofgoalachievingbehavior. Thesigntestfailedtosupportthehypothesisatanyofthete sted levels(S=22;p-value= 0.8942) 15 .Only22ofthe52Websitesfoundagreaterincidenceofgoala chievementamongsessionsthatleftandreturnedasopposedtothosesessionstha tstayedonthesitefortheentiresession.NotonlywasUC3notsupported,buttheexpectedassoci ationofreturningbehaviortogoal achievementappearedtobeincorrect.Insteadofbeingapos itiveassociation,theresultspointed towardastrong(butnon-signicant)negativeassociation ,i.e.,aforagerwas less likelytoachieve agoaliftheuserleftaWebsiteandthenreturnedwithinthes amesession(theoppositeofhypothesisUC3) 16 AlthoughtheresultsofUC3werenotexpected,theydidprovi deadditionalsupporthighlighting theefcacyofaforager'sabilitytosearchwithonlyimperf ectinformationandlimitedcomputationalresources.Forexample,foragersappearedtobeca pableofjudgingtherateofinformationgainandvalueofaWebsiterelativelywell,accordingt otheirneed.Theefcacyofforagers' searchbehaviorwasinformallybackedupbecausemoreusers whopurchasedaproductfromtheir targetWebsitedidnotfeeltheneedtovisitotherWebsitesd uringtheirsession 17 Asfarascanbedetermined,priorliteraturehasnotexamine dthereturningbehaviorofauser duringthesamesession.Therefore,theresultsofthishypo thesisprovideaninitial(butnon-signicant) clueintotherelationshipbetweenreturningbehaviorduri ngthesamesessionandgoalachievement. 15 HypothesisUC3wasalsonotsupportedatanyofthetested levelsforboththet-test(t= 1.96;df=51;p-value =0.9724)andWilcoxontest(V=445;p-value=0.9874). 16 TheoppositeofhypothesisUC3wasalsonotsupportedatanyo fthetested levels(S=30;p-value=0.1659). However,theoppositeofhypothesisUC3 was supportedat =0.05forboththet-test(t=1.96;df=51;p-value= 0.0276)andWilcoxontest(V=933;p-value=0.0129).Thedis crepancyofndingsmaybeasymptomofthesign test'slackofpower(therankoractualdifferencesbetween datapointsarenotusedinthesigntest).However,another possibilitymaybethet-testandWilcoxontestareprovidin ginaccurateresults,especiallywhenconsideringtheextr eme skewofthe RETURN measure(quartileskewof0.93). 17 Thisassumestheactionofpurchasingaproductfromthetarg etWebsitewasa“good”decision. 188

PAGE 202

UC4– REPEAT Thenalhypothesisalsoexaminedreturningbehavior,butd idsobylookingathowpastvisitationsofaWebsiteaffectedthepropensityofforagerstoach ieveagoal.Theexpectationwasprior visitationofvaluablesiteswouldstandoutmoreinaperson 'smemory(i.e.,bemoreeasilyaccessible)thanlessvaluablesites.Thus,arepeatvisitorw ouldbemorelikelytoachieveagoalbecauseoftheexpectationthattheuserwasfamiliarwiththes iteandhadanunderstandingofthe availableinformationfromthesite. Usingthesigntest,thenalhypothesiswassupportedat =0.01(S=36;p-value=0.0039) 18 36ofthe52Websiteshadahighermedianprobabilityofapurc haseamongstgoalsessionswhen auserhadvisitedthesitebefore. Priorliteraturehasfoundmixedassociations(dependento nthetask)betweenusersreturning duringdifferentsessionsandcompletingatask(Sismeiroa ndBucklin,2004).TheresultsofhypothesisUC4lendsadditionalsupportofapositiveassocia tionbetweenrepeatvisitationbehavior andgoalachievement.SummaryofResultsTable55summarizestheresultsofthehypothesestesting.O fthefourhypotheses,UC1,UC2,and UC4wereallsupportedat =0.01.HypothesisUC3wasnotsupportedineitheritsorigin alor oppositeformatanyofthetestedalphalevels(0.01,0.05,o r0.10). 18 HypothesisUC4wasnotsignicantat =0.10forthet-test(t=0.82;df=51;p-value=0.2075),butw assignicantat =0.01fortheWilcoxontest(V=1,021;p-value=0.0010).The t-testmayhavefailedtoreachsignicance becausetheactualdifferencebetweenthegoalandnon-goal sessionswasonly0.62 % (6.97 % versus6.35 % ).The Wilcoxonandsigntestsdonotconsidertheabsolutediffere nce,butrathertherelativedifference(i.e.,rank)orifon e groupwashigherthantheother. 189

PAGE 203

Table55:User-centric:HypothesesResultsSummary Hyp.MetricHypothesisSupported? I NFORMATION P ATCH –S ITE -P ATCH UC1 RELDUR Yes *** UC2 RELPGS Yes *** UC3 RETURN No UC3(opp) a No UC4 REPEAT Yes *** a Hypothesistestedinoppositedirectionasoriginal–i.e.,leavingandreturningwillbe negatively associatedwithachievingagoalonthislongtailWebsite.*p 0.10;**p 0.05;***p 0.01 7.2Site-centricClickstreamModelofInformationForagin g Thesite-centricmodelconsistedofninehypothesesthatwe reconcernedwithbothinformation scentandpatches.Descriptivestatisticsofthedatasetan deachmeasurealongwithchecksofassumptionsforthethreestatisticaltestsusedtotestthehy pothesesarepresentedin§7.2.1.The resultsofallninehypothesesarethendetailedin§7.2.2.A sthreeofthehypotheses(seventotal measures)usedmetricsderivedfromlearningpatchesandtr ails,asensitivityanalysiswasperformedateightdifferentmininglevelsofsignicanceands upport.Thedescriptivestatisticsand resultsofthesensitivityanalysisareprovidedin§7.2.3.7.2.1DescriptiveStatisticsTable56detailsthemean,standarddeviation,median,mini mum,andmaximumnumberofsessionsbyWebsite.Statisticsforgoalandnon-goalsessions ateachWebsitearealsoshown 19 .The dataisseparatedintable56intothreegroups:theentireda taset,trainingset,andtestingset. Theentiredatasetwasusedtotestthesixhypotheseswhichd idnotrelyonminingpatchesor trails(SC1–SC4,andSC7–SC8).Thetrainingdatasetwasuse dtodiscoverpatchesandtrailsthat wouldeventuallybeusedtocalculatemeasuresforhypothes esSC5a-c,SC6,andSC9a-c 20 .The 19 Furtherdescriptivestatisticsforthedatasetcanbefound in§6.2.2. 20 Thetrainingsetconsistedoftherst70 % ofgoalsessions(andallnon-goalsessionsoccurringbefor ethelastgoal 190

PAGE 204

Table56:Site-centric:SessionsbySite MeanSt.Dev.MedianMinMax E NTIRE D ATASET All5,322.607,473.762,637.0024544,405 Goal105.9490.1379.0051587Non-goal5,216.667,427.532,566.0019244,111 T RAINING S ET All3,744.235,418.421,696.0016831,730 Goal74.2863.0756.0036411Non-goal3,669.965,386.001,656.0013031,525 T ESTING S ET All1,578.362,156.26901.004812,675 Goal31.6627.0723.0015176Non-goal1,546.702,143.14884.002912,586 actualcalculationofthemeasuresforhypothesesSC5a-c,S C6,andSC9a-cweredoneusingsessionsfromthetestingsetofdata. Onaverage,eachWebsitehadatotalof5,322.60sessionswit hmorethan49asmanynon-goal sessionsasgoalsessions(5,216.66versus105.94).Theave rageconversionrateforeachWebsite (1.99 % )wassimilartothetwopercentconversionratetypicallyfo undate-commerceWebsites (Moe,2003;SismeiroandBucklin,2004) 21 Overall,thetrainingsetofdatarepresented70.35 % ofallsessions.Forminingpurposes,each Websitehadanaverageof3,744.23sessions.Thetrainingse thadasimilarratioofgoalversus non-goalsessions(49timesmorenon-goalthangoalsession s)andaslightlylowerconversionrate thanwhatwasseenintheentiredataset(1.98 % versus1.99 % ). Thetestingsetwasalsoverysimilarinmakeuptotheentired ataset.EachWebsitehadanaverageof1,578.36sessions,withalmost49asmanynon-goalses sionsasgoalsessions(1,546.70versus31.66).Theconversionratewasalsoverysimilartothee ntiredataset(2.01 % versus1.99 % ). Asseenintable56,themakeupofthetrainingandtestingset sdonotappeartodifferdrastisessionaddedtothetrainingset). 21 Theaverageconversionratewhentakingtheaveragefromeac hWebsitewas5.26 % (see§6.2.2). 191

PAGE 205

callyfromtheentiredataset.Therefore,theresultsofmin ingpatchesandtrailsandthecalculation ofmeasuresusingthetestingandtrainingdatasetsareassu medtobegeneralizabletotheentire dataset(i.e.,theresultsarenotanartifactofthemanneri nwhichthedatawassplit). Table57presentsthemean,standarddeviation,median,min imum,andmaximumvaluesfrom all47Websitesforeachofthesixmetricsthatdidnotrelyon miningpatchesandtrails.The statisticsfortherstandlasttwometricsaredisplayedin threegroupsofsessions:all,goal,and non-goal.Thestatisticsforthemiddletwometricsarealso displayedforallsessions,butwere splittoshowtheconversionratewithintwogroupsofsessio ns:thoseforagersthatreturnedduring thesamesessionandthoseuserswhostayedontheWebsitedur ingtheentiresession 22 TheaveragedurationofusersateachWebsitewas3.69minute s.Thegoalsessionsspent,on average,4.60moreminutesonaWebsitecomparedtothenon-g oalsessions(5.99minutesversus 1.39minutes).Asimilar,butlesslarge,ofadifferencewas alsoseenbetweenthenumberofpages viewedbetweengoalandnon-goalsessions.Onaverage,goal sessionsviewed0.43morepage thannon-goalsessionsdid(4.28versus3.85) 23 Themiddletwomeasures( RETURN and REPEAT )demonstratetheconversionratesfromtwo groupsofsessions.Onaverage,a5.53 % increaseinconversionratewasfoundforsessionsthat stayedonaWebsitetheentiresessionversusthosethatleft andreturnedduringthesamesession (7.24 % versus1.71 % ).Asimilar,butnotassevere,differencewasalsofoundbet weenthetwo groupswithinthe REPEAT measure.A1.02 % increaseinconversionrate,onaverage,wasfound forsessionsthatwerepreviousvisitorsoftheWebsitevers usnewvisitors(6.09 % versus5.07 % ). Thepercentageofuniquepagesviewed,onaverage,was79.42 % .Goalsessionshada20.22 % increaseinuniquepagesviewedovernon-goalsessions(89. 53 % versus69.31 % ).Thedifference inclickstreamlinearitywasalsosimilartothepercentage ofuniquepagesviewedinbothdirection anddifference.A0.23increaseinclickstreamlinearitywa sseenbetweenthegoalandnon-goal sessions(0.86versus0.75). Table58liststhemean,standarddeviation,median,minimu m,andmaximumvaluesforthe sevenmetrics(hypothesesSC5a-c,SC6,andSC9a-c)thatwer ecalculatedfromminedpatchesand trailsatthe0.05signicancelevel 24 .Thestatisticsforthemetricsaredisplayedinthreegroup sof 22 For REPEAT ,thegroupsdemonstratedtheconversionrateofsessionsth athadvisitedtheWebsitebeforeandthose sessionsthatwerenewvisitorstothesite. 23 Thedurationandnumberofpagesviewedforgoalsessionsonl yincludesactivity before anyformsubmission. 24 Theuseof =0.05forlearningpatchesandtrailswasmotivatedbyprior researchoncontrastsets(BayandPaz192

PAGE 206

Table57:Site-centric:MetricStatistics NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –S ITE -P ATCH SITEDUR (inminutes) All473.692.872.890.3313.32 Goal5.992.325.901.3913.32Non-goal1.390.661.230.333.15 SITEPGS All474.061.264.0029 Goal4.281.464.0029Non-goal3.851.004.0027 RETURN All474.47 % 6.04 % 2.16 % 0.00 % 31.72 % P ( Goal j Return ) 1.71 % 2.68 % 0.81 % 0.00 % 14.02 % P ( Goal j Stayed ) 7.24 % 7.14 % 4.71 % 0.65 % 31.72 % REPEAT All475.58 % 5.99 % 3.20 % 0.35 % 27.27 % P ( Goal j Repeat ) 6.09 % 6.33 % 3.55 % 0.35 % 27.27 % P ( Goal j New ) 5.07 % 5.66 % 2.96 % 0.46 % 24.89 % S TRICT I NFORMATION S CENT UNIQUE All4779.42 % 15.23 % 77.50 % 50.00 % 100.00 % Goal89.53 % 11.94 % 100.00 % 58.33 % 100.00 % Non-goal69.31 % 10.85 % 66.67 % 50.00 % 100.00 % LINEAR All470.860.291.000.001.00 Goal0.980.081.000.601.00Non-goal0.750.371.000.001.00 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnon-goalsessions. 193

PAGE 207

sessions:all,goal,andnon-goal.Thestatisticsforthefo urmetricsregardingpage-patcheswere calculatedfromthe14Websitesthatdiscoveredpatches,wh ilethethreetrailmeasureswerecalculatedfromthe10Websitesthatdiscoveredtrails. Therstthreepatchmeasures( PATCHMAX PATCHLAST ,and PATCHSUM )hadaveragepatch valuesof0.32,0.29,and0.79amongallsessions,respectiv ely.Thedifferencebetweengoaland non-goalsessionswassimilarfor PATCHMAX and PATCHLAST PATCHMAX hadadifferenceof 0.44(0.54versus0.10)andthedifferencefor PATCHLAST was0.37(0.47versus0.10). PATCH SUM hadthelargestdifferenceofthethreemeasureswithavalue of1.23(1.40versus0.17),almostthreetimesasgreatofadifferenceaseither PATCHMAX or PATCHLAST Theaveragedurationusersspentwithinpatcheswasalittle overoneminute(68.28seconds). Consideringtheaverageuserspent3.69minutesonanentire site,foragersspentalmostathirdof theirtime(30.84 % )withinpatches.Goalsessionsforagedwithinpatches,ona verage,42.41more secondscomparedtonon-goalsessions(89.48secondsversu s47.07seconds). Unlikethepatchvisitationmeasures,thetrailfollowingm easuresallhadsimilarmeans:0.27, 0.26,and0.33.Likewise,thedifferencebetweengoalandno n-goalsessionswasalsorelatively similarbetweenthethreemeasures.Thedifferencefor TRAILMAX was0.45(0.50versus0.05), TRAILLAST was0.43(0.48versus0.05),and TRAILSUM hadthelargestdifferenceof0.56(0.61 versus0.05).PatchandTrailDescriptiveStatisticsTable59providesthemean,standarddeviation,median,min imum,andmaximumvaluesfora numberofdescriptivemeasuresaboutthelearnedpatchesan dtrails:number,size,coverageand value.Inaddition,statisticsarealsoprovidedabouthowm anypatcheswerevisitedandtrailswere followedbyforagers. Anaverageof11.93patcheswasdiscoveredon14Websitesusi ngthe0.05signicancemining level.Althoughalmost12patcheswerediscoveredonaverag e,therewasafairlylargespreadof discoveredpatches,withoneWebsiteonlyndingasinglepa tchandanothersitediscovering111 patches.Ingeneral,discoveredpatcheswerefairlysmalli nsize,consistingofonly1.82pages. Thesmallpatchsizeindicatesanumberofvaluablepatchesw eresimplyindividualpagesona zani,1999). 194

PAGE 208

Table58:Site-centric:MetricStatistics(Signicant–0. 05) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All140.320.400.000.001.31 Goal0.540.430.600.001.31Non-goal0.100.210.000.000.57 PATCHLAST All140.290.340.000.001.03 Goal0.470.360.470.001.03Non-goal0.100.200.000.000.51 PATCHSUM All140.791.250.000.004.53 Goal1.401.521.060.004.53Non-goal0.170.360.000.001.04 PATCHDUR (inseconds) All1468.2847.0251.3819.00178.00 Goal89.4853.0676.6319.00178.00Non-goal47.0728.4338.8820.00134.00 R ELAXED I NFORMATION S CENT TRAILMAX All100.270.370.000.001.04 Goal0.500.400.520.001.04Non-goal0.050.160.000.000.50 TRAILLAST All100.260.350.000.001.04 Goal0.480.370.520.001.04Non-goal0.050.160.000.000.50 TRAILSUM All100.330.500.000.001.80 Goal0.610.580.520.001.80Non-goal0.050.160.000.000.50 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnongoalsessions. 195

PAGE 209

Table59:Site-centric:PatchandTrailMetricStatistics( Signicant–0.05) NMeanSt.Dev.MedianMinMax P ATCHES Numberofpatches1411.9328.743.501111Patchsize141.820.642.001.003.00Patchcoverage1428.63 % 13.85 % 26.79 % 10.00 % 50.00 % Patchvalue140.670.160.670.280.88Patchvisitation All142.752.252.001.0011.00 Goal3.502.932.001.0011.00Non-goal2.000.882.001.004.00 T RAILS Numberoftrails104.708.041.50127Trailsize102.150.342.002.003.00Trailcoverage1026.47 % 14.80 % 27.44 % 6.90 % 50.00 % Trailvalue100.790.190.820.461.05Trailfollowing All101.601.141.001.005.00 Goal1.801.481.001.005.00Non-goal1.400.701.001.003.00 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnon-goalsessions. 196

PAGE 210

Website.However,eventhoughpatchesweresmallinsize,th eyrepresentedoveraquarterofall WebpagesonaWebsite(28.63 % ). Theaveragevalueofapatchwasrelativelyhighat0.67,indi catingpatcheswerereasonably capableofseparatingareaspredominatelyvisitedbygoals essionsversusnon-goalsessions.However,thevalueofpatcheshadamoderatelylarge0.60points preadbetweentheminimum(0.28) andmaximum(0.88)valuedpatches. Foragersvisited,onaverage,2.75patchesduringasession ,whichrepresented23.05 % ofall availablepatchesonasite.Goalsessionsvisited1.50more patchesthannon-goalsessionsdid duringasession(3.50versus2.00).Althoughnon-goalsess ionsappearedtovisitanumberof patches,thestatisticsintable59onlyincludesessionsth atvisited atleast onepatch.Therefore, whenconsideringallsessions,lessthanhalfofallnon-goa lsessionsvisitedpatcheswhileover halfofthegoalsessionsdidvisitpatches(seemedianvalue sfor PATCHMAX PATCHLAST ,or PATCHSUM intable58). Valuabletrailsweremoredifculttodiscoverthanpatches ,asevidencedbyboththelower numberofWebsiteswithtrails(10versus14)andthemeannum beroftrailsdiscoveredateach Website(4.70versus11.93).Discoveredtrailsconsistedo feithertwo-orthree-pagesequences (averageof2.15),whichrepresented26.47 % ofallWebpagesonaWebsite 25 Althoughmoredifculttond,trailswere0.12pointsmorev aluableonaveragethanpatches (0.79versus0.67).Thevalueoftrailshada0.59pointsprea dbetweentheminimum(0.46)and themaximum(1.05)valuedtrails,whichwasonlyonetenthof apointlowerthanpatches. Intermsofusageofvaluabletrails,foragersfollowedanav erageof1.60trailsduringasession,whichrepresented34.04 % ofallavailabletrailsonaWebsite.Likepatches,goalsess ions alsofollowedmoretrailsthannon-goalsessions(1.80vers us1.40).Althoughfewertrailswere followedinabsolutetermscomparedtothenumberofpatches visited,percentage-wisegoalsessionsfollowedagreaterproportionofavailabletrailsona Websitethanpatches(38.30 % versus 29.34 % ). ExamplesofPatchesandTrails 25 Trailsmaycontainthesamepagebeingvisitedmultipletime sunlikepatcheswhichonlyrepresentuniquepages. Thus,thepercentageoftrailcoverage(26.47 % )canstillbelowerthanpatchcoverage(28.63 % )evenwhenthemean trailsize(2.15)ishigherthantheaveragepatchsize(1.82 ).Theprecedingexplanationassumedthedifferenceincove ragewasnotduetodissimilarWebsiteswithdifferentnumber ofWebpagesbeingincludedinthecalculation. 197

PAGE 211

Table60and61eachpresentthreeexamplesofdiscoveredpat chesandtrails,respectively.Each tablelistsanidentierfortheWebsitealongwithashortde scriptionofwhatthepurposeofthe Websitewas.Thevalueoftheexamplepatchortrailalongwit hthepagesthatmakeupthepatch ortrailisalsoprovided.Forpatches,theorderinwhichthe pagesaredisplayedintable60does not matter.Fortrails,theorderofpagesintable61 does matter. Table60:Site-centric:ExamplePatches Website IdDescriptionSupportPatchPages a 1Sellandservicelight-weightoutboardmotors0.31Index OutboardmotorproductsAccessoriesWarrenties 2Hairandmake-upservicesforweddings0.70Hairprices HairstyleexamplesServicesofferedMakeupprices 3Smalldogbreeder0.61Index PhotoalbumofpuppiesAvailablepuppies Examplesarefromavarietyofdifferentsignicance/suppo rtlevels a Orderofpagesdoes not matter TherstWebsitefromtable60demonstratesafour-pagepatc hofrelativelylowvalue.The patchmaybeofinteresttoaforagerwhohadaquestionaboutt hewarrantycoverageofoutboard motorsandtheiraccessories.Thesecondexamplerepresent edahigher-valuedpatchthantherst example.Thefour-pagepatchdealtwithweddinghairandmak e-upservicesandmayhavebeen visitedbyanindividualinterestedinbookingtheWebsiteo wnerfortheirwedding.Finally,the thirdexampleillustratesapatchthataforagermayvisitif theywereinterestedinadoptingpuppies fromasmalldogbreeder. Thersttrailshownfromtable61demonstratesamoderately -valuedthree-pagesequence.The exampleillustratesatrailfollowedwhenforagersareinte restedinlearninghowtodealcards.First 198

PAGE 212

Table61:Site-centric:ExampleTrails Website IdDescriptionSupportTrailPages a 4Teachesprofessionalcarddealing0.41Index TestimonialsCalendarofclasses 5Cosmetologyschool0.54Generalinformation FinancialassistanceCoursesoffered 6Smallnancialcompany0.31Index GettingloanswithpoorcreditIndexTestimonials Examplesarefromavarietyofdifferentsignicance/suppo rtlevels a Orderofpages does matter theuservisitedtheindexpage,thenreadthepostedtestimo nials,andnallyviewedwhenclasses wereheld.Thesecondexampledemonstratesalikelypathapo tentialstudentmayfollowwhen interestedinenrollingincosmetologyschool.Generalinf ormationabouttheschoolwasreadrst, followedbyinformationaboutavailablenancialassistan ce,andnallywhatcourseswereoffered attheschool.Thenalexampleshowsatrailwhereaforagerr etracedtheirsteps.Theindexpage wasvisitedrstandthenagainaftertheforagerreadinform ationonhowtoobtainaloanwith poorcredit.Thereasonforthebacktrackingisnotknown,al thoughitmaybethenavigationon thesitefollowedahubandspoketopologythatrequiredback tracking(i.e.,allpageslinkedfrom theindexpage,butnottooneanother).AssumptionsofStatisticalTestsTable62liststheassumptionsforeachofthethreestatisti caltestsusedtotestthesite-centrichypotheses 26 .A 2 symbolindicatestheassumptionwasmetforthestatistical test,whilea 2 sym26 Assumewithinthedatathereare n pairsof X and Y observations ( X 0 ;Y 0 ) ; ( X 1 ;Y 1 ) ;:::; ( X n ;Y n ) .Foreach observationpair,thedifference D i iscalculatedbetween X i and Y i ,where D i = Y i X i 199

PAGE 213

bolmeanstheassumptionwasnotmet.Ifbotha 2 anda 2 symbolareshownthentheassumptionheldforsomemetrics,butnotforallofthemetrics.The rewereatotalofveassumptionsfor thepairedt-test(assumptionsthreethroughseven);fourf ortheexactWilcoxonsignedranktest (assumptionsthreethroughsix);andthreeforthedependen t-samplessigntest(assumptionsone, two,andve). Table62:Site-centric:AssumptionsofStatisticalTests #Assumption t-TestWilcoxonSignTest 1Thepairs( X i ;Y i )areinternallyconsistent,inthatifP(+) > P(-)foronepair( X i ;Y i ),thenP(+) > P(-)forallpairs. 2 2Themeasurementscaleisatleastordinalwithineachpair. 2 3Themeasurementscaleofthe D i sisatleastinterval. 22 4The D i sallhavethesamemean. 22 5The D i s(orbivariaterandomvariables( X i ;Y i ))aremutually independent. 222 6Thedistributionofeach D i issymmetric. 2222 7The D i sareidenticallydistributednormalrandomvariables. 22 (Conover,1999,pg.157-158,353,363) Furtherdetailsaboutwhetherassumptionsweremetornotfo reachofthestatisticaltestsare providedbelow.Thetestsarepresentedinorderofwhichtes thadtheleasttomoststringentassumptions:signtest,Wilcoxontest,andt-test.Dependent-samplesSignTest Allthreeassumptionsofthesigntestwerefullymet. Assumption1–Eachobservationpairwasinternallyconsist ent.IfP(+) > P(-),P(+) < P(-),or P(+)=P(-)forasingleobservationpair,thenP(+) > P(-),P(+) < P(-),orP(+)=P(-)wasthe sameacrossallobservationpairs,respectively. Assumption2–Eachmetricusedinthisresearchwasaquantit ativevariablemeasuredonatleast anintervalscale. Assumption5–Eachpairofbivariaterandomvariables( X i ;Y i )wastakenfromadifferentand independentWebsite. 200

PAGE 214

ExactWilcoxonSignedRankTest ThreeofthefourassumptionsfortheexactWilcoxonsignedr anktestweremet.Thefourth assumptiondealingwithsymmetryofthe D i swasmetforsomeofthemetrics,butnotforallof themetrics.Assumption3–Eachmetricusedinthisresearchwasaquantit ativevariablemeasuredonatleast anintervalscale. Assumption4–Eachofthedifferences( D i )wastakenfromaWebsitefromthesamepopulation.Therefore,themeanofeachdifferencewasexpectedto bethesame. Assumption5–Eachofthedifferences( D i )wastakenfromadifferentandindependentWeb site. Assumption6–Thelasttwocolumnsfromtables63and64showt hequartileskewcoefcient 27 forall13metrics 28 .SincetheWilcoxontestonlyconsidersnon-zerodifferenc es,onlythe skewvaluesfromthe“Nozeros”columnswereanalyzed. 27 Adescriptionofquartileskewcoefcientandwhyitwasanal yzedoverthetraditionallyusedcoefcientofskewnesscanbefoundin§7.1.1.Thevaluesforthecoefcientofs kewnessareprovidedintables63and64forreference purposes. 28 Thesixmeasuresthatdidnotrequireminingfortheircalcul ationareshownintable63.Theremainingseven metricsthatwerecalculatedfromminedpatchesandtrailsa redisplayedintable64. 201

PAGE 215

Table63:Site-centric:MetricNormalityandSkew NLillieforsShapiroSkewQuartileSkew Hyp.MetricTotalNoZerosDp-ValueWp-ValueAllNoZerosAll NoZeros I NFORMATION P ATCH –S ITE -P ATCH SC1 SITEDUR 47470.140.0184 ** 0.920.0035 *** 1.15 1.150.150.15 SC2 SITEPGS 47280.23 < 0.0001 *** 0.930.0076 *** 0.340.27 1.000.33 SC3 RETURN 47470.22 < 0.0001 *** 0.75 < 0.0001 *** 2.102.100.180.18 SC4 REPEAT 47470.160.0035 *** 0.87 < 0.0001 *** 1.30 1.30 0.18 0.18 S TRICT I NFORMATION S CENT SC7 UNIQUE 47440.090.39860.960.1368 0.34 0.37 0.43 0.26 SC8 LINEAR 47180.36 < 0.0001 *** 0.67 < 0.0001 *** 1.260.21 1.000.07 *p 0.10;**p 0.05;***p 0.01202

PAGE 216

Table64:Site-centric:MetricNormalityandSkew(Signic ant–0.05) NLillieforsShapiroSkewQuartileSkew Hyp.MetricTotalNoZerosDp-ValueWp-ValueAllNoZerosAll NoZeros I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 1490.190.15960.880.0595 0.64 0.43 0.150.26 SC5b PATCHLAST 1490.210.0966 0.890.0822 0.48 0.560.06 0.53 SC5c PATCHSUM 1490.210.0795 0.790.0033 *** 1.41 1.090.11 0.37 SC6 PATCHDUR 14140.150.53740.960.6601 0.50 0.50 0.02 0.02 R ELAXED I NFORMATION S CENT SC9a TRAILMAX 1060.250.0671 0.850.0581 0.090.190.080.20 SC9b TRAILLAST 1060.260.0616 0.860.0775 0.08 0.310.220.46 SC9c TRAILSUM 1060.220.18600.870.0977 0.91 1.05 0.080.07 *p 0.10;**p 0.05;***p 0.01203

PAGE 217

Besidesstatisticsonskewasshownintables63and64,gure s56and57arealsoprovidedto graphicallyshowthedistributionofpointsforeachmeasur e. -800 -700 -600 -500 -400 -300 -200 -100 0 All No Zeros Non-goals Minus GoalsWebsites Included (a) SITEDUR -5 -4 -3 -2 -1 0 1 2 3 All No Zeros Non-goals Minus GoalsWebsites Included (b) SITEPGS -5 0 5 10 15 20 25 All No Zeros Non-goals Minus Goals %Websites Included (c) RETURN -20 -15 -10 -5 0 5 10 All No Zeros Non-goals Minus Goals %Websites Included (d) REPEAT -60 -50 -40 -30 -20 -10 0 10 All No Zeros Non-goals Minus GoalsWebsites Included (e) UNIQUE -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 All No Zeros Non-goals Minus GoalsWebsites Included (f) LINEAR Figure56.:Site-centric:DifferencePlots Noneofthemeasuresexactlymettheassumptionofnoskew(us ingthequartileskewcoefcient).However, PATCHDUR hadaveryslightnegativeskewof 0.02andwasconsidered symmetrical. LINEAR and TRAILSUM alsohadaslightpositiveskewvaluesof0.07andwere consideredmostlysymmetrical. PATCHLAST hadahighamountofskew( 0.53)anddidnot 204

PAGE 218

-1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 All No Zeros Non-goals Minus GoalsWebsites Included (a) PATCHMAX -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 All No Zeros Non-goals Minus GoalsWebsites Included (b) PATCHLAST -7 -6 -5 -4 -3 -2 -1 0 1 All No Zeros Non-goals Minus GoalsWebsites Included (c) PATCHSUM -200 -150 -100 -50 0 50 100 All No Zeros Non-goals Minus GoalsWebsites Included (d) PATCHDUR -1.25 -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 All No Zeros Non-goals Minus GoalsWebsites Included (e) TRAILMAX -1.25 -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 All No Zeros Non-goals Minus GoalsWebsites Included (f) TRAILLAST -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 All No Zeros Non-goals Minus GoalsWebsites Included (g) TRAILSUM Figure57.:Site-centric:PatchandTrailDifferencePlots (Signicant–0.05) 205

PAGE 219

meettheassumptionofsymmetry.Alltheothermeasuresdidn otfullymeettheassumptionof symmetrysincetheyhadslighttomoderateamountsofskew( j 0 : 15 j to j 0 : 46 j ). Pairedt-Test Threeoftheveassumptionsforthepairedt-testweremet.T hefourthassumptiondealingwith symmetryofthe D i swasmetforsomeofthemetrics,butnotforallofthemetrics .Thefthassumptionofnormalitywasmetforonlyacoupleofmeasures.Assumption3–Eachmetricusedinthisresearchwasaquantit ativevariablemeasuredonatleast anintervalscale. Assumption4–Eachofthedifferences( D i )wastakenfromaWebsitefromthesamepopulation.Therefore,themeanofeachdifferencewasexpectedto bethesame. Assumption5–Eachofthedifferences( D i )wastakenfromadifferentandindependentWeb site. Assumption6–Thesymmetryforeachmeasurewasdeterminedi nthesamemannerasforthe exactWilcoxonsignedranktest,exceptthe“All”columnswe reusedtodetermineskewsince thet-testusespointswitheitheradifferenceornodiffere nce(i.e.,both D i =0 and D i 6 =0 ) initscalculation.Ofthe13measures,fourofthemeasuresh adaveryslighttoslightamount ofskew: PATCHLAST (0.06), PATCHDUR ( 0.02), TRAILMAX (0.08),and TRAILSUM ( 0.08). Twoofthemeasureshadasevereskewof 1.00( SITEPGS and LINEAR )anddidnotmeetthe assumptionofsymmetry.Theothersevenmeasureshadasligh ttomoderateamountofskew ( j 0 : 11 j to j 0 : 43 j ). Assumption7–Twoformalstatisticaltestsofnormalitywer eperformedonthedifferences( D i s) foreachofthemetrics 29 :Lilliefors(Kolmogorov-Smirnov)andShapiro-Wilknorma litytests (Conover,1999) 30 .Eachtesthasanullhypothesisthatthedatafollowsanorma ldistribution withanunspeciedmeanandvariance.Lillieforsisanon-pa rametricnormalitytest,whereas 29 Symmetryisanecessary,butnotsufcient,conditionforno rmality.Although SITEPGS and LINEAR wereseverely skewedandthusnotsymmetrical,thetestsofnormalitywere stillperformedonthemeasuresforpurposesofcompleteness. 30 Althoughpresented,formaltestsofnormality(suchasLill ieforsandShapiro-Wilk)areknowntobesensitiveto evenslightdeparturesfromnormality(MendenhallandSinc ich,2003). 206

PAGE 220

Shapiro-Wilkhasbeenfoundtohavegreaterpowerthanother tests(suchasLilliefors)in manysituations(Conover,1999).Twoofthemeasuresfailedtorejectthenullhypothesisofan ormaldistributionforbothofthe LillieforsandShapiro-Wilkstestsatan levelof0.15orlower( UNIQUE and PATCHDUR ). PATCHMAX and TRAILSUM alsofailedtorejectthenullhypothesisusingtheLilliefo rstest, butdidrejectthenullhypothesisusingthemorepowerfulSh apiro-Wilkstestatthe0.10signicancelevel.Theremainingninemeasuresrejectedthenu llhypothesisusingbothtestswith atleastan levelof0.10.Thus,onlytwoofthemeasuresmettheassumpti onofnormality, whilethedistributionsoftheother11metricswereconside rednon-normal. Inadditiontothetestsofnormality,theskewvaluesfromta bles63and64andthegraphical depictionofeachmetric'spoints(gures56and57)provide dfurtherevidencethatmostofthe measuresdidnotfollowanormaldistribution. Overall,onlytheassumptionsforthedependent-samplessi gntestwerefullymet.Therefore, thesigntestwasusedtotestthehypothesesofthesite-cent ricmodel 31 .Theassumptionsforthe Wilcoxontestandt-testwerenotcompletelymetandareprov idedonlyforcomparisonpurposes. Thelackofsymmetry(andnormality)forsomeofthemeasures meanstheresultsfromtheWilcoxon andt-testmustbeinterpretedwithcaution.7.2.2HypothesesTestingTables65and66presenttheresultsfortheninesite-centri chypotheses.Table65providesresults fromthesixhypotheseswhosemeasuredidnotrelyonminingp atchesandtrails.Table66lists theresultsforthethreehypothesesthatrequiredminingpa tchesandtrails. Thersttwocolumnsofeachtablelistthehypothesisnumber andnameofthemetricbeing tested.Thethirdandfourthcolumnslistthetotalnumberof Websitesandthenumberofsites withanon-zerodifference(i.e., D i 6 =0 ),respectively.ThetotalnumberofWebsiteswasused inthet-test,whileonlyWebsiteswithnon-zerodifference swereusedfortheWilcoxonandsign tests.Columnsvethroughsevenlistthetstatistic,degre esoffreedom(df),andp-valueforthe 31 Thesigntestisgenerallytheleastpowerfulofthethreetes ts(Conover,1999).However,asallofthesigntest's assumptionsweremet,greatercondencecanbegiventother esultsofthesigntestcomparedtotheothertwotests. 207

PAGE 221

t-test.TheeighthandninthcolumnsdisplaytheVstatistic andp-valuefortheWilcoxontest.The naltwocolumnslisttheSstatisticandp-valueforthesign test 32 32 Resultsofthesigntestarepresentedbelowsinceallthreea ssumptionsforthetestweremet.Theresultsofthe t-testandWilcoxontestareprovidedinfootnotes.Sincene itherthet-testnortheWilcoxontestmetalloftheirassump tions,theresultsofthosetestsshouldbeinterpretedwith caution. 208

PAGE 222

Table65:Site-centric:Results Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –S ITE -P ATCH SC1 SITEDUR 474714.8546 < 0.0001 *** 1,128 < 0.0001 *** 47 < 0.0001 *** SC2 SITEPGS 47282.27460.0140 ** 2950.0166 ** 190.0436 ** SC3 RETURN 4747 6.66461.000001.000001.0000 SC3(opp) a 47476.6646 < 0.0001 *** 1,128 < 0.0001 *** 47 < 0.0001 *** SC4 REPEAT 47471.74460.0447 ** 7360.0346 ** 290.0719 S TRICT I NFORMATION S CENT SC7 UNIQUE 474410.1946 < 0.0001 *** 986 < 0.0001 *** 42 < 0.0001 *** SC8 LINEAR 47184.4146 < 0.0001 *** 171 < 0.0001 *** 18 < 0.0001 *** a Hypothesistestedinoppositedirectionasoriginal–i.e., leavingandreturningwillbe negatively associatedwithachievingagoalon thislongtailWebsite.*p 0.10;**p 0.05;***p 0.01209

PAGE 223

Table66:Site-centric:Results(Signicant–0.05) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 1493.68130.0014 *** 450.0020 *** 90.0020 *** SC5b PATCHLAST 1493.92130.0009 *** 450.0020 *** 90.0020 *** SC5c PATCHSUM 1493.00130.0051 ** 450.0020 *** 90.0020 *** SC6 PATCHDUR 14144.11130.0006 *** 1000.0006 *** 130.0009 *** R ELAXED I NFORMATION S CENT SC9a TRAILMAX 1063.3390.0044 ** 210.0156 ** 60.0156 ** SC9b TRAILLAST 1063.3690.0042 ** 210.0156 ** 60.0156 ** SC9c TRAILSUM 1062.8990.0089 ** 210.0156 ** 60.0156 ** *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).210

PAGE 224

SC1– SITEDUR SimilartoUC1,thersthypothesisofthesite-centricmode lalsoexpectedgoalachievingforagerswouldspendmoretimeonaWebsitethannon-goalachiev ingforagers 33 .However,thesitecentricmodeldidnotcomparetherelativebehaviorofafora gerfromonesitetoanother.Instead, allcomparisonsweredonerelativetoanabsolutevalueofze ro.Regardlessofhowcomparisons weredetermined,therationaleforthehypothesisdoesnotc hangefromonemodeltoanother:an expectationofhigherdurationsforgoalsessionsduetogre aterinformationgainfromthesiteof interest. TheresultsofthesigntestsupportedSC1at =0.01(S=47;p-value= < 0.0001) 34 .All47 Websiteshadahighermediandurationamongstgoalsessions thannon-goalsessions.Goalsessionsspentroughlysixminutesonasite,whilenon-goalses sionsforagedforfewerthantwominutes. Theuseof absolute duration,asusedinthishypothesis,providesadditionals upporttoprior researchregardingthepositiveassociationbetweendurat ionandgoalachievement.Inaddition, althoughtheresultsbetweentheuser-andsite-centricmod elsarenotdirectlycomparabledueto therelativenatureoftheuser-centricmodel,thesignica ntsupportforbothhypothesesgenerally reinforcesoneanotherregardingthevalueofdurationtopr edictgoalachievement. SC2– SITEPGS Boththerstandsecondhypothesesweresimilartooneanoth erbecausetheybothfocusedon theconceptofsatiscing(Pirolli,2007;Simon,1956).How ever,SC2examinedtheimportanceof greaterpagesviewedataWebsiteratherthanduration.Theb eliefwasthateveryadditionalpage signaledthecontinuedinterestoftheforagertostayonthe sitebecauseinformationofvaluecould stillbeobtainedfromtheWebsite.Thus,greaterpagesequa tedtomoreinformationofvaluebeingobtainedwhichfurtherleadtoagreaterprobabilityofa chievingagoalbytheforager. Thesecondhypothesiswasfoundtobesignicantat =0.05(S=19;p-value=0.0436) 35 33 Therationalefortherstfoursite-centrichypothesisare explainedingreaterdetailintheuser-centricresults (§7.1.2). 34 HypothesisSC1wasalsosignicantat =0.01forboththet-test(t=14.85;df=46;p-value= < 0.0001)and Wilcoxontest(V=1,128;p-value= < 0.0001). 35 HypothesisSC2wasalsosignicantat =0.05forboththet-test(t=2.27;df=46;p-value=0.0140)a nd Wilcoxontest(V=295;p-value=0.0166). 211

PAGE 225

supportinghypothesisSC2.19outofthe28non-tiedWebsite shadahighermediannumberof pagesviewedforgoalsessionsversusnon-goalsessions.Go alsessionsviewed4.28pagesonaverage,whereasnon-goalsessionsviewedonly3.85pages. HypothesisSC2providesadditionalsupporttopriorlitera tureaboutthepositiveassociationbetweennumberofpagesviewedandgoalachievement.Whencomp aredtotheuser-centricmodel, however;theresultsforthishypothesiswerelesssignica nt.Onepossiblereasonforthedifferencemaybeduetotheuseofabsoluteratherthanrelativecom parisonsofforagers'behavior. However,anotherlikelyreasonmaybeduetothestructureof theWebsitesusedinthesite-centric dataset. Ingeneral,site-centricWebsiteshadrelativelyfewpages (16.36pagesonaverage).Whilethe numberofpagesontheuser-centricWebsiteswasunknown,th esitesmayhavehadmorepages thanthesite-centricWebsites 36 .Therefore,therewouldbemorepagesthatcouldhavebeenvi sitedonaWebsitefromtheuser-centricdataset,leadingtoa greatergapbetweenpagevisitation behaviorofgoalandnon-goalsessions,andhenceamoresign icantresult. SC3– RETURN HypothesisSC3conjecturedthatforagerswholeftaWebsite andreturnedduringthesamesessionwouldbemorelikelytoachieveagoal.Thehypothesiswa snotsupportedatanyofthetested levels(S=0;p-value=1.0000) 37 .Noneofthe47Websiteshadahigherpercentageofgoals achievedamongstforagerswholeftthesiteandreturneddur ingthesamesessionthanvisitors thatstayedonthesiteduringtheirentiresession.Theresu ltsindicatedaforagerwas less likely toachieveagoaliftheuserleftaWebsiteandreturnedwithi nthesamesession(theoppositeof hypothesisSC3),whichwassignicantat =0.01(S=47;p-value= < 0.0001) 38 .All47Web sitesfoundahigherproportionofgoalsessionsthatstayed ratherthanleftandreturnedduring theirsession. 36 Onaverage,foragersfromtheWebsitesintheuser-centricd atasetviewed14.84pagesduringasession,whilevisitorstosite-centricsitesonlyviewed4.06pages.Whileno tdenitiveproofthatuser-centricsiteshadmorepagestha n site-centricWebsites,thehigheraveragenumberofpagesv iewedforuser-centricforagersdoesgivesomeindication thatthoseWebsitesmighthavehadmorepages. 37 HypothesisSC3wasalsonotsupportedatanyofthetested levelsforboththet-test(t= 6.66;df=46;p-value =1.0000)andWilcoxontest(V=0;p-value=1.0000). 38 TheoppositeofhypothesisSC3wasalsosupportedat =0.01forboththet-test(t=6.66;df=46;p-value= < 0.0001)andWilcoxontest(V=1,128;p-value= < 0.0001). 212

PAGE 226

Sincepriorresearchhasnotexaminedreturningbehaviorof auserduringthesamesession,this hypothesis(inoppositeform)providesaninitialresultof anegativeassociationbetweenreturning behaviorduringthesamesessionandgoalachievement.Inco mparisontotheuser-centricmodel, theresultofthishypothesiswasalsonotsupported.Howeve r,unliketheuser-centricmodel,the oppositeoftheoriginalhypothesiswassupportedat =0.01.Thedifferenceinsupportandnot supportbetweenthetwomodelsmayhavebeenduetothemethod ologyofdeterminingwhen leavingandreturningbehavioroccurred. Ingeneral,theuser-centricmodelwasconservativewhilet hesite-centricmodelwasliberal whenclassifyingleavingandreturningbehavior.Forexamp le,theuser-centricmodelonlycounted visitsofatleasttwopagestootherknowne-commerceWebsit esasvalidleavingbehavior.Such aprecisemannerofdeterminingleavingbehaviorwasnotpos sibleinthesite-centricdata.Therefore,amoresimplisticmannerofdeterminingleavingbehav iorwasusedwhichexaminedthereferringeldforeachpageviewed.Alimitationofusingther eferringeldwasitwasnotknownif truebrowsingbehavior(i.e.,morethanonepagewasviewed) tookplaceatthereferredWebsite. Thussituationsinwhichaforagerleftthesiteofinterest, viewedonepageonanothersite,and thenreturnedwouldstillbemarkedasleavingandreturning inthesite-centricmodel 39 .Inaddition,thereferringeldwasalsolimitedbecauseitwasunab letocaptureifaforagerlefttoview anotherWebsiteinanewWebbrowserwindowortab.SC4– REPEAT ThenalhypothesiswhichexaminedthevalueoftheentireWe bsiteasapatchlookedathow priorvisitationbehaviorwouldaffectgoalachievement.H ypothesisSC4expectedpriorvisitationofaWebsitewouldprovideaforagerwithintimateknowl edgeofwhatthesitehadtooffer. Therefore,whentheforagerhassomeneedtobemetatafuture date,theywouldmorelikelyreturntotheWebsiteofinterestiftheybelieveditwouldsati sfytheirinformationgoal.Thus,repeat visitationwouldsignalgreaterlikelihoodofachievingag oalattheWebsiteofinterest. Theresultsofthesigntestdemonstratedthefourthhypothe siswassignicantat =0.10(S= 29;p-value=0.0719) 40 ,supportinghypothesisSC4.29ofthe47Websiteshadahighe rmedian 39 Anotherexamplewouldbeaforagerthatclickedthebackbutt onoftheirbrowseronetoomanytimesandended uponthesearchenginepagethatinitiallybroughtthemtoth esiteofinterest,andthenclickedalinktoreturnbackto thesiteofinterest. 40 HypothesisSC4wassignicantat =0.05forboththet-test(t=1.74;df=46;p-value=0.0447)a ndWilcoxon 213

PAGE 227

probabilityofaformsubmissionamongstgoalsessionswhen auserhadvisitedthesitebefore. Theslightlysignicantresultwasduetothesmalldifferen cebetweentheproportionofgoalsessionsthathadandhadnotvisitedbefore(0.62 % differencebetweengroups). Priorliteraturehasfoundmixedassociations(dependento nthetask)betweenusersreturning duringdifferentsessionsandcompletingatask(Sismeiroa ndBucklin,2004).Theresultofthis hypothesislendsadditionalsupportofaslightpositiveas sociationbetweenrepeatvisitationbehaviorandgoalachievement.Whencomparedtotheuser-cent ricmodel,therewasalargedifferenceinsignicancebetweenthetwomodels( =0.01user-centricversus0.10site-centric).The differenceinsignicancemaybepartiallyexplainedintwo ways. First,thenatureofthegoalbeingexaminedintheuser-cent ricdatasetmaybetterlenditselfto repeatvisitationthanthegoalinthesite-centricdataset .Forexample,theuser-centricdatasetexaminedproductpurchases,whichmayneedtobereplenishedf romtimetotime.Incontrast,there islikelylittleneedtoresubmitcontactinformationonone ofthesite-centricWebsites.Fromanotherstandpoint,apurchasehasadenedcostassociatedwi thit.Therefore,aforagermayreturn toasitemultipletimesastheycontemplatepurchasingapro duct.Leavingcontactinformation onaWebsitehasnorealmonetarycostassociatedwiththeact ion.Therefore,thesubmissionof acontactformmaynotrequirethesamedegreeofthoughtandc omparisonthatpurchasingdoes, whichmaylowertheneedforrepeatvisitationstoasite. Thesecondreasonadifferencebetweenthehypothesesofthe twomodelswasseenmaybe duetothemechanismbywhichsite-centricforagerswereide ntiedinthedataset.Cookieswere usedtoidentifyandtrackusersacrosssessions.Ifauserde letedtheircookiethentheywouldbe seenasanewvisitoronanysubsequentvisit.Thus,repeatvi sitationofforagersmaybeunderrepresentedinthesite-centricdataset.SC5– PATCHMAX PATCHLAST ,and PATCHSUM Thefthhypothesisexpectedthatvisitationofgoalpatche swouldbepositivelyassociatedwith goalachievement.Thehypothesisoperatedundertheassump tionthatcertainareasofaWebsite weremorevaluabletogoalachievingforagersthanotherare asofthesite.Thus,usersthatvisited thosesameareasoftheWebsitewereassumedtohavesimilari nformationgoalsandshouldbe test(V=736;p-value=0.0346).Thereasonthet-testandWil coxontestfoundSC4signicantataloweralphalevel thanthesigntestmaybeduetothelowerpowerofthesigntest 214

PAGE 228

morelikelytoachieveagoal.Theactualvaluefromaforager 'svisitationofpatcheswasspecied inslightlydifferentwaysinthreesub-hypotheses:maximu mvalueofapatch,valueoflastpatch visited,andtotalvalueofallpatchesvisited. Thethreesub-hypothesesofSC5wereallfoundtobesignica ntat =0.01(S=9;p-value= 0.0020forallthreemeasures) 41 ,supportinghypothesesSC5a-c(table66).14Websitesouto fthe 47totalWebsitesdiscoveredpatchesatthe0.05signicanc elevel,withonlynineofthose14sites havinganon-zerodifference.Allnineofthenon-zeroWebsi teshadgoalsessionswithhigher medianvaluesforthemostvaluablepatchvisited,lastpatc hvisited,andsumofallpatchesvisited. Onaverage,foragersvisited2.75patchespersession.With sofewpatchesbeingvisiteditwas possiblethatsomeofthemeasuresdidnotdifferfromoneano therbyagreatdeal.Forexample, sessionsthatonlyvisitedasinglepatchwouldhavethesame valueforallthreemeasures.However,asforagersvisitedalmostthreepatchespersession, theaveragevalueof PATCHSUM was atleast0.42pointshigherthaneitheroftheothermeasures ,makingitunlikelythe PATCHSUM measureonlyincludedthesamepatchesastheothertwomeasu res.Fortheothertwomeasures though,theaveragedifferencebetween PATCHMAX and PATCHLAST wasonly0.03points,indicatingmanysessionsmayhavehadthesamepatchbethemostvalua bleandlastpatchvisited.Therefore,eventhoughbothsub-hypothesesweresupported,thes imilarityofeachmeasuremeansthe actualimpactofthemostvaluableandlastvisitedpatchong oalachievementcannotbereliably separatedfromoneanother. Sincepriorresearchhasnotexaminedtheimpactgroupsofpa ges(thatmaybeofdifferent types(e.g.,productpages,informationalpages)),thishy pothesisprovidesaninitialresultofapositiveassociationbetweenvisitationofpatchesandgoalac hievement. SC6– PATCHDUR Thesixthhypothesisexpectedthatmerevisitationaloneof valuablepatcheswouldnotnecessarily meanforagerswereobtainingvaluefromthosepatches.Ther efore,similartohypothesisSC1, thishypothesisalsoreliedontheconceptofsatiscing(Pi rolli,2007;Simon,1956);contending 41 Usingthet-test,hypothesesSC5aandSC5bwerebothsignic antat =0.01( PATCHMAX (t=3.68;df=13;pvalue=0.0014); PATCHLAST (t=3.92;df=13;p-value=0.0009)),whileSC5cwassignica ntat =0.05(t=3.00; df=13;p-value=0.0051).UsingtheWilcoxontest,allthree hypothesesweresignicantat =0.01(V=45;p-value =0.0020forallthreemeasures).Thediscrepancybetweenth esignicanceof PATCHSUM fromthet-testversusthe Wilcoxonandsigntestmaybeduetoalackofnormalityofthem easure.Withoutnormality,thet-testmaynothave enoughpowertodetectassignicantofadifferenceastheot hertwotests. 215

PAGE 229

higheramountsoftimespentwithinpatchesrelatedtomorei nformationgainedandthusagreater likelihoodofgoalachievement. HypothesisSC6foundahighermediandurationwithinpatche sforgoalsessionsthannon-goal sessionsat =0.01(S=13;p-value=0.0009) 42 ,supportingthehypothesis.13ofthe14nonzeroWebsiteswithdiscoveredpatcheshadgoalsessionsspe ndahighermediandurationoftime withinpatchesthannon-goalsessionsspentinpatches.Ona verage,goalsessionsspentalmost three-quartersofaminutemoreinpatchesthannon-goalses sions. Theuseofdurationonanentiresitehasbeenusedinpriorlit eraturetoexplainchoicebehavior,withmixedresults.Durationhashadmixedassociation swithpurchasingfordifferenttasks (SismeiroandBucklin,2004)alongwithdifferencesinsign icanceamongdifferentdatasets(Padmanabhanetal.,2001).However,aspriorliteraturehasnot examinedtheconceptofpatchesbefore,thishypothesisprovidesaninitialresultofapositi veassociationbetweendurationwithin patchesandgoalachievement.SC7– UNIQUE HypothesisSC7wastherstoftwohypothesesthatdenedinf ormationscentinastrictmanner. Inaddition,thehypothesisalsoviewedaforager'ssession asasinglemonolithicpiece,andassumedanyrepeatpageviewings(regardlessoflocation)wer eindicativeofpoorscent.Inturn, thecauseoflacklusterinformationscentwasbelievedtobe eitherapoorlydenedinformation goaloralessthanoptimalWebsitedesign,bothofwhichwere lesslikelytoresultinagoalbeing achieved.Statedinapositivedirection,thehypothesispr oposedthatthelowerthepercentageof duplicatepagesviewed,themorelikelyagoalwouldbeachie ved. HypothesisSC7(table65)wassignicantat =0.01(S=42;p-value= < 0.0001) 43 ,supportingthehypothesisthatgoalachievingsessionswouldvisit fewerduplicatepagesthannon-goal sessions.42ofthe44non-zeroWebsiteshadgoalsessionswi thahighermedianpercentageof uniquepagesviewedthannon-goalsessions,withgoalsessi onsviewingalmost90 % uniquepages andnon-goalsessionsonlyviewingabout70 % Priorresearchhasexaminedtherelationshipbetweenthepr oportionofuniquepagesvisitedand 42 HypothesisSC6wasalsosignicantat =0.01forboththet-test(t=4.11;df=13;p-value=0.0006)a nd Wilcoxontest(V=100;p-value=0.0006). 43 HypothesisSC7wasalsosignicantat =0.01forboththet-test(t=10.19;df=46;p-value= < 0.0001)and Wilcoxontest(V=986;p-value= < 0.0001). 216

PAGE 230

purchasingbehavior.Moe(2003)foundthattheproportiono funiquepagesdiffereddepending onthetypeofpagesbeingviewed(e.g.,brandpages,product pages,categorypages).However, thishypothesisexaminedproportionofuniquepagesacross allpagetypes.Therefore,hypothesisSC7providessupportforthegeneralpositiveassociati onbetweenproportionofuniquepages viewedandgoalachievement.SC8– LINEAR HypothesisSC8wasthesecondofthetwohypothesesthatden edinformationscentinastrict manner,whererepeatvisitationswereviewedasindication sofpoorscent.However,thishypothesistookaner-grainedconceptualizationofscentthanth eprevioushypothesisbyexaminingthe complexityofauser'ssession.Complexitywasdeterminedb ynotonlywhatpageswereviewed, butalsotheorderinwhichtheywereviewed.HypothesisSC8p roposedthatlesscomplex(i.e., morelinear)clickstreamswereindicativeofhigherlevels ofscent,andthusagreaterlikelihoodof achievingagoal. Thehypothesiswasfoundtobesignicantat =0.01(S=18;p-value= < 0.0001) 44 ,supportinghypothesisSC8.All18ofthenon-zeroWebsiteshadhighe rmedianlinearclickstreamvalues forgoalsessionscomparedtonon-goalsessions.Theaverag egoalsessionshadalinearclickstreamvalueof0.98comparedtotheaveragevalueof0.75for non-goalsessions. Priorresearchhasfoundsuccessinusingthemeasureofsess ioncomplexitytodistinguishbetweengroups(McEneaney,2001),intheuseofproductrecomm endationagents(Senecaletal., 2005),andinpredictingthecompletionofinformationande -commercetasks(Kalczynskietal., 2006).Thishypothesisstrengthenstheuseofsessioncompl exitytodistinguishbetweengoaland non-goalsessionswithinthecontextofgoalachievement. Bothofthetwostrictinformationscenthypotheseswerefou ndtobesupportedatthesame level.Usingtheresultsofthesigntest,onemeasurewasnot abletobedenitivelyconsidered betterthantheother,sincethenumberofnon-zeroWebsites wasdifferentforeachhypothesis 45 Inaddition,eventhoughthe LINEAR measurehadagreaterpercentageofitsnon-zeroWebsitesin 44 HypothesisSC8wasalsosignicantat =0.01forboththet-test(t=4.41;df=46;p-value= < 0.0001)and Wilcoxontest(V=171;p-value= < 0.0001). 45 Examiningthet-testshowsaclearpreferenceforthe UNIQUE measureinbeingbetterabletodistinguishbetween goalandnon-goalsessions(tvalueof10.19versus4.41).Ho wever,astheassumptionsofthet-testwerenotfullymet, theresultsofthet-testshouldbeinterpretedwithcaution 217

PAGE 231

thepositivedirection(100 % versus95.45 % ),thelossoftwoWebsitesinthenegativedirectionfor the UNIQUE measurewasnotenoughtoraisethep-valueprecipitously. SC9– TRAILMAX TRAILLAST ,and TRAILSUM Thenalhypothesisexaminedinformationscentfromarelax edviewpoint.Thehypothesisassumedthatforagerswhofollowedtrailspredominatelytrav ersedbypriorgoalsessionswouldbe positivelyassociatedwithgoalachievement.Thehypothes isoperatedundertheassumptionthat certainpathsthroughoutaWebsite(withorwithout“inefc iencies”)wereindicatorsofhighscent relevanttoagoal-achievinginformationgoal.Thevalueob tainedfromaforagerfollowingatrail wasspeciedinthreeslightlydifferentsub-hypotheses:m aximumvalueofatrail,valueoflast trailfollowed,andtotalvalueofalltrailsfollowed. Thethreesub-hypothesesofSC9(table66)regardingthefol lowingofvaluabletrailsasameans toexplaingoalachievementwereallfoundtobesignicanta t =0.05(S=6;p-value=0.0156 forallthreemeasures) 46 ,supportinghypothesesSC9a-c.10Websitesoutofthe47tot alsites discoveredtrailsatthe0.05signicancelevel,withonlys ixofthose10siteshavinganon-zero difference.Allsixofthenon-zeroWebsiteshadgoalsessio nswithhighermedianvaluesforthe mostvaluabletrailfollowed,lasttrailfollowed,andsumo falltrailsfollowed. Onaverage,foragersfollowed1.60trailspersession.Asma nyforagersonlyfollowedasingletrailpersession,itwaslikelythemeasuresdidnotdiff erfromoneanotherbyagreatdeal.For example,sessionsthatonlyfollowedasingletrailwouldha vethesamevalueforallthreemeasures.Examiningthedifferenceinvaluebetweenthethreem easures(0.26to0.33)failedtorevealaclearanddistinctdifferencebetweenthem.Therefor e,eventhoughallthreesub-hypotheses weresupported,thesimilarityofeachmeasuremeanstheact ualimpactofthemostvaluable,last followed,andtotalvalueofalltrailsfollowedongoalachi evementcannotbereliablyseparated fromoneanother. Priorresearchhasexaminedtheuseofpathsandportionsofp athstopredictfuturepatchselections(Montgomeryetal.,2004;Yangetal.,2004).However, theuseofpathfragmentstosegment groupsofaWebsitepopulationhasnotbeenexaminedinprior literature.Thus,thishypothesis 46 HypothesesSC9a-cweresignicantat =0.05forboththet-test( TRAILMAX (t=3.33;df=9;p-value= 0.0044); TRAILLAST (t=3.36;df=9;p-value=0.0042); TRAILSUM (t=2.89;df=9;p-value=0.0089))andWilcoxon test(V=21;p-value=0.0156forallthreemeasures). 218

PAGE 232

providesaninitialresultofapositiveassociationbetwee nfollowingoftrailsandgoalachievement.SummaryofResultsTable67summarizestheresultsofthehypothesestesting.O fthe13hypothesesandsub-hypotheses, sevenweresupportedat =0.01,fourat =0.05,andoneat =0.10.HypothesisSC3wasnot supportedinitsoriginalform,buttheoppositeofSC3wassu pportedat =0.01. Table67:Site-centric:HypothesesResultsSum-mary Hyp.MetricHypothesisSupported? I NFORMATION P ATCH –S ITE -P ATCH SC1 SITEDUR Yes *** SC2 SITEPGS Yes ** SC3 RETURN No SC3(opp) a Yes *** SC4 REPEAT Yes I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX Yes *** SC5b PATCHLAST Yes *** SC5c PATCHSUM Yes *** SC6 PATCHDUR Yes *** R ELAXED I NFORMATION S CENT SC7 UNIQUE Yes *** SC8 LINEAR Yes *** R ELAXED I NFORMATION S CENT SC9a TRAILMAX Yes ** SC9b TRAILLAST Yes ** SC9c TRAILSUM Yes ** a Hypothesistestedinoppositedirectionasoriginal–i.e.,leavingandreturningwillbe negatively associatedwith achievingagoalonthislongtailWebsite.*p 0.10;**p 0.05;***p 0.01 219

PAGE 233

7.2.3SensitivityAnalysisTheprevioussectiontestedhypothesesSC5a-c,SC6,andSC9 a-cfrompatchesandtrailsmined atthe0.05signicancelevel.Theuseof =0.05forlearningpatchesandtrailswasmotivatedby priorresearchthatusedthesame levelwhendiscoveringcontrastsets(BayandPazzani,1999 ). However,othersignicancelevelsanddifferentmeansofde tectingcontrastsetsmaybeused(e.g., support).Therefore,asensitivityanalysiswasdonetosee howtheselectionofminingcriteria usedforlearningpatchesandtrailsmayaffecttheresultso fthehypotheses. Thissectionprovidesdescriptivestatisticsandresultsf orhypothesesSC5a-c,SC6,andSC9a-c attwodifferentsignicancelevels(0.01and0.05)andsixd istinctsupportlevels(0.25–1.50in 0.25increments).DescriptiveStatisticsFigure58illustratesthenumberofWebsitesthatdiscovere dpatches(gure58a)andtrails(gure58b)fromalleightminedsignicanceandsupportlevels used 47 .Eachgurealsodisplaysthe numberofWebsiteswhichdidnothaveazerodifferenceforth etestedmeasures 48 0 5 10 15 20 25 30 35 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Number of SitesSignificance / Support Levels Total No Zeros (a)Patches 0 5 10 15 20 25 30 35 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Number of SitesSignificance / Support Levels Total No Zeros (b)Trails Figure58.:Site-centric:PatchandTrailSampleSizebySig nicance/SupportLevels Betweenthetwominedsignicancelevels,therewasverylit tlechangeinthenumberofWeb sitesforeitherpatchesortrails.AnincreaseofonlytwoWe bsitesforpatches(16.67 % )andone 47 TheactualnumberofWebsitesmaybefoundintables69–76,wh ichareintroducedlaterinthissubsection. 48 Thenumberofnon-zeroWebsiteswasdeterminedbyndingthe averagenumberof“nozero”Websitesfrom hypothesesSC5a-cforpatchesandSC9a-cfortrails. 220

PAGE 234

Websitefortrails(11.11 % )wasseenwhenmovingfromthemorestringent =0.01totheless stringent =0.05. Thedifferenceinsamplesizebetweenminedsupportlevelsw asmuchmoredramaticthanbetweenminedsignicancelevels.Highersupportlevelsgrea tlyreducedthenumberofWebsites whichdiscoveredpatchesandtrailsofthespeciedvalue.A tthe0.25supportlevel,32Websites foundpatchesand35sitesdiscoveredtrails.Anincreaseto the0.50supportlevelsawa25.00 % dropinpatchWebsites(to24sites)anda31.43 % decreaseintrailWebsites(to24sites).Aneven greaterpercentagedropinWebsiteswasseenusingthe0.75s upportlevel:58.33 % decreasein Websitesforbothpatchesandtrails(to10sites).Atthehig hersupportlevelstherewereveryfew Websitesdiscoveringpatchesortrails.OnlytwoandveWeb sites,andoneandzeroWebsites forpatchesandtrailswerefoundatthe1.00and1.25support levels,respectively. Figure59displaystheaveragevalueofallsessionsforthet hreepatchvisitation(gure59a) andtrailfollowing(gure59b)measuresacrossalleightmi ningsignicanceandsupportlevels used 49 .Inaddition,gure59alsoshowstheaveragenumberofsecon dsspentwithinpatchesfor allsessions(gure59c). Ingeneral,theaveragevaluesforeachofthemeasuresappea redtostaywithinarelativelynarrowrangeofoneanotherfromthe0.01signicanceleveltoth e0.75supportlevel.Table68furtherreinforcestherelativestabilityofthesemeasuresby listingthemean,standarddeviation, median,minimum,andmaximumvaluesfromtheaverageofthe rstsixsignicanceandsupportlevels.Thestandarddeviationforthethreepatchvisi tationandthreetrailfollowingmeasures rangedfrom0.07to0.13. Thehighestvaluesof PATCHMAX and PATCHLAST werefoundatthe0.01signicancelevel (0.36and0.34),while TRAILMAX and TRAILLAST wereattheirhighestaveragevaluesatthe0.05 signicancelevel(0.37and0.35).Notsurprisingly,botho fthesummeasures( PATCHSUM and TRAILSUM )hadtheirhighestvalues(0.90and0.60)whenthesupportwa s0.25(whenthemost numberofpatchesandtrailswerediscovered 50 ). The PATCHDUR measurewastheonlymetricthatcontinuedtoincreasethrou ghallthesigni49 Theresultsfromthe1.00supportlevelandaboveshouldbein terpretedwithcautionastheaverageswerecalculatedfromveryfewWebsites.Inaddition,theaveragesdisp layedinthegureswerecalculatedbyincludingsessions whichdidnotvisitapatchortrail.Therefore,theaveragem etricmaybelowerthanshouldotherwisebepossible.For example,thelowestaverageofpatcheslearnedatthe0.50su pportlevelshouldbe0.50.However,theaverage PATCH MAX valuewas0.17forallsessions. 50 Seetables77and78formoredetails. 221

PAGE 235

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels PATCHMAX PATCHLAST PATCHSUM (a)Patches 0 0.2 0.4 0.6 0.8 1 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels TRAILMAX TRAILLAST TRAILSUM (b)Trails 0 20 40 60 80 100 120 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Seconds per SiteSignificance / Support Levels (c)PatchDuration Figure59.:Site-centric:AllPatchandTrailMetricsbySig nicance/SupportLevels 222

PAGE 236

canceandsupportmininglevels 51 .Theincreasemayhavebeenduetothreeinter-relatedreaso ns. First,theaveragenumberofpatchespersiteincreasedfrom the0.01signicanceleveltothe 0.25supportlevel.Anincreasednumberofpatchesgenerall ymeantgreatercoverageoftheWeb site(e.g.,23.37 % to42.22 % coveragebetweenthe0.01signicanceleveland0.25suppor tlevel) 52 Therefore,thedurationspentinpatchesmayhavemoreclose lyalignedwiththeamountoftimea foragerspentontheWebsiteasawhole. Thesecondreasonmaybeduetotheincreaseinaveragepatchs ize.Forexample,theaverage sizeofpatcheswentfrom1.67to2.25pageswhengoingfromth e0.01signicanceleveltothe 0.50supportlevel.Whenthesizeofapatchwasincreasedthe nthetotaldurationwithinthepatch includedthedurationofmorepages.Therefore,anincrease dtotaldurationwithinapatchmay thenleadtohighermedianpatchdurations. Thenalreasonmayhavebeenbecausetheaveragevalueofapa tchincreased.Forexample, betweenthe0.25to0.75supportlevelstheaveragepatchval ueincreasedfrom0.42to0.88.The assumptionwasforagerswouldspendmoretimewithinthemor evaluablepatches.Therefore, whenasiteonlyhadvaluablepatches(e.g.,atsupportlevel 0.75),then(1)moretimeshouldhave beenspentwithinthosepatchesand(2)themedianpatchdura tionoftheforagerwasnotreduced bythevisitationoflessvaluablepatches(wherelesstimew ithinthepatchwouldbeexpected). Table68:Site-centric:SensitivityAnalysisMetricStati stics MeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX 0.280.070.280.170.36 PATCHLAST 0.240.080.240.120.34 PATCHSUM 0.770.090.780.640.90 PATCHDUR (inseconds)82.7520.3581.3558.46107.71 R ELAXED I NFORMATION S CENT TRAILMAX 0.240.090.210.140.37 TRAILLAST 0.210.100.190.110.35 TRAILSUM 0.420.130.360.310.60 51 Themeasurewascalculatedbyndingthemedianamountoftim espentinallpatchesbyaforager. 52 Statisticsaboutpatchcharacteristicsmaybefoundin§7.2 .3. 223

PAGE 237

Figure60expandsupongure59byillustratingtheaveragev alueofthesevenmeasuresagainst threegroupsofsessions:all,goal,andnon-goal 53 Tables69–76listthemean,standarddeviation,median,min imum,andmaximumvaluesfor thesevenmetricsthatwerecalculatedfromminedpatchesan dtrailsattheeightdifferentsignificanceandsupportlevels.Thestatisticsforthemetricsar edisplayedinthreegroupsofsessions: all,goal,andnon-goal.PatchandTrailDescriptiveStatisticsFigure61illustratesthedifferencesbetweenthesignica nceandsupportlevelsforfourdifferentstatistics 54 :numberofpatchesandtrails(gure61a),sizeofpatchesan dtrails(gure61b), percentageofcoverageofpatchesandtrails(gure61c),an dthevalueofpatchesandtrails(gure61d).Eachguredisplaysthestatisticofpatchesandtr ailsforeachofthemetrics.Figure61 alsodisplaysthenumberofpatchesvisited(gure61e)andt railsfollowed(gure61f)fromthree groupsofsessions:all,goal,andnon-goal. Tables77–86listthemean,standarddeviation,median,min imum,andmaximumvaluesfor themetricsdisplayedingure61acrossalleightsignican ceandsupportmininglevels. TheaveragenumberofpatchesandtrailsfoundonaWebsitefo llowedasimilarpatternforthe rstvelevels,withmorepatchesbeingdiscoveredthantra ils.Notsurprisingly,thegreatestnumberofaveragepatches(50.16)andtrails(39.60)werefound usingtheleaststringentsupportlevel (0.25).Incomparingthesignicanceandsupportlevels,th erewasnotadirectequivalentofeither signicancelevelfoundwithintheselectedsupportlevels .Forexample,inordertoobtainasimilarnumberofpatchesandtrailsasfoundat =0.05(11.93patchesand4.70trails),thesupport levelwouldneedtohavebeenbetween0.75and1.00(8.00–20. 20patchesand1.60–7.20trails). Theaveragesizeofpatchesandtrailsroughlyfolloweda \ shapeoverthesignicanceand supportlevels,withtrailsbeinglargerinsizethanpatche sforallbutthe0.50supportlevel(2.63 pagesperpatchversus2.56pagespertrail) 55 .Patchesandtrailsdiscoveredusingsignicance weresmallerinsizethanthosepatchesandtrailsfoundfrom therstthreesupportlevels.For example,patcheswere1.82pagesinsizeandtrailswere2.15 pageslongat =0.05.Therst 53 Theactualvaluesusedintheguresmaybefoundintables69– 76,whichareintroducedlaterinthissubsection. 54 Theanalysisofpatchandtraildescriptivestatisticsdono tincludesupportlevelsgreaterthan0.75,asalimited numberofWebsitesfoundpatchesandtrailsatthosesupport levelstoprovidereliablemetricaverages. 55 Trailswererestrictedtoaminimumoftwopagesinsequence, whereaspatchescouldbeonepageinsize. 224

PAGE 238

0 0.5 1 1.5 2 2.5 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels All Goal Non-goal (a) PATCHMAX 0 0.5 1 1.5 2 2.5 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels All Goal Non-goal (b) PATCHLAST 0 0.5 1 1.5 2 2.5 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels All Goal Non-goal (c) PATCHSUM 0 50 100 150 200 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Patch Duration (s) per SiteSignificance / Support Levels All Goal Non-goal (d) PATCHDUR 0 0.2 0.4 0.6 0.8 1 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels All Goal Non-goal (e) TRAILMAX 0 0.2 0.4 0.6 0.8 1 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels All Goal Non-goal (f) TRAILLAST 0 0.2 0.4 0.6 0.8 1 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Metric Statistic per SiteSignificance / Support Levels All Goal Non-goal (g) TRAILSUM Figure60.:Site-centric:PatchandTrailMetricsbySigni cance/SupportLevels 225

PAGE 239

Table69:Site-centric:MetricStatistics(Signicant–0. 01) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All120.360.410.200.001.31 Goal0.600.430.740.001.31Non-goal0.120.210.000.000.53 PATCHLAST All120.340.370.200.001.03 Goal0.560.370.700.001.03Non-goal0.120.210.000.000.53 PATCHSUM All120.781.180.200.004.53 Goal1.401.411.060.004.53Non-goal0.160.310.000.000.95 PATCHDUR (inseconds) All1258.4640.4147.7513.00162.75 Goal80.2346.7571.2517.00162.75Non-goal36.6913.9737.8813.0062.50 R ELAXED I NFORMATION S CENT TRAILMAX All90.280.380.000.001.04 Goal0.500.420.500.001.04Non-goal0.060.170.000.000.50 TRAILLAST All90.270.370.000.001.04 Goal0.480.400.500.001.04Non-goal0.060.170.000.000.50 TRAILSUM All90.330.510.000.001.80 Goal0.600.600.500.001.80Non-goal0.060.170.000.000.50 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnongoalsessions. 226

PAGE 240

Table70:Site-centric:MetricStatistics(Signicant–0. 05) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All140.320.400.000.001.31 Goal0.540.430.600.001.31Non-goal0.100.210.000.000.57 PATCHLAST All140.290.340.000.001.03 Goal0.470.360.470.001.03Non-goal0.100.200.000.000.51 PATCHSUM All140.791.250.000.004.53 Goal1.401.521.060.004.53Non-goal0.170.360.000.001.04 PATCHDUR (inseconds) All1468.2847.0251.3819.00178.00 Goal89.4853.0676.6319.00178.00Non-goal47.0728.4338.8820.00134.00 R ELAXED I NFORMATION S CENT TRAILMAX All100.270.370.000.001.04 Goal0.500.400.520.001.04Non-goal0.050.160.000.000.50 TRAILLAST All100.260.350.000.001.04 Goal0.480.370.520.001.04Non-goal0.050.160.000.000.50 TRAILSUM All100.330.500.000.001.80 Goal0.610.580.520.001.80Non-goal0.050.160.000.000.50 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnongoalsessions. 227

PAGE 241

Table71:Site-centric:MetricStatistics(Supported–0.2 5) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All320.270.310.260.001.31 Goal0.370.360.320.001.31Non-goal0.170.210.000.000.57 PATCHLAST All320.200.200.260.000.68 Goal0.260.220.280.000.68Non-goal0.150.180.000.000.48 PATCHSUM All320.901.680.270.006.99 Goal1.442.200.600.006.99Non-goal0.370.560.000.002.05 PATCHDUR (inseconds) All3181.3551.1371.1316.50274.00 Goal98.8959.2989.0026.00274.00Non-goal63.8234.1351.7516.50130.00 R ELAXED I NFORMATION S CENT TRAILMAX All350.210.270.000.001.04 Goal0.280.320.250.001.04Non-goal0.140.200.000.000.58 TRAILLAST All350.150.190.000.000.63 Goal0.190.200.250.000.63Non-goal0.110.160.000.000.50 TRAILSUM All350.601.570.000.0010.72 Goal1.002.140.250.0010.72Non-goal0.210.330.000.001.13 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnongoalsessions. 228

PAGE 242

Table72:Site-centric:MetricStatistics(Supported–0.5 0) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All240.170.350.000.001.31 Goal0.320.440.000.001.31Non-goal0.020.120.000.000.57 PATCHLAST All240.120.250.000.000.79 Goal0.230.310.000.000.79Non-goal0.020.100.000.000.51 PATCHSUM All240.641.620.000.006.35 Goal1.222.140.000.006.35Non-goal0.070.320.000.001.58 PATCHDUR (inseconds) All2497.9760.0684.0013.00348.25 Goal112.5567.3595.2517.00348.25Non-goal83.4048.9070.7513.00171.75 R ELAXED I NFORMATION S CENT TRAILMAX All240.140.300.000.001.04 Goal0.240.370.000.001.04Non-goal0.050.160.000.000.58 TRAILLAST All240.110.220.000.000.68 Goal0.170.250.000.000.68Non-goal0.050.160.000.000.58 TRAILSUM All240.361.010.000.005.55 Goal0.671.350.000.005.55Non-goal0.050.160.000.000.58 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnon-goal sessions. 229

PAGE 243

Table73:Site-centric:MetricStatistics(Supported–0.7 5) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All100.280.450.000.001.31 Goal0.560.500.760.001.31Non-goal0.000.000.000.000.00 PATCHLAST All100.240.390.000.001.03 Goal0.490.430.760.001.03Non-goal0.000.000.000.000.00 PATCHSUM All100.751.430.000.004.33 Goal1.491.750.760.004.33Non-goal0.000.000.000.000.00 PATCHDUR (inseconds) All10107.7190.6489.3813.00398.25 Goal130.60109.67105.8817.00398.25Non-goal84.8364.4271.0013.00240.75 R ELAXED I NFORMATION S CENT TRAILMAX All100.210.380.000.001.04 Goal0.410.460.220.001.04Non-goal0.000.000.000.000.00 TRAILLAST All100.190.360.000.001.04 Goal0.380.430.200.001.04Non-goal0.000.000.000.000.00 TRAILSUM All100.310.610.000.001.81 Goal0.620.760.220.001.81Non-goal0.000.000.000.000.00 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnon-goal sessions. 230

PAGE 244

Table74:Site-centric:MetricStatistics(Signicant–1. 00) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All20.330.650.000.001.31 Goal0.650.920.650.001.31Non-goal0.000.000.000.000.00 PATCHLAST All20.260.520.000.001.03 Goal0.520.730.520.001.03Non-goal0.000.000.000.000.00 PATCHSUM All20.581.170.000.002.34 Goal1.171.651.170.002.34Non-goal0.000.000.000.000.00 PATCHDUR (inseconds) All2105.8854.06102.8843.00174.75 Goal136.2554.45136.2597.75174.75Non-goal75.5045.9675.5043.00108.00 R ELAXED I NFORMATION S CENT TRAILMAX All50.100.330.000.001.04 Goal0.210.470.000.001.04Non-goal0.000.000.000.000.00 TRAILLAST All50.100.330.000.001.04 Goal0.210.470.000.001.04Non-goal0.000.000.000.000.00 TRAILSUM All50.100.330.000.001.04 Goal0.210.470.000.001.04Non-goal0.000.000.000.000.00 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnongoalsessions. 231

PAGE 245

Table75:Site-centric:MetricStatistics(Supported–1.2 5) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All10.650.920.650.001.31 Goal1.310.001.311.311.31Non-goal0.000.000.000.000.00 PATCHLAST All10.650.920.650.001.31 Goal1.310.001.311.311.31Non-goal0.000.000.000.000.00 PATCHSUM All10.650.920.650.001.31 Goal1.310.001.311.311.31Non-goal0.000.000.000.000.00 PATCHDUR (inseconds) All139.2521.5739.2524.0054.50 Goal54.500.0054.5054.5054.50Non-goal24.000.0024.0024.0024.00 R ELAXED I NFORMATION S CENT TRAILMAX All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a TRAILLAST All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a TRAILSUM All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnongoalsessions. 232

PAGE 246

Table76:Site-centric:MetricStatistics(Supported–1.5 0) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH PATCHMAX All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a PATCHLAST All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a PATCHSUM All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a PATCHDUR (inseconds) All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a R ELAXED I NFORMATION S CENT TRAILMAX All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a TRAILLAST All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a TRAILSUM All0n/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoaland non-goalsessions. 233

PAGE 247

0 10 20 30 40 50 60 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Patches/Trails per SiteSignificance / Support Levels Patches Trails (a)NumberofPatchesandTrails 0 0.5 1 1.5 2 2.5 3 3.5 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Patch/Trail Size per SiteSignificance / Support Levels Patches Trails (b)SizeofPatchesandTrails 0 5 10 15 20 25 30 35 40 45 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. % of Coverage per SiteSignificance / Support Levels Patches Trails (c)%CoverageofPatchesandTrails 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Patch/Trail Value per SiteSignificance / Support Levels Patches Trails (d)ValueofPatchesandTrails 0 1 2 3 4 5 6 7 8 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Patches Visited per SiteSignificance / Support Levels All Goal Non-goal (e)PatchesVisited 0 1 2 3 4 5 6 0.01 0.05 0.25 0.50 0.75 1.00 1.25 1.50 Avg. Trails Followed per SiteSignificance / Support Levels All Goal Non-goal (f)TrailsFollowed Figure61.:Site-centric:AveragePatchandTrailStatisti csPerSite 234

PAGE 248

threesupportlevelshadpatchesranginginsizefrom2.25to 2.63pagesandtrailsfrom2.45to 2.63pages. Thepercentageofcoverageforbothpatchesandtrailsweren earlyidenticalfortherstve signicanceandsupportlevels.Thepatternofcoveragerou ghlymirroredthatofthenumberof patchesandtrailsfoundfromgure61a.Asthenumberofavai lablepatchesandtrailsincreased, thelikelihoodthatmorepagesfromaWebsitemaybeincluded inapatchalsoincreased.Thus, theincreaseanddecreaseinWebsitecoveragechangedinasi milardirectionanddegreeasthe changeindiscoveredpatchesandtrails.Forexample,thelo westnumberofpatchesandtrails foundalongwiththesmallestcoveragepercentagewasat =0.01(10.58patcheswith23.37 % coverageand3.44trailswith20.25 % coverage).Incontrast,thehighestnumberofpatches,trai ls, andcoveragepercentagewasatsupportlevel0.25(50.16pat cheswith42.22 % coverageand39.60 trailswith40.00 % coverage). Theaveragevalueofpatchesandtrailswererelativelycons tantacrossthesignicancelevels, butincreasedsteadilywitheachsupportlevel 56 .Anoticeabledifferencebetweenpatchesand trailswasonlypresentforthetwosignicancelevels(e.g. ,0.70patchvalueversus0.83trailvalue at =0.01).Incomparingthesignicanceandsupportlevels,th erewasnotadirectequivalentof eithersignicancelevelfoundwithintheselectedsupport levels.Forexample,inordertoobtaina similarvalueforpatchesandtrailsasfoundat =0.05(0.67patchvalueand0.79trailvalue),the supportlevelwouldneedtohavebeenbetween0.50and0.75(0 .60–0.83patchvalueand0.61 –0.88trailvalue).However,asupportlevelbetween0.50an d0.75wouldstillnotbeequivalent sincethevalueofsignicantpatcheswereaslowas0.27and0 .28for =0.01and0.05,respectively. Thelastmeasure(gures61eand61f)illustratedthenumber ofpatchesvisitedandtrailsfollowedbyforagers.Goalsessionsvisitedmorepatchesandfo llowedmoretrailsacrossallthedifferentsignicanceandsupportlevelsthannon-goalsessio ns.Inaddition,thegeneralshapeof bothguresfollowedthenumberofpatchesandtrailsfoundo nasite.Forexample,thehighest numbersofpatchesfoundandvisitedwerebothseenatsuppor tlevel0.25(50.16patchesdiscoveredwith7.74patchesfollowedbygoalsessions). 56 Theincreaseofvalueforeachsupportlevelwasnotsurprisi ngsincethesupportlevelcreatedaminimumallowablevalueforanyincludedpatchesortrails. 235

PAGE 249

Table77:Site-centric:NumberofPatchesbySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.011210.5826.352.001940.051411.9328.743.501111 S UPPORTED 0.253250.16132.7914.5027480.502431.0483.506.5014120.751020.2047.692.0011551.0028.008.498.002141.2511.000.001.00111.500n/an/an/an/an/a Table78:Site-centric:NumberofTrailsbySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.0193.446.291.001200.05104.708.041.50127 S UPPORTED 0.253539.6097.9712.0014910.502418.2140.695.0011880.75107.2011.603.001391.0051.600.891.00131.250n/an/an/an/an/a1.500n/an/an/an/an/a 236

PAGE 250

Table79:Site-centric:PatchSizebySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.01121.670.541.501.003.000.05141.820.642.001.003.00 S UPPORTED 0.25322.420.722.251.004.000.50242.630.892.751.004.000.75102.250.752.251.003.001.0021.750.351.751.502.001.2511.000.001.001.001.001.500n/an/an/an/an/a Table80:Site-centric:TrailSizebySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.0192.000.002.002.002.000.05102.150.342.002.003.00 S UPPORTED 0.25352.630.653.002.005.000.50242.560.742.502.005.000.75102.450.502.252.003.001.0052.300.672.002.003.501.250n/an/an/an/an/a1.500n/an/an/an/an/a 237

PAGE 251

Table81:Site-centric:PatchCoveragebySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.011223.37 % 13.88 % 19.62 % 7.14 % 50.00 % 0.051428.63 % 13.85 % 26.79 % 10.00 % 50.00 % S UPPORTED 0.253242.22 % 16.79 % 43.65 % 7.89 % 70.00 % 0.502436.39 % 16.04 % 35.92 % 5.26 % 70.00 % 0.751028.95 % 11.66 % 30.08 % 12.50 % 47.62 % 1.00232.90 % 20.82 % 32.90 % 18.18 % 47.62 % 1.2519.09 % 0.00 % 9.09 % 9.09 % 9.09 % 1.500n/an/an/an/an/a Table82:Site-centric:TrailCoveragebySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.01920.25 % 12.78 % 18.18 % 3.45 % 47.62 % 0.051026.47 % 14.80 % 27.44 % 6.90 % 50.00 % S UPPORTED 0.253540.00 % 17.18 % 42.11 % 2.53 % 70.00 % 0.502433.44 % 13.43 % 33.33 % 6.90 % 55.56 % 0.751029.40 % 14.58 % 27.44 % 8.33 % 50.00 % 1.00523.02 % 15.30 % 18.18 % 12.50 % 50.00 % 1.250n/an/an/an/an/a1.500n/an/an/an/an/a 238

PAGE 252

Table83:Site-centric:PatchValuebySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.01120.700.230.710.271.170.05140.670.160.670.280.88 S UPPORTED 0.25320.440.110.430.280.670.50240.600.060.580.520.750.75100.830.120.780.751.171.0021.130.051.131.091.171.2511.310.001.311.311.311.500n/an/an/an/an/a Table84:Site-centric:TrailValuebySite NMeanSt.Dev.MedianMinMax S IGNIFICANCE 0.0190.830.210.860.501.060.05100.790.190.820.461.05 S UPPORTED 0.25350.420.100.410.270.780.50240.610.060.600.530.780.75100.880.100.830.801.051.0051.060.041.051.001.121.250n/an/an/an/an/a1.500n/an/an/an/an/a 239

PAGE 253

Table85:Site-centric:PatchVisitationbySite MeanSt.Dev.MedianMinMax S IGNIFICANCE 0.01 All2.541.892.001.0010.00 Goal3.172.442.001.0010.00Non-goal1.920.792.001.004.00 0.05 All2.752.252.001.0011.00 Goal3.502.932.001.0011.00Non-goal2.000.882.001.004.00 S UPPORTED 0.25 All5.815.884.001.0026.00 Goal7.747.584.001.0026.00Non-goal3.872.234.001.009.00 0.50 All4.704.493.001.0021.00 Goal5.565.683.001.0021.00Non-goal3.832.733.001.0011.00 0.75 All3.653.592.001.0014.00 Goal4.504.602.001.0014.00Non-goal2.802.102.001.007.00 1.00 All3.381.703.002.005.50 Goal3.752.473.752.005.50Non-goal3.001.413.002.004.00 1.25 All1.00n/a1.001.001.00 Goal1.00n/a1.001.001.00Non-goal1.00n/a1.001.001.00 1.50 Alln/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a 240

PAGE 254

Table86:Site-centric:TrailFollowingbySite MeanSt.Dev.MedianMinMax S IGNIFICANCE 0.01 All1.360.941.001.004.50 Goal1.611.271.001.004.50Non-goal1.110.331.001.002.00 0.05 All1.601.141.001.005.00 Goal1.801.481.001.005.00Non-goal1.400.701.001.003.00 S UPPORTED 0.25 All4.235.243.001.0034.00 Goal5.326.863.001.0034.00Non-goal3.132.493.001.0010.00 0.50 All3.183.052.001.0014.00 Goal3.773.682.001.0014.00Non-goal2.582.172.001.0011.00 0.75 All2.561.622.501.006.00 Goal2.781.863.001.006.00Non-goal2.331.412.001.004.00 1.00 All1.500.711.001.003.00 Goal1.600.891.001.003.00Non-goal1.400.551.001.002.00 1.25 Alln/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a 1.50 Alln/an/an/an/an/a Goaln/an/an/an/an/aNon-goaln/an/an/an/an/a 241

PAGE 255

HypothesesTestingTable87presentsasummaryoftheresultsfromeachofthedif ferentsignicanceandsupport mininglevels 57 .Thetableliststhehypothesisnumberandmetricbeingtest edinthersttwo columns.Columnsthreeandfourpresenttheresultswhenpat chesandtrailswereminedusinga signicancevalueof0.01and0.05.Thenalsixcolumnsprov idetheresultswhenthespecied supportlevel(0.25to1.50in0.25increments)wasusedtole arnpatchesandtrails 58 Table87:Site-centric:PatchesandTrailsHypothesesResu ltsSummary HypothesisSupported? SignicanceSupport Hyp.Metric0.010.050.250.500.751.001.251.50 I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX Yes ** Yes *** Yes *** Yes *** Yes ** NoNon/a SC5b PATCHLAST Yes ** Yes *** Yes *** Yes *** Yes ** NoNon/a SC5c PATCHSUM Yes ** Yes *** Yes ** Yes *** Yes ** NoNon/a SC6 PATCHDUR Yes *** Yes *** Yes *** Yes *** Yes ** NoNon/a R ELAXED I NFORMATION S CENT SC9a TRAILMAX Yes Yes ** Yes Yes Yes Non/an/a SC9b TRAILLAST Yes Yes ** Yes ** Yes Yes Non/an/a SC9c TRAILSUM Yes Yes ** Yes Yes Yes Non/an/a *p 0.10;**p 0.05;***p 0.01 Ingeneral,theresultsforallsevenmeasuresappearedtoho ldfairlysteadyacrosstherstve signicanceandsupportmininglevels.The PATCHMAX and PATCHLAST measuresbothhadthe samepatternofsignicantndings.Themetricsweresigni cantat =0.01forallbutthemost stringentsignicance(0.01)andsupportmininglevels(0. 75),wherethemeasureswerebothsignicantat =0.05. PATCHSUM followedasimilarpatternastheothertwopatchvisitation measures.However,unlike PATCHMAX and PATCHLAST PATCHSUM wasonlysignicantat =0.05 forpatchesminedatthe0.25supportlevel.Thedropinsigni cancemaybeasymptomofthe 57 Theresultsweredeterminedfromthesigntest.Asthedataus edforthesensitivityanalysiswasfromthesame datasetthatwasusedtotestthesite-centricmodel,theass umptionsofthesignteststillheld. 58 Theanalysisofresultsdoesnotincludesupportedlevelsgr eaterthan0.75.ThereweretoofewWebsitesatthose minedsupportlevelstopossiblyobtainstatisticallysign icantresults. 242

PAGE 256

patchesfoundatthe0.25supportmininglevelcoveringtoom uchofaWebsite(42.22 % average coverage)tobeaseffectiveatdistinguishingbetweengoal andnon-goalsessions. The PATCHDUR metricwassignicantat =0.01foralllevelsexceptthe0.75supportmining level,wherethemeasurewassignicantat =0.05.Thedecreaseinsignicancemaybedueto thesigntest'slackofpowerindetectingdifferencesat =0.01withasamplesizeofonly10Web sites. TRAILMAX TRAILLAST ,and TRAILSUM wereallsignicantat =0.10exceptfortrailsmined atthe0.05signicancelevel,whereallthreemeasureswere signicantat =0.05.Inaddition, the TRAILLAST measurewasalsosignicantat =0.05atthe0.25supportmininglevel.Alack ofpowerbythesigntesttoadequatelydetectadifferencein suchsmallsamplesizes(e.g.,veto nineWebsites)wastheprimarysuspectformanyofthemeasur esonlyreachingasignicanceof =0.10. Tables88–95presenttheresultsofalleightsignicancean dsupportmininglevelsforallthree statisticaltests.Followingthetables,gure62illustra testhep-valuesobtainedfromthestatistical testsforeachofthesevenmeasures.Thegraphsshowtheresu ltsofthethreetestsovertherst vesignicanceandsupportmininglevels(0.01–0.75). 243

PAGE 257

Table88:Site-centric:Results(Signicant–0.01) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 1283.72110.0017 *** 360.0039 ** 80.0039 ** SC5b PATCHLAST 1283.90110.0012 *** 360.0039 ** 80.0039 ** SC5c PATCHSUM 1282.92110.0070 ** 360.0039 ** 80.0039 ** SC6 PATCHDUR 12123.93110.0012 *** 780.0002 *** 120.0002 *** R ELAXED I NFORMATION S CENT SC9a TRAILMAX 952.9280.0096 ** 150.0313 50.0313 SC9b TRAILLAST 952.9480.0094 ** 150.0313 50.0313 SC9c TRAILSUM 952.5780.0165 ** 150.0313 50.0313 *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).244

PAGE 258

Table89:Site-centric:Results(Signicant–0.05) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 1493.68130.0014 *** 450.0020 *** 90.0020 *** SC5b PATCHLAST 1493.92130.0009 *** 450.0020 *** 90.0020 *** SC5c PATCHSUM 1493.00130.0051 ** 450.0020 *** 90.0020 *** SC6 PATCHDUR 14144.11130.0006 *** 1000.0006 *** 130.0009 *** R ELAXED I NFORMATION S CENT SC9a TRAILMAX 1063.3390.0044 ** 210.0156 ** 60.0156 ** SC9b TRAILLAST 1063.3690.0042 ** 210.0156 ** 60.0156 ** SC9c TRAILSUM 1062.8990.0089 ** 210.0156 ** 60.0156 ** *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).245

PAGE 259

Table90:Site-centric:Results(Supported–0.25) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 32203.49310.0007 *** 1910.0003 *** 170.0013 *** SC5b PATCHLAST 32173.36310.0011 *** 1410.0005 *** 150.0012 *** SC5c PATCHSUM 32213.04310.0024 *** 2100.0002 *** 170.0036 ** SC6 PATCHDUR a 31314.20300.0001 *** 450 < 0.0001 *** 27 < 0.0001 *** R ELAXED I NFORMATION S CENT SC9a TRAILMAX 35202.49340.0089 ** 1670.0096 ** 150.0207 SC9b TRAILLAST 35192.23340.0162 ** 1530.0090 ** 150.0096 ** SC9c TRAILSUM 35202.33340.0128 ** 1810.0016 ** *** 150.0207 a PATCHDUR onlyhadatotalof31Websites(versusthe32sitesin PATCHES )becausetherewerenotanysessionswhichvisiteddiscoveredgoalpatchesatoneWebsite.Allvediscoveredgoal patchesatthesiteofinterestcontainedapagethatwasnolo nger availabletosessionswithinthetestingset.Morespecica lly,thetrainingsetconsistedofsessionswhichexistedon orbefore 05/23/20088:29:38PM.Thepageinquestionwaslastvisited byanysessionon03/20/20088:13:05PM. *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).246

PAGE 260

Table91:Site-centric:Results(Supported–0.50) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 2493.40230.0012 *** 450.0020 *** 90.0020 *** SC5b PATCHLAST 2493.36230.0014 *** 450.0020 *** 90.0020 *** SC5c PATCHSUM 2492.84230.0047 ** 450.0020 *** 90.0020 *** SC6 PATCHDUR 24232.80230.0050 *** 2270.0027 *** 180.0053 *** R ELAXED I NFORMATION S CENT SC9a TRAILMAX 2492.45230.0112 ** 410.0137 ** 80.0195 SC9b TRAILLAST 2492.12230.0226 390.0273 80.0195 SC9c TRAILSUM 2492.29230.0159 ** 420.0098 ** 80.0195 *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).247

PAGE 261

Table92:Site-centric:Results(Supported–0.75) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 1063.5090.0034 ** 210.0156 ** 60.0156 ** SC5b PATCHLAST 1063.6190.0028 *** 210.0156 ** 60.0156 ** SC5c PATCHSUM 1062.7090.0123 ** 210.0156 ** 60.0156 ** SC6 PATCHDUR 10102.6290.0138 ** 500.0098 *** 90.0107 ** R ELAXED I NFORMATION S CENT SC9a TRAILMAX 1052.8390.0099 ** 150.0313 50.0313 SC9b TRAILLAST 1052.8090.0104 ** 150.0313 50.0313 SC9c TRAILSUM 1052.5890.0148 ** 150.0313 50.0313 *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).248

PAGE 262

Table93:Site-centric:Results(Supported–1.00) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 211.0010.250010.500010.5000 SC5b PATCHLAST 211.0010.250010.500010.5000 SC5c PATCHSUM 211.0010.250010.500010.5000 SC6 PATCHDUR 2210.1310.0313 ** 30.250020.2500 R ELAXED I NFORMATION S CENT SC9a TRAILMAX 511.0040.187010.500010.5000 SC9b TRAILLAST 511.0040.187010.500010.5000 SC9c TRAILSUM 511.0040.187010.500010.5000 *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).249

PAGE 263

Table94:Site-centric:Results(Supported–1.25) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX 11n/an/an/a10.500010.5000 SC5b PATCHLAST 11n/an/an/a10.500010.5000 SC5c PATCHSUM 11n/an/an/a10.500010.5000 SC6 PATCHDUR 11n/an/an/a10.500010.5000 R ELAXED I NFORMATION S CENT SC9a TRAILMAX n/an/an/an/an/an/an/an/an/a SC9b TRAILLAST n/an/an/an/an/an/an/an/an/a SC9c TRAILSUM n/an/an/an/an/an/an/an/an/a *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).250

PAGE 264

Table95:Site-centric:Results(Supported–1.50) Nt-testWilcoxonSignTest Hyp.MetricTotalNoZerostdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH SC5a PATCHMAX n/an/an/an/an/an/an/an/an/a SC5b PATCHLAST n/an/an/an/an/an/an/an/an/a SC5c PATCHSUM n/an/an/an/an/an/an/an/an/a SC6 PATCHDUR n/an/an/an/an/an/an/an/an/a R ELAXED I NFORMATION S CENT SC9a TRAILMAX n/an/an/an/an/an/an/an/an/a SC9b TRAILLAST n/an/an/an/an/an/an/an/an/a SC9c TRAILSUM n/an/an/an/an/an/an/an/an/a *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).251

PAGE 265

0 0.005 0.01 0.015 0.02 0.025 0.03 0.01 0.05 0.25 0.50 0.75 p-ValueSignificance / Support Levels t-Test Wilcoxon Sign test (a) PATCHMAX 0 0.005 0.01 0.015 0.02 0.025 0.03 0.01 0.05 0.25 0.50 0.75 p-ValueSignificance / Support Levels t-Test Wilcoxon Sign test (b) PATCHLAST 0 0.005 0.01 0.015 0.02 0.025 0.03 0.01 0.05 0.25 0.50 0.75 p-ValueSignificance / Support Levels t-Test Wilcoxon Sign test (c) PATCHSUM 0 0.005 0.01 0.015 0.02 0.025 0.03 0.01 0.05 0.25 0.50 0.75 p-ValueSignificance / Support Levels t-Test Wilcoxon Sign test (d) PATCHDUR 0 0.01 0.02 0.03 0.04 0.05 0.01 0.05 0.25 0.50 0.75 p-ValueSignificance / Support Levels t-Test Wilcoxon Sign test (e) TRAILMAX 0 0.01 0.02 0.03 0.04 0.05 0.01 0.05 0.25 0.50 0.75 p-ValueSignificance / Support Levels t-Test Wilcoxon Sign test (f) TRAILLAST 0 0.01 0.02 0.03 0.04 0.05 0.01 0.05 0.25 0.50 0.75 p-ValueSignificance / Support Levels t-Test Wilcoxon Sign test (g) TRAILSUM Figure62.:Site-centric:TrailandPatchp-valuesbySigni cance/SupportLevels 252

PAGE 266

7.3ConclusionThischapterprovidedtheresultsofboththeuser-andsitecentricmodelsofinformationforaging.Descriptivestatistics,assumptionchecksofstatist icaltests,andresultsfromeachmodel'shypotheseswereprovided.Inaddition,asensitivityanalysi swasdoneonthesevenhypothesesofthe site-centricmodelthatreliedonminingpatchesandtrails .Overall,threeofthefouruser-centric hypothesesweresupportedat =0.01.Ofthe13site-centrichypothesesandsub-hypothese s, sevenweresupportedat =0.01,fourat =0.05,oneat =0.10,andoneat =0.01intheoppositedirectionasexpected. 253

PAGE 267

Chapter8 TemporalAspectsofInformationForaging Thesite-centricclickstreammodelofinformationforagin gmadeanimplicitassumptionthatthe structureoftheWebsitesbeingexamineddidnotchangeover thecourseoftheanalysis.Thus, itwasexpectedthatbrowsingpatternsofgoalandnon-goals essionswouldberoughlyconstant overtime,forboththecalculatedmeasures(e.g.,duration ,numberofpagesviewed)andlearned patchesandtrails.However,theWebisadynamicandevolvin genvironment(Warrenetal.,1999; Chietal.,1998)whereWebsitesadd,modify,andremovecont ent(PitkowandPirolli,1997)ona regularbasis 1 .Inaddition,liketraditionalsoftware,Websitesmayalso undergostructuralmaintenancetoimprovethequalityofthebrowsingexperienceforv isitors(RiccaandTonella,2001). AsWebsitescanbedynamic,assumingastaticrepresentatio nmaynotbeappropriatewhen testingthesite-centricmodel 2 .Therefore,thischapterpresentsasecondtestofthesitecentric hypothesesusingtemporalaspectstodetermineiftimemake sadifferenceintheresults.Instead ofcomparingbrowsingbehaviortoanabsolutepointofzero, allbehaviorwascomparedrelative topriorgoalsessionsatthesiteofinterest.Thus,ifconte ntorstructuralchangesoccurred,they wouldbereectedintherelativevalueofthecurrentsessio n. Methodologically,relativemeasuresweredeterminedbypr ogressivelycalculatingsessionsin orderoftheirsessionstarttime.Thus,thecurrentlyproce ssedsessionwouldbecomparedrelativetoallgoalsessionsthatoccurredbeforeit.Althought hecomparisonwasrelativelysimplein denition(i.e.,allpriorgoalsessionswereusedasoppose dtoaslidingwindow),thecomputationalcomplexityofthemethodologywasstillmuchhighert hanforthestaticsite-centricversion. Therefore,theresultsofthischaptermayalsoshedlighton tothevalueofundertakingtheextra complexityofthismethodology. 1 Forexample,RiccaandTonella(2000)analyzed15Websiteso verathreemonthperiodandfoundthateachWeb sitehad,onaverage,3.4signicantstructuralchangeswit hinthattimeframe. 2 ThesameconcernoverstaticWebsiteswasnotanissueinthet estedportionsoftheuser-centricmodel.ComparisonsweremaderelativetobrowsingbehavioratotherWebsi teswithinthelimitedtimeoftheuser'ssession.Thus,the onlyexpectationwasthattheWebsitewouldremainstaticwh iletheuserwasonthesiteofinterest. 254

PAGE 268

Therstsectionofthischapterdetailsthemethodologyuse dtotestthetemporalversionofthe site-centricmodel.Inparticular,thedatausedtotestthe modelisexplainedrst,followedbythe algorithmusedtoprogressivelycalculatemeasures,andth ennallytheformulasusedtodeterminetherelativevalueforeachsession'smeasures.Thesec ondsectionpresentstheresultsofthe temporalversionandcomparesagainsttheresultsofthesta ticversionofthemodel.Finally,the conclusionsummarizestheusefulnessofthetemporalmetho dologygiventheresultsobtained. 8.1MethodologyIntherstsubsectionbelow,adescriptionofthedataeleme ntsavailableinthedataaredescribed 3 Thesecondsubsectiondetailstheprogressivemannerinwhi chthedatasetwasprocessed.The processingwasdoneinordertocreatemeasuresthatwererel ativetopriorsessions.Finally,the lastsubsectionillustratestheequationsusedtocalculat eeachofthemeasuresforthetemporal versionofthesite-centricmodel.8.1.1DatasetSampleThedatausedinthestaticandtemporalversionsofsite-cen tricmodelwereexactlythesame.The datacontainedasetof n sessions S ( S 0 S 1 ,..., S n 1 ),where S i representsasinglesession.Each session( S i )containedasetof m pageinformationtuples P ( P i 0 P i 1 ,..., P im 1 ),where P ij representsinformationaboutaparticularpageviewedduringa session.Eachpageinformationtuple wasmadeupofsevenpiecesofinformation:auniqueidentie rforthesession,Website,referring domain,andpageviewed;dateandtimethepagewasviewed;ho wmuchtimewasspentonthe page;andifthepagerepresentedacontactgoalbeingachiev ed. Thecalculationofmetricsforeachsessionwasonlydoneont hosepartsofasessionoccurring before theachievementofacontactgoal 4 .Thistruncationwasdonebecausetheproblembeinginvestigatedwasthepredictionofgoalachievementduringth e remainder ofasession.Thus,predictionwasdonefromapointrightbeforeaformsubmissionoccu rred,i.e., P only containedpages whichoccurred before thecontactformwassubmittedforthecontactgoalofintere st. 3 Summarystatisticsaboutthesite-centricdatasetcanbefo undinchapter6. 4 Ifasessiondidnotsubmitacontactformthentheentiresess ionwasused. 255

PAGE 269

8.1.2ProgressiveCalculationsThemeasuresinthetemporalversionofthesite-centricmod elcomparedthebrowsingbehavior ofthecurrentforageragainstwhatpreviousgoal-achievin gforagershaddone 5 .Forcalculating themeasures,previouswasdenedasanysessionthatstarte dbeforethecurrentforager'ssession began.Forexample,asessionwhichtookplaceamonthafterd atacollectionbeganwouldhave hadthatentiremonth'sworthofgoalsessionstocompareaga inst.Forasessionthattookplacesix monthsafterthestartofdatacollection,therewouldhaveb eenevenmoregoalsessionstocompareagainst. Theprocessusedtocomparepriorsessionsagainstisoutlin edinthe processDataset algorithm (gure63).Thealgorithmrequirestwoarguments:asetofse ssionsforaparticularWebsiteand theminimumpercentageofgoalsessionstobankbeforetheca lculationofsessions'measures shouldbegin.The processDataset algorithmoperatesinsixbasicsteps. Therststepofthe processDataset algorithmsortedsessionsinascendingorderbytheirstart dateandtime(line23).Thesecondstep(line25)determined thetotalnumberofgoalsessionsin theentiresetofsessions.Aftersettinguptheenvironment inthersttwosteps,eachsessionwas theniteratedover(lines27-45)forthenextthreesteps. Thethirdstep(line29)determinediftheminimumpercentag eofgoalsessionshadbeenadded tothesetofbankedgoalsessions.Iftheminimumpercentage hadbeenmet,thenthemeasures forthecurrentsessionwerecalculatedandaddedtothedata set(line30).Calculationswereperformedusingallbankedgoalsessionsalongwithvaluablego alpatchesandtrails.Thefourthstep thenaddedthecurrentsessionintotheappropriatesetofba nkedsessionsinlines34-38.Asession wasbankedregardlessofifmeasureswerecalculatedfortha tsessionornot. Thesixthstephandledtheminingofgoalpatchesandtrails( lines41-44).Sincelongtailsites havelimiteddataandanadditionalgoalsessionmayhaveani mpactontheformationofpatches ortrails,miningwasdoneaftereachgoalsessionwasaddedt othebank(oncetheminimumpercentagewasmet) 6 .Thenalstepoccurredafterallsessionshadbeenprocesse d.Inthelaststep, 5 Previousnon-goalsessionswerealsoused,butonlyforlear ningpatchesandtrails.See§8.1.3formoredetails. 6 Patchesandtrailswerenotminedaftereverynon-goalsessi onbecausethereweregenerallymanymorenon-goal sessionsthangoalsessions.Thus,theadditionofoneaddit ionalnon-goalsession,whenndingfrequentitemsetsor sequentialpatterns,wasunlikelytocausedrasticdiffere ncesinpatchandtrailformation,unlikewhatmayoccurwith goalsessions.Inaddition,thecomputationaleffortrequi redtomineaftereverysessionwouldbeveryhighatsome Websites(e.g.,asitewith40,000sessions). 256

PAGE 270

1 / ** 2 Parameters:(a)Setofusersessions:S= f S 0 S 1 ::: S N 1 g 3 where S i =setofpageinformationtuples,P. 4 (b)Minimumpercentageofgoalsessions 5 tobasecalculationson:minPercent 6 Returns:Setofresultrecords:R= f R 0 R 1 ::: R X 1 g 7 Methods:(a)calculateMeasures( S i ,G,A,T):calculates 8 allthenecessarymeasuresforsession S i using 9 allgoalsessionsfromsetG,goalpatchesfromsetA, 10 andgoaltrailsfromsetT 11 (b)generatePatches(G,N):returnsasetofvaluablegoalpa tches 12 (c)generateTrails(G,N):returnsasetofvaluablegoaltra ils 13 (d)getGoalCount(S):returnsnumberofgoalsessionsinS 14 (e)isGoal( S i ):trueifsessionachievedgoal 15 (f)sort(S):sortssessionsinascendingorderbysession 16 startdate 17 / 18processDataset(S,minPercent) f 19 //R=resultrecords;G=bankedgoalsessions;N=bankednongoalsessions 20 //A=valuablegoalpatches;T=valuablegoaltrails 21R= fg ;G= fg ;N= fg ;A= fg ;T= fg ; 22 23sort(S); //sortsessionsinascendingorderbysessionstartdate 24 25goalCount=getGoalCount(S); //determinehowmanytotalgoalsessionsinentireset 26 27foreach(i 2 S) f 28 //Onlycalculateonceminimumpercentageofgoalsessionsi smet 29if( k G k /goalCount > =minPercent) f 30R+=calculateMeasures(i,G,A,T); 31 g 32 33 //Addsessiontobankedgoalornon-goalset 34if(isGoal(i)) f 35G+=i; 36 g else f 37N+=i; 38 g 39 40 //Minepatchesandtrailsforeachnewgoalsession(ifenoug hgoalsessionsarebanked) 41if(isGoal(i)&& k G k /goalCount > =minPercent) f 42A=generatePatches(G,N); 43T=generateTrails(G,N); 44 g 45 g 46returnR; 47 g Figure63.:TemporalSite-centric:processDatasetAlgori thm 257

PAGE 271

thesetofresultdatarecordswhichcontainedthecalculate dmeasuresforeachsessionwerereturned.Ofnoteisthealgorithmdidnotincludethosesessio nsinthereturnedresultsthatoccurred beforetheminimumpercentageofsessionswasmet.Example Table96presentsanexampleofhowthealgorithmprocesseda dataset.Thetableshowsthe rst11sessionsfromthedatasetsortedbysessionstarttim e.Fiveofthesessionsresultedina goalbeingachieved.Allofthesessionswerepassedtotheal gorithm.Inaddition,theminimum percentageofgoalsessionsrequiredbeforecalculatingme asureswassetto80 % .Therefore,sessionswerenotconsideredpartoftheresultdatasetuntilfo urgoalsessions(80 % )werebanked. Table96:TemporalSite-centric:ExampleSessions SessionStartDateandTimeGoalAchieved? S17/09/0811:04:07NoS27/09/0817:35:12YesS37/11/0810:10:56NoS47/15/0811:36:18YesS57/15/0811:37:08YesS67/15/0814:43:23NoS77/22/0812:11:10NoS87/23/0819:44:39YesS97/23/0820:23:21No S107/25/0814:05:09YesS117/26/0816:07:25No ... Table97illustratestheprocessusingthesessionsfromtab le96.Thecontentsofwhichsessions wereintheresultset,goalset,andnon-goalsetareprovide datthe end ofeveryiterationofthealgorithm(i.e.,line45).Calculationsforasessionweredon e before thesessionwasaddedtoeither thegoalornon-goalset. Afterprocessingtherstsessiontheresultandgoalsetrem ainedemptywhile S 1 wasaddedto 258

PAGE 272

thenon-goalset.Aftertheeighthsessionwasprocessedthe minimumpercentageofgoalsessions wasmetforthegoalset.Sessions S 2 S 4 S 5 ,and S 8 wereincludedinthegoalsetwhilesessions S 1 S 3 S 6 ,and S 7 wereinthenon-goalset.Uptotheeighthsession,nosession shadbeenadded totheresultdatasetyet(i.e.,nocalculationshadbeenper formed). Table97:TemporalSite-centric:ExampleDatasetProcessi ng StepResultSetGoalSetNon-GoalSet 1 S 1 2 S 2 S 1 3 S 2 S 1 ;S 3 4 S 2 ;S 4 S 1 ;S 3 5 S 2 ;S 4 ;S 5 S 1 ;S 3 6 S 2 ;S 4 ;S 5 S 1 ;S 3 ;S 6 7 S 2 ;S 4 ;S 5 S 1 ;S 3 ;S 6 ;S 7 8 S 2 ;S 4 ;S 5 ;S 8 S 1 ;S 3 ;S 6 ;S 7 9 S 9 S 2 ;S 4 ;S 5 ;S 8 S 1 ;S 3 ;S 6 ;S 7 ;S 9 10 S 9 ;S 10 S 2 ;S 4 ;S 5 ;S 8 ;S 10 S 1 ;S 3 ;S 6 ;S 7 ;S 9 11 S 9 ;S 10 ;S 11 S 2 ;S 4 ;S 5 ;S 8 ;S 10 S 1 ;S 3 ;S 6 ;S 7 ;S 9 ;S 11 ... Aftertheeighthstep;however,theminimumpercentageofgo alsessionshadbeenmet.Therefore,allremainingsessionswouldhavetheirmeasurescalc ulatedandaddedtotheresultdataset. Session S 9 usedthepatchesandtrailsminedfromthefourbankedgoal( S 2 ;S 4 ;S 5 ;S 8 )and non-goalsessions( S 1 ;S 3 ;S 6 ;S 7 ),alongwithjustthebankedgoalsessionstocalculateitsr elativemeasures.Forsession S 10 ,theprevioussession( S 9 )wasaddedtothenon-goalset,butthe patchesandtrailswerenotre-mined.Forthenalsession,n ewpatchesandtrailsweremined,because S 10 wasagoalsession.Ifmorethanelevensessionsexisted,the nthisprogressivemanner ofminingpatchesandtrailsandcalculatingmeasureswould havecontinueduntilthenalsession wasprocessed. Inthisresearchthe processDataset algorithmwasrunwiththeminimumpercentageofgoal sessionssetto70 % .Thus,measureswereonlycalculatedwhenatleast70 % ofall goal sessions werebanked. 259

PAGE 273

8.1.3MetricsTable98summarizesthemetricsusedtotestthetemporallypositionedhypothesesforthesitecentricclickstreammodel(TSC).Thenameofeachmetricalo ngwithadescriptionofhowitwas calculatedisprovided.Inaddition,thehypothesiswhichc orrespondstothemetricisalsoprovided inthetable.Amorein-depthdescriptionofthemetricsisgi veninthefollowingsubsections. Table98doesnotcontainthe RETURN and VISITED metrics(hypothesesSC3andSC4)because theywerecalculatedataWebsiteasopposedtoanindividual levelofanalysis.Thetemporalversionofthemodelexaminesrelativebehaviorofa user versusprevioussessions.Therefore,measuresatahigherlevelofanalysiswerenotanalyzed. Tohelpclarifythenotationbeingusedbelowforthemetrics C representsthecurrentsession beinganalyzed, G isthesetofbankedpastgoalsessionsthat C willbecomparedagainst,and median () isafunctionthatreturnsthemedianfromasetofvalues. InformationPatch–Site-Patch RELDUR isthetotaldurationinsecondsavisitorhasspentataWebsi terelativetothemedian timepriorgoalsessionshavespentatthesameWebsite.Ther elativedurationiscalculatedfrom equation8.1,where duration ( i ) isthedurationspentduringsession i .Toobtain RELDUR ,the mediandurationofallbankedgoalsessionsinthegoalset G issubtractedfromthetotalduration ofthecurrentsession C RELDUR = duration ( C ) median ( foreach i 2 G [ duration ( i )]) (8.1) RELPGS isthenumberofpagesavisitorhasviewedataWebsiterelati vetothemediannumberofpagesviewedbypriorgoalsessionsatthesameWebsite .Therelativenumberofpagesis calculatedasshowninequation8.2,where pages ( i ) isthenumberofpagesviewedduringsession i .Toacquire RELPGS ,themediannumberofpagesviewedfromallgoalsessionsing oalset G is subtractedfromthenumberofpagesviewedduringthecurren tsession C RELPGS = pages ( C ) median ( foreach i 2 G [ pages ( i )]) (8.2) 260

PAGE 274

Table98:TemporalSite-centric:ModelMetrics Hypothesis#MetricDescription I NFORMATION P ATCH –S ITE -P ATCH TSC1 RELDUR DurationinsecondsspentonaWebsiterelativetopastgoalsessions. TSC2 RELPGS NumberofpagesviewedonaWebsiterelativetopastgoalsessions. I NFORMATION P ATCH –P AGE -P ATCH TSC5a RELPTCMAX Maximumvalueofanygoalpage-patchvisitedrelativeto pastgoalsessions. TSC5b RELPTCLAST Valueoflastgoalpage-patchvisitedrelativetopastgoal sessions. TSC5c RELPTCSUM Totalvalueofallgoalpage-patchesvisitedrelativeto pastgoalsessions. TSC6 RELPTCDUR Mediandurationinsecondsspentinallgoalpage-patchesrelativetopastgoalsessions. S TRICT I NFORMATION S CENT TSC7 RELUNQ Percentageofuniquepagesviewedrelativetopastgoalsessions. TSC8 RELLNR Linearityofclickstreamrelativetopastgoalsessions. R ELAXED I NFORMATION S CENT TSC9a RELTRLMAX Maximumvalueofanygoaltrailfollowedrelativetopastgoalsessions. TSC9b RELTRLLAST Valueoflastgoaltrailfollowedrelativetopastgoalsessions. TSC9c RELTRLSUM Totalvalueofallgoaltrailsfollowedrelativetopastgoalsessions. O THER n/a GOAL Whetheragoaloccurredduringthesession. 261

PAGE 275

InformationPatch–Page-Patch PatchesataWebsitemustalreadybeknowninordertocalcula tethefour RELPTC visitation metrics: RELPTCMAX RELPTCLAST RELPTCSUM ,and RELPTCDUR .Themethodologyforlearningpatchesisdescribedindetailinappendix5.B.Ingenera l,learningpatchesrequiresasetof goalandnon-goalsessionstodeterminewhichpartsofaWebs ite(i.e.,pages)arebetterableto distinguishbetweenthetwogroups.Patchesarespecictoa singleWebsite. Asthefour RELPTC metricsrequirepatchestobelearnedrstinordertoquanti fyasession's patchvisitation,thebankedgoalandnon-goalsessions( G and N )wereusedtodiscovergoal patchesataWebsite.Thecurrentsessionthencalculatedth e RELPTC metricsfromthelearned goalpatches.However,thecurrentsessionwouldonlycalcu latethe RELPTC metrics ifandonlyif goalpatcheswerefoundattheWebsite.Inaddition,the RELPTCDUR metricwouldonlybecalculatedforthecurrentsession ifandonlyif thatsessionvisitedatleastoneofthegoalpatchesdiscoveredattheWebsiteofinterest.Furthermore,sincethem easuresforthetemporalsite-centric modelareallrelativetopriorgoalsessions,thesamegoals essionsusedtolearnthepatchesalso calculatedthe RELPTC metricsfortheirownrespectivesessionssothatrelativec omparisonscould bemade.LearningPatches PatcheswerelearnedforaWebsiteusingthetrainingdatase t( R ),whichconsistedofbanked goal( G )andnon-goal( N )sessions,accordingtothemethodologyoutlinedinappend ix5.B. Patcheswerelearnedatan levelof0.05 7 Specically,asetof n valuablepatches A ( A 0 A 1 ,..., A n 1 )werediscovered,where A i representsasinglevaluablepatch. A i consistsofasetof m unorderedanddistinctpages U ( U 0 U 1 ..., U m 1 ). Eachpatch( A i )wasalsogivenavalueaccordingtoequation8.3(YangandPa dmanabhan, 2003). S Gi and S Ni representthenumberofgoalandnon-goalsessionsfromthet rainingdataset thatvisitedpatch A i ,respectively. R G and R N isthetotalnumberofgoalandnon-goalsessions fromthetrainingdataset.Thevalueofpatch A i couldrangefromzerototwo,withhighernum7 Amorein-depthdescriptionoflearningpatchesmaybefound in§5.2.2. 262

PAGE 276

bersrepresentingagreaterdifferenceinsupportofthepat chindistinguishingbetweengoaland non-goalsessions(i.e.,beingmorevaluable). value ( A i )= S Gi R G S Ni R N 1 2 S Gi R G + S Ni R N (8.3) Calculating RELPTC Metrics Tocalculatethe RELPTC metricsforagivensession,twostepswererequired.First, itwasdeterminedwhatpatchesthesessionvisitedfromthesetofvaluab lepatches( A ).Eachsessionhadaset of l visitedpatches V ( V 0 V 1 ,..., V l 1 ),where V j wasanindividualpatchvisitedbythecurrent session.Asessionwasconsideredtohavevisitedapatchifa llpagesofthepatch( U )werevisited atleastonce(inanyorder)bythecurrentsession(asdeterm inedbythesetofpages P fromthe session).Formally, A i wasaddedto V if U P .Onceitwasknownwhatpatcheswerevisited, thenthefourmeasureswerecalculated. PATCHMAX isthevalueofthemostvaluablepatchvisitedbythecurrent user.Themaximum valueisdeterminedbyiteratingovereveryvisitedpatchto ndtheonewiththehighestvalue (equation8.4).Iftheuserdidnotvisitanypatchesthenthe valueof PATCHMAX wouldbezero. PATCHMAX = 8<: max ( foreach j 2 V ( value ( V j ))) if k V k > 0 0 else (8.4) RELPTCMAX wascalculatedasshowninequation8.5.Themedian PATCHMAX valueofall bankedgoalsessionsinthegoalset G wassubtractedfromthecurrentsession's( C )valueof PATCHMAX inordertocalculate RELPTCMAX RELPTCMAX = PATCHMAX ( C ) median ( foreach i 2 G [ PATCHMAX ( i )]) (8.5) PATCHLAST isthevalueofthelastpatchvisitedbytheuser 8 .Equation8.6illustrateshow RELPT CLAST wascalculated.Themedian PATCHLAST valueofallbankedgoalsessionsfromthegoal set G issubtractedfromthecurrentsession's( C )valueof PATCHLAST toarriveat RELPTCLAST 8 Detailsonthefour-stepheuristicusedtodeterminewhichp atchwasvisitedlastduringauser'ssessionsmaybe foundin§5.2.2. 263

PAGE 277

RELPTCLAST = PATCHLAST ( C ) median ( foreach i 2 G [ PATCHLAST ( i )]) (8.6) PATCHSUM addsupthevalueofeverypatchvisitedbythecurrentuser(e quation8.7).Avalue ofzeroisgiventoanyuserthatdidnotvisitanypatches. PATCHSUM = 8<: P j 2 V ( value ( V j )) if k V k > 0 0 else (8.7) Equation8.8illustrateshow RELPTCSUM wascalculated.Themetricwasdeterminedbysubtracting PATCHSUM forthecurrentsession C fromthemedian PATCHSUM valueofallbankedgoal sessionsinthegoalset G RELPTCSUM = PATCHSUM ( C ) median ( foreach i 2 G [ PATCHSUM ( i )]) (8.8) PATCHDUR isthemediandurationauserspentinalltheirvisitedpatch es.Onlysessionswhich visitedatleastonepatch(i.e., k V k > 0 )wouldhaveavaluefor PATCHDUR .Thecalculationfor PATCHDUR isshowninequation8.9. totalTime ( k;P ) returnsthetotaltimeasessionwithpages P spentonpage k .Ifasessionvisitedpage k morethanoncein P ,thenthesumdurationfromall k pagevisitationswasreturned. PATCHDUR = median foreach j 2 V X k 2 G totalTime ( k;P ) !# (8.9) Themannerinwhich RELPTCDUR wascalculatedisshowninequation8.10.Toobtain RELPTC DUR ,themedian PATCHDUR valueofallbankedgoalsessionsinthegoalset G wassubtracted fromthecurrentsession's( C )valueof PATCHDUR RELPTCDUR = PATCHDUR ( C ) median ( foreach i 2 G [ PATCHDUR ( i )]) (8.10) StrictInformationScent UNIQUE isthepercentageofuniquepagesviewedduringasession.Th epercentageofunique pagesviewedforthecurrentvisitoriscalculatedaccordin gtoequation8.11,where distinct ( P ) is 264

PAGE 278

thenumberofdistinctpagesviewedinthesetofpageinforma tiontuples P UNIQUE = distinct ( P ) k P k 100 (8.11) Therelativepercentageofuniquepages RELUNQ isdeterminedbysubtractingthemedian UNIQUE valueofallbankedgoalsessions( G )fromthevalueofthecurrentsession's UNIQUE (equation8.12). RELUNQ = UNIQUE ( C ) median ( foreach i 2 G [ UNIQUE ( i )]) (8.12) LINEAR isthecomplexityofasessionascalculatedviathestratumm easure.Complexityis determinedviathestraightness(i.e.,absenceofvisiting pagesrepeatedly)ofauser'sbrowsingbehavior,wherehigherlinearityequatestolesscomplexity. Stratumisameasureoflinearityfrom graphtheory(McEneaney,2001)anddetailsonitscalculati onmaybefoundinappendix5.A. RELLNR wascalculatedaccordingtoequation8.13,wherethemedian LINEAR valuefromthe bankedgoalset( G )wassubtractedfromthecurrentsession'svalueof LINEAR RELLNR = LINEAR ( C ) median ( foreach i 2 G [ RELLNR ( i )]) (8.13) RelaxedInformationScent Thethree RELTRL metricsfortherelaxedinformationscentwerecalculatedi naverysimilar mannerasthe RELPTC metrics.Thesametrainingsetusedtodiscoverpatcheswasu sedtolearn trails.Boththecurrentsessionandthegoalsessionsfromt hetrainingsetthenusedthoselearned trailstocalculatetheirvaluesforthethree RELTRL metrics. Specically,asetof n valuabletrails T ( T 0 T 1 ,..., T n 1 )werediscoveredfromthetraining set,where T i representsasinglevaluabletrail. T i consistsofasetof m ordered pages O ( O 0 O 1 ,..., O m 1 ),wherethepagesmayrepeatthemselvesintheorderedset(e .g., h A;B;B;A;C i ). Oncediscovered,trailsweregivenavaluelikepatchesusin gequation8.3(with T i beingusedinsteadof A i ). Oncethetrailswerediscovered,eachsessionrequiredtwos tepstocalculatethe RELTRL measures.First,itwasdeterminedwhattrailswerefollowedby thesessionofinterestfromthesetof 265

PAGE 279

valuabletrails( T ).Eachsessionhadasetof l followedtrails F ( F 0 F 1 ,..., F l 1 ),where F j was anindividualtrailfollowedbythecurrentsession.Asessi onwasconsideredtohavefolloweda trailifallpagesofthetrail( O )werefollowedinorderbythecurrentsession(asdetermine dby thesetofpages P fromthesession).Althoughallpagesmusthavebeenfollowe dinorder,repeat visitationandgapsbetweenpageswereallowed(i.e.,other pagesmaybevisitedinbetweenpages fromthetrail).Morespecically, T i wasaddedto F if O P andthepagesof O werefoundin thesameorderin P .Onceitwasknownwhattrailswerefollowed,thenthethreem easureswere calculated. TRAILMAX isthevalueofthemostvaluablefollowedtrailbythecurren tuser.Themaximum valueisdeterminedbyiteratingovereveryfollowedtrailt ondtheonewiththehighestvalue (equation8.14).Iftheuserdidnotvisitanytrailsthenthe valueof TRAILMAX wouldbezero. TRAILMAX = 8<: max ( foreach j 2 F ( value ( F j ))) if k F k > 0 0 else (8.14) RELTRLMAX wascalculatedasshowninequation8.15,wherethemedian TRAILMAX valueof allbankedgoalsessionsinthegoalset G wassubtractedfromthecurrentsession's( C )valueof TRAILMAX RELTRLMAX = TRAILMAX ( C ) median ( foreach i 2 G [ TRAILMAX ( i )]) (8.15) TRAILLAST isthevalueofthelasttrailfollowedbytheuser 9 .Equation8.16illustrateshow RELTRLLAST wascalculated.Themedian TRAILLAST valueofallbankedgoalsessionsfromthe goalset G issubtractedfromthecurrentsession's( C )valueof TRAILLAST toarriveat RELTRL LAST RELTRLLAST = TRAILLAST ( C ) median ( foreach i 2 G [ TRAILLAST ( i )]) (8.16) TRAILSUM addsupthevalueofeveryfollowedtrailbythecurrentuser( equation8.17).Avalue ofzeroisgiventoanyuserthatdidnotvisitanytrails. 9 Detailsonthefour-stepheuristicusedtodeterminewhicht railwasfollowedlastduringauser'ssessionsmaybe foundin§5.2.2. 266

PAGE 280

TRAILSUM = 8<: P j 2 F ( value ( F j )) if k F k > 0 0 else (8.17) Equation8.18illustrateshow RELTRLSUM wascalculated.Themetricwasdeterminedbysubtracting TRAILSUM forthecurrentsession C fromthemedian TRAILSUM valueofallbankedgoal sessionsinthegoalset G RELTRLSUM = TRAILSUM ( C ) median ( foreach i 2 G [ TRAILSUM ( i )]) (8.18) Other Themutuallyexclusivebinomiallydistributedmetric GOAL specieswhetheratsomepoint duringtheremainderofasessionacontactformwassubmitte dforthecontactgoalofinterest.Ifa goalwillbeachievedduringthesession, GOAL willhavethevalueof true .Otherwise, GOAL will haveavalueof false 8.2ResultsThetemporalsite-centricmodelconsistedofsevenhypothe sesaboutinformationscentandtrails. Descriptivestatisticsofthedatasetandeachmeasurearep rovidedintherstsubsectionbelow. Theresultsforeachofthesevenhypothesesarethenprovide dinthenextsubsection. 8.2.1DescriptiveStatisticsTable99presentsthemean,standarddeviation,median,min imum,andmaximumnumberofsessionsperWebsiteinthreecategories:all,goal,andnon-go alsessions.Statisticsfortheentire datasetareshownrst,followedbythenumberofsessionsin itiallyusedinthetrainingset.The trainingsetrstcontainedallsessionsoccurringbeforet herst70 % ofgoalsessions.However, sincemeasureswerecalculatedinaprogressivemanner,the trainingsetincreasedinsizeafter eachprocessedsession. Thetrainingset(orsetofbankedsessions),wasusedtocalc ulatethemeasuresforeachsession aftertheminimumpercentofgoalsessionswasreached.Atot alof3,744.24sessions(70.35 % ) 267

PAGE 281

Table99:TemporalSite-centric:SessionsbySite MeanSt.Dev.MedianMinMax E NTIRE D ATASET All5,322.607,473.762,637.0024544,405 Goal105.9490.1379.0051587Non-goal5,216.667,427.532,566.0019244,111 M INIMUM T RAINING S ET All3,744.235,418.421,696.0016831,730 Goal74.2863.0756.0036411Non-goal3,669.965,386.001,656.0013031,525 perWebsite,onaverage,hadtheirmeasurescalculatedinap rogressivemannerfrompriorgoal sessions.NewpatchesandtrailswerelearnedoneachWebsit eoverthirtydifferenttimes(30.66). Eachadditionminingprocedurealsomeantthatallprevious goalsessionshadtorecalculatetheir RELPTC and RELTRL measuresagainstthenewpatchesandtrails. Table100displaysthemean,standarddeviation,median,mi nimum,andmaximumvaluesfor eachofthefourmeasuresthatdidnotrequireminingofpatch esandtrails.Thestatisticsarebrokendownintothreegroupsofsessions:all,goal,andnon-go al.Thesame47Websitesusedinthe site-centricversionwerealsousedinthetemporalversion Theaveragerelativedurationofalluserswas2.10fewermin utesonasitethanpreviousgoal sessions.Goalsessionsspent0.27moreminutesthanpastgo alsessionsonasite,whilenon-goal sessionsspent4.46fewerminutes.Apatternsimilartother elativedurationoftimebetweenthe threegroupswasalsoseenfortherelativenumberofpages.A mongstallforagers,0.15fewer pageswereviewedonaveragecomparedtopriorgoalsessions .Goalsessionsviewedrelatively morepagesthannon-goalsessionsdid(0.12versus 0.41)whencomparedtopriorgoalsessions. Allthreegroupsviewedalowerpercentageofuniquepages,o naverage,thanpastgoalsessions: 11.67 % forall, 2.33 % forgoal,and 21.02 % fornon-goal.Althoughtheaveragewas negativeforgoalsessions,themedianvalueshowsgoalsess ionshadexactlythesamepercentage ofuniquepagesviewedaspastsessions(i.e.,0.00 % ) 10 10 Thenegativerelativevalueforpercentageofuniquepagesm ayhavealsobeenasymptomoftheevolutionofWeb sites.Forexample,informationonaWebsitemayhavebeenco nsolidatedtoonlyafewpageswhichcausedforagersto 268

PAGE 282

Table100:TemporalSite-centric:MetricStatistics NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –S ITE -P ATCH RELDUR (inminutes) All47 2.103.07 1.75 11.984.45 Goal0.271.580.23 3.544.45 Non-goal 4.462.27 4.43 11.980.18 RELPGS All47 0.151.250.00 4.002.00 Goal0.121.130.00 3.002.00 Non-goal 0.411.330.00 4.002.00 S TRICT I NFORMATION S CENT RELUNQ All47 11.67 % 15.72 % 8.33 % 50.00 % 20.00 % Goal 2.33 % 11.03 % 0.00 % 33.33 % 20.00 % Non-goal 21.02 % 14.12 % 20.00 % 50.00 % 8.93 % RELLNR All47 0.140.310.00 1.000.46 Goal0.000.090.00 0.170.46 Non-goal 0.280.380.00 1.000.23 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnon-goalsessions. 269

PAGE 283

Relativeclickstreamlinearityfollowedthesamebasicpat ternasboththerelativedurationand numberofpagesviewed:valuesforgoalsessionswerepositi vewhiletheywerenegativefornongoalsessions.Onaverage,goalsessionshadexactlythesam evalueofclickstreamlinearity(0.00) asthepreviousgoalsessions.Non-goalsessionshadmoreth anaquarter-of-a-pointlowervalue ( 0.28)forclickstreamlinearitythanpastgoalsessions. Table101liststhemean,standarddeviation,median,minim um,andmaximumvaluesforthe sevenmeasuresderivedfrompatchesandtrails.Thepatches andtrailswerelearnedatthe0.05 signicancelevelfrompriorgoalandnon-goalsessions.Ea chstatisticisbrokendownforall, goal,andnon-goalsessions.Eachmeasurealsoliststhetot alnumberofWebsitesthatfound patchesortrailsatanypointduringtheprocessingprocedu re.ThenumberofWebsitesdiffer fromthesite-centricversion(17versus14patchWebsitesa nd15versus10trailsites 11 )because ofthemultipletimespatchesandtrailswereminedateachWe bsite.Forexample,patchesmay havebeenfoundonaWebsitewhenusing80 % ofgoalsessions,butnotwhenonly70 % ofgoal sessionswereused. Therstthreepatchmeasures( RELPTCMAX RELPTCLAST ,and RELPTCSUM )hadaveragerelativepatchvaluesof 0.17, 0.15,and 0.82amongallsessions,respectively.Therelativepatch valuesfor RELPTCMAX and RELPTCLAST bothhadthesamepositivevalue(0.02). RELPTCMAX however,wasnegativebyalmostathirdofapoint( 0.29).Allthreeofthenon-patchvalues sharednegativevaluesof 0.36, 0.31,and 1.36for RELPTCMAX RELPTCLAST ,and RELPTC SUM ,respectively. Usersspent,onaverage,8.03fewersecondswithinpatchesr elativetopriorgoalsessions.Currentgoalsessionsspent13.95moresecondsinpatchesrelat ivetopastgoalsessions,whereasnongoalsessionsspent30.01fewersecondsinpatches. Unlikethepatchvisitationmeasures,thetrailfollowingm easureshadnegativevaluesforall threegroupsofsessions.Theaveragemeanfor RELTRLMAX RELTRLLAST ,and RELTRLSUM was 0.10, 0.09,and 0.22,respectively.Allthreemeasuresforthegoalsession swerealso negative,butwereclosetohavingthesamevaluesaspastgoa lsessions( 0.01for RELTRLMAX and RELTRLLAST and 0.04for RELTRLSUM ).Thenon-goalsessionsweremuchfurtheraway fromzerothanthegoalsessions,withvaluesrangingfrom 0.16to 0.40. switchbackandforthbetweenthepages. 11 Seetable58in§7.2.1forstatisticsonthesite-centricver sionofthemodel. 270

PAGE 284

Table101:TemporalSite-centric:MetricStatistics(Sign icant–0.05) NMeanSt.Dev.MedianMinMax I NFORMATION P ATCH –P AGE -P ATCH RELPTCMAX All17 0.170.360.00 1.300.30 Goal0.020.080.00 0.100.30 Non-goal 0.360.43 0.19 1.300.00 RELPTCLAST All17 0.150.300.00 1.020.30 Goal0.020.070.00 0.010.30 Non-goal 0.310.35 0.19 1.020.00 RELPTCSUM All17 0.822.440.00 11.441.62 Goal 0.292.030.00 7.991.62 Non-goal 1.362.75 0.43 11.440.00 RELPTCDUR (inseconds) All17 8.0339.61 2.75 118.00102.75 Goal13.9533.625.88 36.75102.75 Non-goal 30.0132.85 21.00 118.004.63 R ELAXED I NFORMATION S CENT RELTRLMAX All15 0.100.270.00 0.890.35 Goal 0.010.170.00 0.510.35 Non-goal 0.180.330.00 0.890.00 RELTRLLAST All15 0.090.240.00 0.700.35 Goal 0.010.170.00 0.510.35 Non-goal 0.160.280.00 0.700.00 RELTRLSUM All15 0.220.920.00 3.971.02 Goal 0.040.800.00 2.571.02 Non-goal 0.401.030.00 3.970.00 Note:allvaluesarebasedonthemedianvaluesfromeachWebs ite'sgoalandnon-goal sessions. 271

PAGE 285

8.2.2HypothesesTestingTables102and103presenttheresultsfortheseventemporal ly-focusedsite-centrichypotheses. Table102providesresultsfromthefourhypotheseswhoseme asurewerenotdependentonknowledgeofminedpatchedandtrails.Table103liststheresults forthethreehypothesesthatreliedon minedpatchesandtrails. Thersttwocolumnsofeachtablelistthehypothesisnumber andnameofthemetricbeing tested.Thethirdandfourthcolumnslistthetotalnumberof WebsitesandthenumberofWeb siteswithanon-zerodifference(i.e., D i 6 =0 ),respectively.ThetotalnumberofWebsiteswas usedinthet-test,whileonlyWebsiteswithnon-zerodiffer enceswereusedfortheWilcoxonand signtests.Columnsvethroughsevenlistthetstatistic,d egreesoffreedom(df),andp-valuefor thet-test.TheeighthandninthcolumnsdisplaytheVstatis ticandp-valuefortheWilcoxontest. ThenaltwocolumnslisttheSstatisticandp-valueforthes igntest 12 12 Allthreeassumptionsofthesigntestweremet.Therefore,t heresultsfromthesigntestarefocusedoninthe followingparagraphs.Unlikethesigntest,someassumptio nsoftheWilcoxontest(symmetryof D i s)andt-test(symmetryandnormalityof D i s)werenotbelievedtohavebeenmet.Sincethesamedatawasu sedforboththetemporal andnon-temporalversionsofthemodel,thesamegeneraluns ymmetricalandnon-normaldistributionsof D i swereexpected.Thus,whileresultsoftheWilcoxontestandt-testa reprovidedinfootnotes,theresultsofthosetestsshouldb e interpretedwithcaution. 272

PAGE 286

Table102:TemporalSite-centric:Results NT-testWilcoxonSignTest Hyp.MetricTotalNoTiestdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –S ITE -P ATCH TSC1 RELDUR 474713.8746 < 0.0001 *** 1,128 < 0.0001 *** 47 < 0.0001 *** TSC2 RELPGS 47302.22460.0155 ** 3280.0243 ** 200.0494 ** S TRICT I NFORMATION S CENT TSC7 RELUNQ 47469.3446 < 0.0001 *** 1,049 < 0.0001 *** 43 < 0.0001 *** TSC8 RELLNR 47245.1546 < 0.0001 *** 295 < 0.0001 *** 22 < 0.0001 *** *p 0.10;**p 0.05;***p 0.01273

PAGE 287

Table103:TemporalSite-centric:Results(Signicant–0. 05) NT-testWilcoxonSignTest Hyp.MetricTotalNoTiestdfp-ValueVp-ValueSp-Value I NFORMATION P ATCH –P AGE -P ATCH TSC5a RELPTCMAX 17103.74160.0009 *** 550.0010 *** 100.0010 *** TSC5b RELPTCLAST 17103.95160.0006 *** 550.0010 *** 100.0010 *** TSC5c RELPTCSUM 17103.32160.0022 *** 550.0010 *** 100.0010 *** TSC6 RELPTCDUR 17173.54160.0014 *** 1420.0004 *** 160.0001 *** R ELAXED I NFORMATION S CENT TSC9a RELTRLMAX 1562.13140.0255 210.0156 ** 60.0156 ** TSC9b RELTRLLAST 1552.21140.0219 150.0313 50.0313 TSC9c RELTRLSUM 1562.37140.0165 ** 210.0156 ** 60.0156 ** *p 0.10;**p 0.05;***p 0.01 HypothesesSC5a-candSC9a-careeachsignicantat 3 (e.g., 0 : 10 3 =0 : 0333 0 : 05 3 =0 : 0167 ,and 0 : 01 3 =0 : 0033 ).274

PAGE 288

TSC1– RELDUR Thersthypothesisexpectedthatgoalachievingforagersw ouldspendmoretimeonasite,relativetopriorgoalsessions,thannon-goalsessionswouldsp end 13 .TheresultsofthesigntestsupportedhypothesisTSC1at =0.01(S=47;p-value= < 0.0001) 14 .All47Websiteshadahigher medianrelativedurationamongstgoalsessionsthannon-go alsessions.Relativetopriorgoalsessions,currentgoalsessionsspentroughly15additionalse condsonasite,whilenon-goalsessions spentalmostvefewerminutes. Theresultsforthisrsthypothesiswereidenticalbetween thetwosite-centricversionsofthe model(S=47;p-value= < 0.0001forbothversions).Foreachoftheversions,allofth etested Websiteshadgoalsessionswithahighermediandurationtha nnon-goalsessions.Therefore,the useofduration,ineitheranabsoluteorrelativemanner,ap pearstobeconsistentlyusefulindistinguishingbetweengoalandnon-goalsessions.TSC2– RELPGS Thesecondhypothesisalsoexaminedhowforagersjudgedthe valueofaWebsite,butdidsoby lookingattherelativenumberofpagesviewed.Thehypothes isthatgoalsessionswouldhave ahigherrelativemediannumberofpagesviewedthannon-goa lsessionswassupportedat = 0.05(S=20;p-value=0.0436) 15 .20outofthe30non-tiedWebsiteshadahighermedianrelativenumberofpagesviewedforgoalsessionsversusnon-go alsessions.Onaverage,goalsessionsviewed0.12morepagesrelativetopastgoalsessions, whereasnon-goalsessionsviewed 0.41fewerpages. Thesecondhypothesishadpracticallyidenticalresultsbe tweenthetwosite-centricversions ofthemodel:static(S=19;p-value=0.0436)andtemporal(S =20;p-value=0.0436).Roughly two-thirdsofallthenon-zeroWebsitesfoundahighermedia nnumberofpagesviewedforgoal sessionsthannon-goalsessions(67.86 % ofWebsitesforstaticand66.67 % fortemporal).This hypothesisalsodemonstratedthateitheranabsoluteorrel ativemannerofdeterminingnumberof pagesviewedwasusefulindistinguishingbetweengoalandn on-goalsessions. 13 Amorein-depthdiscussionofeachofthehypothesesmaybefo undin§4.2.1and§7.2.2. 14 HypothesisTSC1wasalsosignicantat =0.01forboththet-test(t=13.87;df=46;p-value= < 0.0001)and Wilcoxontest(V=1,128;p-value= < 0.0001). 15 HypothesisTSC2wasalsosignicantat =0.05forboththet-test(t=2.22;df=46;p-value=0.0155)a nd Wilcoxontest(V=328;p-value=0.0243). 275

PAGE 289

TSC5– RELPTCMAX RELPTCLAST ,and RELPTCSUM Thethreesub-hypothesesofTSC5(table103)exploredhowth evisitationofvaluablepatches couldhelpexplaingoalachievement.Allthreesub-hypothe sesweresignicantat =0.01(S= 10;p-value=0.0010forallthreemeasures) 16 ,supportingthehypothesizedpositiveassociationof relativepatchvalueandgoalachievement.All10ofthenonzeroWebsiteshadgoalsessionswith higherrelativepatchvisitationvaluesthanthenon-goals essions,forallthreepatchmeasures. TheresultsforhypothesisTSC5werefoundtobesignicanta tthesame levelforbothversionsofthesite-centricmodel:static(S=9;p-value=0.00 20forallthreemeasures)andtemporal (S=10;p-value=0.0010forallthreemeasures).Bothversio nsofthemodelhadallnon-zero Websitesndahighermedianpatchvalueamongstgoalrather thannon-goalsessions.However, thetemporalversionhadmoreWebsitesndpatchesthanthes taticversion(17versus14total Websites).Thus,thetemporalversionwasbetterabletouti lizetheavailable,butsparse,amount ofdatatolearnpatches.Furthermore,asthestructureofaW ebsitemayevolve,thetemporalversion'suseofthemostrecentdatawouldbetterreectthecha ngingnatureofasite 17 TSC6– RELPTCDUR HypothesisTSC6expectedthatmerevisitationofvaluablep atchesdidnotwhollyindicateaforagerobtainedvaluefromapatch.Thus,thehypothesisconje cturedthatrelativelyhigheramounts oftimewithinpatcheswereassociatedwithgreaterinforma tiongainandthusweremorelikely toachieveagoal.Theresultsofthesigntestsupportedtheh ypothesisat =0.01(S=16;pvalue=0.0001) 18 ,ndingahigherrelativedurationwithinpatchesforgoals essionsthannon-goal sessions.16ofthe17non-zeroWebsiteswithdiscoveredpat cheshadgoalsessionsthatspenta higherrelativedurationoftimewithinpatchesthannon-go alsessions.Onaverage,goalsessions spentalmost14additionalsecondswithinpatchesandnon-g oalsessionsspent30secondsless. Likemanyoftheotherhypotheses,theresultsbetweenthetw oversionsofthesite-centricmodel forhypothesisTSC6werealsoalmostthesame:static(S=13; p-value=0.0009)andtemporal(S 16 Allthreesub-hypothesesofhypothesisTSC5werealsosigni cantat =0.01forboththet-test( RELPTCMAX (t =3.74;df=16;p-value=0.0009); RELPTCLAST (t=3.95;df=16;p-value=0.0006);and RELPTCSUM (t=3.32;df= 16;p-value=0.0022))andWilcoxontest(V=55;p-value=0.0 010forallthreemeasures). 17 Foradiscussionofthelimitationsofthecurrentincarnati onofthetemporalmodelreferto§9.1. 18 HypothesisTSC6wasalsosignicantat =0.01forboththet-test(t=3.54;df=16;p-value=0.0014)a nd Wilcoxontest(V=142;p-value=0.0004). 276

PAGE 290

=16;p-value=0.0001).EachversiononlyhadoneWebsitewhi chfoundahighermedianamount oftimespentwithinpatchesfornon-goalsessionsthangoal sessions(92.86 % ofpositiveWeb sitesforstaticand94.12 % fortemporal).Therefore,theuseofdurationwithinpatche sappears usefulindiscriminatingbetweengroupsofsessions,regar dlessofthemeasurebeingabsoluteor relative.TSC7– RELUNQ Theseventhhypothesis(TSC7)examinedinformationscenti nastrictmannerwhereanyinefciencywasviewedaspoorindicatorsofscent.Apositiveass ociationbetweentherelativeproportionofuniquepagesviewedandgoalachievementwasexpecte dandsupportedat =0.01(S= 43;p-value= < 0.0001) 19 .43ofthe46non-zeroWebsiteshadgoalsessionswithahighe rrelativepercentagesofuniquepagesviewedthannon-goalsessi ons.Bothgoalandnon-goalsessions viewedalowerpercentageofuniquepagesthanpastgoalsess ions( 2.33 % versus 21.02 % ),but goalsessionsstillvisitedagreaterproportionofuniquep agesthanthenon-goalsessions. Forthishypothesis,boththestaticandtemporalversionso fthesite-centricmodelweresupportedatthesame level:static(S=42;p-value= < 0.0001)andtemporal(S=43;p-value= < 0.0001).95.45 % and93.48 % ofthenon-zerostaticandtemporalWebsitesfoundgoalsess ions withahigherpercentageofuniquepages,respectively.Bet weenthetwoversions,theuniquepercentageofpagesviewedwasequallysuccessfulindifferent iatingbetweenthetwogroupsofsessions.TSC8– RELLNR Thesecondhypothesisaboutstrictinformationscent(TSC8 )alsoexaminedinformationscentina strictmanner.However,overallscentwasdeterminedinan er-grainedmannerbyusingthepages andtheorderinwhichthosepageswerevisited.Thebeliefwa sthatlesscomplex(i.e.,morelinear)clickstreamswereindicativeofhigherlevelsofscent ,andthusagreaterlikelihoodofachievingagoalwasexpectedandsupportedat =0.01(S=22;p-value= < 0.0001) 20 .22ofthe24 19 HypothesisTSC7wasalsosignicantat =0.01forboththet-test(t=9.34;df=46;p-value= < 0.0001)and Wilcoxontest(V=1,049;p-value= < 0.0001). 20 HypothesisTSC8wasalsosignicantat =0.01forboththet-test(t=5.15;df=46;p-value= < 0.0001)and Wilcoxontest(V=295;p-value= < 0.0001). 277

PAGE 291

non-zeroWebsiteshadhigherrelativelinearclickstreamv aluesforgoalsessionscomparedto non-goalsessions,withgoalsessionshaving,onaverage,t heexactsameclickstreamcomplexity aspriorgoalsessions.Non-goalsessionswereoveraquarte rofapointlowerinclickstreamcomplexity( 0.28)thanpastgoalsessions. Theresultsofthestaticandtemporalversionsofthemodelw erealmostidentical:static(S= 18;p-value= < 0.0001)andtemporal(S=22;p-value= < 0.0001).Noneofthenon-zerostatic Websites(0.00 % )andonlytwoofthetemporalWebsites(8.33 % )hadanysiteswithnon-goal sessionshavingahighermedianclickstreamcomplexity.Th us,likemanyoftheothermeasures, bothversionswereequallycapableofseparatinggoalfromn on-goalsessionsusingthelinearityof auser'ssession.TSC9– RELTRLMAX RELTRLLAST ,and RELTRLSUM Thenalthreesub-hypothesesofTSC9(table103)examinedt heefcacythatfollowingvaluable trailshadinexplaininggoalachievement.HypothesisTSC9 aandTSC9cwerebothsupportedat =0.05(S=6;p-value=0.0156forbothmeasures),whilehypot hesisTSC9bwasonlysupported at =0.10(S=5;p-value=0.0313) 21 .Thedifferenceinsignicancebetweenthemeasureswas duetosamplesize.Both RELTRLMAX and RELTRLSUM hadsixnon-zeroWebsites,while REL TRLLAST onlyhadve(allofwhichsupportedthehypothesisinaposit ivedirection).Thus,there weresimplynotenoughWebsitesfor RELTRLLAST toreachsignicanceat =0.05. TheresultsforhypothesisTSC9werefoundtobesignicanta tthesame levelforallbutone measure( RELTRLLAST )inthetemporalversionofthesite-centricmodel:static( S=6;p-value= 0.0156forallthreemeasures),andtemporal(S=6;p-value= 0.0156for RELTRLMAX and REL TRLSUM andS=5;p-value=0.0313for RELTRLLAST ).Similarbetweentheversionswasall non-zeroWebsitesfoundhighermediantrailvalueswithint heirgoalsessions.However,justas 21 HypothesesTSC9a-cweresignicantateither =0.05or0.10,dependingonthetest.Forthet-test,twoofth e threemeasuresweresignicantat =0.10( RELTRLMAX (t=2.13;df=14;p-value=0.0255)and RELTRLLAST (t= 2.21;df=14;p-value=0.0219)),whilethethirdwassignic antat =0.05( RELTRLSUM (t=2.37;df=14;p-value= 0.0165)).FortheWilcoxontest,twoofthemeasuresweresig nicantat =0.05(V=21;p-value=0.0156for REL TRLMAX and RELTRLSUM ),whilethethirdwasonlysignicantat =0.10(V=15;p-value=0.0313).Thelesssignicant RELTRLMAX measurefromthet-testmaybeduetothedegreeofdifference betweenthegoalandnon-goal sessions.Forexample,bothofthemeasuresthatweresigni cantat0.10hadlessofanaveragedifferencebetween sessions( RELTRLMAX = 0.17; RELTRLLAST = 0.15)thanthemeasurethatwassignicantat0.05( RELTRLSUM = 0.35).Thedifferenceinsignicancebetweenthemeasureso ftheWilcoxontestwasduetothesamereasonasfound withthesigntest:smallersamplesizeandthuslesspowerto detectdifferencesbetweenthesessions. 278

PAGE 292

withlearningpatches,thetemporalversionalsohadmoreWe bsitesndvaluabletrailsthanthe staticversion(15versus10totalWebsites),highlighting theabilityofthetemporalversiontouse theextraavailabledatatolearnadditionaltrails.SummaryofResultsTable104summarizestheresultsofthehypothesestesting. Ofthe11hypothesesandsub-hypotheses, sevenweresupportedat =0.01,threeat =0.05,andoneat =0.10.Thetablealsoliststhe resultsobtainedfromthestaticversionofthesite-centri cCMIF. Table104:TemporalSite-centric:HypothesesResultsSummary HypothesisSupported? Hyp.MetricTemporalStatic I NFORMATION P ATCH –S ITE -P ATCH TSC1 RELDUR Yes *** Yes *** TSC2 RELPGS Yes ** Yes ** I NFORMATION P ATCH –P AGE -P ATCH TSC5a RELPTCMAX Yes *** Yes *** TSC5b RELPTCLAST Yes *** Yes *** TSC5c RELPTCSUM Yes *** Yes *** TSC6 RELPTCDUR Yes *** Yes *** R ELAXED I NFORMATION S CENT TSC7 RELUNQ Yes *** Yes *** TSC8 RELLNR Yes *** Yes *** R ELAXED I NFORMATION S CENT TSC9a RELTRLMAX Yes ** Yes ** TSC9b RELTRLLAST Yes Yes ** TSC9c RELTRLSUM Yes ** Yes ** *p 0.10;**p 0.05;***p 0.01 279

PAGE 293

8.3ConclusionOverall,theresultsbetweenthetwoversionsofthesite-ce ntricmodeldidnotdifferinsignicant ways.Althoughtheresultswerenotsignicantlybetter,th eywerealsonotworse.Thus,theuseof thetemporalversionprovidesadditionalevidenceintheef cacyoftheselectedmeasuresandin theabilityofrelativemeasurestodistinguishbetweengoa landnon-goalsessions.Inaddition,the temporalversiondidseeanincreaseinthenumberofWebsite swhichwereabletolearnpatches andtrails,althoughthesignicanceoftheresultsdidnoti ncreasewiththelargersamplesize. Atthesurface,thelackofsignicantlybetterresultsthan thestaticversionwoulddiscourage theundertakingofthetemporalmodel,especiallygiventhe computationalcostandcomplexity associatedwithitsmethodology.However,theWebsitesuse dtotestthemodelmaynothave changeddramaticallyenoughoverthecourseofthedatacoll ectionperiodtowarranttheneed forthetemporalmethodology.Warrenetal.(1999)foundwit hintheirlimitedexaminationof Websitesthat“...theoverallrateofchangeofasiteincrea sedwiththesizeofthesite”(pg.182). Thus,thetemporalversionmaybemoreappropriateforlarge rWebsitesthatareevolvingata fasterratethanthoseseenwithinthesite-centricdataset 22 22 Onaverage,Websiteswithinthesite-centricdatasetweres mallwithonly16.36pages. 280

PAGE 294

Chapter9 Conclusion Thisdissertationsoughttoexplaingoalachievement(i.e. ,choicebehavior)atlimitedtrafclong tailWebsitesusingInformationForagingTheory(IFT)(Pir olli,2007;PirolliandCard,1999). ThethesisofIFTwasthatindividualsaredrivenbyametapho ricalsenseofsmellthatguidesthem throughpatchesofinformationintheirenvironment.Havin gafoundationinbothpsychologyand ecology,IFTdrewfrombothdisciplinestoexplainthemecha nismsandtheresultingbehaviorof informationforagers. IFTusedaproductionrulesystemfromthepsychologicalada ptivecontrolofthought-rational (ACT-R)theorytodescribethecognitiveprocessofindivid ualsforagingforinformation(Andersonetal.,2004).Therationalizationofwhyapersonwouldm ovefromoneareaoftheirenvironmenttoanotherwasexplainedaccordingtotheecologica lpatchmodelfromoptimalforaging theory(OFT)(StephensandKrebs,1986). FromACT-RandOFT,theconceptsofinformationscentandpat chesweredenedforIFT.Informationscentwasthedrivingforcebehindwhyapersonmad eanavigationalselectionamongst agroupofcompetingoptions.Asforagerswereassumedtober ational,scentwasamechanism bywhichforagerscouldreducetheirsearchcostsbyincreas ingtheiraccuracyonwhichoption leadtotheinformationofvalue(Pirolli,2007).Aninforma tionpatchwasdenedasanareaof thesearchenvironmentwithsimilarinformation(e.g.,sin gleWebpage,multipleWebpages,Web site)(Pirolli,2007). IFTwasoriginallydevelopedtobeusedina“productionrule ”environment,whereauserwould performanactionwhentheconditionsofaruleweremet.Howe ver,theuseofIFTinclickstream researchrequiredconceptualizingtheideasofIFTinanonproductionruleenvironment.Tomeet suchanendthisdissertationaskedthreeresearchquestion sregardinghowtolearn(1)information patches,(2)trailsofscent,andnally(3)howtocombinebo thconceptstocreateaClickstream ModelofInformationForaging(CMIF). 281

PAGE 295

Thersttworesearchquestionsweresimilarinbothconcept andexecution.Inregardstopatches, eachuserwasfreetodenewhatapatchwasastheysawt.Howe ver,certainpatternsofpatches emergedonaWebsiteamongstthoseforagerswithsimilarinf ormationgoals.Likewise,scent trailswerealsodenedbyeachuser.Whencombinedwithothe rusers,patternsfromfragments ofscenttrailsalsoemergedonasitebetweenuserswithsimi larinformationgoals.Fortheonline rm,categorizingpatchesortrailsasvaluabletogoal-ach ievingornon-goal-achievingforagers helpedgiveanindicationoftheintentofusersaccordingto whichpatchesortrailswerevisitedor followed.ResearchQuestion1: HowcaninformationpatchesbelearnedfromalongtailWebsi te? ResearchQuestion2: HowcaninformationscenttrailsbelearnedfromalongtailW ebsite? Forresearchquestion1and2,frequentitemsetsandsequent ialpatternswerelearnedoneach Websitefromgoalandnon-goalsessionstocreatecontrasts ets(BayandPazzani,1999).Contrast setswhichwereabletosignicantlydistinguishbetweenth etwogroupsofsessionsat =0.05 weredeemedvaluablepatchesortrails.Oncediscovered,pa tchesandtrailsweregivenavalue accordingtohowwellthepatchortraildistinguishedbetwe enthegoalandnon-goalsessions. Ingeneral,ndingvaluablepatchesandtrailswassuccessf ulonroughlyaquarterofalltested Websites(29.79 % ofsitesforpatchesand21.28 % fortrails).OnthoseWebsiteswhichdiddiscoverpatchesandtrails,thereweremultipleinstancesofp atchesandtrailsbeingfound(average of11.93patchesand4.70trailspersite). Theprevioustworesearchquestionsexaminedtheconceptso finformationscentandpatches individually.However,therealvalueofIFTwasitsability tocombinethesearchenvironment (i.e.,patches)withtheactionsofaforager(i.e.,scent). Thusthemainfocusofthisdissertation andthenalresearchquestionwasonhowtheseconceptscoul dbecombinedusingclickstream datatoinfergoalachievement.ResearchQuestion3: Howcaninformationforagingtheoryandclickstreamdatabe usedtoexplaintheachievementofagoalatalongtailWebsite? Twoversionsofaclickstreammodelofinformationforaging wereproposedwhichusedclickstreammetricstorepresenttheconceptsofinformationsce ntandpatches.Inaddition,themod282

PAGE 296

elsalsoincludedmeasureswhichextendedIFT.Forexample, hypotheseswereintroducedwhich testedtheroleofmemoryaboutasiteandhowpatchvalue,spe cictoagroupofforagers,could beusedtopredictgoalachievement.Theuser-centric(UC)m odelexploiteduser-centricdata (Padmanabhanetal.,2001)aboutaforager'sentirebrowsin gbehaviortoexplaingoalachievementatalongtailWebsite.Thismodelcomparedaforager'sb ehavioracrossmultipleWebsites. However,duetouser-centricdatabeingaggregatedatthese ssionlevel,themodellackeddepthat individualWebsites. Inlightoftheraritywithwhichauser'sentireclickstream overmultiplesitesiscommonly availabletoanonlinerm,asite-centric(SC)versionofth emodelemployingsite-centricdata (Padmanabhanetal.,2001)wasalsodeveloped.Havingacces stopage-leveldatamadethesitecentricmodelcapableofanalyzingpatchesatalllevelsofa nalysisalongwithinformationscentat aWebsite.However,sinceaforager'sbehavioracrosssites wasunknownwithsite-centricdata, thesite-centricmodelcomparedaforager'sbehaviorrelat ivetoanabsolutevalueofzero 1 Theuser-centricmodelproposedfourhypothesesthatexami nedthebehaviorofaforagerwithin asite-patch(i.e.,Website).Threeofthefourhypothesesw eresupportedatan levelof0.01, whilethefourthwasnotsupportedatanyofthetestedalphal evels.Thesite-centricmodelproposedthesamefoursite-patchhypothesesastheuser-centr icmodel,plustheadditionoftwopagepatchhypotheses,andthreeinformationscenthypotheses( ninehypothesestotal).Fiveofthehypothesesweresupportedatan levelof0.01,twoat =0.05,andoneat =0.10.Theremaining hypothesiswasfoundtobehighlysignicant( =0.01)inthe opposite directionofwhatwashypothesized. Overall,bothmodelswereabletondmeasureswhichsuccess fullydistinguishedbetweengoal andnon-goalsessions.Furthermore,themeasuresweregrou ndedonatheoreticalbasethatnot onlyguidedtheirselection(orcreation),butalsoprovide dareasoningfortheirexistencethat helpedtoexplainwhyusersbehavedinthemannersinwhichth eydid.Ingeneral,thetwoconceptsofIFTwerewellsupportedusingbothversionsofthecl ickstreammodelofinformationforaging. Theremainderofthischapterisorganizedasfollows.First ,thelimitationsofthisresearchare discussedin§9.1.Adiscussionofthecontributionsofthis dissertationaregivenin§9.2.Finally, 1 Chapter8containsatemporalversionofthesite-centricmo delwhichcomparedeachsessionrelativetopriorgoal sessions. 283

PAGE 297

§9.3providesabriefoverviewoffutureresearchwhichexpa ndsuponthisdissertation. 9.1LimitationsAswithanyresearch,therewereanumberoflimitationswhic hshouldberecognizedsothatfutureresearchmayimproveuponthiswork.Listedbelowareni nelimitationsofthisdissertation. (1)SinceIFTisarelativelynewandnotwidelytestedtheory ,basingthisentiredissertationonits usagemaybeconsideredalimitation.However,eventhought hetheoryhasnotseenwidespread usagelikeothertheoriescommonlyusedinIS(e.g.,Theoryo fPlannedBehavior(Ajzen, 1991)),priorresearchhassuccessfullyusedthetheory.Fo rexample,elementsofIFThave beenusedtoinformthedesignofuser-interfaces(Willette tal.,2007;Xieetal.,2006;OlstonandChi,2003)andtohelpexplainthebrowsingbehavior offoragers(Lawranceetal., 2007;Gallettaetal.,2006;KatzandByrne,2003).Furtherm ore,IFTisitselfheavilybased upontwotheoriesthatarewellestablishedwithintheirres pectivedisciplines:OptimalForagingTheory(OFT)(StephensandKrebs,1986)andtheAdaptive ControlofThought-Rational Theory(ACT-R)(Andersonetal.,2004).Therefore,whileIF Tisrelativelynew,itsusefulness asatheoryshouldnotbediscountedonthatbasisalone.Inst ead,thisdissertationandother researchlikeitareneededtodetermine,throughevaluatio n,theworthofIFT. (2)Thepredictionproblembeingexaminedbetweentheuserandsite-centricmodelsweredifferent.Thesite-centricmodelpredictedifagoalwouldbea chievedduringtheremainderofa session.Tomeetthattask,onlyinformationthatoccurred before aformsubmissionwasused tocalculatethemeasuresandlearnpatchesandtrails.This forward-lookingpredictionwas possiblebecausethesite-centricdatasetcontainedpagelevelinformation,whichalloweda sessiontobesegmentedsuchthatonlybrowsingbehaviorbef oretheformsubmissionwas used.Incontrast,theuser-centricmodelpredictedifagoa lwouldhaveoccurredgivenallinformationaboutasession(i.e.,backward-lookingpredict ion).Theuser-centricdatawasat thesite-levelandthusconstrainedtheproblemthatcouldb eanalyzed.Sinceitwasunknown whereinthesessionapurchasetookplace,therewasnorelia blemeanswithwhichtosegment sessions.Theuseofallbrowsingbehaviorwithintheuser-centricmod elintroducedtwolimitations. 284

PAGE 298

First,themeasuresreectedthebrowsingbehaviorofforag ersbeforeandaftertheirpurchase. Whilethedataonlyallowedtherstfourmeasurestobeteste d,thechangeininformation goalafterthepurchasemayhaveintroducedagreateramount ofnoiseintosomeoftheother measures(e.g.,thosedealingwithpage-patchesandscent) .Thesecondlimitationisthatgoal sessionsbydefaultwouldlikelyhavehighernumberofpages viewedandsessiondurationas adirectconsequenceofpurchasingaproduct.Forexample,e verygoalsessionwouldhavean increasednumberofpagesviewedandsessiondurationovern on-goalsessionssimplybecause theywentthroughthecheckoutprocess.Thus,someofthedif ferencesseenbetweenthemeasuresofthersttwohypothesesmaybebiasedbecauseallbeh aviorfromasessionwasused. (3)Withinthesite-centricversionoftheclickstreammode l,theWebsiteswereassumedtoremainrelativelyconstantoverthecourseofthedatacollect ionperiod.Iftheassumptionof constantstructureorcontentonasitewasnotmet,thentheb rowsingbehaviorofsessions maydifferdependingonwhenthesessionstookplace.Forexa mple,atonepointintimegoal sessionsataWebsitemayhavevisited10pagespersession,o naverage.However,afterreorganizingandstreamliningtheWebsite,goalsessionstheno nlyviewedvepagesonaverage. Comparingagainstanabsolutevalueofzerowouldmakedisti nguishinggoalfromnon-goal sessionsdifcultbecauseofthedrasticchangeinbrowsing behavior. Tocombatthislimitationatemporalversionofthesite-cen tricmodelwasintroducedinchapter8.Thetemporalversioncomparedallbrowsingbehaviorr elativetoallgoalsessionswhich hadtakenplacebeforethecurrentsession.Thus,therelati vemeasureswouldbebetterableto reectchangesinthestructureorcontentofasite.Compari ngtheresultsofthetwoversions ofthesite-centricmodelfailedtondanylargedifference sbetweenthemodels,indicatingthe Websitesusedinthesite-centricdatasetweremostlystati c.However,otherdatasetswhich containWebsiteswhichevolveatamuchmorerapidpace,may ndbetterresultsusingthe temporalversionofthemodel.Futureresearchwillmoreclo selyexaminetheaffecttimehas onexplaininggoalachievement. (4)Theuser-centricdatasetcontainedWebsitesofallpopu larity,butthisdissertationwasonly interestedinexamininglongtailWebsites.Thelimitation wasarigorousandquantiabledefinitionofwhatconstitutedalongtailWebsitewasnotknown .Thus,the80/20rule(Newman, 285

PAGE 299

2005)wasusedtoclassifysitesaseitherpartsoftheshorth eadorlongtailofapowerlaw distribution.Whiletheuseofthe80/20ruleappearstobere asonable,futureresearchshould betterexplorehowtodenethelongtail.Theuser-centricdatasetalsorestrictedWebsitesthatwer etoofardownthelongtail.For example,siteswithfewachievedgoals( < 50 purchases)wereremovedastheyweresuspectedofbeingabandoned,toonew,orrepresentingfailedb usinessmodels.Duetotheirlack oftrafc,these“verylongtail”siteswereconsideredtoos parsetobeusablefortheintended analysis(e.g.,miningmayresultinnoorspuriouspatchesa ndtrailsbeingfound).Whilethe selectionof50goalsessionsappearstobereasonable,thes electionwasspecictotheusercentricdataset.Thus,futureresearchmaybebetterableto segmentthelongtailbydening generally-applicablerules. (5)Acommonlimitationfacedwhendealingwithreal-worldd atasetsistheelementof“noise” inthedata.Althoughbothdatasetswerepreprocessedexten sively,someelementsofnoise inevitablyremainedwithinthedatasets.Forexample,with inthesite-centricdatasetrobots, spiders,andotherautomatedprogramsmayhavebeenpresent inthedata.Todealwiththese robots,thedataproviderhadinitiallyscrubbedthedatafo ranyself-identiedrobots.Then theoutlieranalysiswasperformedduringpreprocessingto removeanyotherout-of-placesessions.Theactualeffectofsuchnoiseontheresultsofthemodelisu nknown.However,itwasbelievedthenoisehadaminimalimpactbecausetheresultsofb othversionsofthemodelgenerallycameoutasexpected.Thus,themodeldemonstratessome levelofrobustnessintheface ofnoisydata.Futureworkmaybebetterabletoquantifythei mpactnoisehasonthemodel byusingmorein-depth(e.g.,categoriesofsitesfortheuse r-centricdata)andfocuseddata (e.g.,browsingbehaviorfromexperimentalparticipants) (6)Thedeterminationofleavingandreturningbehaviorwit hinthesamesessiondifferedbetween thetwomodels,makingcomparisonsoftheirresultsdifcul t.Theuser-centricmodelwas stricterthanthesite-centricmodelindeterminingwhethe rasessionleftandreturnedduringasession.Theuser-centricmodelrequiredavisitation ofatleasttwopagesatanotherecommerceWebsite,whereasthesite-centricmodelcountedv isitationsofanylengthatany 286

PAGE 300

site.Thus,thesite-centricmodelmayhaveinadvertentlyi ntroducednoiseintotheanalysisby countingWebsiteswhichwerenotrelatedtotheinformation goaloftheforageratthesiteof interest 2 .Futureresearchusingmoredetaileduser-centricdataset smaybebetterabletodetermineifthedistinctionbetweenthetypesofsitesbeingl eftformakesadifference.Forexample,asite-centricdatasetmaybecreatedfromadetailed user-centricdataset 3 todetermine ifviewingmorethanonepageatasimilartypeofWebsitereal lymatterswhenconsidering leavingandreturningbehavior. (7)Thesite-centricversionoftheclickstreammodelusedt herst70 % ofsessionstocalculate patchesandtrails 4 .Therationalebehindtheusageof70 % wastohavealargeenoughnumberofsessionstominefrominordertondvaluablepatchesa ndtrails,withoutintroducing noiseintotheanalysisbydiscoveringspuriouspatchesand trails(e.g.,apatchthatonlyone othersessionvisited).Futureresearchshouldperformase nsitivityanalysistodetermineifthe resultsofthemodelchangedramaticallywithdifferentper centages. (8)Thetemporalsite-centricversionofthemodelusedthes amesetofsessionstoperformtwo tasks.First,asetofgoalandnon-goalsessionswereusedto learnpatchesandtrails.Second, thegoalsessionsfromthatsamesetofsessionswerethenuse dtocalculatemeasuresforpatch visitationandtrailfollowing.Themedianvalueofthoseme asureswasthenusedtodetermine therelativevalueofpatchvisitationandtrailfollowingf orthecurrentsession.Thereasonthis approachwastakenwasduetothelimitedsamplesizeateachW ebsite.Ideally,theminingof patchesandtrailsshouldhaveusedonegroupofsessions,wh ilethecalculationofmeasures shouldhaveusedanother.FutureresearchthatusesWebsite snotsofardownthelongtail maybebettercapableofhavingindependentgroupsofsessio nsaccomplisheachtask. (9)Thepathstratummeasurewasbasedonconceptsfromgraph theory(McEneaney,2001). Whenusedtoquantifythelinearityofauser'sclickstreamt womainlimitationscametothe surface.First,thepathstratummeasurewouldbemuchlower iftheuserstartedandended 2 Withintheuser-centricmodel,noisecouldhavebeenfurthe rreducedbyrestrictinge-commercesitestothose withinthesameproductcategoryasthetargetsession.Unfo rtunately,theWebsiteswithintheuser-centricdatasetwe re notcategorized. 3 SeePadmanabhanetal.(2001)foranexampleofcreatingasit e-centricdatasetfromauser-centricdataset. 4 Tobeprecise,therst70 % ofgoalsessionswereused.Inaddition,allnon-goalsessio nswhichoccurredbefore thelastofthe70 % ofgoalsessionswerealsoused. 287

PAGE 301

theirsessiononthesamepage(e.g.,theindexpage),asoppo sedtodifferentpages(e.g.,indexandcontactpage).Thisisbecausethepaththeusertookw asaclosedwalk.Withinthe contextofmeasuringscent,however;suchaclosedwalkmayn otnecessarilyindicatesuchan extremedropinscent.Forexample,aforagermayreturntoth eindexpageattheendoftheir sessiontomakesuretheyinvestigatedalllinksofinterest Thesecondlimitationofthemetricisthatrepeatingsequen tialpageviewsandmultipletraversalsofthesamepatharelostwhentransformingaclickstrea mtotheconverteddistancematrix thatisneededtocalculatethemeasure.Sincerepeatedbeha viorislost,theoverallscentof ausermaybemarkedashighbythemeasureeveninsituationsw heremultiplecyclesoccur withintheclickstream.Giventheselimitations,futurere searchmayfurtherexploreifthese situationsuniquetomeasuringinformationscentmaybeinc orporatedintothepathstratum measure. 9.2ContributionsInlightofthelimitationsmentionedintheprevioussectio n,itisbelievedthisdissertationstill makesanumberofworthwhilecontributions.Listedbelowar ethemajorcontributionsofthisdissertation. (1)First,thisdissertationdemonstratedhowIFTcouldbeu sedasatheoreticalbasisforclickstreamresearch.Throughthecreationoftwoversionsofacl ickstreammodelofinformation foraging,theconceptsofIFTwerequantiedoutsideofapro ductionruleenvironment.Inaddition,theCMIFnotonlyoperationalizedthecoreconcepts ofIFT,butalsoextendedthetheorybyintroducingmemory,forager-independentvaluation ofpatchesandtrails,alongwithreneddenitionsofscent(e.g.,strictandrelaxedscent).O ncetested,manyofthecoreaspects ofIFTandthetheoreticalextensionsintroducedinthisdis sertationweresupported.Thus,this dissertationnotonlydemonstratedtheabilityofIFTtoexp laingoalachievement,butitalso introducedtheoreticalextensionswhichprovidedamorein -depthexplanationofgoalbehavior. (2)Thisdissertationalsopresentedamethodologyonhowto learnpatchesandscenttrailsusing notonlysignicant,butalsosupportedcontrastsets.Meas ureswerealsocreatedwhichquan288

PAGE 302

tiedaforager'svisitationofpatchesandfollowingoftra ils.Themetricsmeasuredthemost valuable,last,andsummationofallpatchesandtrailsthat werevisitedorfollowed.Forthose WebsiteswithintheCMIFthatdiscoveredpatchesandtrails ,themeasureswerecapableof distinguishinggoalfromnon-goalsessionsaccordingtoaf orager'svisitationandfollowing behavior. (3)Thethirdcontributionwasamethodologythatdetailedh owtopreprocessdatasetswithlong tailWebsites.Inparticular,aseparateuser-andsite-cen tricmethodologywaspresentedwhich highlightedtheuniquechallengesassociatedwithpreproc essingeachdataset.Forexample,a processwasprovidedforthesite-centricdatasetabouthow tolocateandselectasingledenablegoalonWebsiteswhichhavemorethanoneavailablegoal (4)Finally,duetothepresenceofIFTguidinganalysis,tra ditionallyunderstudiedlongtailWeb siteswereabletobeexaminedeveninlightoftheirsparseda tasets. 9.3FutureResearchThisdissertationwasmeanttoprovideawell-denedchanne lthroughwhichastreamoffuture researchmayow.Thus,listedbelowarefourfutureresearc hprojectsthatcontinueandextend upontheworkinthisdissertation. (1)Therstresearchprojectdealswithattemptingtoanswe rthequestion“Whatisthelongtail?”. InthisdissertationthelongtailwasdenedasthoseWebsit eswhichonlyaccountedfor20 % ofachievedgoals.Anaturalextensionwouldbetomorepreci selydenetheseparationbetweenlongtailandshort-headWebsites.However,suchadis tinctionmaystillbetoosimplisticinlightofhowmuchareathelongtailportionofacur vemaycover.Therefore,further segmentationwithinthelongtail(e.g.,the“verylongtail ”)mayalsoneedtobedened. Inaddition,theremaybeothermeanswithwhichtodenelong tailWebsites,ingeneral.For example,shouldsitesbedenedaccordingtotheirtotalamo untoftrafcorbythenumber ofgoalsachieved?Ifthegoalbeingexaminedispurchases,c anasitebealongtailWebsite foronetypeofproduct,yetresidewithintheshort-headfor otherproductcategories?Ifso, howdobrowsingpatternsofforagersdifferinregardstothe longtailednessoftheWebsite's productcategories? 289

PAGE 303

Thelargestcontributionsofthisresearchwouldbeacleard enitionofwhat“longtail”really means. (2)Thesecondresearchprojectisalsoanaturalextensiono fthisdissertation 5 :“Howdoesthe evolutionoflongtailWebsitesaffectbrowsingpatterns?” .Thetemporalversionofthesitecentricmodelprovidedaninitial,yetsomewhatsimplistic ,glimpseintoatime-sensitiverelativeanalysis.Inessence,thetemporalversionusedawind owconsistingofallprevioussessions.However,includingallprevioussessionsmaybealia bilityonWebsitesthatcommonly change,since“old”datawouldlimittheabilityofnewpatch esandtrailstobelearnedfrom thenewlychangedsite.Thusthisresearchprojectwouldexa minehowslidingwindowsmay bedenedtobettermeettheneedsoflongtailsites.Forexam ple,windowsmaybeofacertainsize(numberorpercentage),foraparticulartimeperi od,ofasizenecessarytostabilize measures,orsomecombinationofthethree.Inaddition,theburn-inperiodmayalsobedenedsuchthatm easuresarenotcalculateduntil patchandtraildiscoveryhasstabilized 6 .Theuseofstabilizationmayalsohavetheaddedbenetofnot“throwing”awayextrabankedsessionsjustbecaus ethebankhadnotmettheprescribednumber(orpercentage)ofsessionsinit.Furthermo re,thecomputationallyexpensive taskofre-learningpatchesandtrailsmayberestrictedtoo nlyafterthosetimesofdestabilization.Thelargestcontributionofthisresearchwouldbeathoroug hanalysisofhowtimeimpacts theanalysisofforagingbehavioronlongtailWebsites.Ina ddition,amethodologywouldbe introducedthatwouldmakethemostofthesparsenessofdata fromlongtailsites,whilestill allowingrelativecomparisonsofforagingbehavior. (3)Thethirdresearchprojectwouldprovideatestofinform ationforagingtheoryusingaproductionrulesystem.Inparticular,IFTwouldbeexaminedatlon gtailWebsitestodeterminehow welltheproductionrules,asspeciedbyPirolli(2007),ar eabletoexplainforagingbehavior onlongtailWebsites.Inaddition,productionruleswhicht akeintoaccountthetheoretical 5 AdatasetwhichconsistsofWebsitesthatevolveatamorerap idpacethanthoseseeninthesite-centricdataset wouldbeused. 6 MeasurecalculationwouldalsoceasefollowingWebsitecha ngesuntilpatchandtraildiscoveryhadstabilized again. 290

PAGE 304

extensionstestedinthisdissertation(e.g.,memory,valu eofpage-patches)wouldalsobecreatedandtested.Themaincontributionsofthisresearchwou ldbetwo-fold.First,IFTwould betestedinitsoriginalformonasampleofWebsitesdiffere ntfromthosesitesusedtocreate andtestthetheory.Thesecondcontributionwouldexaminet heabilityandimportanceofthe theoreticalextensionsoutlinedinthisdissertationtoex plaingoalachievementusingIFT. (4)Thenalresearchprojectwouldnotbeasdirectofanexte nsionofthisdissertationasthe otherthreeprojects,however;itwouldstillemployIFTasa theoreticalbasetoexaminesearchingbehavior.Inparticular,thepurposeofthisresearchpi ecewouldbetodeterminehowsearch queries,usedtoarriveataWebsite,canpredicttheprobabi lityofagoalbeingachieved.The beliefisthatsearchqueriesareanobservablemanifestati onoftheinformationgoalofaforager.Thus,informationwithinasearchquerymayprovidecl uesintonotonlythegoalof theforager,butalsohowwell-denedthegoalis.Forexampl e,assumeonevisitorsubmitted“at-panelTV”fortheirsearchquery,whileanothersub mitted“SonyBravia52”.Therst queryappearstobemoregeneralinnatureandthusmaybemore suitedforbrowsingbehavior thatoccursduringtheinformationgatheringstage.Incont rast,thesecondquerylookstobe muchmorerenedandpointingtowardaspecicproduct,whic haforagermaybeinterested inpurchasing.Semanticsimilarity,whichisthelikenessofconceptsbetw eentwosetsofwords(Lietal., 2003),wouldbeusedtoquantifythetextualnatureofsearch queriesandthengroupsimilar searchqueries(andtheirresultingsessions)together.Cl usteringsearchqueries,whichare semanticallysimilartooneanother,mayuncovergroupsofs essionswhicharemorelikely toachieveagoalduringasession.Theexpectedcontributio nsofthisresearchwouldbethe introductionofsemanticsimilaritytoclickstreamresear chandthecreationofamethodology onhowsemanticsimilaritymaybeusedtopredictgoalachiev ement. 291

PAGE 305

References Adamic,L.A.andHuberman,B.A.2001.TheWeb'shiddenorder CommunicationsoftheACM 44 (9)55–60. Ajzen,I.1991.Thetheoryofplannedbehavior. OrganizationalBehaviorandHumanDecision Processes 50 (2)179–211. Alonso,J.C.,Alonso,J.A.,Bautista,L.M.,andMunoz-Puli do,R.1995.Patchuseincranes:aeld testofoptimalforagingpredictions. AnimalBehaviour 49 (5)1367–1379. Anderson,C.2006. TheLongTail:WhytheFutureofBusinessisSellingLessofMo re .Hyperion, NewYork,NY. Anderson,J.R.,Bothell,D.,Byrne,M.D.,Douglass,S.,Leb iere,C.,andQin,Y.2004.Anintegratedtheoryofthemind. PsychologicalReview 111 (4)1036–1060. Anderson,J.R.,Budiu,R.,andReder,L.M.2001.Atheoryofs entencememoryaspartofageneraltheoryofmemory. JournalofMemoryandLanguage 45 337–367. Anderson,J.R.andMilson,R.1989.Humanmemory:Anadaptiv eperspective. Psychological Review 96 (4)703–719. Anderson,J.R.andPirolli,P.L.1984.Spreadofactivation JournalofExperimentalPsychology: Learning,Memory,&Cognition 10 (4)791–799. Awad,N.F.,Jones,J.L.,andZhang,J.2006.Doessearchmatt er?Usingonlineclickstreamdatato examinetherelationshipbetweenonlinesearchandpurchas ebehavior. WorkshoponInformationSystemsandEconomics .Evanston,IL,1–5. Ayres,J.,Flannick,J.,Gehrke,J.,andYiu,T.2002.Sequen tialpatternminingusingabitmaprepresentation. ProceedingsoftheEighthACMSIGKDDInternationalConfere nceonKnowledge DiscoveryandDataMining .Edmonton,Alberta,Canada,429–435. 292

PAGE 306

Barnett,V.andLewis,T.1994. Outliersinstatisticaldata .3rded.JohnWiley&Sons,Inc.,West Sussex,England. Bay,S.D.andPazzani,M.J.1999.Detectingchangeincatego ricaldata:miningcontrastsets. The FifthACMSIGKDDInternationalConferenceonKnowledgeDis coveryandDataMining .San Diego,CA,302–306. Bhat,S.,Bevans,M.,andSengupta,S.2002.Measuringusers 'Webactivitytoevaluateandenhanceadvertisingeffectiveness. JournalofAdvertising 31 (3)97–106. Bloch,P.H.,Sherrell,D.L.,andRidgway,N.M.1986.Consum ersearch:Anextendedframework. TheJournalofConsumerResearch 13 (1)119–126. Botafogo,R.A.,Rivlin,E.,andShneiderman,B.1992.Struc turalanalysisofhypertexts:identifyinghierarchiesandusefulmetrics. ACMTransactionsonInformationSystems 10 (2)142–180. Browne,G.J.,Pitts,M.G.,andWetherbe,J.C.2007.Cogniti vestoppingrulesforterminatinginformationsearchinonlinetasks. MISQuarterly 31 (1)89–104. Bucklin,R.E.,Lattin,J.M.,Ansari,A.,Gupta,S.,Bell,D. ,Coupey,E.,Little,J.D.,Mela,C., Montgomery,A.,andSteckel,J.2002.ChoiceandtheInterne t:Fromclickstreamtoresearch stream. MarketingLetters 13 (3)245–258. Bucklin,R.E.andSismeiro,C.2003.AmodelofWebsitebrows ingbehaviorestimatedonclickstreamdata. JournalofMarketingResearch 40 (3)249–267. Burdick,D.,Calimlim,M.,andGehrke,J.2001.MAFIA:amaxi malfrequentitemsetalgorithm fortransactionaldatabases. Proceedingsofthe17thInternationalConferenceonDataEn gineering .Heidelberg,Germany,443–452. Byrne,M.D.2001.ACT-R/PMandmenuselection:applyingaco gnitivearchitecturetoHCI. InternationalJournalofHuman-ComputerStudies 55 (1)41–84. Card,S.K.,Pirolli,P.,Wege,M.M.V.D.,Morrison,J.B.,Re eder,R.W.,Schraedley,P.K.,and Boshart,J.2001.InformationscentasadriverofWebbehavi orgraphs:resultsofaprotocol analysismethodforWebusability. ConferenceonHumanFactorsinComputingSystems .Seattle,WA,498–505. 293

PAGE 307

Catledge,L.D.andPitkow,J.E.1995.Characterizingbrows ingstrategiesintheWorld-WideWeb. ComputerNetworkISDNSystems 27 (6)1065–1073. Charnov,E.L.1976.Optimalforaging,themarginalvalueth eorem. TheoreticalPopulationBiology 9 (2)129–136. Charnov,E.L.andOrians,G.H.1973.Optimalforaging:Some theoreticalexplorations. Chatterjee,P.,Hoffman,D.L.,andNovak,T.P.2003.Modeli ngtheclickstream:Implicationsfor Web-basedadvertisingefforts. MarketingScience 22 (4)520–541. Chi,E.H.,Pitkow,J.,Mackinlay,J.,Pirolli,P.,Gossweil er,R.,andCard,S.K.1998.Visualizing theevolutionofWebecologies. ProceedingsoftheSIGCHIconferenceonHumanfactorsin computingsystems .LosAngeles,CA,400–407. Chi,E.H.,Rosien,A.,Supattanasiri,G.,Williams,A.,Roy er,C.,Chow,C.,Robles,E.,Dalal,B., Chen,J.,andCousins,S.2003.Thebloodhoundproject:Auto matingdiscoveryofWebusabilityissuesusingtheInfoScentSimulator. ConferenceonHumanFactorsinComputingSystems Ft.Lauderdale,FL,1–8. Church,K.W.andHanks,P.1989.Wordassociationnorms,mut ualinformation,andlexicography. Proceedingsofthe27thAnnualMeetingoftheAssociationfo rComputationalLinguistics Vancouver,B.C.,76–83. Cochran,W.G.1954.Somemethodsforstrengtheningthecomm on 2 tests. Biometrics 10 (4) 417–451. Collins,A.M.andLoftus,E.F.1975.Aspreading-activatio ntheoryofsemanticprocessing. PsychologicalReview 82 (6)407–428. comScore,Inc.2005.comScore2004disaggregatedataset.U RL http://wrds.wharton. upenn.edu/ds/comscore/manuals/comScore_2004_Disaggr egate_ Dataset_WRDS.pdf comScore,Inc.2007a.61billionsearchesconductedworldw ideinaugust.URL http://www. comscore.com/press/release.asp?press=1802 294

PAGE 308

comScore,Inc.2007b.comScore2006Webbehaviordatabase. URL http://wrds. wharton.upenn.edu/ds/comscore/manuals/comscore_wrds _2006.pdf Conover,W.1999. Practicalnonparametricstatistics .3rded.JohnWiley&Sons,Inc.,NewYork, NY. Dalgaard,P.2008. IntroductoryStatisticswithR .2nded.SpringerScience+BusinessMedia,New York,NY. Danaher,P.J.,Mullarkey,G.W.,andEssegaier,S.2006.Fac torsaffectingWebsitevisitduration: Across-domainanalysis. JournalofMarketingResearch 43 (2)182–194. Davison,B.D.2000.TopicallocalityintheWeb. Proceedingsofthe23rdAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinI nformationRetrieval .Athens, Greece,272–279. Engel,J.,Blackwell,R.,andMiniard,P.1990. ConsumerBehavior .6thed.TheDrydenPress, Chicago,IL. Ester,M.,Kriegel,H.,Sander,J.,andXu,X.1996.Adensity -basedalgorithmfordiscovering clustersinlargespatialdatabaseswithnoise. Proceedingsofthe2ndInternationalConference onKnowledgeDiscoveryandDataMining .Portland,OR,226–231. Fader,P.S.,Hardie,B.G.S.,andLee,K.L.2005.RFMandCLV: Usingiso-valuecurvesforcustomerbaseanalysis. JournalofMarketingResearch 42 (4)415–430. Fielding,R.,Gettys,J.,Frystyk,H.,Masinter,L.,Leach, P.,andBerners-Lee,T.1999.Hypertext transferprotocol–HTTP/1.1.URL http://www.ietf.org/rfc/rfc2616.txt Fischer,J.E.,Bachmann,L.M.,andJaeschke,R.2003.Aread ers'guidetotheinterpretationof diagnostictestproperties:Clinicalexampleofsepsis. IntensiveCareMedicine 29 (7)1043– 1051. Floyd,R.W.1962.Algorithm97:Shortestpath. CommunicationsoftheACM 5 (6)345. Fu,W.T.,Bothell,D.,Douglass,S.,Haimson,C.,Sohn,M.H. ,andAnderson,J.2006.Towarda real-timemodel-basedtrainingsystem. InteractingwithComputers 18 (6)1215–1241. 295

PAGE 309

Galletta,D.,Henry,R.,McCoy,S.,andPolak,P.2006.Whent hewaitisn'tsobad:Theinteracting effectsofWebsitedelay,familiarity,andbreadth. InformationSystemsResearch 17 (1)20–37. Garshelis,D.L.2007. BrownBear .OxfordUniversityPress,OxfordReferenceOnline.URL http://www.oxfordreference.com/views/entry.html?sub view=Main\ &entry=t227.e19 Gourley,D.andTotty,B.2002. HTTP:TheDenitiveGuide .O'Reilly. Gray,W.D.,Schoelles,M.J.,andSims,C.R.2005.Adaptingt othetaskenvironment:Explorations inexpectedvalue. CognitiveSystemsResearch 6 (1)27–40. Hames,R.B.andVickers,W.T.1982.Optimaldietbreadththe oryasamodeltoexplainvariability inamazonianhunting. AmericanEthnologist 9 (2)358–378. Harary,F.1959.Statusandcontrastatus. Sociometry 22 (1)23–43. Hawkes,K.,Hill,K.,andO'Connell,J.F.1982.Whyhuntersg ather:Optimalforagingandthe AcheofeasternParaguay. AmericanEthnologist 9 (2)379–398. Helsel,D.R.andHirsch,R.M.1992. StatisticalMethodsinWaterResources .ElsevierScience PublishersB.V.,Amsterdam,TheNetherlands. Hodge,V.J.andAustin,J.2004.Asurveyofoutlierdetectio nmethodologies. ArticialIntelligenceReview 22 85–126. Holling,C.1959.Somecharacteristicsofsimpletypesofpr edationandparasitism. TheCanadian Entomologist 91 (7)385–398. Jaillet,H.F.2003.Webmetrics:Measuringpatternsinonli neshopping. JournalofConsumer Behaviour 2 (4)369–381. Janiszewski,C.1998.Theinuenceofdisplaycharacterist icsonvisualexploratorysearchbehavior. TheJournalofConsumerResearch 25 (3)290–301. Johnson,E.J.,Bellman,S.,andLohse,G.L.2003.Cognitive lock-inandthepowerlawofpractice. JournalofMarketing 67 (2)62–75. 296

PAGE 310

Johnson,E.J.,Moe,W.W.,Fader,P.S.,Bellman,S.,andLohs e,G.L.2004.Onthedepthanddynamicsofonlinesearchbehavior. ManagementScience 50 (3)299–308. Kalczynski,P.J.,Senecal,S.,andNantel,J.2006.Predict ingon-linetaskcompletionwithclickstreamcomplexitymeasures:Agraph-basedapproach. InternationalJournalofElectronic Commerce 10 (3)121–141. Katz,M.A.andByrne,M.D.2003.Effectsofscentandbreadth onuseofsite-specicsearchon e-commerceWebsites. ACMTransactionsonComputer-HumanInteraction 10 (3)198–220. Kavassalis,P.,Lelis,S.,Rafea,M.,andHaridi,S.2004.Wh atmakesaWebsitepopular? CommunicationsoftheACM 47 (2)50–55. Kay,A.2002.Applyingoptimalforagingtheorytoassessnut rientavailabilityratiosforants. Ecology 83 (7)1935–1944. Lawrance,J.,Bellamy,R.,andBurnett,M.2007.Scentsinpr ograms:Doesinformationforaging theoryapplytoprogrammaintenance? IEEESymposiumonVisualLanguagesandHumanCentricComputing .Coeurd'Alene,ID,15–22. Li,Y.,Bandar,Z.,andMcLean,D.2003.Anapproachformeasu ringsemanticsimilaritybetween wordsusingmultipleinformationsources. IEEETransactionsonKnowledgeandDataEngineering 15 (4)871–882. MacArthur,R.H.andPianka,E.R.1966.Onoptimaluseofapat chyenvironment. TheAmerican Naturalist 100 (916)603–609. MacCracken,J.G.andHansen,R.M.1987.Coyotefeedingstra tegiesinsoutheasternidaho:Optimalforagingbyanopportunisticpredator? TheJournalofWildlifeManagement 51 (2)278– 285. McEneaney,J.E.2001.Graphicandnumericalmethodstoasse ssnavigationinhypertext. InternationalJournalofHuman-ComputerStudies 55 761–786. Mendenhall,W.andSincich,T.2003. ASecondCourseinStatistics:RegressionAnalysis .6thed. PearsonEducation,Inc.,UpperSaddleRiver,NJ. 297

PAGE 311

Miller,G.A.1956.Themagicalnumbersevenplusorminustwo :Somelimitsonourcapacityfor processinginformation. PsychologicalReview 63 81–97. Moe,W.W.2003.Buying,searching,orbrowsing:Differenti atingbetweenonlineshoppersusing in-storenavigationalclickstream. JournalofConsumerPsychology 13 (1&2)29–39. Moe,W.W.andFader,P.S.2004a.Capturingevolvingvisitbe haviorinclickstreamdata. Journal ofInteractiveMarketing 18 (1)5–19. Moe,W.W.andFader,P.S.2004b.Dynamicconversionbehavio rate-commercesites. ManagementScience 50 (3)326–335. Montgomery,A.L.,Li,S.,Srinivasan,K.,andLiechty,J.C. 2004.Modelingonlinebrowsingand pathanalysisusingclickstreamdata. MarketingScience 23 (4)579–595. Morrison,D.G.1969.Ontheinterpretationofdiscriminant analysis. JournalofMarketingResearch 6 (2)156–163. Mount,D.M.andArya,S.2006.ANN:alibraryforapproximate nearestneighborsearching.URL http://www.cs.umd.edu/ mount/ANN/ Nevitt,G.A.2000.Olfactoryforagingbyantarcticprocell ariiformseabirds:Lifeathighreynolds numbers. BiologicalBulletin 198 245–253. Newman,M.2005.Powerlaws,paretodistributionsandzipf' slaw. ContemporaryPhysics 46 (5) 323–351. Nicholas,D.,Huntington,P.,Jamali,H.R.,andDobrowolsk i,T.2007.Characterisingandevaluatinginformationseekingbehaviourinadigitalenvironment :Spotlightonthe`bouncer'. InformationProcessingandManagement 43 (4)1085–1102. Olston,C.andChi,E.H.2003.ScentTrails:Integratingbro wsingandsearchingontheWeb. ACM TransactionsonComputer-HumanInteraction 10 (3)177–197. Padmanabhan,B.,Zheng,Z.,andKimbrough,S.O.2001.Perso nalizationfromincompletedata: whatyoudon'tknowcanhurt. TheSeventhACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining .SanFrancisco,CA,154–163. 298

PAGE 312

Park,Y.H.andFader,P.S.2004.Modelingbrowsingbehavior atmultipleWebsites. Marketing Science 23 (3)280–303. Pirolli,P.L.2007. InformationForagingTheory:AdaptiveInteractionwithIn formation .Oxford UniversityPress,NewYork,NY. Pirolli,P.L.andCard,S.1999.Informationforaging. PsychologicalReview 106 (4)643–675. Pitkow,J.andPirolli,P.1997.Life,death,andlawfulness ontheelectronicfrontier. Proceedings oftheSIGCHIconferenceonHumanfactorsincomputingsyste ms .Atlanta,GA,383–390. Pitts,M.G.andBrowne,G.J.2004.Stoppingbehaviorofsyst emsanalystsduringinformation requirementselicitation. JournalofManagementInformationSystems 21 (1)203–226. RDevelopmentCoreTeam.2008. R:ALanguageandEnvironmentforStatisticalComputing .R FoundationforStatisticalComputing,Vienna,Austria.UR L http://www.R-project. org .ISBN3-900051-07-0. Reader,W.R.andPayne,S.J.2007.Allocatingtimeacrossmu ltipletexts:Samplingandsatiscing. Human-computerInteraction 22 (3)263–298. Ricca,F.andTonella,P.2000.Websiteanalysis:structure andevolution. Proceedingsofthe16th InternationalConferenceonSoftwareMaintenance .SanJose,CA,76–86. Ricca,F.andTonella,P.2001.Understandingandrestructu ringWebsiteswithReWeb. IEEE Multimedia 8 (2)40–51. Rode,C.,Cosmides,L.,Hell,W.,andTooby,J.1999.Whenand whydopeopleavoidunknown probabilitiesindecisionsunderuncertainty?testingsom epredictionsfromoptimalforaging theory. Cognition 72 (3)269–304. Rowley,J.2000.Productsearchine-shopping:areviewandr esearchpropositions. Journalof ConsumerMarketing 17 (1)20–35. Sandstrom,P.E.1994.Anoptimalforagingapproachtoinfor mationseekinganduse. Library Quarterly 64 (4)414–449. 299

PAGE 313

Satsangi,A.andZaiane,O.R.2007.Contrastingthecontras tsets:Analternativeapproach. Proceedingsofthe11thInternationalDatabaseEngineeringan dApplicationsSymposium .Banff, Alberta,Canada,114–119. Senecal,S.,Kalczynski,P.J.,andNantel,J.2005.Consume rs'decision-makingprocessandtheir onlineshoppingbehavior:aclickstreamanalysis. JournalofBusinessResearch 58 (11)1599– 1608. Shaver,D.1996. TheNextStepinDatabaseMarketing .JohnWiley&Sons,Inc.,NewYork,NY. Simon,H.A.1956.Rationalchoiceandthestructureoftheen vironment. PsychologicalReview 63 (2)129–138. Simon,H.A.1974.Howbigisachunk? Science 183 (4124)482–488. Sismeiro,C.andBucklin,R.E.2004.Modelingpurchasebeha vioratane-commerceWebsite:A task-completionapproach. JournalofMarketingResearch 41 (3)306–323. Smith,E.A.1983.Anthropologicalapplicationsofoptimal foragingtheory:Acriticalreview. CurrentAnthropology 24 (5)625–651. Stephens,D.W.andKrebs,J.R.1986. ForagingTheory .PrincetonUniversityPress. Stewart,T.C.andWest,R.L.2007.Deconstructingandrecon structingact-r:Exploringthearchitecturalspace. CognitiveSystemsResearch 8 227–236. Stone,B.andJacobs,R.2001. SuccessfulDirectMarketingMethods .7thed.McGraw-Hill,New York,NY. Streitberg,B.andR¨ohmel,J.1986.Exactdistributionsfo rpermutationsandranktests:Anintroductiontosomerecentlypublishedarticles. StatisticalSoftwareNewsletter 12 (1)10–17. Taatgen,N.A.andAnderson,J.R.2002.Whydochildrenlearn tosay”broke”?amodeloflearningthepasttensewithoutfeedback. Cognition 86 (2)123–155. Turney,P.D.2001.MiningtheWebforsynonyms:PMI–IRversu sLSAonTOEFL. Proceedings oftheTwelfthEuropeanConferenceonMachineLearning .Freiburg,Germany,491–502. 300

PAGE 314

VandenPoel,D.andBuckinx,W.2005.Predictingonline-pur chasingbehaviour. EuropeanJournalofOperationalResearch 166 (2)557–575. Warren,P.,Boldyreff,C.,andMunro,M.1999.Theevolution ofWebsites. SeventhInternational WorkshoponProgramComprehension .Pittsburgh,PA,178–185. Wilcoxon,F.1945.Individualcomparisonsbyrankingmetho ds. BiometricsBulletin 1 (6)80–83. Willett,W.,Heer,J.,andAgrawala,M.2007.Scentedwidget s:Improvingnavigationcueswith embeddedvisualizations. IEEETransactionsonVisualizationandComputerGraphics 13 (6) 1129–1136. Xie,X.,Liu,H.,Ma,W.,andZhang,H.2006.Browsinglargepi cturesunderlimiteddisplaysizes. IEEETransactionsonMultimedia 8 (4)707–715. Yang,Q.,Li,T.,andWang,K.2004.Buildingassociation-ru lebasedsequentialclassiersfor Web-documentprediction. DataMiningandKnowledgeDiscovery 8 (3)253–273. Yang,Y.andPadmanabhan,B.2003.Segmentingcustomertran sactionsusingapattern-based clusteringapproach. ProceedingsoftheThirdIEEEInternationalConferenceonD ataMining Melbourne,FL,411–418. Zauberman,G.2003.Theintertemporaldynamicsofconsumer lock-in. JournalofConsumer Research 30 405–419. Zhang,J.,Fang,X.,andSheng,O.R.L.2006.Onlineconsumer searchdepth:Theoriesandnew ndings. JournalofManagementInformationSystems 23 (3)71–95. 301

PAGE 315

AbouttheAuthor JamesA.McCartreceivedaBachelorofScienceinInformatio nSystemsfromPurdueUniversityin2002.In2006,Mr.McCartearnedaMasterofSciencein ManagementInformationSystemsfromtheUniversityofSouthFlorida. WhileinthePh.D.programattheUniversityofSouthFlorida ,Mr.McCartwasawardeda UniversityGraduateFellowshipin2005andaCollegeofBusi nessResearchAwardin2009.In addition,Mr.McCartwasacceptedin2008toDoctoralConsor tiumsforboththeInternational ConferenceonInformationSystems(ICIS)andAmerica'sCon ferenceonInformationSystems (AMCIS).


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2009 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003190
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
McCart, James.
0 245
Goal attainment on long tail web sites :
b an information foraging approach
h [electronic resource] /
by James McCart.
260
[Tampa, Fla] :
University of South Florida,
2009.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
Includes vita.
502
Dissertation (Ph.D.)--University of South Florida, 2009.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: This dissertation sought to explain goal achievement at limited traffic "long tail" Web sites using Information Foraging Theory (IFT). The central thesis of IFT is that individuals are driven by a metaphorical sense of smell that guides them through patches of information in their environment. An information patch is an area of the search environment with similar information. Information scent is the driving force behind why a person makes a navigational selection amongst a group of competing options. As foragers are assumed to be rational, scent is a mechanism by which to reduce search costs by increasing the accuracy on which option leads to the information of value. IFT was originally developed to be used in a "production rule" environment, where a user would perform an action when the conditions of a rule were met. However, the use of IFT in clickstream research required conceptualizing the ideas of information scent and patches in a non-production rule environment. To meet such an end this dissertation asked three research questions regarding (1) how to learn information patches, (2) how to learn trails of scent, and finally (3) how to combine both concepts to create a Clickstream Model of Information Foraging (CMIF). The learning of patches and trails were accomplished by using contrast sets, which distinguished between individuals who achieved a goal or not. A user- and site-centric version of the CMIF, which extended and operationalized IFT, presented and evaluated hypotheses. The user-centric version had four hypotheses and examined product purchasing behavior from panel data, whereas the site-centric version had nine hypotheses and predicted contact form submission using data from a Web hosting company. In general, the results show that patches and trails exist on several Web sites, and the majority of hypotheses were supported in each version of the CMIF. This dissertation contributed to the literature by providing a theoretically-grounded model which tested and extended IFT; introducing a methodology for learning patches and trails; detailing a methodology for preprocessing clickstream data for long tail Web sites; and focusing on traditionally under-studied long tail Web sites.
590
Advisor: Donald J. Berndt, Ph.D.
653
Clickstream research
Information foraging theory
Web mining
Information scent
Data mining
690
Dissertations, Academic
z USF
x Information Sys and Decision Sci
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.3190