USF Libraries
USF Digital Collections

Mining medical data in a clinical environment

MISSING IMAGE

Material Information

Title:
Mining medical data in a clinical environment
Physical Description:
Book
Language:
English
Creator:
Ivanovskiy, Tim V
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Data mining
Rule association
Medical expert system
Apriori
Medical implications
Dissertations, Academic -- Computer Science -- Masters -- USF
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: The availability of new treatments for a disease depends on the success of clinical trials. In order for a clinical trial to be successful and approved, medical researchers must first recruit patients with a specific set of conditions in order to test the effectiveness of the proposed treatment. In the past, the accrual process was tedious and time-consuming. Since accruals rely heavily on the ability of physicians and their staff to be familiar with the protocol eligibility criteria, candidates tend to be missed. This can result and has resulted in unsuccessful trials.A recent project at the University of South Florida aimed to assist research physicians at H. Lee Moffitt Cancer Center & Research Institute, Tampa, Florida, with a screening process by utilizing a web-based expert system, Moffitt Expedited Accrual Network System (MEANS). This system allows physicians to determine the eligibility of a patient for several clinical trials simultaneously.We have implemented this web-based expert system at the H. Lee Moffitt Cancer Center & Research Gastroenterology (GI) Clinic. Based on our findings and staff feedback, the system has undergone many optimizations. We used data mining techniques to analyze the medical data of current gastrointestinal patients. The use of the Apriori algorithm allowed us to discover new rules (implications) in the patient data. All of the discovered implications were checked for medical validity by a physician, and those that were determined to be valid were entered into the expert system. Additional analysis of the data allowed us to streamline the system and decrease the number of mouse clicks required for screening. We also used a probability-based method to reorder the questions, which decreased the amount of data entry required to determine a patient's ineligibility.
Thesis:
Thesis (M.A.)--University of South Florida, 2006.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
Statement of Responsibility:
by Tim V. Ivanovskiy.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains 59 pages.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001798354
oclc - 159957687
usfldc doi - E14-SFE0001639
usfldc handle - e14.1639
System ID:
SFS0025957:00001


This item is only available as the following downloads:


Full Text

PAGE 1

MiningMedicalDatainaClinicalEnvironmentbyTimV.IvanovskiyAthesissubmittedinpartialfulllmentoftherequirementsforthedegreeofMasterofScienceinComputerScienceDepartmentofComputerScienceandEngineeringCollegeofEngineeringUniversityofSouthFloridaMajorProfessor:DmitryGoldgof,Ph.D.LawrenceHall,Ph.D.SudeepSarkar,Ph.D.DateofApproval:July7,2006Keywords:datamining,ruleassociation,medicalexpertsystem,apriori,medicalimplicationscCopyright2006,TimV.Ivanovskiy

PAGE 2

DEDICATIONThisthesisisdedicatedto:mywifeAngela,whohelpedmeinmanywaysduringmygraduatestudieswhilesheherselfwascompletingherundergraduatedegree;myparentsDianaandMerle,fortheirconstantsupportduringmyacademicyears;andtherestofmyfamily,forworkingaroundmybusyschedule.

PAGE 3

ACKNOWLEDGEMENTSIwouldliketothankDr.DmitryGoldgofandDr.LawrenceHallforgivingmetheopportunitytoworkwiththemandforprovidingmetheirinputduringmywork.ThankyouDr.SudeepSarkarforbeingpartofthesupervisorycommittee.IwouldalsoliketoextendmygratitudetoDr.ChrisGarrettfromMottCancerCenterandhisnurseHalinaGreenstienforassistingmeduringthedatacollectionphaseofthisthesis.Finally,IwouldliketothankShibendraPobiforworkingwithmeandallofthenursesanddoctorsatMottCancerCenter'sGIclinic.IwouldalsoliketothankmywifeAngelaformakingsurethatmynativeRussianlanguagedidnotmigrateintotheproperEnglishstructureofthisthesis.

PAGE 4

TABLEOFCONTENTSLISTOFTABLESiiiLISTOFFIGURESvABSTRACTviCHAPTER1INTRODUCTION11.1Overview11.2Statistics11.3ClinicalTrials2CHAPTER2PREVIOUSWORK32.1SimilarSystems32.2AbouttheMEANSSystem42.3MEANSArchitecture52.4KnowledgeEntry52.4.1QuestionTypes72.4.2ProtocolExample72.5MEANSEligibilityAlgorithm92.6DataMining92.6.1AssociationRules112.6.2MiningMedicalData15CHAPTER3IMPROVEMENTSTHROUGHAUTOMATEDDATAANALYSIS173.1DataCollection173.1.1ImplementationinaClinicalSetting173.1.2GIClinicalTrials173.1.3PatientSelection183.2ExperimentsandResults193.2.1AboutPatientData193.2.1.1DataExtraction203.2.2MiningMedicalImplications213.2.2.1SubsetDataSetOne213.2.2.2SupersetDataSetTwo263.2.2.3RulesofInterestSubset273.2.2.4RulesofInterestSuperset303.2.2.5MedicalValidationSubset323.2.2.6MedicalValidationSuperset333.2.3Probability-BasedReordering353.2.3.1Probability-GuidedAgent38i

PAGE 5

3.2.3.2TestingSystem423.2.3.3ProbabilityGuidedExperiments443.2.4DataEntryOptimization483.2.5InterfaceChanges49CHAPTER4CONCLUSION534.1ImplicationDiscoveries534.2Probability-BasedReordering544.3Optimization554.4FutureWork56REFERENCES57ii

PAGE 6

LISTOFTABLESTable2.1TypesofQuestions7Table2.2SampleProtocol:Criteria8Table2.3SampleProtocol:Encoding8Table2.4SampleProtocol:AcceptanceandRejectionExpressions8Table2.5Patient'sStatesDuringScreening9Table2.6AprioriExample:Database12Table2.7AprioriExample:One-itemSets13Table2.8AprioriExample:Two-itemSets13Table2.9AprioriExample:Three-itemSets14Table2.10AprioriExample:Rules14Table3.1ActivePhaseIIGIClinicalTrials18Table3.2PatientTypeintheClinic19Table3.3ImplicationExample21Table3.4Subset:PatientData24Table3.5Subset:RuleStatisticsPerProtocols=10%24Table3.6Superset:PatientData26Table3.7Superset:RuleStatisticsPerProtocols=5%27Table3.8Subset:Antecedent-ConsequentMatrixs=20%28Table3.9Subset:RuleAnalysiss=20%28Table3.10Subset:Antecedent-ConsequentMatrixs=5%30Table3.11Superset:Antecedent-ConsequentMatrixs=5%31Table3.12Superset:DiscoveredRules33Table3.13Example:RuleSimplicationBasedonMedicalKnowledge34iii

PAGE 7

Table3.14AnalyticalReordering44Table3.15Probability-BasedReordering-FoldCrossValidation45Table3.16ProbabilisticThresholding-FoldCrossValidation46Table3.17MaxNumberPerSet47Table3.18Yes/NoQuestionLength49iv

PAGE 8

LISTOFFIGURESFigure2.1ExpertSystemArchitecture6Figure2.2MEANSEligibilityAlgorithm10Figure3.1AprioriRuns=70%22Figure3.2DecodedImplication23Figure3.3Subset:AllMedicallyValidImplications25Figure3.4Subset:ValidImplicationss=20%29Figure3.5Subset:Rule206CloserLook29Figure3.6Superset:AllMedicallyValidImplications36Figure3.7FinalRules37Figure3.8TestingSystemAlgorithm43Figure3.9Thresholding-FoldCrossValidation47Figure3.10MEANSStreamlinedInitialQuestionsPage50Figure3.11FullTextPopup50Figure3.12EligibilityColorTable51Figure3.13ProtocolNumberandTitle51v

PAGE 9

MININGMEDICALDATAINACLINICALENVIRONMENTTimV.IvanovskiyABSTRACTTheavailabilityofnewtreatmentsforadiseasedependsonthesuccessofclinicaltrials.Inorderforaclinicaltrialtobesuccessfulandapproved,medicalresearchersmustrstrecruitpatientswithaspecicsetofconditionsinordertotesttheeectivenessoftheproposedtreatment.Inthepast,theaccrualprocesswastediousandtime-consuming.Sinceaccrualsrelyheavilyontheabilityofphysiciansandtheirstatobefamiliarwiththeprotocoleligibilitycriteria,candidatestendtobemissed.Thiscanresultandhasresultedinunsuccessfultrials.ArecentprojectattheUniversityofSouthFloridaaimedtoassistresearchphysiciansatH.LeeMottCancerCenter&ResearchInstitute,Tampa,Florida,withascreeningprocessbyutilizingaweb-basedexpertsystem,MottExpeditedAccrualNetworkSystemMEANS.Thissystemallowsphysicianstodeterminetheeligibilityofapatientforseveralclinicaltrialssimultaneously.Wehaveimplementedthisweb-basedexpertsystemattheH.LeeMottCancerCenter&ResearchInstituteGastroenterologyGIClinic.Basedonourndingsandstafeedback,thesystemhasundergonemanyoptimizations.Weuseddataminingtechniquestoanalyzethemedicaldataofcurrentgastrointestinalpatients.TheuseoftheApriorialgorithmallowedustodiscovernewrulesimplicationsinthepatientdata.Allofthediscoveredimplicationswerecheckedformedicalvaliditybyaphysician,andthosethatweredeterminedtobevalidwereenteredintotheexpertsystem.Additionalanalysisofthedataallowedustostreamlinethesystemanddecreasethenumberofmouseclicksrequiredforscreening.Wealsousedaprobability-basedmethodtoreorderthequestions,whichdecreasedtheamountofdataentryrequiredtodetermineapatient'sineligibility.vi

PAGE 10

CHAPTER1INTRODUCTION1.1OverviewTheapplicationofArticialIntelligencetoreal-worldissueshasproducedpromisingre-sults.TheworkpresentedinthisthesisisaimedtowardimprovingtheexistingmedicalexpertsystemMEANS{whichstandsforMottExpeditedAccrualNetworkSystem{tobemorephysician-friendly.Thisexpertsystemisdesignedtobeusedasatoolforhelpingphysi-ciansscreenpatientsforeligibilityforclinicaltrials.ItisourgoaltoseethatthesystemeventuallybecomespartofthestandardofcareatMottCancerCenter,Tampa,Florida,anditsaliates.Varioustechniqueswereusedtoincreaseuserfriendlinessandacceptabilityoftheexpertsystem,includingchangestotheinterfaceandsystemow.Inaddition,dataminingandstatisticalanalysistechniqueswereusedinordertodecreasetheamountofdataentryneededfromtheuser.IwillbeginwithabriefintroductiontopreviousworkonmedicalexpertsystemsinSection2.1andcontinuewithanexplanationofourmedicalexpertsysteminSection2.2.Chapter3willtakeusintocurrentworkandthechangesthatweremadetotheexpertsystem,includingworkthatwasdonetoincreasethesystem'suserfriendlinessandminimizetheamountoftimerequiredtodetermineapatient'seligibilityforclinicaltrials.IwillconcludewithresultsandfuturechallengesinChapter4.1.2StatisticsIntheUnitedStates,cancerisamongleadingcausesofdeath.TheAmericanCancerSocietyestimatedthatthenumberofnewcancercasesin2005was1,372,910where570,280resultedindeath.TheNationalInstituteofHealthestimatesthatthecostofcancerintheUnitedStateswasaround$189.8billionin2004,fromwhichonly$64.4billionwasfordirect1

PAGE 11

medicalcosts.Theremaining,$125.4billionwasthecostoflostproductivityduetoillnessorprematuredeath.[4]1.3ClinicalTrialsAlsocalledclinicalstudies,clinicaltrialsaredenedbyNationalCancerInstituteas:Atypeofresearchstudythattestshowwellnewmedicalapproachesworkinpeople.Thesestudiestestnewmethodsofscreening,prevention,diagnosis,ortreatmentofadisease"[16].Clinicaltrialscanalsobedenedasawaytoevaluatetheeectivenessandsafetyofmedi-cationsormedicaldevicesbymonitoringtheireectsonlargegroupsofpeople"[20].Clinicaltrialscanalsobereferredasclinicalstudy.2

PAGE 12

CHAPTER2PREVIOUSWORK2.1SimilarSystemsTheuseofarticialintelligenceinmedicineisnotanewconcept.Researchersrealizedinasearlyas1950'sthatcomputerscouldbeusedtoaidphysicianswithclinicaldecisionmaking.Sincethen,physiciansandcomputerscientistsstartedanalyzingmedicalinformationinawaythatcouldbeusedbyautomateddecisionaidswiththehelpofarticialintelligenceforacertaindomain.Thetermknowledgeengineering"referstotheuseofcomputer-basedsymbolicreasoning,includingknowledgerepresentation,acquisition,explanation,andselfmodicationthatcomesfromselfawareness[26].InSection2.2,IwillshowthatoursystemiscapableofselfmodicationbasedonBayes'learningalgorithm.Oneoftheearliestexpertsystems,calledDENDRALwasdevelopedinearly1965atStan-fordUniversity.EdwardFeigenbaum,JoshuaLederbergandBruceBuchananwereinterestedintheexplorationofthemechanizationofscienticreasoningandtheformalizationofsci-enticknowledge.Massspectrometrywasemergingasatechnologyofchoiceforchemicalanalysis.Therefore,theydecidedtoapplytheirideatotheissueofhowtoproperlyrep-resentthen-existingchemicalgraphstructuresandthengenerateallpossiblestructuresinthemassspectralanalysisdomain[19,28].DENDRALwasbasedonasetofrule-basedif-thenreasoningtodeducethemolecularstructureoforganicchemicalcompoundsfromknownmassspectrometrydataandchemicalanalyses.Theprojectalsocreatedthestandardsforexpertsystemsbyseparatinginternaloperationsofthesystemfromtheexplicitrulesoftheknowledge[28].TheDENDRALprojectgaverisetoafamousantimicrobialtherapyconsultationsystemcalledMYCIN[25,27,28].Shortlieetal.setouttocreateasystemthatwillbecompatiblewiththephysician'sowndecision-makingprocess,andMYCINwastheresultoftheireorts.3

PAGE 13

Itwasdesignedtobeusedasatooltoassistphysicianswiththeselectionofantibiotictreatmentsforpatientswithbacterialinfections.Itwasactuallytherstexpertsystemtoincorporateinitsdesignaseparateandmodiableknowledgebasethatconsistedofif-thenorPREMISE"andanACTION"rulesasdescribedbyShortlieetal.[25].ThesuccessofDENDRALandMYCINnotonlysigniesmajorachievementsinarticialintelligence,butitalsohelpedtouchothedevelopmentofexpertsystems.EMYCIN,whichactuallymovedintothecommercialsoftwaremarket,wasdevelopedfromMYCIN.EMYCINwasageneric,domain-independentexpertsystemshellthatcouldbeusedtobuildarule-basedexpertsysteminanydomain[28].JaniceS.Aikinsetal.usedtheEMYCINframeworktobuildamedicalexpertsystemfortheinterpretationofpulmonaryfunctiontestsforthepatientswiththelungdiseasecalledPUFF[3].Anotherspin-ofromMYCINwasateachingexpertsystemcalledNEOMYCIN.ItwasacombinationoftheknowledgebaseofMYCINandateachingprogramcalledGUIDON[7].2.2AbouttheMEANSSystemThemedicalexpertsystemMEANS{MottExpeditedAccrualNetworkSystem{wasoriginallydevelopedattheUniversityofSouthFloridaaround1998byFletcheretal.[5].Thepurposeofthesystemwastoautomatetheselectionofpatientsforcancerclinicaltrials.Sinceitsconception,MEANShasbeenimproveddramaticallywithadditionofvariousoptimizationtechniques.Theoriginalversion,developedbyFletcheretal.,wasaqualitativerule-basedsystem.Cost-eectivenessoftheselectionprocesswasaddressedbyKokkuetal.Intheirwork,Kokkuetal.lookedatthecost-optimizationproblemoforderingrelatedtestsandminimizingthepainfactorassociatedwiththenumberoftestsneededforansweringeligibilitycriteriaquestions[18].Physicianscouldreducetheoverallcostofthescreeningprocessbyorderinglessexpensivetestsearlierintheprocessandthenusethoseresultstoruleoutapatient[9].SavvasNikiforoucreatedaKnowledgeEntrysystemin2002[21].Untilthen,protocolexpressionshadtobecodedbyaprogrammerandtookasignicantamountoftimeandeort.Goswamietal.experimentedwithBayes'probabilisticreorderingagentonretrospectivedata.TheirexperimentsshowedthattheapplicationofBayes'probabilisticreorderingagent4

PAGE 14

couldreducedataentrybymorethan20%[9,13].Probability-basedreorderingwasaddedtotheexpertsystembyGoswamietal.;however,untilnowprobabilisticknowledgegatheringwasnotproperlyimplementedinMEANSduetoconceptualdierencesofthetestsystemthatGoswamietal.usedtotesttheprobability-basedreorderingagent.InSection3.2.3IwillexplaintheadditionalmodicationswhichthatdonetoMEANS.2.3MEANSArchitectureTheMEANSisdividedintotwomainparts:PatientAssignmentandKnowledgeEntry.ThedevelopmentofthePatientAssignmentsubsystemwasstartedbyFletcheretal.[5].Kokkuetal.continueddevelopmentofthesubsystemandimplementedheuristicsfororderingmedicalteststominimizeoverallcost[18].Goswamietal.addedaprobabilisticagenttothesystem[13].TheKnowledgeEntrysubsystemwasdevelopedbyNikiforou.Ithasafriendlyweb-baseduserinterfacewhichallowsencodingofclinicalprotocolsforusebythePatientAssignmentsystem[21].Tofullycompletethesystem,Irenedtheimplementationoftheprobabilisticagent,streamlinedthesystemowandaddedareportingsubsystem.ThedetailedsystemarchitectureisshowninFigure2.1.AsseeninFigure2.1,theknowledgebasecontainsinformationaboutclinicaltrials{storedasgeneralanddomainknowledge{togetherwithimplicationsandprobabilisticknowledge.TheprobabilisticknowledgeisgatheredbythesystemeverytimeananswerisevaluatedbyMEANS.Onceasucientnumberofpatientshavebeenscreened,theprobabilisticknowledgecanbeusedtoreorderthequestions.Thedatabasecontainstheinformationonallpreviouslyscreenedpatientswhichincludesthetrialsforwhichpatientswerescreenedaswellastheircurrenteligibilityforscreenedtrialsandanypreviouslyprovidedanswers.Atanytime,theuserhastheabilitytochangeordeleteananswerviatheweb-baseduserinterface.2.4KnowledgeEntryThemedicalknowledgeacquisitionforMEANSisdoneviatheKnowledgeEntrysubsys-tem,asseeninFigure2.1.Aprotocolmustbeencodedusingonlythetypesofquestions,listedinTable2.1.Foranumericquestion,arangeofnumericvaluesisprovidedasananswer;5

PAGE 15

Figure2.1ExpertSystemArchitecture6

PAGE 16

forayes/noquestion,yesornomaybeselectedasaneligibleanswer;foramultiple-choicequestion,allpossiblevaluesareprovided.TheKnowledgeEntrysubsystempermitstheuseoflogicalexpressions{suchasANDandOR{tocreatecombinedquestions.Onceaprotocolisencoded,theKnowledgeEntrysubsystemgeneratestwoexpressionsperquestion:1theAcceptanceExpression,and2theRejectionExpression.2.4.1QuestionTypesAlloftheprotocolsareencodedutilizingthreetypesofquestions:1Numeric,2Yes/No,and3Multiple-choice.AsampleofhoweachquestionappearsinMEANSisshowninTable2.1.Table2.1TypesofQuestions QuestionType Example AcceptableAnswer Numeric NumericValue NormalAutoFill DeferForLater Yes/No Yes No DeferForLater MultipleChoice AnyValue From Dropdown 2.4.2ProtocolExampleThissectionshowshowasimpliedprotocolcanbeencodedforuseinMEANS.First,theprotocoleligibilityinclusion/exclusioncriteriaisinterpreted.AsampleprotocolisshowninTable2.2.Then,theprotocolisbrokendownintoseparatequestions,andthequestionsthattheappropriateanswersareenteredintotheKnowledgeEntrysystemareshowninTable2.3.Lastly,theKnowledgeEntrysubsystemconvertstheprotocolintotwotypesofexpressions:17

PAGE 17

theAcceptanceExpressionand2theRejectionExpression.BothexpressionsforoursampleprotocolarelistedintheTable2.4.IwilldiscusstheneedforbothexpressionsinSection2.5.Table2.2SampleProtocol:Criteria InclusionCriteria 1. Patientmustbemalebetweentheagesof40and55 2. Havelifeexpectancyofgreaterthan12weeks 3. Ableandwillingtogivewrittenconsent 4. Havepathologicaldiagnosisofadenocarcinomaofthestomach ExclusionCriteria 1. Currenttobaccouser 2. Participationinotherclinicaltrials Table2.3SampleProtocol:Encoding # Question EligibilityAnswer 1.Age40and55 )-222()-222()-222()-222()]TJ/F15 10.909 Tf 55.757 0 Td[(AND)-222()-222()-222()-222()]TJET1 0 0 1 533.194 442.32 cmq[]0 d0 J0.398 w0.199 0 m0.199 13.549 lSQ1 0 0 1 -407.987 -13.549 cmq[]0 d0 J0.398 w0.199 0 m0.199 13.549 lSQ1 0 0 1 -125.207 -428.771 cmBT/F15 10.909 Tf 152.43 432.836 Td[(SexMale 2.Lifeexpectancyof>12weeks?Yes 3.Ableandwillingtogivewrittenconsent?Yes 4.Pathologicaldiagnosisofadenocarcinomaofthestomach?Yes 5.Currenttobaccouser?No 6.Patientonanyothertrial?No Table2.4SampleProtocol:AcceptanceandRejectionExpressions AcceptanceExpression fAge:40andAge:55andSex=MalegAND Lifeexpectancyof>12weeks=YesAND Ableandwillingtogivewrittenconsent=YesAND Pathologicaldiagnosisofadenocarcinomaofthestomach=YesAND Currenttobaccouser=NoAND Onanyothertrial=NoAND RejectionExpression fAge:<40orAge:>55orSex=FemalegOR Lifeexpectancyof>12weeks=NoOR Ableandwillingtogivewrittenconsent=NoOR Pathologicaldiagnosisofadenocarcinomaofthestomach=NoOR Currenttobaccouser=YesOR Onanyothertrial=YesOR 8

PAGE 18

2.5MEANSEligibilityAlgorithmAsdiscussedinSection2.4,theKnowledgeEntrysubsystemisusedtoencodeaprotocol'seligibilitycriteriaintotheformthatwillbeinterpretedbyMEANS.Atanygiventime,apatient'sstatusforaprotocol,whichisdeterminedbytheassignmentagentFigure2.1,isinoneofthreestates:1Eligible,2Ineligible,or3MoreInformationNeeded.IftheacceptancecriteriacanbeevaluatedandisTRUE,thenthepatientisEligiblefortheprotocol.IftherejectioncriteriacanbeevaluatedandisTRUE,thenthepatientstatusissettoIneligible.Ifneithertheacceptancenorrejectioncriteriacanbeevaluated,thenthepatient'sstatusissettoMoreInformationNeeded.ThereasonforthedeterminationofeachstateislistedinTable2.5.TheMEANSeligibilityalgorithmisshowninFigure2.2.Theeligibilityalgorithmisrunforeachprotocol.Also,Ishouldmentionthattheuserisonlypresentedwith10questionsatatimeline15.Eachtimeauserprovidesthesystemwithnewanswers,theeligibilitycriteriaisevaluatedline18.Thediscoveryofnewimplicationswillallowquestionsthathavenotbeenaskedtobeansweredbythesystem,thusdecreasingtheamountofdataentryFigure2.2,line17.Table2.5Patient'sStatesDuringScreening State Reason Eligible AcceptanceCriteria=TRUE Ineligible RejectionCriteria=TRUE MoreInformationNeeded AcceptanceorRejectioncan'tbedetermined 2.6DataMiningDataminingtechniquesallowresearcherstosearchthroughenormousamountsofdatainordertodiscoverassociationrules,emergingpatternsanddependencyrules.Datamininghasbeensuccessfullyappliedtoanumberofapplicationdomains,includingtelecommunications,commerce,astronomy,geologicalsurvey,security,censusanalysisandtextanalysis[24,30].Theidenticationofsetsofitems,products,symptomsandcharacteristicsthatoftenoccurtogetherinagivendatabaseareoftenseenasbasictasksofdatamining.9

PAGE 19

1:foreachprotocol2:f3:ifNewPatient4:f5:ProtocolStatus="MoreInfoNeeded";6:g7:whileProtocolStatus=="MoreInfoNeeded"do8:f9:ReadUnansweredQuestions;10:ifUnansweredQuestionsList==011:f12:ReadDeferredQuestions;13:g14:SortQuestionsByImportance;15:AskQuestionsFromUser;16:ReadNEWAnswers;17:ApplyImplications;18:EvaluateAcceptanceCriteria;19:ifAcceptanceCriteria==TRUE20:f21:GenerateReasonForEligibility;22:ProtocolStatus=ELIGIBLE;23:g24:else25:f26:EvaluateRejectionCriteria;27:ifRejectionCriteria==TRUE28:f29:GenerateReasonForIneligibility;30:ProtocolStatus=INELIGIBLE;31:g32:g33:ifAcceptanceCriteria6=TRUEandRejectionCriteria6=TRUE34:f35:ProtocolStatus=MoreInfoNeeded;36:g37:g38:g Figure2.2MEANSEligibilityAlgorithm10

PAGE 20

2.6.1AssociationRulesTheproblemofassociationruleminingtogetherwithanalgorithmforitssolutionwasoriginallyintroducedbyAgrawaletal.in1993[1].Itwasdesignedforper-transactionbaseanalysisofasupermarketdatabase.Thegoalwastoidentifyassociationsbetweensetsofitemswithsomeminimalcondence.Agrawaletal.proposedafasteralgorithmcalledAprioritosolvetheassociationruleproblemin1994.[2]Theproblemcanbedescribedasfollows:LetI=f{1,{2,:::,{mgbeasetofallitems.LetDbeadatabaseoftransactions,whereeachtransactionTconsistsofasetofitemssuchthatTI.AtransactionTcontainsX,asetofitemsinI,ifXT.AnassociationruleXYiswrittenintheformofanimplicationasXY,whereXI,YIandXY=0.Anassociationrulemusthaveacondenceandasupport.ThecondencecfortheruleXYholdsifc%oftransactionsinDthatcontainXalsocontainY.TheruleXYhasaminimumsupportsinthetransactionsetDifs%oftransactionsinDcontainX[Y.[1,2]TheoriginalmotivationfortheApriorialgorithmcamefromtheneedtoanalyzesuper-marketanalysisdatasothatcustomerbehaviorcouldbeexaminedintermsofpurchasedproducts.Howoftentheproductsarepurchasedtogetheraredescribedbyafrequentset.Anexampleofthisassociationruleis:90%oftransactionsthathadbutterandeggsalsocontainedmilk.Inthisexample,thecondencecis0.9%.Associationrulesarediscoveredbylookingatitemsetsthathavespeciedminimumsupportscoverage.Anitemisacombinationofattribute-valuepairs.Firstwegenerateone-itemsetsforallattributesthathaveminimumsupportgreaterthanagivenminimumsupportvalue.Thenextstepofthealgorithmisthegenerationoftwo-itemsetstwoattribute-valuepairs.Wemustnotethatweonlygeneratenewitemsetsiftheirminimumsupportisgreaterthanagivenminimumsupportvalue.Itemsetsinwhichanattributetakestwoseparatevalueswillnotbegeneratedbecauseitisimpossible.Forexample,itisimpossibleforthetemperaturetobehotandcoldatthesametime;therefore,suchanitem-setwillnotbegenerated.Thealgorithmwillgeneratenewitemsetsuntilnomoresetscanbegeneratedwithgivenminimumsupport.11

PAGE 21

Hereisanexampleofhowthealgorithmworks:SupposewehavedatabaseIthatconsistsofvepatientsTable2.6.Fromthistable,wecanseethatourdatabaseconsistof7attributesand17values.Usingaminimumsupportsof3,wecanndallpossibleassociationrules.IwilldiscussthecondenceclevelafterwendtherulesfromourdatabaseI.Table2.6AprioriExample:Database SamplePatients Attributes Goal # Metastatic Biopsy Complaint Severity Race Age Treatment 1 No Yes Some High French 10to20 Yes 2 Yes No Some Low Russian 10to20 Yes 3 No Yes None Low German 40to70 No 4 No Yes Some Low German 20to40 Yes 5 No No None High French >70 No Firstwemustcreateone-itemsetsfromourdatabase.Wewillscanourdatabaseandlocateallpossibleattribute-valuepairsthathaveoccurred.WecanseethatourdatabaseIcontains17one-itemsetsTable2.7.Afterallone-itemsetsarefound,theyarethresholdedbyagivenminimumsupport.Ifthesupportfrequencyofanitemsetisbelowagivenminimumsupport,thentheitemsetisexcludedfromfurtherinvestigation.FromTable2.7wecanseethatonly5one-itemsetshavetheirsupportaboveorequaltothegivenminimumsupportof3.Forthenextstepwewillonlyusetheitemsetsthathavepassedtheminimumsupporttest.Inthisstepwewilluseone-itemsetstocomposetwo-itemsets.Theresultingtwo-itemsetsareshowninTable2.8.Weendedupwith10two-itemsets;however,only2outofthe10passtheminimumsupportcheck.Onlythersttwotwo-itemsetswillbeusedinthenextstepofgeneratingthree-itemsets.Table2.9containstheresultingthree-itemsets.Sinceourminimumsupportwassetat3,noneofthenewly-generatedthree-itemsetspasstheminimumsupportcheck;hence,wewillstopthegenerationofitemsetsatthistimeandwillreturntoourtwo-itemsets.Bylookingatthetwo-itemsetsTable2.8wewillgenerateassociationrules.WhenMetastaticisNoandBiopsyisYes,thiscombinationoccurs3outof4times;however,the12

PAGE 22

Table2.7AprioriExample:One-itemSets One-itemSets # Attribute Value Support MinSupCheck 1 Metastatic No 4 X 2 Biopsy Yes 3 X 3 Complaint Some 3 X 4 Severity Low 3 X 5 Treatment Yes 3 X 6 Biopsy No 2 7 Complaint None 2 8 Severity High 2 9 Race French 2 10 Race German 2 11 Age 10to20 2 12 Treatment No 2 13 Metastatic Yes 1 14 Race Russian 1 15 Age 40to70 1 16 Age 20to40 1 17 Age >70 1 Table2.8AprioriExample:Two-itemSets Two-itemSets # Attribute/Value Support MinSupCheck 1 Metastatic=NoBiopsy=Yes 3 X 2 Complaint=SomeTreatment=Yes 3 X 3 Metastatic=NoComplaint=Some 2 4 Metastatic=NoSeverity=Low 2 5 Metastatic=NoTreatment=Yes 2 6 Biopsy=YesComplaint=Some 2 7 Biopsy=YesSeverity=Low 2 8 Biopsy=YesTreatment=Yes 2 9 Complaint=SomeSeverity=Low 2 10 Severity=LowTreatment=Yes 2 13

PAGE 23

Table2.9AprioriExample:Three-itemSets Three-itemSets # Attribute/Value Support MinSupCheck 1 Metastatic=NoBiopsy=YesComplaint=Some 2 2 Metastatic=NoBiopsy=YesTreatment=Yes 2 3 Complaint=SomeTreatment=YesMetastatic=No 2 4 Complaint=SomeTreatment=YesBiopsy=Yes 2 combinationwhenBiopsyisYesandMetastaticisNooccurs3outof3times.Thesetworulesareinterpretedasfollows:IFMetastatic=NoTHENBiopsy=Yes;IFBiopsy=YesTHENMetastatic=No.FromoursampledatabaseTable2.6wewereabletodiscover4ruleswithaminimumsupportof3.IlistedallofthediscoveredrulesinTable2.10.Ishouldalsopointoutthatthetablecontainsacondencecolumn.Accordingtothetablewecanseethatrule#1hasoccurred3outof4timesinourdatabase.Thisgivestheruleacondencelevelof75%c=0.75.Theother3ruleshaveacondencelevelof1c=100%.Table2.10AprioriExample:Rules SampleRules # Rules Support Condence 1 Metastatic=NoBiopsy=Yes 3outof4 75.00% 2 Biopsy=YesMetastatic=No 3outof3 100.00% 3 Complaint=SomeTreatment=Yes 3outof3 100.00% 4 Treatment=YesComplaint=Some 3outof3 100.00% Sincewewillbeminingmedicaldata,wemustuseacondencelevelof1c=1.Basedonthat,wewouldonlybeinterestedinrules2,3,and4.Tocompleteourprocessofmedicaldatamining,rules2,3,and4willneedtobevalidatedbyaphysicianbeforewecandeclarethemtobevalidrules.14

PAGE 24

2.6.2MiningMedicalDataAsseeninFigure2.1,implicationsarepartoftheknowledgebase.InSection3.2.2Iwillexploreimplicationsingreaterdetail,butfornowabriefoverviewwillbesucientinordertoconveyourpurpose.Todecreasetheamountofdataentryneededtodetermineapatient'seligibility,weexploredtheuseofdatamining.Detectionoftrendsandanomaliesinpopulationshasproducedsignicantadvances.InAugustof1854Dr.JohnSnowusedstatisticalobservationstondthesourceofCholeratransmission,thusstoppingtheCholeraoutbreakinBritain.EdwardJennerusedtheob-servationthatmilkmaidswhosueredfromthemilddiseaseofcowpoxnevercontractedthemoreserioussmallpox.Afterheconductedhisfamousexperimentin1796,hepublishedhisworkin1798,inwhichhecoinedthetermvaccine"fromtheLatinwordvacca,orcow"[6].Oneoftheproblemsthatadataminerisfacedwithwhileminingmedicaldataisthatunlikeotherdomains,themedicaldisciplineisinitselfdiverseandcomplex.Sensibledataminingrequiressignicantdomainexpertiseandasaresult,anactivecollaborationbetweenthedataminerandthedomainspecialist[24].Dataavailabilityandaccuracyisanotherissuefacedbythedataminer.Gatheringmedicaldataisatediousprocessandanylackofcompletenessorlevelofdetailmayrenderthedatatobeuseless.Ethicalandlegalissuesalsoneedtobeaddressedwhenminingmedicaldata.If,forexample,thefollowingrulewasdiscovered:ZipCode;Age)]TJ/F15 10.909 Tf 10.909 0 Td[(25;GenderMaleHepatitis B StatusYes%ThensucharulemaynotonlybedisturbingbutcouldalsobeconsideredoensiveshouldtheZipCode12345refertoanindigenouscommunity[10,24].ResearchersatthePediatricBrainTumorResearchProgramatChildren'sMemorialHos-pitalinChicagouseddataminingwhentheyperformedgeneexpressionanalysisforpediatriccancers.TheywereabletoisolatepediatricleukemiaCDmarkersantibodiesthatbindtoproteinsonthesubracesofwhitebloodcellsandleukemiccellsandhopetousethatknowl-edgetoimproveexistingmethodsofdiagnosisandtreatmentofthedisease.[14]15

PAGE 25

Ourgoalwastousedataminingtechniques,specicallytheApriorialgorithm,todiscovernewassociationrulesfromcurrentpatientdataandencodenewly-discoveredrulesintotheMEANSimplicationsubsystem[1,2,11].16

PAGE 26

CHAPTER3IMPROVEMENTSTHROUGHAUTOMATEDDATAANALYSIS3.1DataCollection3.1.1ImplementationinaClinicalSettingBeforethesystemcouldbeusedbyphysicians,ithadtobeproperlyandsecurelysetupforuse.Sincethesystemisaweb-basedtool,noinstallationwasneededonthephysicians'computersandonlyanInternetExplorershortcuthadtobeplacedonthedesktop.NecessaryprecautionsweretakentocomplywiththeHealthInsurancePortabilityandAccountabilityActHIPAAprivacyruleof1996.HIPAAistherstFederalprotectionrulefortheprivacyofpersonalhealthinformation[22].TheMEANSsystemwaslaunchedonFebruary7,2006atMottCancerCenter'sGas-trointestinalGIclinic.Thedecisionwasmadetointroducethesystemtonursesandphysi-ciansatthesametime.Becausethesystemwaspartofthestudyandhadaseparatestudyprotocolnumber,allofthepatientsthatwerescreenedbythesystemhadtosignave-pageHIPAAresearchauthorization.OnlyaftertheHIPAAauthorizationwassignedbythepatientorguardian,couldthepatientbescreened.Sincethegoalofthesystemwastoscreenpatientsthatcometotheclinicforthersttime,onlythosewhowereseeingagastroenterologistforthersttimewereconsideredaspotentialcandidatesforscreening.Asmentionedearlier,onlyafterapatientsignedtheHIPAAauthorizationwashisorherinformationenteredintothesystem.3.1.2GIClinicalTrialsDuringAprilof2006thenumberofallactiveclinicaltrialsattheMottCancerCenterwas140[15].SinceourstudywasaimedonlyatPhaseIIclinicaltrials,onlyactiveandopen17

PAGE 27

forenrollmentPhaseIIclinicaltrialswereenteredintoMEANS.Atthetimeofthisstudy,MottCancerCenterhad36activePhaseIIclinicaltrials,veofwhichwereconductedintheGIClinic.TheveactivePhaseIIGIclinicaltrials,listedinTable3.1,wereencodedintoMEANS.Table3.1ActivePhaseIIGIClinicalTrials Protocol# TrialDescription 1 13424 APhaseIIStudyofCapecitabineinCombinationwith IrinotecanandOxaliplatinEloxatininAdultPatients withAdvancedColorectalCancer 2 13426 APhaseIIStudyofCisplatinandIrinotecanInduction Chemotherapy,FollowedbyZD1839IRESSAinAdultPatients withSurgicallyUnresectableand/orMetastaticEsophageal orGastricCarcinomas 3 13449 MicroarrayAnalysisofColonCancerOutcome-AMACCO-A Foradvancedormetastaticnon-resectablecoloncancer 4 13946 PharmacogenomicStudyofNeoadjuvantPre-irradiation DocetaxelandCisplatin,followedbyNeoadjuvantConcomitant Docetaxel,CisplatinandIrradiation,followedby SurgeryDC-DCR-SinAdultPatientswithOperable AdenocarcinomasoftheEsophagusorGastroesophagealJunction 5 14607 RandomizedphaseIIstudyofAG-013736andGemcitabinein Chemo-NaivePancreaticCancerPatients 3.1.3PatientSelectionWecollectedourdataattheGIClinicoftheMottCancerCenter.Therearethreetypesofpatientsintheclinic:NewPatientNP,NewEstablishedPatientNEPandEstablishedPatientEP.Table3.2providesexplanationsofeachpatienttype.ItwasourgoaltoscreeneverynewpatientforeligibilityforclinicaltrialsintheGIClinic;hence,ourstudyfocusedonlyonNPandNEPpatients.18

PAGE 28

Table3.2PatientTypeintheClinic Type Explanation NewPatientNP Neverseenthedoctorattheclinic -FirstTimeVisit NewEstablishedPatientNEP Previouslybeenseenbyanother doctorattheclinic EstablishedPatientEP Previouslyseenthisdoctor. -Repeatvisit 3.2ExperimentsandResultsThissectiondescribesthedataminingexperimentsthatweconductedandourndings.3.2.1AboutPatientDataWecollecteddataoncurrentNPandNEPpatientsattheGIClinicofMottCancerCenter.Sincepatientsweregoingthroughtreatment,notallpatientdata{suchastestresults{wereavailableduringtheinitialscreening.MEANStrackseachpatientviaapatientprole.Eachpatientprolecontainsthelistofprotocolsforwhichheorshewasscreenedtogetherwiththelistofallansweredquestions.Apatientcanbedeterminedineligibleanytimeduringthescreeningprocess.Ifapatientwasruledoutaftercompletionoftheinitialquestionspage,thenthenumberofansweredquestionscouldbe10ifallofthequestionsontheinitialpagewereanswered;however,ifapatientwasruledoutbythelastquestionintheeligibilitycriteriaofeachprotocol,thenumberofansweredquestionsinapatient'sprolecouldbe90.Basedonthat,eachpatientprolecouldcontainadierentnumberofquestions.Sincewewereworkingonanewsystem,thereisnoarchivaldataavailableforanalysis.Wedecidedtostarttheminingpatientdataaftercollectingthedataforapproximatelythreeweeks.Therefore,ourdatawasdividedintotwosets:SubsetDataSetOne,whichconsistedof161GIpatientswhosedatawascollectedfromFebruary7thtoMarch1standSupersetDataSetTwo,whichistheextensionoftheSubsetandincludedGIpatientdatafromFebruary7thtoApril21st.19

PAGE 29

Imustclarifythefactthatthedatathatwasusedinalloftheexperimentsdidnotcontainanypotentiallyidentiablepatientinformation.Ananonymouspatientprolewasassignedanumberandthatnumberwasusedduringtheexperiments.3.2.1.1DataExtractionBeforepatientdatawasmined,somepreprocessingwasdone.WeusedWEKA,adataminingsoftwaretoolkitfromtheUniversityofNewZealand,todiscovernewassociationrules[29].ToconvertthedatafromtheMEANSformattoaformatusablebyWEKA,asuiteoftoolswasdevelopedbytheauthorofthisthesis.Thesetoolspermittednecessarypre-andpost-processing,suchasdataextraction,analysis,cleaning,formattingandrulerecovery.ThesuitewaswritteninCandC++.ThetoolsareincludedontheCDthataccompaniesthisthesis.TheApriorialgorithmdoesnotacceptcontinuousvaluessothequestionsthathadcontin-uousvaluesasananswerneededtobeexcludedfromthelistofquestions.TheencodingoftheGIprotocolsintoMEANSproducedatotalof225questions.Ofthe225questions,33ofthemrequiredcontinuousvalues.Defaultquestionsansweredduringthestreamliningprocess,suchasSignedInformedConsent,wouldnotcontributetothediscoveryofnewassociationrulessinceallpatientshadthesameanswer;therefore,defaultquestionsneededtobeexcludedfrompatientproles.Usingpre-processingtools,patientdatawasanalyzed.Continuousvaluedquestions{aswellasdefaultquestions{wereremovedfrompatientproles.Themaximumnumberofquestionsthatapatientprolecouldhavewas192.Eachpatient'sinformationthencouldbeviewedasavectorwithasizebetween0zeroand192.PatientdatawasretrievedfromMEANSandstoredinasingle.arlesoitcouldbereadbythedataminingsoftwarepackage.Thedetailsaboutthe.arformatarecoveredinWEKA'sdocumentation[29].Becausewewerelookingforimplicationsthatalwayshold,weusedthecondencecof1%forallofourexperimentsandvariedtheminimumsupportsfromruntorun.Thedataminingsoftwareweusedstartseachrunats=100%anddecreasessby5%untilagivenminimumsupportisreached.20

PAGE 30

3.2.2MiningMedicalImplicationsAsseeninFigure2.1,implicationsarepartoftheknowledgebase.Implicationsallowustomakeinferencesbasedonexistinginformation,thereforedecreasingthenumberofquestionsneededtobeaskedforeligibilitydetermination.Example:BasedonthesamplequestionsinTable3.3,wecanconstructseveralimplica-tions.IfQ1isansweredMale,andsinceitisknownthatonlyfemalescanbepregnant,thenthefollowingimplicationcanbemade:IfQ1=MaleIMPLIESQ2=No.Thisimplicationisalwaystruecondence=100%.Animplicationthathasacondence<100%{forexample,Q2=NoIMPLIESQ1=Male{isfalseinsomecases{afemalethatisnotpregnant{cannotbeconsideredavalidimplication.Table3.3ImplicationExample Q1=Patient'sSexMale/Female? Q2=PregnantYes/No? Q3=Patient'sAge? ValidImplicationsCondence=100%: XIfQ1=MaleIMPLIESQ2=No XIfQ1=FemaleandAge70IMPLIESQ2=No InvalidImplicationCondence<100%: IfQ2=NoIMPLIESQ1=Male Ourgoalwastousedataminingtechniques,specicallytheApriorialgorithm,forthegenerationofassociationrules.TheassociationruleswereminedusingWEKA[2,29].TheencodingoftheMEANSdatapre-processing,thedecodingoftherules,anddiscoveryoftherulesofinterestpost-processingwasperformedusingpost-processingtoolsthatwherewrittenbytheauthor.3.2.2.1SubsetDataSetOneDuringthepre-processingphasewefoundthatofthe175patients,14patientscontainedonlydefaultanswersintheirprole.Thesepatientswouldnotprovideanyusefulinformationandsowerenotconsideredinthisexperiment.Wealsofoundthattherewere33questionsthatrequiredcontinuousvaluesforananswer.SincetheApriorialgorithmdoesnotdealwith21

PAGE 31

continuousvalues,these33continuousvaluedquestionswereexcludedfromtheremainingpatientproles.Theresultofthepre-processingwasasinglelethatcouldbeviewedas161patientvectorswith192possibledimensions.Forthisexperiment,161records{eachwithapossible192dimensions{weremined.Sincewewerelookingforimplicationsthatalwayshold,acondenceof100%c=1wasused.Westartedwithminimumsupportsof95%anddecreaseditby5%untils=5%.Therstrunthatproducedanyresultwasataminimumsupportof70%Figure3.1.Withaminimumsupportof70%,only4rulesweregeneratedatc=100%.Aftertherulesweredecoded,theanalysisoftherules{showninFigure3.1{sparkedsomeinterest.Itwasfoundthatrulestwo,threeandfourwerepermutationsofruleone:000 y1y=No009 y1y=NoRuleonehadacoverageof128whichisgreaterthentherestoftherulesgeneratedduringthisrun. Figure3.1AprioriRuns=70%22

PAGE 32

ThedecodedversionofruleoneisshowninFigure3.2.Thisnewassociationrulewasreviewedbyaphysiciananddeemedmedicallyvalid.Ishouldnotethat[000 y1y]belongstoprotocol#13426andquestion[009 y1y]belongstoprotocol#13946.Wefoundanassociationrulebetweentwoprotocols.Thiswasapromisingbeginning;wefoundasimple,medically-validimplication.ThisimplicationwasaddedtotheimplicationmoduleofMEANS. 1.000 y1y=No128009 y1y=No128conf: Question:[000 y1y][PathdiagnosisofesophagealSCCorACA?]=[No]Occurred[128]Times:IMPLIESQuestion:[009 y1y][Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[No]Occurred[128]Times:Withconf: Figure3.2DecodedImplicationWecontinuedrunningexperimentsbydecreasingtheminimumsupports.Asmentionedearlier,eachrunstartswiths=100%anddecreasessby5%untilagivenminimumsupportisreached;therefore,therulesthatwererecoveredats=70%alsoshowedupinduringsubsequentrunswhenminimumsupportwas<70%.Since4rulesmaynothaveencompassedeverythingofinterest,Icontinuedtodecreasedminimumsupport.Whenswasdecreasedto60%,14newrulesweregenerated.Table3.4containstheassociationrulesummaryforthesubset.Wewerealsocuriousiftherewereanyimplicationswithinasetofquestionsforeachprotocol.Duringdatacollection,someoftheclinicianschosetoscreensomepatientsonlyforasubsetofourtrials;therefore,notallpatientswerescreenedforallvetrials.Table3.5showstheresultsoftheperprotocolminingexperimentats=10%onthesubset.Onlyprotocols13424and13946producedanyrulesofinterest.However,whentheseruleswereexaminedformedicalvalidity,noneofthe13rulesfor13424and15rulesfor13946weremedicallyvalid.Sinceaverylargenumberofassociationrulescouldbefound,westoppedtheanalysisoftherulesats=5%.Atminimumsupportof5%thelastrulehadacoverageof8.23

PAGE 33

Table3.4Subset:PatientData MinSupports% MinCoverage# #Rules #NewRules 70 114 4 4 60 97 18 14 50 83 30 12 40 78 32 2 30 48 103 71 20 32 362 259 10 16 2,537 2,175 5 8 49,564 47,027 Table3.5Subset:RuleStatisticsPerProtocols=10% 13424 13426 13449 13946 14607 TotalPatients 158 157 158 154 157 TotalAnswers 686 590 665 444 502 RulesRecovered 20 0 0 39 0 RulesOfInterest 13 0 0 15 0 MedicallyValidRules 0 0 0 0 0 24

PAGE 34

Intheendwediscovered2truerulesand5conditionalrules.ThenallistofmedicallyvalidrulesfromthesubsetislistedinFigure3.3Subset:AllMedicallyValidImplications.AnexplanationofthisgurewillbediscussedinSection3.2.2.3. Subset:AllMedicallyValidImplications 2TrueRules #1:[PathdiagnosisofesophagealSCCorACA?]=[No]IMPLIES[Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[No] #5471:[Metastaticcolonorrectalcancertissueprovenandnotsuitableforsurgery?]=[Yes]IMPLIES[Histologicallyconrmedmetastaticcolorectalcancer?]=[Yes] 5ConditionalRules #3393:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESifsametissuesample[Metastaticcolonorrectalcancertissueprovenandnotsuitableforsurgery?]=[No] #3394:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESifsametissuesample[Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[No] #3395:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESifsametissuesample[Histologicallyconrmedmetastaticcolorectalcancer?]=[No] #3397:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESifsametissuesample[PathdiagnosisofesophagealSCCorACA?]=[No] #206modiedRulehas1antecedentand1consequent:[001 y1y][MeasurablediseaseRECIST?]=[No]IMPLIESifsametissuesample[000 y138y][Tissueprovenpancreasadenocarcinoma?]=[No] TOTAL:7implications Figure3.3Subset:AllMedicallyValidImplications25

PAGE 35

3.2.2.2SupersetDataSetTwoOursupersetwastheextensionofthesubset.Asmoreandmorepatientswerescreened,thedataavailableforminingincreased.Oursupersethadtogothroughsomepre-processingbeforetheminingcouldbegin.Westartedwith393patientrecords;however,afterpre-processing,wediscoveredthat26ofthepatientrecordscontainedonlydefaultquestionsand/oracombinationofdefaultquestionsandcontinuousvalues.Weendedupwithonly367patientrecords,whichmeantthatoursupersetcontained206morepatientrecordsthanthesubset.Sincethepoolofthequestionshadnotchanged,eachpatientrecordcouldcontainupto192possibledimensions.Thesupersetcontainedatotalof5055answersfrom367patients.Likebefore,westarteddataminingwithaminimumsupportof95%anddecreaseditby5%untilitreached5%,keepingthecondenceat100%.Incomparisontothesubset,wedidnotgetanyrulesats=70%.Therstresultforthesupersetwasproducedats=30%.AsshowninTable3.6,ats=30%wefound27rules.Theminimumcoverageforthelastruleats=30%was111instances.Wecontinuedtodecreasetheminimumsupportuntil5%.Westoppedats=5%withthediscoveryof7079rules.Table3.6Superset:PatientData MinSupports% MinCoverage# #Rules #NewRules 30 111 27 27 20 73 71 44 10 37 343 272 5 18 7,079 6,736 Wealsotookalookatthedataonaperprotocolbasis.Theresultofthisexperimentonthesupersetats=5%isshowninTable3.7.Asmentionedearlier,clinicianshaveachoiceofselectinganycombinationofprotocolsforwhichtoscreenapatient.Asaresult,thenumberofpatientsscreeneddieredbetweenprotocols.However,protocols13424,13449and14607hadthesamenumberofpatients.Eventhoughweadiscoveredsignicantnumberofrules,theyneededtobeanalyzedbeforeanythingcouldbesaidaboutthem.Westillneededtovalidatethemtomakesuretheymade26

PAGE 36

Table3.7Superset:RuleStatisticsPerProtocols=5% 13424 13426 13449 13946 14607 TotalPatients 362 361 362 355 362 TotalAnswers 1658 1426 1588 907 1303 RulesRecovered 2 0 0 2 0 RulesOfInterest 1 0 0 1 0 MedicallyValidRules 0 0 0 0 0 senseandtoconrmtheirmedicalvaliditywithaphysician.BycomparingthesubsetrulesTable3.5andthesupersetrulesTable3.7,wecanseethatthesubsethadagreaternumberofrulesthatwerediscovered;however,thenumberofmedicallyvalidrulesforbothwerethesame{zero.3.2.2.3RulesofInterestSubsetBeforewecouldclaimthatwefoundnewimplications,weneededtoanalyzetherulesforpermutationsandmedicalvalidity.Iwillrstdiscussourndingsonthesubsetdata.Later,Iwilldemonstratehowthesediscoverieshelpeduscomeupwithalteringheuristictonarrowoursearchduringthesupersetanalysis.Therulenumbersreferredtofromnowonrefertothediscoverysequenceoftherulesandnottherulesequenceofvalidrules.Therunats=20%onthesubsetproducedatotalof362newassociationrules.Therulesetwasanalyzedintermsofthenumberofantecedentsandconsequents.ThematrixinTable3.8showsthebreakdown.Therowsrepresentthenumberofantecedentsandcolumnsrepresentthenumberofconsequentsinarule.Wecanseethatourrulesetcontainsruleswithupto7antecedentsand3consequents.Therulesfromthesubsetrunats=20%wereanalyzedformedicalvalidity.Wefoundthatof362rules,201rulesweremedicallyinvalid;however,161rulesweremediallyvalidandwereofinterest.Furtheranalysisoftherulesrevealedthatof161rulesofinterest,158werepermutationsofvalidrules.Only3rules,206,and213outof362werefoundtobemedicallyvalid.Therulesummarytableforthesubsetruleanalysisats=20%isshowninTable3.9.27

PAGE 37

Table3.8Subset:Antecedent-ConsequentMatrixs=20% Ant/ConsMatrix 1 2 3 1 1 0 0 2 26 9 2 3 78 25 4 4 102 20 0 5 66 5 0 6 20 1 0 7 3 0 0 Table3.9Subset:RuleAnalysiss=20% RuleType # InvalidRules 201 Permutations 158 ValidRules 3 Acloserlookatthe3validrulesFigure3.4revealedthatruleone,whichwasactuallydiscoveredats=70%,wastheonlysimplerule.Rulestwoandthreebothhadtwoantecedents,oneofwhichtheyshared.Theresultsofourexperimentsonthesubsetdataindicatedtousthatweshouldhavecon-centratedontherulesthatcontainedamaximumoftwoantecedentsandonlyoneconsequent.Forexample:ACA^BCAsmentionedearlier,rules#206and#213hadthesamerstantecedentFigure3.4.Werstexaminedrule#206,whichwasthesecondruleonourlistFigure3.4.Aftertakingacloserlookatthesubsetdata,wefoundthatof160patients,2patientsdidnothavetheanswersfortwoquestions,[000 y64y]and[000 y138y].Weremovedthesetwopa-tientsfromourdataandrantheexperimentagain.Wefoundthatasidefromtheexistingrule#1Figure3.4,thereweretwoadditionalsimplerules,1antecedentand1consequentFigure3.5.Sowiththeadditionalanalysisofourdata,wewereabletocomeupwiththesimplerule.28

PAGE 38

Subset:ValidImplicationss=20% 1Rule#1has1antecedentand1consequent:[000 y1y][PathdiagnosisofesophagealSCCorACA?]=[No]IMPLIES[009 y1y][Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[No] 2Rule#206has2antecedentsand1consequent:[001 y1y][MeasurablediseaseRECIST?]=[No]{AND{[000 y64y][AnyhistopathologicallyprovendiagnosisofmalignantGISTthatisnotamenabletostandardtherapywithcurativeintent?]=[No]IMPLIES[000 y138y][Tissueprovenpancreasadenocarcinoma?]=[No] 3Rule#213has2antecedentsand1consequent:[001 y1y][MeasurablediseaseRECIST?]=[No]{AND{[000 y125y][Metastaticcolonorrectalcancertissueprovennotsuitableforsurgery?]=[No]IMPLIES[000 y138y][Tissueprovenpancreasadenocarcinoma?]=[No] Figure3.4Subset:ValidImplicationss=20% Subset:Rule206CloserLook 1Rulehas1antecedentand1consequent:[001 y1y][MeasurablediseaseRECIST?]=[No]IMPLIES[000 y138y][Tissueprovenpancreasadenocarcinoma?]=[No] 2Rulehas1antecedentand1consequent:[001 y1y][MeasurablediseaseRECIST?]=[No]IMPLIES[000 y64y][AnyhistopathologicallyprovendiagnosisofmalignantGISTthatisnotamenabletostandardtherapywithcurativeintent?]=[No] Figure3.5Subset:Rule206CloserLook29

PAGE 39

AfterthesimplerulesFigure3.5werereviewedaphysician,itwasdiscoveredthatrule#2wasnotvalid;however,rule#1wasvalidated,butonlyundercertainassumptions.IwilldiscussmedicalvalidationindetailinSection3.2.2.5.Whentheminimumthesupportwasdroppedto5%,miningofthesubsetproduced49,564rulesTable3.4.Afteranalysisoftherules,wediscoveredthat100ofthemweresimplerules.Table3.10showsusthatthissetofrulescontainedonly100simplerules.Wewillrestrictouranalysistotheserules.Table3.10Subset:Antecedent-ConsequentMatrixs=5% Antecedentsrow/ConsequentscolMatrix 1 2 3 4 5 6 7 8 1 100 175 241 245 177 85 24 3 2 997 1,566 1,719 1,307 662 200 27 0 3 3,205 4,274 3,669 2,035 672 100 0 0 4 4,858 5,308 3,429 1,274 213 0 0 0 5 3,938 3,364 1,506 2,94 0 0 0 0 6 1,825 1,108 269 0 0 0 0 0 7 477 160 0 0 0 0 0 0 8 58 0 0 0 0 0 0 0 Whenwelookatthesubsetdataataminimumsupportof5%wendthatof100simplerules,only6werevalid.Rule#1,waspreviouslyseenats=20%.Rules5471,3393,3394,3395,and3397werenewlydiscovered.Ishouldmentionthatrule5471isatruerule,whichmeansitreferstothesametissuetypecolorectal.3.2.2.4RulesofInterestSupersetLet'stakealookattherulesdiscoveredinoursuperset.Allrulenumbersfromnowonrefertothesupersetrulesunlessotherwisenoted.Table3.11isasnapshotofthediscoveredrulesfromthesuperset.Again,ourrulesofinterestonlyincluderuleswithamaximumoftwoantecedentsandexactlyoneconsequent.Inthetablewecanseethatthereare13one-antecedent,one-consequentrulesand241two-antecedents,one-consequentrules,atotalof254rulesofinterest.30

PAGE 40

Table3.11Superset:Antecedent-ConsequentMatrixs=5% 1 2 3 4 5 6 1 13 7 4 1 0 0 2 2,41 128 82 40 13 2 3 903 475 226 82 14 0 4 1,303 716 303 65 1 0 5 1,034 509 134 5 0 0 6 461 159 12 0 0 0 7 116 17 0 0 0 0 8 13 0 0 0 0 0 Antecedentsrows/Consequentscols Afterexaminingthe254rulesofinterestofinoursuperset,wefoundthat73ruleshadsomepotential;however,closerexaminationwasneeded.Sinceour254-rulepoolconsistedofruleswithamaximumoftwoantecedentsandonlyoneconsequent,itisinterestingtonotethatonly4ofthe73ruleshadoneantecedentandoneconsequentpair.Theresthadtwoantecedents.Outofthe4ruleswithoneantecedent,onlyonerulerule#998wasthetruerulenotdependentonanyassumptions.Theotherthree#1355,#1356,#1357couldonlybevalidatedundertheassumptionthattheantecedentandconsequentoftherulereferredtothesametissuesample.Nowlet'stakealookattheruleswithtwoantecedentsandonceconsequent.Wecanseethatthereareverulesthatlookedpromisingrules#1,#343,#1347,#7036,#7076.Itwasinterestingtodiscoverthattherule#1fromFigure3.2,whichwasfoundduringanalysisofthesubset,acquiredanadditionalantecedentSex=Female.Asimilarcaseoccurredwithrules#343and#7036.Supersetrules#343and#7036werethesameasthesubsetrule#5471exceptthat#343acquiredanadditionalantecedentTakeMedicationsByMouth=YESwhile#7036acquiredadierentantecedentUncontrolledMedicalConditions=NO.IwilldiscussthesendingsingreaterdetailduringtheMedicalValidationdiscussionsection3.2.2.5.Thenalcountofrulesthatwasdiscoveredfromthesupersetwas9.Table3.12liststhenewlydiscoveredrulesfromthesupersettogetherwiththenumberofantecedentsandconsequentsperruleandmedicalcomments.Iwillexpandonthemedicalcommentsinthenextsection.31

PAGE 41

3.2.2.5MedicalValidationSubsetBecausewewereminingmedicaldataandnotsupermarketdata,wewerepresentedwithanadditionalchallengeduringtherule-interpretationphase{thatofrulevalidation.ThedomainknowledgeneededformedicalvalidationoftheruleswasprovidedbythephysiciansattheMottCancerCenter,Dr.ChrisGarrettandDr.AmitPathak.Allofourndingswerecheckedbythesephysiciansformedicalvalidity.Ruleinterpretationfromamedicalstandpointcan,attimes,besubjective.IspentalotoftimewithDr.AmitPathakdiscussinghowthediscoveredrulesshouldbeinterpretedforourpurpose.Herearesomeoftheissueswewerefacedwith.Figure3.3listsallofthe7validimplicationsthatwerediscoveredfromoursubsetdata.Wecanseethatonly2#1and#5471ofthe7implicationswerespecicenoughbecausethequestionsmentionthesameregionofthebodyintheantecedentandconsequent.AsseeninFigure3.3inimplication#5471,boththeantecedentandconsequentrefertocolorectaltypecancer;therefore,thisruleistissue-specicanddidnotrequireanyadditionalassumptions.Inorderfortheother5implicationstobevalid,wehadtomaketheassumptionthatboththeantecedentandconsequentrefertothesametissuesample.Let'stakeacloserlookatthesubsetrule#3395fromFigure3.3:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESWithassumptionthatthesametissuewasexamined[Histologicallyconrmedmetastaticcolorectalcancer?]=[No]Explanation:Thelogicbehindthisisthat,ifaphysicianislookingatthepancreatictissueanditisproventhatthistissueiscancerous,thenitisalsoknownthatthistissueisnotfromthecolorectalregionandthereforewillnotcontainacolorectalcancer.Sincetherestofthequestionsfromthesubsetweresomewhatgeneral,themedicaldecisionaboutthemwasbasedontheassumptionthattheantecedentandconsequentreferstothesametissuesample.32

PAGE 42

3.2.2.6MedicalValidationSupersetNowlet'slookatthesupersetdataallofthesubsequentrulenumberswillrefertosupersetrulenumbers.Afteranalyzingthesupersetdata,wefound9rulesofinterestTable3.12.Weseethatofthe9rulestherewasonly1truerule#998sinceboththeantecedentandconsequentoftherulereferredtothecardiacregion.Rules#1347,#1355,#1356,#1357,and#7076aretissuedependentrules.Twoofthese5rulescontained2antecedents;however,uponmedicalevaluationitwasdeterminedthatoneoftheantecedentswasnotnecessaryinorderfortheimplicationtoremainvalidANTnotneeded.Iwillgiveanexampleofsuchacaselaterinthissection.Supersetrule#1wasavariationofsimplerule#1discoveredinoursubset.Rules#343and#7036hadtwoantecedentsandwerevariationsofthesubset'ssimplerule#5471.Table3.12Superset:DiscoveredRules SupersetDiscoveredRules Rule# Type #ANT #CON MedicalComments 1 VariationofSubset#1 2 1 1ANTnotneeded 343 VariationofSubset#5471 2 1 1ANTnotneeded 998 True 1 1 Superset 1347 TissueDependent 2 1 SpecictoGeneral 1ANTnotNeeded 1355 TissueDependent 1 1 MustBeSameTissue 1356 TissueDependent 1 1 MustBeSameTissue 1357 TissueDependent 1 1 MustBeSameTissue 7036 VariationofSubset#5471 2 1 1ANTnotneeded 7076 TissueDependent 2 1 1ANTnotneeded IfruleA^BCwasdiscoveredduringdatamining,itwouldbeincorrecttosaythatitcouldbebrokendownintotworulesACandBCandthatthesenewruleswouldhavethesamecondencelevelasA^BC.Ifthedatacouldsupportit,thenthesetwosimpleruleswouldhaveshownuppriortothecombinationruleA^BCbutonlyiftheyhadanequalcondenceleveltoA^BCwerequire1.0.However,inourcase,wewereabletousemedicalknowledgetosimplifysuchrules.TheexampleinTable3.13demonstratesourreasoningbehinddoingsimplicationbasedonmedicalknowledge.IfaruleA^BC33

PAGE 43

wasdiscovered,andaftermedicalevaluation,itwasdeterminedthattherstantecedentAoftherulewasnotneeded,thenbasedonmedicalknowledgetherulecouldbereducedtoBCandstillmaintainthecondencelevelof100%.Imustemphasizeagainthatsucharulereductionwouldnotbecorrectifbasedonlyondata.Table3.13Example:RuleSimplicationBasedonMedicalKnowledge RULE:A^BCA[AgeOver18]y=[Yes]B[Sex]=[Male]IMPLIESC[Pregnant]=[No] MEDICALKNOWLEDGE:Onlyfemalecanbepregnant.REASONING:Sincemalecannotbepregnant,questionA[AgeOver18]isnotnecessaryfortheaboveruletobevalid. SIMPLIFIEDRULE:BCB[Sex]=[Male]IMPLIESC[Pregnant]=[No] y-AntecedentNotNecessary Let'stakeacloseratlookwhythesimplerule#1fromFigure3.3thatwasdiscoveredduringtheminingofthesubsetendedupwith2antecedentsinoursuperset.Sinceweused100%condence,weonlygotrulesthatwerecorrect100%ofthetimebasedonourdata.SowhentheApriorialgorithmwasrunonthesupersetwiththecondenceof100%,wegotoursubsetrule#1withanadditionalantecedentSex=Female.However,whenthecondencelevelwasdroppedto90%,wegotaoneantecedent,oneconsequentsupersetrulethatmatchedthesubsetrule#1exactly.Acloserlookatthepatientdatarevealedthatoneofthepatientscontainthefollowinganswers:A[Sex]=[MALE]B[PathdiagnosisofesophagealSCCorACA?]=[No]34

PAGE 44

C[Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[Yes]MedicalexaminationrevealedthatifB[PathdiagnosisofesophagealSCCorACA]isNO,thenC[Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroe-sophagealjunction]mustalsobeNO.SincetheanswerforBwasYes,theantecedentA[Sex=Female]surfacedfortherule.I'vealreadystatedthattobemedicallyvalidCmustbeNO,andsinceitwasYesthiscaserevealedanerrorinthedataentry.Weseethatbyusingdataminingwecouldndrulesofinterest.Becauseweweredealingwiththemedicaleld,weneededtorecruittheservicesofaphysicianinordertovalidateourrulesofinterest.Byrelyingonhisorhermedicalknowledge,thephysicianisabletotrimdownarulebydroppinganantecedentifthepresenceoftheantecedentisnotmedicallynecessaryfortheruletobevalidTable3.13.Figure3.6containsthenalresultofourdataminingfromthesuperset.Eventhoughwediscovered7rulesinthesubsetand9rulesinthesupersettwoofwhichwerepreviouslydiscoveredinthesubsetwecanonlyusetherulesthatdonotrelyonanyassumptions.Wecallsuchrulestruerules.Bylistingonlythetruerules,whicharemarkedwithandshowninFigure3.6,weendupwithournallistofveimplications.OurnallistofimplicationsisseeninFigure3.7.TheseimplicationhavebeenenteredintotheimplicationmoduleofMEANS.3.2.3Probability-BasedReorderingMEANSusesanalyticalandprobabilisticagentstoreorderthelistofunansweredquestionsquestionstoask.TheAnalyticheuristicisbasedonthecostofmedicalteststogetherwiththestructureoftheacceptanceandrejectionexpressions.Theprobability-basedreorderingisdonewiththehelpoftheprobabilisticagent.Theprobabilisticagentusesdatathatwasaccumulatedovertimetocalculatetheprobabilityforaquestion.Inthissection,Iwilldiscusstheprobability-basedheuristicindetail.BayesianNetworksseemtobepreferredforprobabilisticexpertsystems;however,therearesomedrawbackstousingthem.Theycanbeverycomplexandcomputationallyexpensive.35

PAGE 45

Superset:AllMedicallyValidImplications #1:z[Sex]y=[Female]AND[PathdiagnosisofesophagealSCCorACA?]=[No]IMPLIES[Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[No] #998:[Priorcardiacconditioninthelast6months?]=[No]IMPLIES[Unstableangina?]=[No] #1355:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESifsametissuesample[Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[No] #1356:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESifsametissuesample[Histologicallyconrmedmetastaticcolorectalcancer?]=[No] #1357:[Tissueprovenpancreasadenocarcinoma?]=[Yes]IMPLIESifsametissuesample[PathdiagnosisofesophagealSCCorACA?]=[No] #343:[Takemedicationsbymouth?]y=[Yes]AND[Histologicallyconrmedmetastaticcolorectalcancer?]=[No]IMPLIESifsametissuesample[MetastaticcolonorrectalcancertissueprovenANDnotsuitableforsurgery?]=[No] #1347:[Histologicallyconrmedmetastaticcolorectalcancer?]y=[No]AND[Tissueconrmedesophagealadenocarcinomaoradenocarcinomaofthegastroesophagealjunction]=[Yes]IMPLIES[PathdiagnosisofesophagealSCCorACA?]=[Yes] #7036:z[Metastaticcolonorrectalcancertissueprovenandnotsuitableforsurgery?]=[Yes]AND[Uncontrolledmedicalconditions?]y=[No]IMPLIES[Histologicallyconrmedmetastaticcolorectalcancer?]=[Yes] #7076:[Histologicallyconrmedmetastaticcolorectalcancer?]=[Yes]AND[Priorcardiacconditioninthelast6months?]y=[No]IMPLIES[MeasurablediseaseRECIST?]=[Yes] yAntecedentNOTNecessaryzRulePreviouslyFoundinSubsetTruerule TOTAL:9implications Figure3.6Superset:AllMedicallyValidImplications36

PAGE 46

FinalListofDiscoveredAssociationRules TrueRules: 1 [PathdiagnosisofesophagealSCCorACA?]=[No] IMPLIES [Tissueconrmedesophagealadenocarcinomaoradenocarcinoma ofthegastroesophagealjunction]=[No] SupersetRule#1 SubsetRule#1 2 [Metastaticcolonorrectalcancertissueproven andnotsuitableforsurgery?]=[Yes] IMPLIES [Histologicallyconrmedmetastaticcolorectalcancer?]=[Yes] SupersetRule#7036 SubsetRule#5471 3 [Priorcardiacconditioninthelast6months?]=[No] IMPLIES [Unstableangina?]=[No] SupersetRule#998 4 [Tissueconrmedesophagealadenocarcinomaoradenocarcinoma ofthegastroesophagealjunction]=[Yes] IMPLIES [PathdiagnosisofesophagealSCCorACA?]=[Yes] SupersetRule#1347 5 [Histologicallyconrmedmetastaticcolorectalcancer?]=[Yes] IMPLIES [MeasurablediseaseRECIST?]=[Yes] SupersetRule#7076 Figure3.7FinalRules37

PAGE 47

Introductionofnewevidencerequiresrecalculationofcorrespondingnumericalprobabilitiesandmayrequiremodicationtoanexistingnetwork[8,23].However,oneofthegreatestadvantagesofBayesianNetworksisthattheyareabletopredictanoutcomeevenwithabsenceofsomeinformation.ThisabilitygivesBayesianNetworksanadvantageoverrulebasedsystems.Duetothisfeature,thenaveBayesapproachwasadaptedfortheprobabilisticagent.Everytimeanewsetofquestionsandanswersissubmitted,theanswertoeachquestionisexaminedandMEANSmakesadeterminationofthepatient'seligibilityonaperprotocolbasis.If,aftertheevaluationofallnewly-submittedanswerstheeligibilityofapatientcannotbedetermined,MEANScompilesalistofquestionstoaskandpresentstheuserwiththetop10questionsfromthatlist.Thiscyclecontinuesuntiltheeligibilityofthepatientisdeterminedforallselectedprotocols.Wewantedtomakethescreeningprocesslesstime-consuming.Ifapatientiseligibleforatrial,thenallofthequestionsintheacceptanceexpressionneedtobeanswered;however,onlyasmallsetofquestionsisrequiredforarejectionexpressiontobeTRUE,thusmakingapatientineligible.Onewayofdecreasingthenumberofquestionsthatneededtobeansweredwastotrytomakeapatientineligibleassoonaspossibleduringthescreeningprocess.Weneededtopresenttheuserwiththequestionsthatwouldmostlikelytomakeapatientineligible.Thiswaspossiblewiththeuseofaprobability-basedheuristic.Byusingaprobability-basedheuristicwecouldreorderquestionssuchthatthequestionswithahigherprobability-inuencedrankingvalueofdeterminingapatient'sineligibilityforaprotocolwouldbeclosertothetopofthelistofquestionstoask.Sinceaquestionwithahigherrankingofmakingthepatientineligibleforaprotocolwouldbeaskedrst,suchanorderinghadthepotentialtoreducethenumberofquestionsthatauserneededtoanswer.3.2.3.1Probability-GuidedAgentTheoriginalversionoftheprobability-guidedagentwasdevelopedbyB.Goswami[12].Wemodiedtheoriginalimplementationofdatacollectionandintroducedathreshholdingfeatureduringprobabilitycalculation.38

PAGE 48

Eventhoughitisnotentirelytrue,intermsofourresearch,weuseclassconditionalindependenceanassumptionthatallofthequestionsareindependentofeachother.Weeliminatedknowndependenciesbyusingimplications;therefore,thedependentimpliedquestionswerenotinthelistofquestionstoask.Thescreeningprocesscouldbeviewedasa2-classproblem:EligibleandIneligible.Wetreatedthequestionsoftheprotocolsasourattributesandthevaluesofthequestionsasaneligibilitystatus:eitherFavorabletobeEligibleorFavorabletobeRuledOut.TousenaveBayes,weneededknowtheprobabilitiesofoccurrencesofeachclasstype.Thiswasaccomplishedeverytimeaquestionwasevaluatedbythesystem.Duringthequestionevaluationphase,weusedthefollowingheuristictokeeptrackofaskedquestionsandtheoutcome.Eachquestionhadtwonumbersattachedtoittogetherwithaprotocolnumber.Therstnumberkepttrackofhowmanytimesaquestionruledoutapatientmadeapatientineligibleforaprotocol;thesecondnumberwasthenumberoftimesaquestionmadeapatienteligibleforaprotocolordidnotruleoutapatient.Everytimeananswerwasevaluatedforaprotocol,oneofthenumberswouldincrease.If,theeligibilityofapatientcouldnotbedeterminedevenafterprocessingallofthenewlysubmittedanswers,anewlistofquestionstoaskwascompiled.Whenapatientwasscreened,theuserhadachoiceofeitheransweringaquestion,leavingitblankordeferringthequestionforlater.Atthistimethesystemdoesnotcapturewhichquestionwasansweredrstonthepageortheorderofsubmitteddataonaper-pagebasis;therefore,wewereunabletodetermineifthequestionthatwasansweredwastherstquestiononthelistofpresentedquestions.ThiscouldbeimplementedinthefutureversionsofthesystembyaddingaJavaScripttotracktheorderinwhichthequestionswereansweredonapageandcapturethedataduringprocessing.Wewereinterestedincalculatingthelikelihoodprobabilitythatthequestionwouldmakeapatientineligibleruledoutforaprotocol.ThecalculationoftheprobabilitywasbasedontheBayes'ruleofconditionalprobabilityEquation3.1.Pr[A]istheprobabilityoftheeventAandPr[AjB]istheprobabilityoftheeventAconditionaltoanothereventB.Theevidence39

PAGE 49

thatconsistsofaparticularcombinationofattributevaluesisdenotedbyE[29]:Pr[HjE]=Pr[EjH]Pr[H] Pr[E].1Inourcase,theprobabilitythataquestionwouldruleoutapatientfromaprotocolwascalculatedasfollows:Pr[IneligiblejE]=Pr[Ineligible] Pr[E]Example:WhenwecollectedthestatisticsforaquestionQ1,thelinethatcontainedthequestionwasreadasQ1134241080.Thiscouldbeinterpretedas:QuestionQ1forprotocol13424,whichwasasked90times,ruledoutapatientmadeineligible10times,anddidnotruleoutapatientlefteligible80times.IfwedenoteAforthenumberoftimesapatientwasruledoutandBforthenumberoftimesapatientwaseligible,thenwehave:Pr[ruleoutjE]=A A+B A A+B+B A+Bwhichinourcaseissimpliedto:Pr[ruleoutjE]=A A+B.2BysubstitutingthenumbersfromourexampleintoEquation3.2,tocalculatetheprob-abilitythatthequestionQ1islikelytoruleoutapatientfromtheprotocol13424wouldbe:Pr[ruleoutjE]=A A+B=10 +80=0:11Itisnotuncommonforsomeprotocolstosharethesamequestions,likeageandsex.Whenapatientisbeingtestedformorethanoneprotocol,aquestionmayappearintheacceptancecriteriaineachoftheprotocols.Totakethisintoconsideration,wehavechosentouseaprobability-inuencedrankingvalueforournalorderingofthequestions.Fora40

PAGE 50

questionthatappearsinmorethanoneprotocol,theprobability-inuencedrankingvalueiscalculatedbytakinganaverageofindividualprobabilitiesandmultiplyingitbythenumberoftrialsinwhichthequestionappears.IfwehavequestionQ1inthreeprotocolsandeachindividualprobabilityforquestionQ1hasbeenalreadycalculatedandis:PQ113424=:11PQ113426=:72PQ113946=:51thentheprobability-inuencedrankingvaluewillbecalculatedasfollows:RankingValue=Q113424+Q113426+Q113946 33=:11+:72+:51 33=1:34Theresultingprobability-inuencedrankingvaluewillbebetweenzeroandthenumberoftrialsinwhichaquestionappears.Ifaquestionappearsonlyinonetrial,thentheprobability-inuencedrankingvalueisequaltotherealprobabilityofthequestion.However,ifaquestionappearsinmorethanonetrial,theprobability-inuencedrankingvaluewillgiveanadvantagetosuchaquestionbecauseitthehaspotentialofrulingoutapatientfrommultipletrials.Otherwaysofcalculatingaprobability-inuencedrankingvaluewouldbetouseaweightedaverageoftheindividualprobabilitiesoronlyusethemaximumprobabilityofaquestion.Ifweusethemaximumprobabilityofaquestionforatrial,thatwillgivetherule-outadvantagetothetrialfromwhichtheprobabilitywasused;however,thatmaynotgivetheoveralladvantagethatweareseeking.Suchheuristicsstillneedtobeexplored.Whentheprobability-basedheuristicisused,thelistofquestionstoaskisorderedwiththehighestrule-outquestionrankingatthetop.Inordertohaveameaningfulprobabilityforaquestion,thenumberoftimesaquestionshouldhavebeenaskedmustbesucient,thatwaythequestionspresentedrstwillhaveagreaterimpact.Let'ssaythataquestionQ2containssomerareconditionandwasaskedatotalof9times,andruledoutapatientonlytwice.Theprobabilityforthisquestionis41

PAGE 51

0.22.IfwecomparedtheprobabilitiesforQ1andQ2,therule-outrankingforQ2isgreater;therefore,itwillbeclosertothetopofthetopofthelistofquestionstoask.IfQ2didnotruleoutanymorepatients,aftersometimetherule-outrankingwoulddecreaseandQ1wouldmoveclosertothetopofthelist;however,thisrequiresadditionaldataentry.Beforeusingequation3.2tocalculatetheprobabilityonaper-protocolbasis,weusedathresholdtoevaluatethetotalnumberoftimesaquestionhadbeenasked.Ifthatnumberwasabovethegiventhreshold,thenwecalculatedtheprobabilityforaquestion.Otherwise,weassignedthequestionaprobabilityofzeroanddidnotusetheittoreorder.3.2.3.2TestingSystemTotesttheprobability-guidedmethodwedevelopedatestingsystem.Thetestingsystem,whichwaswritteninCandisontheenclosedCD,wasabletoscreenthepatientbysubmittingexistinganswersfrompatient'sdataonequestionatatime.Afterananswerwasevaluatedandthequestions-to-asklistcompiled,thetestingsystemattemptedtondananswerforthequestionfromthelistofexistinganswersofapatient.Iftheanswercouldnotbefound,thesystemwouldincreasethenumberofaskedquestions,andwouldtrytondananswerforthenextquestiononthelist.Whenananswerforaquestionwasfound,thenumberofansweredquestionsincreasedandtheanswerwassubmittedintothesystemforevaluation.Ifapatient'seligibilityforaprotocolwasdeterminedeligibleorineligible,nootherquestionsfromthatprotocolwouldappearinthelist.Iftheeligibilityforallprotocolswasdetermined,thequestions-to-asklistwouldthenbeblankandthetestingprogramterminated.ThetestingsystemalgorithmisshowninFigure3.8.Wehavecreatedtwoversionsofthetestingsystem:web-basedandcommandprompt-based.Theweb-basedtestsystempermitsscreeningapatientwithoutloggingintotheserver.Wecantestoneortwopatientsandviewtheresultsviathebrowser.However,thecommandpromptversionismorerobustifscreeningalistofpatients.Thecommand-promptversiontakesinoneparameter,whichistheMEANSIDoftheexistingpatient.Duringourexperiments,thecommand-prompttestingsystemwascalledviashellscripts.42

PAGE 52

1:forEachtestingpatient2:f3:Loaddefaultanswers4:Loadrstpageanswers5:Compilequestions-to-asklistbyusingrankingtoreorder6:whiletherearequestionstoask7:f8:Taketopquestionfromquestions-to-asklistorderedbyrankingvalue9:Trytondtheanswerinpatient'sprole10:ifanswernotfound11:f12:whileanswernotfoundANDtherearequestionstoask13:f14:Incrementnumberofaskedquestions15:Takenextquestionfromquestionstoasklist16:Trytondtheanswerinpatient'sprole17:g18:g19:Incrementnumberofansweredquestions20:ifanswerfound21:f22:Evaluatepatient'seligibility23:Compilequestionstoasklist24:g25:g26:RecordPatient'sstatistics27:g Figure3.8TestingSystemAlgorithm43

PAGE 53

3.2.3.3ProbabilityGuidedExperimentsForourexperiment,wechose100ineligiblepatients.Theseweretheineligiblepatientsfromthelatterpartofthedata-collectionphase.First,eachofthe100testpatientswereevaluatedbythetestingsystemandanumberofaskedquestionsandansweredquestionsweretrackedtogetherwiththestatisticsforeachquestion.Next,thesame100patientswereevaluatedbytestingsystemagain,exceptthistimethequestions-to-asklistwassortedbyprobability-inuencedrankingvaluewiththehighestrule-outrankingvalueatthetopofthelist.Ishouldreiteratethatthetestingsystemscreensapatientonequestionatatime.Aquestionisconsideredpresentedoraskedwhenaquestionisatthetopofthequestions-to-asklistandthetestingsystemisdonesearchingfortheanswerinatestpatient'sprole.TheresultsoftheanalyticalreorderingexperimentareshowninTable3.14.FromtheQuestionsAnsweredcolumnwecanseethat,onaverage,thesystemrequired7.89questionsinordertodetermineapatient'seligibility.TheLessNeededcolumnshowsthat,comparedtothetestprole,thenumberofquestionsthatwereansweredduringdatacollectionbutwerenotneededforeligibilitydeterminationdecreasedbyanaverageof4.46questionsperpatient.TheQuestionsAskedcolumnreferstothenumberofquestionsthatwerepresentedsearched-foranswersinatestpatientbeforeeligibilitywasdetermined.Theaveragenumberofquestionsthatwaspresentedwas86.7.Table3.14AnalyticalReordering Questions Questions Questionsin Questionsin Less Asked Answered TestProle OriginalProle Needed TotalP 8,670 789 1,787 2,233 446 Meanx 86.7 7.89 17.87 22.33 4.46 MedianMd 106 8 18 19 2.5 ModeMo 106 8 18 19 1 St.Dev 52.17 5.19 5.23 8.75 5.73 Tocomparehowtheanalyticalreorderingstackedupagainsttheprobability-basedheuris-tic,weusedthe10-foldcrossvalidationmethod.Wedividedour100patientsinto10equalsetsof10patientsperset.Wegatheredtheprobabilityofthe9setsandthenusedthatprob-44

PAGE 54

abilityonthe10thsetthatwasleftoutduringtheprobabilitygathering.Thiswasrepeated10times,onceperset.Table3.15Probability-BasedReordering-FoldCrossValidation Questions Questions Questionsin Questionsin Less Asked Answered TestProle OriginalProle Needed TotalP 1,887 511 1,509 2,233 724 Meanx 18.87 5.11 15.09 22.33 7.24 MedianMd 12 4 14 19 5 ModeMo 11 4 14 19 5 St.Dev 23.13 3.65 3.69 8.75 6.84 ThecomparisonbetweenTable3.14andTable3.15indicatesthatusingtheprobability-basedheuristictoreorderthequestions-to-asklistdecreasedthenumberofquestionsthatwereneededtoruleoutapatientfromclinicaltrials.BylookingatthemeanxwecanseethattheaveragetotalquestionsaskedforeligibilitydeterminationQuestionsAskeddecreasedfrom86.7to18.87.Thatisadropof67.83,whichisquiteasignicantnumber.TheaveragenumberofquestionsthatwereansweredQuestionsAnsweredwas5.11,adecreaseof2.78.Imustnotethatduringtheinitialscreeningofthepatients,clinicianshadachoiceofansweringanynumberofthe10questionsonthepageinanyorderbeforesubmittingtheanswersintothesystem.Sincethesystemcompiledthequestions-to-asklistonlyaftereval-uationofallofthenewly-submittedanswers,thenumberofaskedquestionswilldierwhenmultipleanswersweresubmittedatatimeinsteadofoneansweratatime.AsImentionedbefore,wewerealsointerestedinndingoutwhathappensifwethresholdthenumberoftimesaquestionwasaskedbeforecalculatingprobabilityforthequestion.Forournextexperiment,weusedthesame100ineligiblepatientsandranthe10-foldcrossvali-dationwhilevaryingthethresholdvaluebeforecalculatingtheprobabilityforeachquestion.Firstweranthe10-foldcrossvalidationwithathresholdTofzero.Thenweranthe10-foldcrossvalidationforeachthresholdvaluestartingatT=10untilT=120whilevaryingthethresholdby5.AsseeninTable3.16,thenumberofquestionsaskedwhenTincreasedfrom0to10decreasedbyanaverageof8.03.Thelowestaverageofquestionsaskedwas8.60,45

PAGE 55

whichoccurredatT=55.ThiscanbeseenbetterinthegraphinFigure3.9.Wewereinterestedinthecurvewiththediamond-shapeddatapoints.Table3.16ProbabilisticThresholding-FoldCrossValidation Threshold Questions Questions Questionsin Questionsin Less Value Asked Answered TestProle OriginalProle Needed 0 17.11 5.02 15.00 22.01 7.01 10 9.08 5.54 15.52 22.01 6.49 15 9.09 5.77 15.75 22.01 6.26 20 9.02 5.74 15.72 22.01 6.29 25 8.87 5.71 15.69 22.01 6.32 30 8.87 5.71 15.69 22.01 6.32 35 8.86 5.71 15.69 22.01 6.32 40 8.77 5.71 15.69 22.01 6.32 45 8.62 5.71 15.69 22.01 6.32 50 8.61 5.71 15.69 22.01 6.32 55 8.60 5.71 15.69 22.01 6.32 60 8.68 5.64 15.62 22.01 6.39 65 8.68 5.64 15.62 22.01 6.39 70 8.68 5.64 15.62 22.01 6.39 75 8.68 5.64 15.62 22.01 6.39 80 8.68 5.64 15.62 22.01 6.39 85 14.24 5.80 15.78 22.01 6.23 90 23.30 6.44 16.42 22.01 5.59 95 23.30 6.44 16.42 22.01 5.59 100 23.28 6.44 16.42 22.01 5.59 105 23.61 6.46 16.44 22.01 5.57 110 23.98 6.60 16.58 22.01 5.43 115 24.36 6.55 16.53 22.01 5.48 120 24.36 6.55 16.53 22.01 5.48 InFigure3.9wecanseethatasTincreasedfromzero,theaveragenumberofquestionsthatneededtobeaskedinordertoruleoutapatientdecreased.However,whenTapproached85,thecurvestartedgoingup,withanoticeablejumpatT=90.ThiscanbeexplainedbylookingatTable3.17,whichlistsforeachsetthemaximumnumberoftimesaquestionwasaskedforaprotocol.Wecanseethattheaveragemaximumnumberoftimesaquestionwasasked46

PAGE 56

Figure3.9Thresholding-FoldCrossValidationforaparticularprotocolwas88.So,whenathresholdwassettoohigh,theprobability-basedreorderingwasnotineectandthequestionswerepresentedatrandom.Table3.17MaxNumberPerSet MaxNumberofTimesaQuestionforaProtocolWasAsked Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 Set9 Set10 87 87 88 88 88 88 88 88 88 90 Average:88 Whensortingbyprobability-inuencedrankingvalue,itisusefultoknowwhentopickastartingpointinordertoinsurethattheprobabilitycalculationforaquestionwillbesig-nicantenough.Countingthenumberofpatientsthathavebeenscreenedisnotsucient.Sincewewerereorderingthequestions,weneededtouseinformationabouteachindividualquestion.Weshowedthatbythresholdingthenumberoftimesaquestionwasaskedbeforecomputingtheprobabilityforaquestion,itdecreasedthenumberofquestionsasked,com-paredtonon-thresholding.Bydecreasingthenumberofquestionsneededtoruleoutapatientwedecreasedtheamountofdataentryrequiredandpossiblydecreasedtheamountoftimespentonthescreeningprocess.47

PAGE 57

3.2.4DataEntryOptimizationDuringthepatientdatacollectionphase,wereceivedcontinuousfeedbackfromphysiciansandnursesattheMottCancerCenterGIClinic.TomakeMEANSmoreuser-friendly,weimplementedtheclinicians'suggestions.Oneoftheconcernswastheamountofmouseclicksthatcliniciansneededtomakeduringtheselectionofprotocols.Basedonthat,weimplementedaprotocolbypassoption.Thatoption,whichispreselected,automaticallyselectsalloftheprotocolsforwhichapatientwillbescreened.Auserhastheoptionofdeselectingtheautomaticselectionofprotocolsifheorshewouldliketoscreenapatientonlyforasubsetofactiveprotocols.Ifauserdecidestodeselectit,thentheusermaymanuallypickwhichprotocolstoscreenapatientfor.Atleastoneoftheprotocolsmustbeselectedinorderforthesystemtoproceed.Anotheroptimizationmeasure,whichwasimplementedbasedonthesuggestionsfromthephysicians,wastheentryofnormallabvalues.Physiciansareonlyinterestedifapatienthasabnormallabvalues.Ifthelabvaluesofapatientareallwithinthenormalrange,thentheactuallabvaluesarenotofparticularinterest.ThesamecanbesaidaboutMEANS'screeningprocess.Aslongasthelabvaluesarewithinthenormalrange,apatientwillbeconsideredeligible.Basedonthat,weimplementedanormalcheckboxforeachquestionthatrequiresalabvalue.Anotherrelatedoption{AllLabsNormal{wasalsorequestedbythephysicians.Now,whenaphysicianchecksapatient'slabsandisabletodeterminethatallofthelabvaluesfallwithinthenormalrange,aphysiciancanchecktheAllLabsNormalcheckboxandthesystemwillautomaticallyenterallofthepredenedlabvaluesintothatpatient'sprole.ThiscanbeseeninFigure3.10.Suchmeasuressignicantlyreducesthenumberofmouseclicksthataphysicianmustmakeduringthescreeningprocess.Implementationoftheabove-mentionedchangesalsohadimpactonthesystemow.Wealsofoundthattherewasasignicantdierencebetweentheverbiageusedbythephysiciansandthelegallanguageinwhichtheprotocolinclusion/exclusioncriteriawaswrit-ten.Thelegalversioncansometimesbemuchlonger,asseeninTable3.18.Toconserveaphysician'stime,onlytheshorterversionofaquestionwasdisplayedonthescreen.Thefull48

PAGE 58

versionofthequestionwasviewablebyplacingthemousecursoroverthequestionorclickingontheFullTextlink.Table3.18Yes/NoQuestionLength FullText Doesthepatienthavehistologicalorcytologicalconrmed esophagealadenocarcinomaoradenocarcinomaofthe gastroesophagealjunction?Thisreferstotumorsatthejunction oftheesophagusandthestomach,where>50%ofthe tumormassisabovethediaphragm ShortVersion Tissueconrmedesophagealadenocarcinomaor adenocarcinomaofthegastroesophagealjunction? AutomaticselectionofallencodedclinicaltrialsandtheAllLabsNormaloptionwereintroducedbasedonfeedbackfromthephysicians.ThesechangesarereectedinFigure3.10.3.2.5InterfaceChangesToimproveusabilityofthesystem,avarietyofchangesweremadetotheinterfaceofthesystem.TheinitialquestionspageisshowninFigure3.10.Wecanseethatbesideeachquestiontherearethreechoices.Ifaquestionisanumericquestion,aphysiciancanentertheactualvalue,selectNormal,andhavethesystementeravalue,orselectDefertodelayansweringthequestionforlater.ByclickingontheFullTextlinkaphysiciancanseethefullversionofanyquestion.Anexampleofafull-textpop-upcanbeseeninFigure3.11.Tohelpphysiciansseewhenapatientbecomeseligible,weaddedatablethatliststhestatusofallofthetrialsduringthescreeningprocess.Whenapatientbecomeseligibleforatrial,thecoloroftheeligibilityslotofthetablechangestogreen.AsseeninFigure3.12,therstcellofthetableishighlightedgreen,signifyingthatthepatientwithID105iseligibleforatrial.Mostofthephysiciansdonotknowtheclinicaltrialsbytheirprotocolnumberbecausetheyrefertothembythenameofthestudy.Becauseofthis,weimplementedapop-upthatincludesboththeprotocolnumberandthefulltitleoftheclinicaltrial.Byclickingonthe49

PAGE 59

Figure3.10MEANSStreamlinedInitialQuestionsPage Figure3.11FullTextPopup50

PAGE 60

Figure3.12EligibilityColorTable Figure3.13ProtocolNumberandTitle51

PAGE 61

#Eligiblelink,apop-upwiththeexplanationwillappearFirgure3.13.Wereceivedpositivefeedbackonthechangesthatwehaveimplemented.52

PAGE 62

CHAPTER4CONCLUSION4.1ImplicationDiscoveriesWeuseddataminingonacquiredmedicaldatatoagimplicationsforfurtherphysicianevaluation.Asaresultofoureorts,14ruleswerediscovered;however,only5oftheserulesweretruerulesandcouldbeenteredintoMEANS.These5implicationswereaddedtoMEANS'implicationmodule.IshouldnotethatonereasonwedidnotndmorewasbecauseofthewaytheinitialquestionsinMEANSweresetup.Becausewewantedtoidentifyineligiblepatientsasearlyaspossibleduringthescreeningprocess,theinitialquestionspageinMEANScontainedasetofquestionsthatweremost-likelytomakeapatientineligibleforaprotocol.Theinitialpageusuallycontainedtwokeyquestionsfromeachprotocol,andeachquestionwasspecicallychosenbyaphysiciantomost-likelyruleoutapatientfromaprotocol.Themajorityofthepatientswereruledoutafterthequestionsontherstpagewereanswered.Thepatientsthatwerenotruledoutaftertherstpagewereruledoutsoonafter,unlesstheywereeligibleforaprotocol.Sincethequestionsontherstpagewerenotmedicallyrelated,thenumberofrulesthatcouldbemedicallyvaliddecreasedtremendously.WefoundthatbyusingAprioriwecouldndmanyrules.Wecouldisolateourrulesofinterestfromtherestoftherulesbylookingatthenumberofantecedentsandconsequents.Ourexperimentsshowedthatwhenwewereusingacondencelevelof100%,itwasbenecialtonotonlylookatoneantecedent,oneconsequentrulesbutalsoincludetwoantecedents,oneconsequentrulesinourrules-of-interestpool.Withthehelpofphysicians,wecoulddeterminewhetherarulewithtwoantecedentscouldbesimpliedbydroppingoneoftheantecedents.Suchasimplicationwasonlypossiblewiththesupportofmedicalknowledge.53

PAGE 63

Imustalsonotethatsomeofthediscoveredrulesweremedicallyvalidatedbasedoncertainassumptions.Becausesomeofthequestionsdidnotspecifytheregionofthebodytowhichtheywerereferring,weassumedthatthereferencewaseithertothesamesectionofthebodyorboththeantecedentandconsequentoftherulewerereferringtothesametissue.Thedatasupportedourassumption;however,sincetherulescouldonlybevalidatedbasedupontheassumption,itwasnotagoodideatoincludesuchrulesinourimplicationsubsystem.Duetothis,therulesthatrequiredanyassumptions,werenotincludedinournallist.Manydecisionsinmedicinecanbesubjective;therefore,twophysiciansmayevaluatethesameconditiondierently.Thispresentsanotherchallengeformedicalvalidationofnewly-discoveredrules.4.2Probability-BasedReorderingWewereinterestedinminimizingtheamountofdataentryneededtodetermineapatient'seligibilityforaclinicaltrialprotocol.Ifapatientiseligible,itisobviousthat,allofthequestionsmustbeanswered.However,ifapatientisineligibleasaresultofthescreening,itwouldbebenecialtoaskthequestionthatmadethepatientineligibleatthebeginningofthescreeningprocess.Wewereinterestedincalculatingthelikelihoodprobabilitythataquestionwouldmakeapatientineligibleruledoutforaprotocol.WebasedourcalculationofprobabilityontheBayes'ruleofconditionalprobabilityEquation3.1.Bycalculatingprobabilityforeachques-tionandpresentingtheuserwithaquestionthathasahigherrule-outprobability-inuencedrankingvalue,weshowedthattheaveragenumberofquestionsthatwerepresentedQues-tionsAskedbythetestsystemdecreasedfrom86.7to18.87andthattheaveragenumberofquestionsthatneededtobeansweredbeforeeligibilitywasdetermineddecreasedfrom7.89to5.11.Wealsoshowedthatbythresholdingthenumberoftimesaquestionwasasked,itwaspossibletodecreasetheamountofquestionsaskedduringthescreeningprocessevenmore.Bythresholdingduringtheprobabilitycalculation,theaveragenumberofquestionsaskedbythesystemwasthelowestwhenthethresholdwasbetween45and80;however,the54

PAGE 64

optimalthresholdwasat55.AtT=55theaveragenumberofquestionsaskedbythetestsystemtodetermineeligibilitywas8.60andtheaveragenumberofanswersrequiredforeligibilitydeterminationwas5.71.Comparedtoanalyticalandnon-thresholdingprobability-basedheuristics,thisisasignicantreductioninthenumberofquestionsthatwereaskedpresentedandansweredinordertodetermineeligibility.Atthistimetheprobabilisticthresholdingisonlybeingimplementedintestsystems.Sincetheprobabilitywasgatheredeverytimeaquestionwasanswered,wecouldsetthethresholdvalueand,assoonasthethresholdvaluewasreached,theprobability-basedreorderingforthatquestionwouldcomeintoplay.Ifthequestionwasbelowthethresholdvalue,probability-basedreorderingwouldhavenoeectonthequestionrankingandwouldbeplacedinthelowerpartofthelistofquestionstoaskafterallofthequestionswitharule-outrankingvalue.Astimegoesby,oursystemlearns"fromsubmittedanswersandbecomessmarter"overtime.4.3OptimizationPhysiciansareveryreluctanttousenewsoftware,whichmakesitdiculttointroducesystemstophysicians.Byusingthemethodparticipatorydesign,weusedthefeedbackthatwasgiventousbythecliniciansduringthepatientscreeningphasetostreamlineoursystem[17].ThefeedbackthatwasreceivedduringtheeldingofthesystemwasinvaluabletotheoverallsuccessofMEANS.Ifthesystemisnotacceptedbytheusers,nomatterhowwellitfunctionsorhowgreatitis,noonewilluseitand,asaresult,itwillfail.WefoundthatthesuccessoftheacceptanceofMEANSinaclinicalenvironmentdependsheavilyontheamountoftimethataphysicianwillspendscreeningapatient.Physiciansareverybusyandifthesystemtakesalongtimetolearnanduse,itislesslikelythatitwillbeutilized.Weusedmedicalknowledgetoselectthequestionsfordisplayontheinitialpagebecausemostoftherule-outquestionswerebasedoncancersite.Basedonthefeedback,wemodiedthesystemtodecreasetheamountoftimeaphysicianspendsreadingthequestions.Wealsodecreasednumberofmouseclicksrequired.WithonemouseclickaphysiciancanmarkallofthelabvaluesasNormal.Sinceweautomatedthe55

PAGE 65

selectionoftheprotocolforphysician,nomouseclicksarerequiredtoselectallprotocolsforscreening.Duringthescreeningprocess,optimizationoccurspriortotheprobability-basedreordering.4.4FutureWorkBecauseofHIPAAregulations,wewereunabletointerfaceMEANSwiththetheelectronicmedicalrecordssystematH.LeeMottCancerCenter&ResearchInstitute.Therefore,everytimeanewpatientwasscreened,allofthepatient'sinformationhadtobeenteredmanually.Becauseofthis,theamountofinformationavailablefordataminingwasverylimited.ItwouldbeagreatadvantagetohavesomekindofinterfacewhereinformationfrommedicalrecordscanbeimportedintoMEANS.Thiswouldsignicantlyincreasetheamountofdataavailablefordatamining.Itwouldalsodecreasetheamountofdataentryrequiredforscreeningpatients.Wedidnotexplorethisoption,butitispossibletoautomatetheentireprocessofdataextractionfromMEANS,runningtheApriorialgorithmonthedata,rulerecovery,andrulelteringbasedonthenumberofantecedentsintherule.Itwouldbeinterestingtosee,withtheadditionofnewdata,whatrulescouldberecoveredonaweeklybasis.Miningmedicaldatapresentsachallengeallitsownbecauseofthenatureofthemedicaldomain.Recoveringrulesbasedonmedicaldatahasprovedtobeadiculttask.Manyquestionsintheprotocolsarenotspecicenough,whichinturnpresentsagrayareawhentryingtousemedicalknowledgetovalidatearule.Itwouldbeofanadvantagetohaveclearerwordingintheinclusion/exclusioncriteriaoftheprotocols.Thiswouldhelpalleviateconfusionifquestionsarereferringtothesametissue.WeonlyexploredtheBayes'ruleofconditionalprobabilitywiththethresholdingoption.Theotherprobabilisticmethodsshouldalsobeexploredtoseeiftheymaybemoreeectiveinreorderingquestionsforrule-outprobability.Weuseda10-foldcrossvalidationof100patientstoconductourexperiments.Itwouldbeinterestingtoseehowthesystembehaveswithalargernumberofpatients.Ifthenumberofpatientsincreases,itmaybepossiblethatadierentthresholdvaluewouldperformbetter.56

PAGE 66

REFERENCES[1]R.Agrawal,T.Imielin'ski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.pages207{216.Proceedingsofthe1993ACMSIGMODinternationalconferenceonManagementofdata,1993.[2]R.AgrawalandR.Srikant.Fastalgorithmsforminingassociationrules.Proceedingsofthe20thVeryLargeDatabasesConferenceSantiago,Chilie,Sept.1994.[3]J.S.Aikins,J.C.Kunz,E.H.Shortlie,andR.J.Fallat.Pu:Anexpertsystemforinterpretationofpulmonaryfunctiondata.ComputersandBiomedicalResearch,16:199{208,July1982.[4]AmericanCancerSociety.Cancerfactsandgures2005.www.cancer.org,2005.[5]S.Bhanja,L.M.Fletcher-Heath,L.O.Hall,D.B.Goldgof,andJ.P.Krischer.Aqualitativeexpertsystemforclinicaltrialassignment.pages84{88.ProceedingsoftheEleventhInternationalFloridaArticialIntelligenceResearchSocietyConference,1998.[6]BritishBroadcastingCorporation.Historicgures,http://www.bbc.co.uk/history/historic gures,2006.[7]W.J.ClanceyandR.Letsinger.Neomycin:Reconguringarule-basedexpertsystemforapplicationtoteaching.InIJCAI,pages829{836,1981.[8]F.J.Dez,J.Mira,E.Iturralde,andS.Zubillaga.DIAVAL,aBayesianexpertsystemforechocardiography.ArticialIntelligenceinMedicine,10:59{73,1997.[9]E.Fink,L.O.Hall,D.B.Goldgof,J.P.Krisher,B.Goswami,andM.Boonstra.Ex-perimentsontheautomatedselectionofpatientsforclinicaltrials.volume5,pages4541{4545.IEEE,Systems,ManandCybernetics,2003.IEEEInternationalConferenceon,Oct2003.[10]P.FuleandJ.F.Roddick.Detectingprivacyandethicalsensitivityindataminingresult.volume26.ProceedingsoftheTwenty-SeventhAutralianComputerScienceConferenceACSC2004,AustralianComputerSociety,Inc.,January2004.[11]B.Goethals.InTheDataMiningandKnowledgeDiscoveryHandbook,chapter17,pages377{397.Springer,2005.[12]B.D.Goswami.Optimizingcostanddataentryforassignmentofpatientstoclinicaltrialusinganalyticalandprobabilisticweb-basedagents.Master'sthesis,UniversityofSouthFlorida,4202EFowlerAvenue,Tampa,FL33620,Nov2003.57

PAGE 67

[13]B.D.Goswami,L.O.Hall,D.B.Goldgof,E.Fink,andJ.P.Krisher.Usingproba-bilisticmethodstooptimizedataentryinaccrualofpatientstoclinicaltrials.pages434{438.Computer-BasedMedicalSystems,2004.CBMS2004.Proceedings.17thIEEESymposiumon,June2004.[14]M.Hagland.Strongercomputertoolsallowdeeperanalysisofmedicalresearch,patientcareandinsurancedata.HealthcareInformatics,page33,April2004.[15]H.LeeMottCancerCenter&ResearchInstitute.Preventionandtreatment:Clinicaltrials.,April2006.[16]NationalCancerInstitute.Dictionaryofcancerterms-clinicalstudyhttp://www.cancer.gov,May2006.[17]M.R.JohnH.Gennari.Participatorydesignandaneligibilityscreeningtool.InInProceedingsoftheAmericanMedicalInformaticsAssociationAnnualFallSymposium,pages290{294,2000.[18]P.K.Kokku,L.O.Hall,D.B.Goldgof,E.Fink,andJ.P.Krischer.Acost-eectiveagentforclinicaltrialassignment.volume1,pages60{65.IEEE,Oct2002.[19]J.Lederberg.Proceedingsoftheacmconferenceonhistoryofmedicalinformatics,bethesda,maryland,usa,november5-6,1987.InB.I.Blum,editor,HistoryofMed-icalInformatics.ACM,1987.[20]MedicineNet.comDiagnosisinformationaboutclinicaltrials.www.MedicineNet.com,1996-2001.[21]S.Nikiforou.Selectionofclinicaltrials:Knowldegerepresentationandacqusition.Mas-ter'sthesis,UniversityofSouthFlorida,4202EFowlerAvenue,Tampa,FL33620,May2002.[22]NationalInstituteofHealth.Hipaaprivacyruleanditsimpactonresearchhttp://privacyruleandresearch.nih.gov/,May2006.[23]C.Papaconstantinou,G.Theocharous,andS.Mahadevan.Anexpertsystemforassigningpatientsintoclinicaltrialsbasedonbayesian.JournalofMedicalSystems,22:189{202,1998.[24]J.F.Roddick,P.Fule,andW.J.Graco.Exploratorymedicalknowledgediscovery:experiencesandissues.ACMSIGKDDExplorationsNewsletter,5:94{99,July2003.[25]E.H.Shortlie,S.G.Axline,B.G.Buchanan,T.C.Merigan,andS.N.Cohen.Anarti-cialintelligenceprogramtoadvisephysiciansregardingantimicrobialtherapy.ComputersandBiomedicalResearch,6:544{560,1973.[26]E.H.Shortlie,B.G.Buchanon,andE.A.Feigenbaum.Knowledgeengineeringformed-icaldecisionmaking:Areviewofcomputer-basedclinicaldecisionaids.InProceedingsoftheIEEE,volume67,pages1207{1224.IEEE,September1979.[27]T.ShortlieandR.Davis.Someconsiderationsfortheimplementationofknowledge-basedexpertsystems:Themycinproject.InAIMWorkshop,pages9{12.SIGARTNewsletter,December1975.58

PAGE 68

[28]NationalResearchCouncilU.S..FundingaRevolution:GovernmentSupportforCom-putingResearch.NationalAcademyPress,Washington,D.C.,1999.[29]I.H.WittenandE.Frank.DataMining:Practicalmachinelearningtoolsandtechniques.MorganKaufmann,SanFrancisco,2ndeditionedition,2005.[30]H.YahoandH.J.Hamilton.Miningitemsetutilitiesfromtransactiondatabases.DataandKnowledgeEngineering,AcceptedOctober2005,2005.59


xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001798354
003 fts
005 20070731110850.0
006 m||||e|||d||||||||
007 cr mnu|||uuuuu
008 070731s2006 flu sbm 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0001639
040
FHM
c FHM
035
(OCoLC)159957687
049
FHMM
090
QA76 (ONLINE)
1 100
Ivanovskiy, Tim V.
0 245
Mining medical data in a clinical environment
h [electronic resource] /
by Tim V. Ivanovskiy.
260
[Tampa, Fla] :
b University of South Florida,
2006.
3 520
ABSTRACT: The availability of new treatments for a disease depends on the success of clinical trials. In order for a clinical trial to be successful and approved, medical researchers must first recruit patients with a specific set of conditions in order to test the effectiveness of the proposed treatment. In the past, the accrual process was tedious and time-consuming. Since accruals rely heavily on the ability of physicians and their staff to be familiar with the protocol eligibility criteria, candidates tend to be missed. This can result and has resulted in unsuccessful trials.A recent project at the University of South Florida aimed to assist research physicians at H. Lee Moffitt Cancer Center & Research Institute, Tampa, Florida, with a screening process by utilizing a web-based expert system, Moffitt Expedited Accrual Network System (MEANS). This system allows physicians to determine the eligibility of a patient for several clinical trials simultaneously.We have implemented this web-based expert system at the H. Lee Moffitt Cancer Center & Research Gastroenterology (GI) Clinic. Based on our findings and staff feedback, the system has undergone many optimizations. We used data mining techniques to analyze the medical data of current gastrointestinal patients. The use of the Apriori algorithm allowed us to discover new rules (implications) in the patient data. All of the discovered implications were checked for medical validity by a physician, and those that were determined to be valid were entered into the expert system. Additional analysis of the data allowed us to streamline the system and decrease the number of mouse clicks required for screening. We also used a probability-based method to reorder the questions, which decreased the amount of data entry required to determine a patient's ineligibility.
502
Thesis (M.A.)--University of South Florida, 2006.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
500
Title from PDF of title page.
Document formatted into pages; contains 59 pages.
590
Adviser: Dmitry Goldgof, Ph.D.
653
Data mining.
Rule association.
Medical expert system.
Apriori.
Medical implications.
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.1639