USF Libraries
USF Digital Collections

A generalized framework for automatic code partitioning and generation in distributed systems

MISSING IMAGE

Material Information

Title:
A generalized framework for automatic code partitioning and generation in distributed systems
Physical Description:
Book
Language:
English
Creator:
Sairaman, Viswanath
Publisher:
University of South Florida
Place of Publication:
Tampa, Fla
Publication Date:

Subjects

Subjects / Keywords:
Code Partitioning
Heterogeneous Systems
Distributed Execution
Code Generation
Dissertations, Academic -- Computer Science and Engineering -- Doctoral -- USF   ( lcsh )
Genre:
non-fiction   ( marcgt )

Notes

Abstract:
ABSTRACT: In distributed heterogeneous systems the partitioning of application software to be executed in a distributed fashion is a challenge by itself. The task of code partitioning for distributed processing involves partitioning the code into clusters and mapping those code clusters to the individual processing elements interconnected through a high speed network. Code generation is the process of converting the code partitions into individually executable code clusters and satisfying the code dependencies by adding communication primitives to send and receive data between dependent code clusters. In this work, we describe a generalized framework for automatic code partitioning and code generation for distributed heterogeneous systems. A model for system level design and synthesis using transaction level models has also been developed and is presented. The application programs along with the partition primitives are converted into independently executable concrete implementations. The process consists of two steps, first translating the primitives of the application program into equivalent code clusters, and then scheduling the implementations of these code clusters according to the inherent data dependencies. Further, the original source code needs to be reverse engineered in order to create a meta-data table describing the program elements and dependency trees. The data gathered, is used along with Parallel Virtual Machine (PVM) primitives for enabling the communication between the partitioned programs in the distributed environment. The framework consists of profiling tools, partitioning methodology, architectural exploration and cost analysis tools. The partitioning algorithm is based on clustering, in which the code clusters are created to minimize communication overhead represented as data transfers in task graph for the code. The proposed approach has been implemented and tested for different applications and compared with simulated annealing and tabu search based partitioning algorithms. The objective of partitioning is to minimize the communication overhead. While the proposed approach performs comparably with simulated annealing and better than tabu search based approaches in most cases in terms of communication overhead reduction, it is conclusively faster than simulated annealing and tabu search by an order of magnitude as indicated by simulation results. The proposed framework for system level design/synthesis provides an end to end rapid prototyping approach for aiding in architectural exploration and design optimization. The level of abstraction in the design phase can be fine tuned using transaction level models.
Thesis:
Dissertation (Ph.D.)--University of South Florida, 2010.
Bibliography:
Includes bibliographical references.
System Details:
Mode of access: World Wide Web.
System Details:
System requirements: World Wide Web browser and PDF reader.
Statement of Responsibility:
by Viswanath Sairaman.
General Note:
Title from PDF of title page.
General Note:
Document formatted into pages; contains X pages.
General Note:
Includes vita.

Record Information

Source Institution:
University of South Florida Library
Holding Location:
University of South Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
usfldc doi - E14-SFE0003449
usfldc handle - e14.3449
System ID:
SFS0027764:00001


This item is only available as the following downloads:


Full Text
xml version 1.0 encoding UTF-8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam 22 Ka 4500
controlfield tag 007 cr-bnu---uuuuu
008 s2010 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14-SFE0003449
035
(OCoLC)
040
FHM
c FHM
049
FHMM
090
XX9999 (Online)
1 100
Sairaman, Viswanath.
0 245
A generalized framework for automatic code partitioning and generation in distributed systems
h [electronic resource] /
by Viswanath Sairaman.
260
[Tampa, Fla] :
b University of South Florida,
2010.
500
Title from PDF of title page.
Document formatted into pages; contains X pages.
Includes vita.
502
Dissertation (Ph.D.)--University of South Florida, 2010.
504
Includes bibliographical references.
516
Text (Electronic dissertation) in PDF format.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
3 520
ABSTRACT: In distributed heterogeneous systems the partitioning of application software to be executed in a distributed fashion is a challenge by itself. The task of code partitioning for distributed processing involves partitioning the code into clusters and mapping those code clusters to the individual processing elements interconnected through a high speed network. Code generation is the process of converting the code partitions into individually executable code clusters and satisfying the code dependencies by adding communication primitives to send and receive data between dependent code clusters. In this work, we describe a generalized framework for automatic code partitioning and code generation for distributed heterogeneous systems. A model for system level design and synthesis using transaction level models has also been developed and is presented. The application programs along with the partition primitives are converted into independently executable concrete implementations. The process consists of two steps, first translating the primitives of the application program into equivalent code clusters, and then scheduling the implementations of these code clusters according to the inherent data dependencies. Further, the original source code needs to be reverse engineered in order to create a meta-data table describing the program elements and dependency trees. The data gathered, is used along with Parallel Virtual Machine (PVM) primitives for enabling the communication between the partitioned programs in the distributed environment. The framework consists of profiling tools, partitioning methodology, architectural exploration and cost analysis tools. The partitioning algorithm is based on clustering, in which the code clusters are created to minimize communication overhead represented as data transfers in task graph for the code. The proposed approach has been implemented and tested for different applications and compared with simulated annealing and tabu search based partitioning algorithms. The objective of partitioning is to minimize the communication overhead. While the proposed approach performs comparably with simulated annealing and better than tabu search based approaches in most cases in terms of communication overhead reduction, it is conclusively faster than simulated annealing and tabu search by an order of magnitude as indicated by simulation results. The proposed framework for system level design/synthesis provides an end to end rapid prototyping approach for aiding in architectural exploration and design optimization. The level of abstraction in the design phase can be fine tuned using transaction level models.
590
Advisor: Nagarajan Ranganath, Ph.D.
653
Code Partitioning
Heterogeneous Systems
Distributed Execution
Code Generation
690
Dissertations, Academic
z USF
x Computer Science and Engineering
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.3449



PAGE 1

AGeneralizedFrameworkforAutomaticCodePartitioningan dGenerationin DistributedSystems by ViswanathSairaman Adissertationsubmittedinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy DepartmentofComputerScienceandEngineering CollegeofEngineering UniversityofSouthFlorida MajorProfessor:NagarajanRanganathanPh.D. SrinivasKatkoori,Ph.D. HaoZheng,Ph.D. TapasDas,Ph.D. ManishAgrawal,Ph.D. DateofApproval: February5,2010 Keywords:CodePartitioning,HeterogeneousSystems,Dist ributedExecution,Code Generation c r Copyright2010,ViswanathSairaman

PAGE 2

DEDICATION ToBhavani

PAGE 3

ACKNOWLEDGEMENTS IwouldliketothankDr.NagarajanRanganathanforguidingm ethroughtheprocessof pursuingmyresearchandwritingmydissertation.Whenthin gsweremovingslow,the encouragementthatheprovidedwastheonlymotivatingfact orthathelpedmeinpursuing mygoals.Iwouldlikethankhimforbeingpatientwithmeandh elpingmetoregainfocus andbelievingthatIcancompletethisresearchwork.Thisth esisdissertationwouldnot havebeenpossiblewithouthisindepthexpertiseandinnit epatienceintheprocessof helpingmetowardsthethenalgoalofdefendingmydisserta tion.Iwouldalsoliketo thankDr.Srinivaskatkoori,forhelpingoninnumerableacc ountsandalwayssupporting meinallaspects.IwouldliketothankDr.TapasDas,Dr.Mani shAgrawalandDr.Hao Zhengforkindlyconsentingtobeamemberinmycommitteeand guidingmeinmygraduation.IwouldliketothankDanielPrietoandPeterSchaivo ofthetechnicalsupportfor thecomputerscienceandengineeringdepartment.Thisdiss ertationresearchwouldhave beenverydicultwithouttheirconstanteortsinxingall theissueswithsoftwareand hardware.IwouldalsoliketothankMs.TheresaCollins,oft hegraduateprogramforthe longhoursandinnitepatienceinguidingmethroughallpap erworkandthereadmission process.IwouldliketothankDr.DmitryGoldgof,GraduateP rogramCoordinatorofthe computersciencedepartmentforallowingmethisopportuni tytocompletemydissertation.IwouldliketothankDr.SudeepSarkar,forclearingal lmyrequisitionsandmaking surethatIwouldbegivenfairchancetopresentmywork.Fina llyIwouldliketothank CatherineBurton,fortheinvaluableamountoftimeandexpe rtiseinguidingmewithall thenecessarystepsforgraduation.I'mreallythankfulfor allthesupportthatshehasprovidedinalltheseyearstosuccessfullycompleteallformal itiesatUSF.Iwouldalsoliketo thankNarenderHanchate,KarthikeyanLingasubramanyanan dManishShuklawhohave beenwithmeinthislongjourneyandhavealwayssupportedme

PAGE 4

TABLEOFCONTENTS LISTOFTABLES iv LISTOFFIGURES v ABSTRACT vii CHAPTER1INTRODUCTION 1 1.1DistributedHeterogeneousSystems 1 1.2MappingApplicationonaDistributedHeterogeneousSys tem4 1.2.1CodePartitioning 5 1.2.2CodeGeneration 7 1.3MappingImplementationDetails 8 1.3.1Clustering 8 1.3.2PartitioningDetails 9 1.4ContributionsoftheDissertation 10 1.5Outline 11 CHAPTER2RELATEDWORK 13 2.1PartitioningProblemsinDierentApplicationDomains 13 2.2CodePartitioningBasedApproaches 14 2.2.1HeterogeneousSystems 15 2.2.2Graph/AutomataApproaches182.2.3Hardware/SoftwarePartitioningSystems18 2.3CommunicationMechanismsinDistributedApplications 21 2.4CodeGenerationforPartitionedApplications232.5ContextofDissertationResearch 26 CHAPTER3CLUSTERINGMETHODOLOGY28 3.1PartitioningProcess 28 3.1.1OverviewofthePartitioningMetrics29 3.2MotivationforClustering 30 3.2.1AttributesforCloseness 31 3.2.2ComputingCloseness 32 3.3PartitioningAlgorithm 32 3.3.1ImprovisingClusters 35 3.3.2SteeringLogic 37 3.4ArchitectureandAssumptionsforExperimentalSetup37 3.4.1ComputationofAttributeValues38 i

PAGE 5

3.5ExperimentalResults 39 3.5.1ComparisonofCumulativeResults43 CHAPTER4SOFTWAREARCHITECTURE 46 4.1ProlingSetup 47 4.2Sparc-Linux-ArmCrossCompilation 53 4.3Motorolam68hc11CrossCompilation554.4PowerPCCrossCompilation 55 4.5 C++ BasedImplementation 56 4.5.1MotivationforUsing C++ 56 4.5.2ObjectDescriptions 58 4.5.3TreeTraversalandCostComputations594.5.4BuildingBlocks 59 CHAPTER5CODEGENERATION 61 5.1NeedforaDataRepository 61 5.1.1ReverseEngineering 63 5.1.2The"UnderstandC"Tool 64 5.1.2.1FunctionalityandUtilityofTool64 5.2CreationofMetadataTable 64 5.3PartitioningProgramintoIndependentSub-programs66 5.3.1Clustering 66 5.3.2AddingConstructs 67 5.4CommunicationMethodology 68 5.4.1SelectionofPVMastheCommunicationMechanism695.4.2PVMCommunicationProcess69 5.4.2.1AsynchronousMessagingPrimitives71 5.4.3ImplementingPVMCommunication725.4.4Optimizations 74 5.4.4.1BroadcastCommunication745.4.4.2OptimizingCommunicationOverhead765.4.4.3UseAggregateCommunicationforConsecutiveTasks 77 5.4.5CodeSizeIncrease 77 5.5Summary 78 CHAPTER6AFRAMEWORKFORARCHITECTURALEXPLORATION79 6.1ElectronicSystemLevelDesign 79 6.2ClassicDesignFlow 80 6.3SoCDesignFlow 82 6.4TLM/SystemCBasedExplorationforSoCArchitectures84 CHAPTER7CONCLUSION 89 REFERENCES 92 ii

PAGE 6

APPENDICES 95 AppendixAClusteringBackground 96 ABOUTTHEAUTHOR EndPage iii

PAGE 7

LISTOFTABLES Table3.1.SizeoftheProgramsUsedinSimulation40Table3.2.SimulationResultsObtainedbyRunningFFT41Table3.3.SimulationResultsObtainedbyRunningStatisti calTool142 Table3.4.SimulationResultsObtainedbyRunningStatisti calTool243 Table3.5.SimulationResultsObtainedbyRunningLaplacia nEdgedetection44 Table3.6.SimulationResultsObtainedbyRunningArithmet icEncodingofFiles 45 Table3.7.ConstraintsTerminology 45 Table3.8.ComparisonofExecutionTimesforClustering,Si mulatedAnnealingandTabuSearchBasedApproaches45 Table5.1.OverheadIncreaseinTermsofLinesofCode78 iv

PAGE 8

LISTOFFIGURES Figure1.1.ConventionalSingleLineofExecutionofanAppl ication3 Figure1.2.IllustrationofaDistributedHeterogeneousSy stem4 Figure1.3.DesignFlowforMappinganApplicationonaDistr ibutedSystem6 Figure1.4.DetailedViewoftheMappingProcess8Figure2.1.ATaxonomyofPriorWorksinCodePartitioning19Figure2.2.ATaxonomyofPriorWorksinCodeGeneration24Figure3.1.StepsInvolvedinCodePartitioning 28 Figure3.2.SnapShotofaPartoftheTaskGraph29Figure3.3.SampleAnnotatedCode 30 Figure3.4.FlowChartfortheClusteringAlgorithm34Figure3.5.AlgorithmforPartitioningBasedonClustering 35 Figure3.6.IllustrationofPossiblePartitioningOptions 36 Figure3.7.AlgorithmforImprovisingClusters 36 Figure4.1.ApplicationSourceCodeProlingTools47Figure4.2.IllustrationofSampleCodeatVariousStages48Figure4.3.IllustrationofOutputfromTaskGraphGenerato r53 Figure4.4.OverviewoftheProcessofBuildingaCrossCompi ler54 Figure4.5.ObjectBasedDesignoftheMainEntities57Figure5.1.StepsInvolvedInCodeGeneration 62 Figure5.2.ReverseEngineeringProcess 63 Figure5.3.CreationofMetadataTable 65 v

PAGE 9

Figure5.4.Clustering 66 Figure5.5.CodeClusterFormation 67 Figure5.6.AddingConstructs 68 Figure5.7.CreationofSub-Programs 69 Figure5.8.IllustrationforPartitioningUsingClusters7 0 Figure5.9.AbstractionofCommunicationBetweenNodesfro mDierentClusters70 Figure5.10.CommunicationBetweenSub-Programs73Figure5.11.InterleavingPVMPrimitives 74 Figure5.12.CommunicationAlgorithm 75 Figure5.13.LocalMemoryOptimization 76 Figure5.14.AggregateCommunicationOptimization77Figure6.1.ClassicDesignFlowforElectronicSystemLevel Design80 Figure6.2.DesignFlowforModelingSOC's 82 Figure6.3.NovelApproachforModelingSOC's83Figure6.4.SampleCommunicationCodeUsingTLM85Figure6.5.SoftwareAreasthatcanbeModeledusingTLM86Figure6.6.OptimizationusingTLMModels 88 vi

PAGE 10

AGENERALIZEDFRAMEWORKFORAUTOMATICCODE PARTITIONINGANDGENERATIONINDISTRIBUTEDSYSTEMS ViswanathSairaman ABSTRACT Indistributedheterogeneoussystemsthepartitioningofa pplicationsoftwaretobe executedinadistributedfashionisachallengebyitself.T hetaskofcodepartitioning fordistributedprocessinginvolvespartitioningthecode intoclustersandmappingthose codeclusterstotheindividualprocessingelementsinterc onnectedthroughahighspeed network.Codegenerationistheprocessofconvertingtheco departitionsintoindividually executablecodeclustersandsatisfyingthecodedependenc iesbyaddingcommunication primitivestosendandreceivedatabetweendependentcodec lusters.Inthiswork,we describeageneralizedframeworkforautomaticcodepartit ioningandcodegenerationfor distributedheterogeneoussystems.Amodelforsystemleve ldesignandsynthesisusingtransactionlevelmodelshasalsobeendevelopedandisp resented.Theapplication programsalongwiththepartitionprimitivesareconverted intoindependentlyexecutable concreteimplementations.Theprocessconsistsoftwostep s,rsttranslatingtheprimitives oftheapplicationprogramintoequivalentcodeclusters,a ndthenschedulingtheimplementationsofthesecodeclustersaccordingtotheinherent datadependencies.Further, theoriginalsourcecodeneedstobereverseengineeredinor dertocreateameta-datatable describingtheprogramelementsanddependencytrees.Thed atagathered,isusedalong withParallelVirtualMachine(PVM)primitivesforenablin gthecommunicationbetween thepartitionedprogramsinthedistributedenvironment.T heframeworkconsistsofprolingtools,partitioningmethodology,architecturalexplo rationandcostanalysistools.The partitioningalgorithmisbasedonclustering,inwhichthe codeclustersarecreatedtominvii

PAGE 11

imizecommunicationoverheadrepresentedasdatatransfer sintaskgraphforthecode. Theproposedapproachhasbeenimplementedandtestedfordi erentapplicationsand comparedwithsimulatedannealingandtabusearchbasedpar titioningalgorithms.The objectiveofpartitioningistominimizethecommunication overhead.Whiletheproposed approachperformscomparablywithsimulatedannealingand betterthantabusearchbased approachesinmostcasesintermsofcommunicationoverhead reduction,itisconclusively fasterthansimulatedannealingandtabusearchbyanordero fmagnitudeasindicated bysimulationresults.Theproposedframeworkforsystemle veldesign/synthesisprovides anendtoendrapidprototypingapproachforaidinginarchit ecturalexplorationanddesignoptimization.Thelevelofabstractioninthedesignph asecanbenetunedusing transactionlevelmodels. viii

PAGE 12

CHAPTER1 INTRODUCTION 1.1DistributedHeterogeneousSystems Distributedcomputingistheprocessofexecutinganapplic ationinaparallelmanner onmultipleprocessingelementssimultaneously.Dependin gonthearchitectureoftheprocessingelements,topologyandcommunicationmethodology employed,distributedsystems canbedividedintomanycategories. HomogeneousSystem: Allprocessingelementscomprisingthesystemhavesimilar basicarchitecture. HeterogeneousSystem: (Figure1.2.)]Processingelementscomprisingthesystem havevaryingarchitectureandprocessingpower. MultiprocessorSystem: Homogeneoussysteminwhichprocessingelementsmultiple instancesofthesamebasicarchitectureandareusuallycon nectedtoeachotheron abusbasedcommunicationnetwork. Parallelexecutionistheprocessofexecutingdierentpar tsofasingleapplicationparallely. Parallelexecutiondoesnotnecessarilyconstitutedistri butedexecution.Parallelisminthe executionofapplicationsourcecodecanbeachievedbymean sofdierentmechanisms atdierentlevelsofcompilation.Thehighestlevelofpara llelismiswhentheapplicationcompilerdetectsparallelisminthesourcecodeandpro videsmeansofsimultaneous execution.Theoperatingsystemprovidesthenextlowerlev elofparallelexecutionby employingthreads.Multi-threadedoperatingsystemsarec apableofexecutingmultiple threadsofasingleapplicationsource.Parallelismcanals obeachievedattheprocessor 1

PAGE 13

levelthroughtheuseofprocessormulti-threading.TheCPU iscapableofchoosinginstructionsfrommultipleinstructionqueuesthusexhibitingpar allelism.Distributedexecution istheprocessofexecutingtheapplicationfromphysically dierentlocations. Traditionalcomputingsystemsfollowasinglelineofexecu tionthreadthroughthe applicationsourcecodeasindicatedbygure1.1.Applicat ionscontaininginherentparallelismsuerslowerexecutionspeedsinthecaseofsingleth readofexecution.Distributed heterogeneoussystemsprovideanexecutionenvironmentwh eremultiplelinesofexecutionthroughtheapplicationispossible.Eventhoughcompi ler/operatingsystemlevel multi-threadedarchitecturesupportcanessentiallyachi eveasimilarresult,theexecution environmentsareclearlydierent.Multi-threadedarchit ecturescancreatemultiplelinesof executionwithintheapplication,howevertheabilitytoex ecutemultiplethreadsislimited bytheprocessingcapabilityoftheprocessingunit.Atanyg iveninstantthereisonlyone activethreadontheprocessingunit.Inthecase,whenathre adinexecutionishaltedon aneventlikeaninterrupt,acontextswitchoccursandtheli neofexecutionistransferred toanotherthread. Figure1.2.providesanillustrationofadistributedheter ogeneoussystemalongwith anapplicationwithdierentpartsoftheapplicationmappe dtothedierentprocessing elementsofthedistributedheterogeneoussystem.Intheca seofdistributedheterogeneous systems,theapplicationcodeispartitionedandmappedtod ierentprocessingunitsinthe system.Parallelexecutionofmultiplelineofapplication sourcecodeisthusguaranteed. Heterogeneoussystemsconsistofprocessingelementswith varyingprocessingcapabilities, memoryarchitecturesandalsoincludeprocessingelements withentirelydierentunderlyingarchitectures.Dierentsystemarchitecturesforproc essingelementsrequiredierent compilersandthenalexecutableversionoftheapplicatio noncecompiledisnotportable andcanbeexecutedonlyonthehostprocessor.Thedierence sinthecompilationprocessisdenitelyadisadvantageintheuseofheterogeneous systems.Theadvantageof usingheterogeneoussystemsisthewidevarietyofarchitec turesavailabletochoosefrom. 2

PAGE 14

Start Application Finish Application Single Line of Execution Figure1.1.ConventionalSingleLineofExecutionofanAppl ication Distributedsystemsprovideanenvironmentforhighspeedp arallelcomputation,provided theinherentparallelismintheapplicationisutilizedtot hemaximum. Theimplementationofdistributedheterogeneoussystemfo rmappinganapplication involvesanumberofdesignissueswhicharediscussedbrier y.Therstissueinthedesign processisthechoiceoftheprocessingunitsthatwouldbeth ecomponentsofthedistributedsystem.Theprocessingunitscanvaryfromo-shel fgenericprocessorstocustom designedASIC'sperformingspecictasks.Partitioningth eapplicationsinsuchaway thatsometasksareexecutedongenericprocessorsandsomet asksoftheapplicationare executedoncustomdesignedhardwareisknownasthehardwar e/softwarepartitioning problem.Thecombinationoftheprocessingunitscomprisin gtheheterogeneoussystem requiredforachievingoptimumperformancewhenexecuting anyapplicationisspecic forthatapplication.Thechoiceoftheo-shelfprocessing unitsavailableinthemarket varyfrommicro-controllerswithsmallmemorytogenericem beddedprocessorswithlarge 3

PAGE 15

Application 1 24 3Processing ElementProcessing Element Processing ElementProcessing Element 1 3 24 Network Figure1.2.IllustrationofaDistributedHeterogeneousSy stem memoryandcomplexinstructionsets.Genericprocessorsar eexpensivebutprogrammable andsupportcomplexinstructions.Agenericprocessorposs essesthecapabilitytoexecute anytaskoftheapplication.Micro-controllersareRISCpro cessorswithalowerdegreeof programmability.Thecostofthemicro-controllersismuch lesswhencomparedtogeneric processors.Thesecondissueinvolvedinthedesignofdistr ibutedsystemsisthecommunicationmethodologyusedfordatatransferbetweenthecompo nentsofthesystem.Fastand ecientcommunicationmethodologiesareessentialforred ucingnetworkdelaysandimproveoverallcommunication.Thearchitectureforthedist ributedheterogeneoussystemis nalizedwhentheprocessingelementshavebeenchosenanda highspeedcommunication methodologyhasbeenidentied.1.2MappingApplicationonaDistributedHeterogeneousSys tem Designanddevelopmentofheterogeneousdistributedcompu tingsystemsisachallenge initself.Ecientmappingofanapplicationontoadistribu tedsystemisessentialforlow executiontimeandhighperformance.Themappingprocessmu stresultinanoptimized partitioningoftheapplicationcodeinordertopromotespe edyexecutionwhilemaintaining abalancedworkloadamongtheprocessingelementsofthesys tem.Inrecenttimesalarge 4

PAGE 16

numberofscienticandconsumerelectronicsapplications arebeingdesignedtoexecuteon distributedheterogeneoussystemsinordertoachieveopti mumperformance.Consumer electroniccompaniesdenetheterm Time-to-market foranyproductasthetotalamount oftimetakenbytheproductstartingfromitsinceptionasan ideaonthedrawingtothe nalproductrollout.Time-to-marketisvitalandneedstob erealshort,ifthecompany needstomaintainasustainedmarketpresence.Accordingto [9]adelayofaboutfour weeksinthetimetomarketcanleadtoalossofabouttwentype rcentandadelayoften weekswouldleadtoalmostaftypercentloss.Theenormouss izeofthedesignspace leadstodicultyinmakingfasterdesignleveldecisions.I nthefollowingchapterstheword application referstoanapplicationthatisintendedforadistributedh eterogeneoussystem. Figure1.3.showsanoverviewoftheprocessofmappinganapp licationontoadistributedheterogeneoussystem.Therststepintheprocess ,isgeneratingataskgraph fromtheapplication.Thetaskgraphcontainsdetailedinfo rmationpertainingtoeach taskintheapplication.Eachtaskintheapplicationisthen proledtoobtainfurther informationpertainingtoeachtask,includingthenumbero fCPUcyclestakenbythetask forexecution,powerconsumedinordertoperformthetask,s izeofataskintermsofthe amountofdataneededforperformingthetaskanddependenci esthatthistaskshareswith othertasks.Thecollectedinformationisprocessedandthe passedontothepartitioning algorithmtogenerateclustersofapplicationcode.Anopti malmappingisidentiedfor eachcodeclusterandeachcomponentofthedistributedsyst em. 1.2.1CodePartitioning Codepartitioningcanbedenedastheprocessofformingpar titions( taskclusters )by combiningindividual tasks ,representedasnodesonthetaskgraph,andmappingeachpar titionontoaprocessingelementintheheterogeneousdistr ibutedsystem.Therearethree inputsrequiredforthecodepartitioningalgorithm.Ahard warelibrarycontainingspecic informationonthecomponentsofthedistributedheterogen eoussystemarchitecture,applicationsourcecodeandthetaskgraphextractedfromthea pplicationsourcecode.The 5

PAGE 17

Application Source Code Task Graph Generation Code partitioning Code Generation Heterogeneous Scheduling Figure1.3.DesignFlowforMappinganApplicationonaDistr ibutedSystem outputofthecodepartitioningalgorithmisasetofpartiti onsconsistingofcodeclusters alongwithspecicmappinginformationforeachpartitionw ithacorrespondingprocessing element P.E onthedistributedsystem.Thecodepartitioningstepinthe mappingprocess isvitalfortheoptimalexecutionoftheapplication.Theso lutionobtainedfromthepartitioningproblemcanbefurtheroptimizedintheheterogen eousschedulingprocess.Code partitioningconsistsofthreeimportanttasks.Theyare Proling: consistsofparsingthetaskgraphandtheapplicationsourc ecodeinand constructingatreedatastructureinordertocomputeandho ldcommunicationand processingcostsfortasks. Partitioning: consistsusingtheprolinginformationandapplyingapart itioning algorithmonthetaskgraphtogenerateindividualcodepart itions. 6

PAGE 18

Mapping: consistsofidentifyingthebestpossibleprocessingeleme ntforeachindividualcodepartitionbytheprocessofevaluatingdierent combinationsofpartitions andprocessingelements. 1.2.2CodeGeneration ThenextstepinthemappingprocessisCodeGeneration.Thed enitionoftheterm codegenerationvarieswiththecontextinwhichitisused.I nthebroadestsenseofthe word,itmeanstogenerateorcreatesourcecode.Inobjector ientedanalysisanddesign ( OOAD )domain,codegenerationcommonlyreferstosynthesisofap plicationsourcecode forobjectsmodeledusingamodelinglanguageliketheUnie dModelingLanguage(UML). Inthecontextofthiswork,thetermhasbeenusedtoindicate theprocessinwhich thepartiallycompletecodeclustersobtainedasoutputfro mthepartitioningprocessare embeddedwithcommunicationprimitivesinordertocomplet ethemandconvertthem toindividuallyexecutablecodesegments.Thecodecluster sareanalyzedindividuallyto identifyandisolatedatadependenciesbetweenthecluster sofsourcecode.Appropriate communicationprimitivesareincorporatedintothecodecl ustersinordertopermitecient exchangeofdataandcompletetheexecutionoftheapplicati oncode.Thestepsinvolved inthecodegenerationprocessareasfollows. IdentifyDependencies: Analyzeindividualcodeclustersinordertoidentifyand isolatedatadependenciesbetweenpartitions. AddCode: Additionalcodeisaddedtotheoriginalsourceofeachparti tioninorder toconvertthemtonearlyindependentexecutableprograms. CommunicationSynthesis: Addcommunicationprimitivestothesourcecodeto resolveinterpartitiondatadependencies. Thelaststepofthemappingprocessisheterogeneousschedu ling.Theclustersofcode oninterconnectedprocessingunitsarescheduledforextra ctingtheadditionalparallelism inthecode. 7

PAGE 19

ApplicationSource Code Parallel lines of execution Identify and Isolate Convert individual programsegments to independentprograms and add communication primitives Collate and combine resultsfrom individualpartitions Map individualprograms to processingelements Figure1.4.DetailedViewoftheMappingProcess Anelaboratedepictionofthemappingprocessisprovidedin gure1.4.Thepartitioner isolatesindividualcodeclustersfromtheoriginalapplic ationsourcecodeandidentiesa suitableprocessingelementformapping.Thecodegenerato rcompleteseachindividual codeclusterwithcommunicationprimitivesnecessaryfors endingandreceivingdatabetweeninterdependenttasks.Theindividualcodeclustersa fterthecodegenerationstep areindependentofoneanotherexceptforanydatadependenc ies.Theresultsarenally collatedandreturned.1.3MappingImplementationDetails1.3.1Clustering Clusteringisatechniquethatisideallysuitedforpartiti oningproblems.Thepartitioningalgorithmusedforcodepartitioninginthisworkis avariationoftheKmeans clusteringalgorithmproposedbyMacQueenin1967.Thealgo rithmissimple,unsupervisedandaggressive.Theruntimeforthealgorithmisfaste rtomostofthecommonly usedpartitioningalgorithmsbyorders.Thedisadvantageo fthealgorithmisthatthereis noguaranteethatthesolutionobtainedisoptimal.Inorder toovercomethisshortcoming thealgorithmcanbeexecutedalargenumberoftimesoverthe samesetofdataandthe 8

PAGE 20

bestsolutioncanbeselected.Thelowruntimeforthealgori thmallowat-leastabout 100iterationsofthealgorithmwithoutanysignicantsetb ackontheoverallruntime. Clusteringfollowsagreedyapproach.Hencesteeringtheal gorithminthedirectionofthe optimalsolutionisbasedontheformulationoftheproblema ndcostfactordenitions.The clusteringalgorithmdoesnotprovidetheoptimumnumberof partitionstheapplication canbesplitinto.Thealgorithmisexecutedforaboutvedi erentclusternumbersand thethreebestnumberofpartitionsarechosentobeideal.1.3.2PartitioningDetails Thequalityofthemappinggeneratedbythepartitioningalg orithmisquantiedby theamountofreductionincommunicationoverheadandbythe amountofworkload balanceamongtheprocessingelements.Thereisaneedforac entralautomatedentityto identifythemostsuitableprocessingelementforaparticu lartaskintheprocessofmapping thetasksoftheapplication.Inordertoachievemaximumper formancethemappingis performedbyconsideringmaximumnumberofattributesasso ciatedwiththetask.This sectionelaboratesonnerdetailsinvolvedinmakingthede cisionchoicesthataectthe performanceofvariousstagesinthemappingprocess.Theap plicationsourcecodeis initiallydividedintosmallertasksandpassedthroughata skgraphgenerator.Theoutput oftheprocessisataskgraph.Ataskgraphcanbelooselyden edasagraphwiththeset ofallnodesrepresentingthetasksoftheapplicationandth esetofalledgesrepresenting control(ordata)rowbetweentasks.Acontroldatarowgraph (CDFG)isalsoaformof taskgraph.Adatarowgraphisalsoataskgraph.Thetaskgrap hgeneratorusedinthis work[19]parsesanapplicationsourcecodeandgeneratesaa cyclicdirectedgraphwith noderepresentingtasksalongwiththetimeinsecondstaken toexecutetheparticulartask andtheedgesrepresentingdatadependenciesbetweentasks alongwithamountofdata thatistransferredbetweenthetasksinkilobytes.Theappl icationcodeisproledstatically forobtainingadditionalinformationandtheallthedatais normalizedandstoredinadata reserve.Thetaskgraphgenerator[19]usedforgeneratingt asksfromtheapplicationsource 9

PAGE 21

codeistargetedspecicallyforapplicationswrittenusin gthe C language.Hencethiswork istargetedtowardssystemsusingtheprocedurallanguage C forapplicationdescription. Granularitylevelofapartitioningalgorithmisanimporta ntfactorthataectstheperformanceofthealgorithm.The Granularity level,apartitioningalgorithmisimplemented forisdecidedbytheapproximatesizeofeachtaskcomprisin gtheapplication.Tasksconsistingofloops,blocksandwholefunctionsofcodeareden edtobecoarseingranularity. Advantagesofusingcoarsergranularityarereducedexecut iontimesforthealgorithmand lowercommunicationoverhead.Thepartitioningalgorithm ischaracterizedasoperating ofnegranularitywhentaskscomprisingtheapplicationco nsistofafewinstructions(or individualinstructions)ofsourcecode.Finegranularity resultsinahugesolutionspace. Theincreaseinsizeofsolutionsearchspaceincreasesthee xecutiontimeofthepartitioning algorithmandalsoincreasesthenumberofdesignchoices.I nthisworkthegranularityof thetasksisassumedtobecoarse.1.4ContributionsoftheDissertation Inthiswork,wedescribeageneralizedframeworkforautoma ticcodepartitioningand codegenerationfordistributedheterogeneoussystems.Am odelforsystemleveldesign andsynthesisusingtransactionlevelmodelshasalsobeend evelopedandispresented.The frameworkconsistsofprolingtools,partitioningmethod ology,architecturalexploration andcostanalysistools.Thepartitioningalgorithmisbase donclustering,inwhichthecode clustersarecreatedtominimizecommunicationoverheadre presentedasdatatransfers intaskgraphforthecode.Theproposedapproachhasbeenimp lementedandtested fordierentapplicationsandcomparedwithsimulatedanne alingandtabusearchbased partitioningalgorithms.Theobjectiveofpartitioningis tominimizethecommunication overhead.Whiletheproposedapproachperformscomparably withsimulatedannealing andbetterthantabusearchbasedapproachesinmostcasesin termsofcommunication overheadreduction,itisconclusivelyfasterthansimulat edannealingandtabusearchbyan orderofmagnitudeasindicatedbysimulationresults.Thep roposedframeworkforsystem 10

PAGE 22

leveldesign/synthesisprovidesanendtoendrapidprototy pingapproachforaidingin architecturalexplorationanddesignoptimization.Thele velofabstractioninthedesign phasecanbenetunedusingtransactionlevelmodels.Thesi gnicantcontributionsof thethisdissertationthatweredescribedarelistedasfoll ows. Anewframeworkforautomaticcodepartitioningandcodegen eration. Anewalgorithmforcodepartitioningbasedonk-meansclust eringforpartitioning applicationprograms Clusteringalgorithmbasedonexecutiontime,powerconsum ptionandcommunicationoverheadasmetrics. Algorithmandsoftwareforcodegenerationtoconvertcodec lustersintoindependent codeexecutables. AnewTLMbasedmodelforarchitecturalleveldesignspaceex ploration. Acompletesoftwaretoolthatcanperformautomaticcodepar titioningandcode generation. Resultsonselectbenchmarks. 1.5Outline Thischapterintroducedanddiscussedinbriefthedesignof heterogeneousdistributed systemsandthemappingofapplicationontosuchasystemusi ngpowerfultoolsofcode partitioningandcodegeneration.Therestofthedissertat ionisorganizedasfollows.Chaptertwobrierycoversrelevantresearchandindustryeorts intheareasofcodepartitioning andcodegeneration.Codepartitioningalgorithmbasedonh ierarchicalclusteringisdiscussedindetailinchapterthree.Thesoftwarearchitectur eoftheimplementationofthe codepartitioningalgorithmsalongwiththeprolingtools usedintheworkarediscussed inchapterfourelaborately.Chapterveexplainsthecodeg enerationapproachusedin 11

PAGE 23

thiswork.AdetailedviewoftheproposedframeworkforTLMb asedSoCexplorationis presentedinchaptersix.Chaptersevencontainsconclusio nsandscopeforfuturework. 12

PAGE 24

CHAPTER2 RELATEDWORK Thischapterprovidesacomparativeanalysisofexistingco ntributionsinrelatedwork area.Thisworkisacombinationofcodepartitioningandcod egeneration.Therstpart ofthechapterprovidesaclassicationofthepartitioning problemfordierentapplication environments.Abriefoverviewofthedierentapproachesf orthepartitioningapplications withdierentmethodsofdescriptionispresentednext.Acl assicationofcodegeneration problembasedonthemodeofcommunicationbetweentheparti tionedclustersofcode isthendescribedbriery.Finallythechapterconcludeswit hanoverviewofthevarious approachestothecodegenerationproblemandsomeofthemos tsignicantcontributions inthateld.Codepartitioningforaheterogeneousenviron mentisachallengebyitself. Theproblemhasbeeninvestigatedatvariouslevelsofabstr action.Codepartitioninghas beenstudiedunderdierentarchitectureswithvariedcons traints. 2.1PartitioningProblemsinDierentApplicationDomains Codepartitioningisessentiallyaspecialcaseoftheparti tioningproblem.Basedonthe specicapplicationdomainthecodepartitioningproblemi smappedon,thepartitioning problemcanbeclassiedintocategories.Theapplicationd omaincategoriesincludethe following. HW/SWPartitioning: Partitioningapplicationsmodeleddistributedembeddeds ystemsusingthehardware/softwareco-designprocess. Graph/Automata: PartitioningapplicationsmodeledusingaTaskgraphorFin ite stateautomatonessentiallyforfunctionalverication. 13

PAGE 25

HeterogeneousSystems: Partitioningproblemforapplicationsdesignedforexecut ion onheterogeneoussystems,implementedusinghighlevelpro gramminglanguages. Basedonthenerimplementationdetailsoftheapplication andthetargetplatform,the partitioningproblemforheterogeneoussystemscanbefurt herclassiedintothefollowing categories. Multitaskingapplicationsdesignedformappingontomulti processorsystems. Objectorientedapproachesrequiringahighlevelofcompil erandoperatingsystem support. ApplicationimplementedusingprocedurallanguageslikeC Thespecicreasonforaseparateclassicationforobjectb asedapplicationsisbasedonthe reasoningthatobjectbasedapplicationsneedtobepartiti onedincongruencewithobject denitionandbehavior.Objectbasedapplicationsaregene rallyimplementedtopromote modularityofthesourcecodeandtheeaseofscalabilitywit htheaidofstrongcompiler andoperatingsystemssupport.Procedurallanguagesintro duceadditionalchallengesto thepartitioner.Codedensityinthecaseofprocedurallang uagesisnotuniformhence optimalpartitioningoftasksandsimultaneouslymaintain inganuniformlybalancedwork loadisdicult.Inthecaseofmultitaskingapplicationfor multiprocessorsystems,the partitioningproblemisessentiallyatwopartproblem.The rstpartisatthelevelof theoperatingsystemanditsabilitytoprovidesimultaneou sexecution( multi-threading ). Thesecondpartoftheproblemistheactualpartitioningfor themultiprocessorsystem. Figure2.1.representssomeofthemajorcontributionsinth eareaofcodepartitioning. 2.2CodePartitioningBasedApproaches Relevantworkincodepartitioningispresentedinthreecat egories.Applicationsimplementedusinghighlevelprogramminglanguagesandmappe dontoheterogeneousand multiprocessorbasedsystemsarediscussedrst.Inhetero geneoussystemswepresentthree 14

PAGE 26

subcategories,multitaskingbasedapplications,objectb asedapplicationsandprocedural languagebasedapplications.2.2.1HeterogeneousSystems Oneoftheearliestapproachesforapplicationpartitionin gonheterogeneousarchitectureswasbasedonADA[22]programminglanguage.Acall-ren dezvousgraph(CRG)is constructedfromtheapplicationusingaparser.ACRGcanbe comparedcloselytoatask graph.ThenodesintheCRGrepresentprogramunits/tasks/s ubprograms.Theedges intheCRGrepresentcallandtaskinteractionrelationship s.AweightedCRG(WCRG) isobtainedbyincludingedgeweightsrepresentinginterno decommunication.ThetechniquedescribedbytheauthorinvolvesparsingADAapplicat ionsthroughaparserand subsequentlyemployingapartitionertotightlymaptheclu stersofprogramunitsontothe heterogeneoussystemswiththeprimarygoalofreducingcom municationoverhead.The partitionerisolatestasksbasedonAbstractDataTypesorp rogramunitswithbehavior similartoobjects.Thetechniqueworkswellinasmalltomed iumscale.Theapproach ishowevernotscalabletolargeorcomplexapplications.Th epartitionerislimitedincapacitytohandlefewernumberoftasksandhasahighruntime. Anotherdisadvantageof theapproachisthegranularityofpartitioning.Sincethes izeofeachnodeintheCRGis aprogramunitandthepartitioneristargetedforobjectbas edbehaviorthepartitioning isperformedataveryhighlevelandcannotbemodiedtoner granularity. Naculetal[4],havedevelopedanautomatedcodepartitione rwithacodegenerator fordynamicmultitaskingapplications.Thephantomsystem isdevelopedprimarilyfor embeddedapplicationwithverylittleoperatingsystemsup port.Themotivationforthe workistheideaofserializingcompilers.A serializingcompiler isatranslatorthatconvertsamultitaskingapplication(inthiscasewritteninth eClanguage)intoprocessor independentsourcecode.Theembeddedprocessorsareusual lyshippedalongwithtool chainsthatcanbeusedforcompilingapplicationsforthats pecicprocessor.Theoutput ofthephantomsystemcanbecompiledbythetoolschainsacco mpanyingtheembedded 15

PAGE 27

processor.Thecodepartitionerisbasedonagenericcluste ringalgorithm.Thepartitioner dividestheoriginalapplicationsourcecodeintocodebloc ksandaschedulerissynthesized toexecutetheblocksofcode.Eachblockofcodeistermedbyt heauthorsasanAtomic ExecutionBlock(AEB).AEB'sarenonpreemptiveandreprese ntasetofnodesonaCFG. Thephantomsystemusescompiletimeinformationtodividet heapplicationintotasks andthenformthepartitionsofcodeblocks.Theapproachhas beentestedfordynamic multitaskingapplications.Themajordownsideoftheappro achisthegenericclusteringmechanismusedtoformthecodeblocks.Theclusteringla ckseciencyasiterative improvementstothesolutionareachievedbyrandomlychasi ngthenextmove.Theclusteringmechanismemployeddoesnottakeintoaccounttheper formancevariationofthe codeblocksonvariedarchitectures. TheauthorsoftheJ-orchestraapproach[26]proposeanauto maticpartitioningforreal timeJavabasedsystems.TheJ-orchestraapproachistarget edforapplicationsdesigned forexecutioninubiquitouscomputingenvironments. Ubiquitouscomputing environments areheterogeneouscomputingsystemswheretheprocessinge lementscomprisingtheenvironmentarevariedinarchitectureandprocessingpowerand alsoincludeembeddedprocessorsinterconnectedtoeachotherinawiredand/orwirel essnetwork.Theframework providestoolsforrapidprototypingformappingapplicati onsontoubicompdomains.The automatedprocessexploresthedesignspaceforoptimalpar titioningandmappingofapplications.Thekeycontributionofthisworkisautomating thedesignexplorationprocess. Inthecaseof ubicomp environmentsexploringthedesignspaceforisolatingopti malhigh performancedesignsisachallenge.Theimplementationoft hedesignhasbeentestedto bedependable,sophisticatedandscalable.Acasestudywas performedwiththe Kimura systemcomprisingofabout4400linesofuncommentedsource .Theautomaticpartitioner wasusedinthedevelopmentofthe kimura ubicompsystem.Thepartitionerusedinthe J-orchestraapproachusedtheoriginalapplicationsource codeasinputandthenrewrites thesourcetoreduceinterpartitioncommunicationinadist ributedenvironment.Theimplementationalsoincludesagraphicaluserinterfacefort hefrontend.Thecommunication 16

PAGE 28

overheadisreducedbyrewritingthesourcecodetoconvertr emotemethodinvocations todirectobjectreferencinginthebytecode.Theresultoft heapproachisautomatic partitioningisachievedeasilybutthenalsolutionisnot essentiallyoptimal.Themodel proposedistargetedprimarilyforJavabasedobjectorient edapplications.Thescopeof thepartitioningsolutionisthusrestricted. Partitioningofinstructionsforwideissuesuper-scalarp rocessorsisdealtwithin[31]. Thefocusoftheirworkisonmappingcodetothefunctionalun itsofaclusteredarchitecture withtheobjectiveofoptimizingthetradeobetweenworklo adbalancingandreducing communicationoverhead.Theapproachistargetedformulti taskingapplicationsmapped ontomultiprocessorenvironment. Theclosestcomparisontoourworkisintheareaofprocedura llanguagebasedpartitioner.In[18]theauthorspresentanapproachforpartit ioningandmappingprograms writteninANSIContoaCustomComputingMachine(CCM).CCM' sareusuallymulti FPGAcomputingplatforms.Theapplicationneedstopartiti onedforecientexecution onfunctionalunitsandtheoriginalapplicationsourcecod eisconvertedtoadatarow representationandmappedontoFPGA's.Thenextstepinthep rocessisschedulingthe operationsforhighspeedexecution.Asimulatedannealing basedpartitioningalgorithm isimplementedandtheschedulerusesapartialscheduleasi nputtocompletetheentire schedule.Thecompilerconsistsoftwoparts,anarchitectu reindependentfrontendwhich handlesallthelexicalandsyntacticalanalysisinorderto generatedirectedacyclicgraphs (DAG's)basedontheapplicationsourceandaarchitectured ependentbackendinorder togeneratethearchitecturespeciccode.Theauthorshave usedanintermediaryformGDLgraphdescriptionlanguagetorepresentthearchitectu respeciccode.Theschedule istestedbyconvertingsourcetoVHDLandthevalidityofthe approachisascertained. Themainsetbackoftheapproachisusingasimulatedanneali ngbasedpartitioningalgorithm.Simulatedannealingguaranteesasolutionthatisop timalbutatthecostofvery highrunningtime.Asthecomplexityofthesourcecodeincre asessimulatedannealing basedpartitionertakemanydaystoconverge.Hencetheappr oachisnotscalable. 17

PAGE 29

In2006afterthisworkwascompletedtherewasacasestudy[3 5]completedinvestigatingthecodepartitioningforhighperformancerecongu rablearchitectures.Theauthor considersthetaskofpartitionsbetweenmultiprocessorsa ndFPGA's.Inthiscasestudy dierentpartitioningschemesareinvestigatedforacusto mdesignedarchitecture. 2.2.2Graph/AutomataApproaches Thecodepartitioningproblemhasalsobeenconsideredequi valenttograph/automata basedpartitioningalgorithms.Thenodesinthegraph/auto matonrepresentcomponents oftheapplication.Hencetheproblemcanbeformulatedasag raph/automatonpartitioningproblemandsolved.Thecontribution[23]isbasedo nextendednitestatemachines(EFSM)models.Theapplicationsourcecodeisinputi nahighlevellanguage,inthis caseESTEREL.TheEFSMisgeneratedfromtheapplicationsou rcecode.Theapproach worksonlyiftheapplicationcanbemodeledasanEFSM.Inmos tcasestranslatingthe givenapplicationandmodelingitasanEFSMisextremelydi cultandinsomecasesmay evenbeimpossible.Hencethisisamajordrawbackforthisco ntribution.Thepartitioningisperformedforthedomainofhardware/softwarepartit ioningproblem.Hencethis contributionisdiscussedindetailedinthenextsection. Ageneticalgorithmbasedapproachforschedulingandmappi ngconditionaltaskgraphs forsynthesisoflowpowerembeddedsystemswasinvestigate din[7].Themajorrestriction ofthisapproachisthatitisrestrictedtoCDFG'sandthemaj orfocusisonthesynthesis ofaschedulefortheapplicationonanembeddedsystem.2.2.3Hardware/SoftwarePartitioningSystems Hardware/Softwarepartitioningisanimportantstepinthe hardware/softwarecodesignprocess.Theproblemconsistsofgeneratingamappin gforeachofthetaskin theapplication.Thetaskscanbeexecutedonageneralpurpo seprocessor(software) orcanbesynthesizedintoindependentinstancesofprocess ingelementsperformingonly speciccomputations(hardware).Thisisequivalenttothe codepartitioningproblem 18

PAGE 30

Wu et al 2003Baleani et al 2002 Vallejo et al 2001Henkel et al 1999Ernst et al 1993 HW/SW Partitioning Based Multitasking Applications canal et al 2001Nacul et al 2004 Liogkas et al 2004Welch et al 1995 Object Based Languages ProceduralLanguages Peterson et al 1996This work Code Partitioning Heterogeneous Systems Language Based Graph/Automata Based Figure2.1.ATaxonomyofPriorWorksinCodePartitioning wheretheprocessingelementsarecomponentsofaheterogen eoussystemswithvarying architectures,processingspeedandcost.Inthissectionw ediscusssomecontributionsin thehardware/softwarepartitioningareathatarecloselyc omparabletoourwork. In[23],theauthorsuseniteautomatonandconstraintbase dapproachforatarget architectureconsistingageneralpurposecoreandarecon gurablefunctionalunit.The recongurablepartisimplementedusingprogrammableFPGA units.Thepartitioning processisautomatedandtakesahighleveldescriptionofth eapplicationasinput(in thisparticularcasethelanguageESTERELisused).Opensou rceGNUbasedcompiler GCC toolchainsareusedtoconvertthehighleveldescriptionof theapplicationinto ANSICfunctionalequivalenttaggedclustersofcode.Theco declustersaremappedon totheFPGAunits.Thepartitioningalgorithmclustersthei ndividualcomponentsofthe generatedANSICcodebasedonindividualpropertiesofthec omponent.Theauthorsterm thisprocessasclubbing.Themaincontributionincludesan automatedpartitioneranda codegeneratorforsynthesizingANSICcodefortheFPGAunit s.Thenumberofcycles neededforeachClanguageinstructioniscomputedusingGCC toolchainsasperformedin ourwork.Theworkdoesnotconsiderotherfactorsimportant tothepartitioningprocess 19

PAGE 31

likethepowerconsumptionofindividualclubsandmemoryus age.Themainlimitationof thisworkisthatitcanbeappliedonlyforextendednitesta temachines. Anothernotablecontributioninconsideringpowerinthepr ocessofpartitioningwas investigatedin[16]and[15].Acomparativestudyofsimula tedannealingandTabusearch basedalgorithmsappliedtosystemlevelpartitioningispr ovidedin[28].Themaincontributionofthisworkissystemlevelhardware/softwareparti tioningalgorithmwhichtakes inputintheformofahighlevelmachineindependentdescrip tionoftheapplicationin VHDL.Communicationbetweenhardware(co-processor)andt hesoftware(microprocessor)isachievedusingmessagepassingcallsimplementedin VHDL.Partitioningalgorithms areimplementedusingsimulatedannealingandtabusearchb asedapproachesandtheauthorsclaimthataftertestingonbenchmarkstheyndtabuse archbasedapproachto bemoreeectivethansimulatedannealingbasedapproachan dthattabusearchbased approachesconvergefasterthansimulatedannealingbased approaches.Thepartitioning algorithmisnegrainedtothethelevelofloopsandbasicbl ocks.Includingneimplementationdetailsatthesystemlevel,restrictsthenum berofavailabledesignchoices. Thisworkisreferencedinthesectionforrelatedwithincod epartitioningeventhoughitis systemlevelhardware/softwaredesignbecauseofthesimil aritiesbetweenintheproblems encounteredinboththeseareasinthepartitioningstep. In[10],theauthorsusebinarycodeoftheapplicationtoext ractinformationnecessary forpartitioning.Thistechniqueisecientasbinariespro videaccurateruntimeinformation,theonlydrawbackbeingtheoverheadinapplyingthete chniquetodierenttargetarchitectures.Theapproachofclusteringforhardware/soft warepartitioningwasattempted in[24]butitdoesnottakeintoaccounttheeectsofpowerdi ssipation. Thereareothernotablecontributionspartitioningapplic ationsimplementedinprocedurallanguagesdesignedforhardware/softwarepartition ingenvironments.Someofthe contributionsareasfollows.InNIMBLE[37]theauthorhave constructedanentirely retargetableframeworkformappingapplicationsontoreco ngurableembeddedsystems. Thetargetarchitectureisdescribedusinganarchitecture descriptionlanguage(ADL).The 20

PAGE 32

ADLandthehighleveldescriptionoftheapplicationinANSI Caretheinputstothe framework.Thepartitioningalgorithmusedinthisworkisb asedonhierarchicalclusteringapproach.Finegrainedgranularityattheleveloflo opsandbasicblocksisused bythepartitioningalgorithm,hencethesolutionobtained fullyutilizesmaximalavailable instructionlevelparallelism.Theoutputofthehardware/ softwarepartitionercomprises ofsynthesizedcodeforFPGA'sandmodiedapplicationCsou rcecode.Thepartitioner analyzesthesourcecodetoidentifyloopsandmodiestheso urcetoincludeloopunrolling foroptimalexecutionontheFPGA's.Thedownsideforthisap proachisthenegranularityofthetaskscomprisingtheapplication.Themodeli ngisprimarilyintendedfor maximizingparallelismwiththeloopunrollingtechnique. Thenelevelofgranularity ofthetaskatthatlevelprohibitscalabilityoftheapproac htohugeapplications.The eectiveexecutiontimeevenformoderatesizedapplicatio nswouldbehighlyexpensive renderingthisapproachtobeineective.2.3CommunicationMechanismsinDistributedApplications Communicationmethodsusedindistributedapplicationsde pendonprocessingelementscomprisingthetargetarchitecture.theapplication sourcelanguage,topologyofthe distributedsystemandthemethodologyusedforpartitioni ng.Generaltechniquesused forcommunicationinmostdistributedapplicationsinclud eCORBA,MPI,PVMandJava RMI.Phantom[4]usesasharedmemorytechniqueemployingse maphoresforsynchronization.Sharedmemorysystemsdonotrequireremotecallcommu nicationproceduresbut synchronizationoftheshareddatainordertoavoiddeadloc kandfordatacoherency.Yang [34]usesmessagepassingforcommunicationinasharedmemo ryenvironment.In[26], JavaRMIisusedforcommunicationandhenceusefulonlyforJ avabasedapplications. Webrierydiscussthefeaturesofsomethecommunicationmet hodsusedindistributed systems. Remotemethodinvocation(RMI)allowsJavadeveloperstoin vokeobjectmethods, andhavethemexecuteonremoteJavaVirtualMachines(JVMs) [11].UnderRMI,entire 21

PAGE 33

objectscanbepassedandreturnedasparameters,unlikeman yotherremoteprocedure callbasedmechanismswhichrequireparameterstobeeither primitivedatatypes,or structurescomposedofprimitivedatatypes.Thatmeanstha tanyJavaobjectcanbe passedasaparameter-evennewobjectswhoseclasshasnever beenencounteredbefore bytheremotevirtualmachine.RMIisJavaspecic,andwriti ngabridgebetweenolder systemsbecomestheresponsibilityoftheprogrammer.Addi tionally,itisdesignedfor objectbasedapplicationsonly. CommonObjectRequestBrokerArchitecture(CORBA)[11]isa competingdistributed systemstechnologythatoersgreaterportabilitythanrem otemethodinvocation.Unlike RMI,CORBAisnottiedtoaparticularlanguage,andassuch,c anintegratewithlegacy systemsofthepastwritteninolderlanguages,aswellasfut urelanguagesthatinclude supportforCORBA.CORBAisnottiedtoasingleplatform(apr opertysharedbyRMI), andshowsgreatpotentialforuseinthefuture.Incontrast, forJavadevelopers,CORBA oerslessrexibility,becauseitdoesnotallowexecutable codetobesenttoremotesystems. CORBAservicesaredescribedbyaninterface,writteninthe InterfaceDenitionLanguage (IDL).IDLmappingstomostpopularlanguagesareavailable ,andmappingscanbewritten forlanguageswritteninthefuturethatrequireCORBAsuppo rt.CORBAallowsobjects tomakerequestsofremoteobjects(invokingmethods),anda llowsdatatobepassed betweentworemotesystems.Remotemethodinvocation,onth eotherhand,allowsJava objectstobepassedandreturnedasparameters.Thisallows newclassestobepassed acrossvirtualmachinesforexecution(mobilecode).CORBA onlyallowsprimitivedata types,andstructurestobepassed-notactualcode. MessagePassingInterfaceisastandardforwritingmessage passingprogramsallowing ecientcommunicationbyavoidingmemory-to-memorycopyi ngandallowingoverlapof computationandcommunication.MPI[11]ismoresuitablefo rlargemultiprocessorhomogeneoussystems.Inordertoutilizethemaximumcapabili tyoftheMPI,thetarget architectureneedstoberestrictedtohomogeneoussystems withidenticalarchitectures. 22

PAGE 34

PVMisbuiltaroundtheconceptofavirtualmachineandisuse fulwhentheapplication isexecutedonanetworkedcollectionofhosts,particularl yifthehostsareheterogeneous. PVM[11]containsresourcemanagementandprocesscontrolf unctionsthatareimportant forcreatingportableapplicationsthatrunonclustersofw orkstationsandMPP.Thelarger theclusterofhosts,themoreimportantPVM'sfaulttoleran tfeaturesbecome.Theability towritelongrunningPVMapplicationsthatcancontinueeve nwhenhostsortasksfail, orloadschangedynamicallyduetooutsideinruence,isquit eimportanttoheterogeneous distributedcomputing. WechoosePVMasthemessagepassingtechniqueinourmethodb ecauseithasthese salientfeatures.Itisbestsuitedforlargelyheterogeneo usdistributedsystems.Other featureslikeeasyportabilityandrobustnessmakeitabett erchoice.MPIisgoodbut noteasilyportableandworksbetterinahomogeneousenviro nment.EventhoughPVM basedcommunicationtechniquesrequireaminimaloperatin gsystemsupport,ithasbeen utilizedinthisworkinordertofacilitatearapidprototyp edesignatthesystemlevel. Thisenableseasierdesignspaceexplorationwithouthardd esignleveldecisionsofspecic implementationdetails.2.4CodeGenerationforPartitionedApplications AutomaticCodeGenerationhasbeenusedinalotofscenarios insomecasesasamere translatorforconversionfromonelanguagetodierentoth ertargetlanguages.Figure2.2. showssomeofthemostsignicantcontributionsintheareao fcodegeneration.Pertaining topartitioningproblemscodegenerationcanbestudiedfor thetwodistinctmethods:a) codegenerationfromStatechartorautomata[6],[14],[17] ,which(mostlyseeninHW/SW partitionproblems)andb)codegenerationforsoftwareapp licationpartitionproblems. TheclosesttoourapproacharePangea,CoignandJ-orchestr a. In[34],theproblemofparallelizationofapplicationsfor distributedmemoryarchitecturesisaddressed.Adirectedacyclicgraph(DAG)wasusedt omodelparallelcomputation. ThetasksintheDAGwereclusteredwiththehelpofascheduli ngalgorithmcalledDSC 23

PAGE 35

Automata/State chart Based Kim et al 2003 Sekerinski et al 2001Marwedel et al 1995 Multitasking appliactions / Multi processor Based HW/SW Codesign Based Rauket et al 2003Baleani et al 2003Jiang et al 2001 OO Based Applications Nacul et al 2004Nacul et al 2003Zebrowitz et al 2002Tilevich et al 2001Pino et al 1994Yang et al 1992 Language Based Procedural Code Generation Figure2.2.ATaxonomyofPriorWorksinCodeGeneration (DominantSequenceClustering).Thecodegenerationmetho dproposedbytheminvolves executingascheduleofanarbitrarytaskgraphbasedonanas ynchronouscommunication model. ThetasksintheDAGaccessdatainasharedmemoryprogrammin gstyle.Message Passingwasusedforcommunication.Theauthorsin[17],pro poseacodegenerationframeworkforhybridautomatawhichdealswithcontinuousanddis cretedatadependency.The automaticcodegenerationprocessproposedisdecomposedi ntotwophases:onetranslatingeachprimitiveintoapieceofcodeandtheothersched ulingthepiecesofcode consistenttodatadependency.In[21],aCADtool,SynDExis usedforpartitioningand codegeneration.Theauthorsdevelopfastautomaticprotot ypingprocessdedicatedtoparallelarchitectures.In[23],theauthorsintroduceanauto matedhw/swpartitionandcode generationrowforcontrolapplications.ItusestheEFSMmo delandderiveshw/swimplementationautomaticallyfromhigh-levelsynchronousspec ications.Usingtheautomated synthesisrow,embeddedapplicationscanbemappedfromhig h-levellanguagespecica24

PAGE 36

tionslikeESTEREL,downtohw/swimplementationonthereco ngurablearchitecture platform.However,thisapproachislimitedtotheapplicat ionsthatcanbemodeledand programmedbyextendednitestatemachines.Recently,the rehasbeenalotofworkin automaticcodegenerationforembeddedsystemspertaining tosoftwaresynthesisofhardware/softwareCo-designproblems.Mostofthesemethodsus econversionofautomatainto code.Thismethodismoresuitableforhw/swCo-designparti tioningproblems.Though thismethodcanalsobeusedforsoftwareapplications,ther eisanoverheadoftranslating oftheautomataintosoftwarecode. Therehasnotbeenmuchfocustotheproblemofcodegeneratio nfordistributedsoftware.Mostly,whenapplicationsneedtobedistributed,pro grammersmanuallypartition theapplicationandassignthosesubprogramstomachines.W iththehelpofmiddleware suchasRRPC,CORBAandDCOM,thesesubprogramsaresuccessf ullyexecutedonthe respectivemachines.Oftenthetechniquesusedtochoosead istributionareadhocand createone-timesolutionsbiasedtoaspeciccombinationo fusers,machines,andnetworks.Toaddressthisissue,theauthorsinCoign[12]propo sedaautomaticdistributed partitioningsystemforbinarysoftwareapplicationsbase dontheCOMmodel.Givenan application(inbinaryform)builtfromdistributableCOMc omponents,Coignconstructs agraphmodeloftheapplication'sinter-componentcommuni cationthroughscenario-based proling.Later,Coignappliesagraphpartitioningalgori thmtomaptheapplicationacross anetworkandminimizeexecutiondelayduetonetworkcommun ication.UsingCoign,even anenduser(withoutaccesstosourcecode)cantransformano n-distributedapplication intoanoptimized,distributedapplication.Thismethodca nnotbeusedasageneralized approachasveryfewoftherealworldapplicationsarewritt enascollectionsoftheCOM component. J-OrchestraoperatesattheJavabyte-codelevelandrewrit estheapplicationcodeto replacelocaldataexchange(functioncalls,datasharingt hroughpointers)withremote communication(remotefunctioncallsthroughJavaRMI,ind irectpointerstomobile objects).Theresultingapplicationisguaranteedtohavet hesamebehaviorastheoriginal 25

PAGE 37

one(withafew,well-identiedexceptions).J-Orchestrar eceivesinputfromtheuser specifyingthenetworklocationsofvarioushardwareandso ftwareresourcesandthecode usingthemdirectly.Aseparateprolingphaseandstatican alysisareusedtoautomatically computeapartitioningthatminimizesnetworktrac. ThemainobjectiveofourprogramandJ-orchestraisthesame -Automaticapplicationpartitioningforadistributedembeddedenvironment. TheJ-orchestraapproachis restrictedtoobjectbasedenvironmentsandspecicallyto Javabasedapplications.The approachdescribedinthiswork,ismoregeneralizedandbas edonapplicationsthatcan bedescribedusingtheprocedurallanguageC.InthecaseofJ -orchestraapproachthe wholeapplicationsourceisconvertedtobytecodeandtheby tecodeisrewrittenandthen thepartitionsareidentied.partitionsarecomputed.InJ -orchestraallthelocaldata communicationsarereplacedwithremotecommunicationcal lsthusresultinginadditional overhead. Pangaea[3]isanautomaticpartitioningsystem.Pangaeais basedontheJavaParty infrastructureforapplicationpartitioning.JavaPartyi sdesignedformanualpartitioning andoperatesatthesourcecodelevel,Pangaeaisalsolimite dinthisrespect.Thesource codeisanalyzedandpartitionedandtheprimitivesgenerat edarespecictotheback-end adapters.InthisworktheadapterstestedwerebasedonCORB A,RMIandJavaParty. Theworkisrestrictedtoaspecicsetapplicationsthatcan bemodeledusingobjectbased techniques.Thetechniqueisdirectedtotakefulladvantag eoftheinherentparallelism intheapplication.Thepartitionerisnotresponsiblefort heback-enddistributionof theindependentcodeclusters.Hencethetechniquerequire smanualinterventionand alsosubstantialprogrammingfordevelopingthebackendad aptersforspecicexecution environments. 26

PAGE 38

2.5ContextofDissertationResearch Signicantresearchcontributionsintheareasofcodepart itioningandcodegeneration werediscussedinthischapter.Theprimaryshortcomingsof alltheresearchworksexisting atthetimeofimplementationofthisworkarelistedbelow. Noautomaticcodepartitioningtoolswereavailable Existingworksweretargetedforspecicarchitectures Inmostapplications,thecodepartitioningstepwasdonewi thhumanintervention Veryminimalsupportwasavailableforprocedurallanguage s.Mostofthetoolswere targetedtowardsobjectbasedapplications Thisdissertationdescribesacodepartitioningtechnique forapplicationswhichcanbe describedusingtheprocedurallanguageC,andaretargeted fordistributedheterogeneous environmentsAnewcodegenerationmethodologyforexecuti ngpartitionedcodeclusters inadistributedenvironmentsisalsodescribed.Inthischa pter,priorworkintheareaof codepartitioningandcodegenerationwasstudiedwithemph asistothosedirectlyrelated tothiswork.Anextensivecomparisonoftheexistingcontri butionsinboththeareaswas provided.Theconclusivedierencesbetweentheexistingw orksandthisdissertationwas alsoprovided. 27

PAGE 39

CHAPTER3 CLUSTERINGMETHODOLOGY Thecodepartitioningproblemisequivalenttothetraditio nalgraphpartitioningproblem.Theedgesofthedirectedgraphtranslatetocontrolrow inthecode.Thenodes representcentersofcomputationliketasksorinstruction s.Thischapterdescribesindetailthemethodologyappliedforpartitioningthetaskgrap handtherebypartitioningthe applicationsourcecode.Thechapterisorganizedasfollow s.Thepartitionalgorithmis explainedindetailrst..Themotivationbehindusingaclu steringbasedapproachis thendiscussedindetail.Thearchitectureoftheenvironme ntusedandtheassumptions arecoveredthenThechapterconcludeswithasetofallthere sultsoftheexperiments conductedtoestimatethecorrectnessandthequalityofthe partitioningalgorithmona setofbenchmarks.3.1PartitioningProcess Source code Application Generator Graph Task Graph Source code Annotated Profiling Source code Attributes Code clusters Partitioned Partitioning Figure3.1.StepsInvolvedinCodePartitioning 28

PAGE 40

Theprocessofcodepartitioninghastwomajorsubprocesses .TheapplicationdescriptionwhichisintheformofasingleClanguageSourcecodele ispartitionedintocode clusters.EachclusteristhenmappedontoaPEofthedistrib utedsystem.Figure3.1. showsadetailedviewofthestepsinvolvedinthepartitioni ngprocess. 3.1.1OverviewofthePartitioningMetrics Therststepinthepartitioningprocessisthegenerationo fataskgraphfromthe applicationdescription.Theapplicationdescriptionint heformofC-languagesourcecode ispassedthroughthetaskgraphgenerationtool[19].Theno desofthetaskgraphgenerated correspondtoblocksofcodefromtheactualsourceoftheapp lication.Theedgesofthe taskgraphrepresentdependenciesintheexecutionoftheap plicationwiththeweights correspondingtotheamountofdatatransferinKiloBytes(K B).Figure3.2.showsa snapshotofthetaskgraphgeneratedfromtheapplicationso urcecodebythetaskgraph generationtool.Eachnodeinthegureisrepresentedwitht wonumbers.Therstnumber isthetasknumbercorrespondingtothatnodeandthesecondn umberistheexecution timerequiredforthetaskonthehostprocessorinwhichthet askgraphwasgenerated. Figure3.2.SnapShotofaPartoftheTaskGraph Theotheroutputproducedbythetaskgraphgenerationtooli sannotatedCcode. Figure3.3.showsasnapshotoftheannotatedsourcecodeout putfromthetaskgraph 29

PAGE 41

generationtool.Commentsareintroducedintheoriginalso urcecodeinordertoindicate taskboundariesandtasknumbers.Theannotationsincludet hefunctionalblocktowhich thetaskbelongstoandthetasknumber.Thekeyword main indicatesthatthetask belongstothemainfunctionalblockoftheapplication.The annotatedCcodeisusedas aninputtoprolingtoolstoobtainthevariousattributeva luesforeachofthetasks.This workinvestigatesthepartitioningprobleminthecontexto fadistributedembeddedsystem environment.TheexacttofacodeclustertoaPEyieldsthem ostoptimalsolution.In ordertomapatasktothemostsuitablePE,thetaskisanalyze dthoroughlyandattribute valuescomputed.Largernumberofattributesconsideredfo rtaskshelpmodelthebehavior oftasksinanaccuratefashion. Figure3.3.SampleAnnotatedCode 3.2MotivationforClustering Taskshavevaryingvaluesforthedierentattributesoneac hofthedierentPE's. Inthisworktheentire population isthesetofalltaskscomprisingtheapplication, clas30

PAGE 42

sication ispartitioningtheapplicationintoclustersandthe metrics aretheattribute valuesofthetasksoneachofthePE's.Theprimarymotivatio nforapplyingaclustering basedapproachtopartitiontheapplicationsourcecode,is toutilizetheinherentnature oftaskstoclusterwithintheapplication,therebyachievi ngmaximumoptimizationofall thefactorscontributingtothetotalcostofthesolution.A notherimportantcharacteristic ofclusteringbasedapproachesistheexceptionallylowexe cutiontime,whencompared tootherpartitioningmethodologies[24].Clusteringtask sbyconsideringalargenumber ofattributesassociatedwiththetaskoptimizesthecluste ringprocessandimprovesthe qualityofthesolution.Sincetheattributesofthetasksar evariedinnature,inorderto clustertasks,theattributesaretakencollectivelyandmo deledtogenerateanewmetric calledthe closeness metric. 3.2.1AttributesforCloseness The Closeness metricbetweenanytwoclusterscanbecomparedtotheEuclid ean distancebetweentheclustersina K dimensionalspace,with K denotingthenumberof attributesusedinclusteringprocess.Therststepisiden tifyingtheattributesofthetasks thatarevitalformodelingtheirbehavioronanyPE.Theattr ibutesofthetaskconsidered forcomputingtheclosenessmetricinthisworkcaseareasfo llows: Datacommunicationcomingintoatask ExecutiontimeforthetaskonthePE CPUcyclesnecessaryforexecutiononthePE PowerconsumptionofthetaskonthePE Theattributevaluesforeachtaskaredierentinnature,ar erepresentedusingdierent unitsandvaryonwidelydierentscales.Theamountofdatac ommunicationisrepresented inKiloBytes(KB).Theexecutiontimeforthetaskismeasure dinMilli-seconds.TheCPU usagecyclesisanumberandisusuallyintherangeofhundred softhousands.Poweris 31

PAGE 43

measuringinnanojoules.Theclosenessmetricisacombinat ionofalltheaboveattributes. Astraightforwardcombinationofthedierentattributesi nthiscaseisnotanoptimal solution,sincethereishugedierenceintheattributeval ues.Inorderthecombinethe attributesinthemostoptimalmanner,theattributesneedt obenormalized.Inthiswork aglobalnormalization[8]policyisadopted.Thenormaliza tionprocesscanbetweakedin ordertoobtaintheclosenessmetricwhichisbiasedtowards oneormoreoftheattributes ofthetasks.3.2.2ComputingCloseness Thevaluesoftheattributesforthetaskareobtainedfromth eprolinginformation. Someattributevaluesareobtainedbyusingmathematicalex trapolation.Forallthetasks intheapplicationdescription,themaximumandminimumval uesforeachoftheattributes iscomputed.TheRange( R i )of i thattributeisgivenbytheformula. R i = Max i Min i Thenormalizedmetricisgivenbythefollowingformula. X [ i;j ]= indicator k (( attribute k ( i ) attribute k ( j )) =R k ) 8 i 6 = j; 8 k 2 2 ; 3 ; 4(3.1) where indicator k hasavalueof1ifthevalueforattributekexistsorhasavalu eof0.In ordertotweakthenormalizationprocesstobebiasedonone/ someparticularattributes,the indicator k forthatparticularkcanbemodiedtohaveadierentvaluea ndindicatorcan nowincludearangeofvaluesinsteadofjustzeroorone.Thec losenessvalueiscomputed foreverypairofclustersateveryiterationofthepartitio ningprocess.Thedriverroutine usestheclosenessvaluetosteerthepartitioningalgorith mtooptimalsolution. 3.3PartitioningAlgorithm Thepartitioningalgorithmisimplementedwithadriverrou tinetoexecutetheclusteringandcomputethecostthesolution.Thedriverroutine forpartitioningexecutes 32

PAGE 44

thefunction improvisecluster() inatightloopuntiltheterminationconditionisreached. Theterminationconditionforthedriverroutineisdesigne dforachievingamultiplegoals. Clusteringisbasedonagreedyapproach,hencethereisapro babilitythatthesolution obtainedmaybealocaloptima.Hencethedriverroutineexec utestheclusteringina tightloop.Anotherdisadvantageofthisapproachisthatth ealgorithmmightnotterminateevenafterreachingtheglobaloptima,thusresultingi naninniteloop.Inorderto overcomethesedrawbacks,theterminationconditionisact uallyatwopartsolution.The algorithmchecksforifoneofthefollowingconditionsisme t.1)iterationsexceedaxed maximumnumber( user-dened )2)costcriteriahasbeenattained.Inthiscasethereissti ll asmallprobabilitythatthesolutionobtainedmightbealoc aloptima.Henceinorderto absolutelyconrmthatthesolutionisindeedtheglobalopt imum,theentiredriverroutine isexecutedahundredtimeswithvaryingnumberofuserdene diterationsforeachrun. The improvisecluster() functioncontainstheclusteringalgorithmandwouldbeexp lained indetailsubsequently. Thereducedrunningtimeoftheclusteringapproachhelpsth ealgorithmtoconverge toasolutionintimingsthatarefasterthansimulatedannea lingandtabusearchbased approachesbyordersofmagnitude.Clusteringalgorithmca nbeexecutedforalarge numberofiterationstoguaranteeasolutionandstillnish executingmuchearlierthanthe timetakenforasingleiterationofsimulatedannealingbas edapproach.Thepartitioning algorithmbasedonclusteringreturnsclustercenters.Acl ustercenteristhecentralnode foragroupofnodesofnodes.Tasksareassignedtoaclusterc enter C i ifthetaskformsis reachablefromthecenterof C i andtheclosenessvaluefor C i withtheotherclustercenters istheleastwhencomparedwiththeclosenessvaluesintheca sewhenthetaskbelongs toanotherclustercenter.Figure3.6.givesanillustratio noftheclusteringprocess.The gureshowsthattheretwoclustercentersatpresentnamely a and b .Eachclustercenter hasattachedtoitasetoftwonodes.Thenode x iscurrentlynotassociatedwithany cluster.Therearetwooptionsforassigningthenode x .Theentireapplicationcanbenow partitionedwith x beingassigned a astheclustercenteror b astheclustercenter.Every 33

PAGE 45

StoppingCriteria? OutputInfo Mapping Stop No Yes No Yes Initialize Values Set Cost=0 Cluster Centers Create Initial Generate New Cluster Centers Estimate New Cost Is New Cost Lower? Start Update Cluster CentersFigure3.4.FlowChartfortheClusteringAlgorithm 34

PAGE 46

Figure3.5.AlgorithmforPartitioningBasedonClustering nonclustercenternodeintheapplicationtaskgraphisassi gnedtoeveryclustercenter inthegraphandthecostiscomputed.Thecongurationleadi ngtoatotalcumulative optimumcostistheidealmapping.Clusteringbasedalgorit hmsfollowgreedyapproach. Thealgorithmproceedsinthedirectionyieldingthelowest cost.Unfavorableorhighcost solutionsarerevertedtotoanearliercongurationwithlo wercostandthealgorithm proceedstosearchinanewdirection.3.3.1ImprovisingClusters Theimproviseclusterfunctiontakesasinputasetofrandom lygeneratedclustercenters andworkstowardstheoptimalsolution. Anode(say h )whichisnotalreadypresentinthesetofinitialclusterce nters,ischosen fromthelistoftaskstoreplaceanexistingclustercenter i iftheamountofgainobtained byreplacing i with h isthemaximumoverthesetofallnonclustercentertasks.Th is processisrepeateduntilthenumberofnewclustercenterss electedequalstheamountof clustersrequestedbytheuser.Thenewsetofclustercenter sthusobtainedissubjected toimprovisationprocessuntilthestoppingcriteriaisrea ched.Clusteringisaheuristic 35

PAGE 47

X a1 a b a2 b2 b1 X a1 a b a2 b2 b1 a1 a b a2 b2 b1 X Figure3.6.IllustrationofPossiblePartitioningOptions Figure3.7.AlgorithmforImprovisingClusters 36

PAGE 48

basedapproachandhencedoesnotguaranteeasolution.Mini malexecutiontimesforthe clusteringalgorithmisideallysuitedforexecutingtheal gorithmnumeroustimes. 3.3.2SteeringLogic Thepartitioningalgorithmissteeredbythecostcomputeda ftereachiteration.The factorsthatgointomeasuringthecostofapartitionarevar iedinnatureandspecictothe userandtheobjectiveofthepartitioningalgorithm.Inthi swork,thetwofactorstaken intoconsiderationcostmeasurementare1)totalamountdat acommunicationbetween partitionsand2)amountofworkloadimbalance.Theamounto fworkloadimbalanceis usedbothasaconstraintandacostfactor.Acapvalueforthe imbalanceissetbytheuser andallsolutionsexceedingthecaparerejectedautomatica lly.Thesolutionsqualifyingthe imbalanceconstraintuseboththeamountofcommunicationa ndtheimbalanceforcost computation.Combiningthefactorsthatcontributetoward scostisfundamentaltothe properevaluationandusageofthecostfactors.Inthiswork thecostfactorsarecombined usingtheweightedgeometricmeanmodelsincethisdescribe stherelationbetweenthecost factorsinamuchaccuratefashionthanthetraditionalsumo fproductsmodel. Pcost = p Communication ( W 1) Imbalance ( W 2) (3.2) Intheaboveequation,W1andW2areweightsassignedbytheus er.Thework-load balancecriteriaisusedasaconstrainttopreventtrivials olutionsofdenseclusters. 3.4ArchitectureandAssumptionsforExperimentalSetup Theexperimentalsetupusedtoverifythevalidityoftheapp roachandestimate qualityofthesolutionisexplainedinthissection.Theass umptionsaboutthearchitecture underconsiderationarepresentedrst.Theactualsetupfo restimatingvariousattributes valuesisthendiscussed. 37

PAGE 49

TheArchitectureunderconsiderationisadistributedhete rogeneoussystemwith n ProcessingElements( PE 1 ;PE 2 ;:::PE n ).Allthecomponentsareconnectedtogetherby meansofahighspeednetwork,withdatatransferspeedsknow ninadvance.Delayencounteredinbusarbitrationisassumedtobenegligiblefor thesakeofsimplicity.The frameworkhasbeendesignedwithscopeforscalabilityandi nclusionofnewparameters. Hencebusarbitrationprotocolsanddelaysencounteredina rbitrationcanbemodeledinto theexistingframeworkwithminimalmodicationtotheorig inaldesign.Thepartitioning algorithmwouldstillremainunchanged.Thecommunication latencyhereistheproduct ofthecapacityofthecommunicationlinkwiththefrequency ofusageofthelink.Since mostofthesimulationwascarriedoutonasparcv9sun-blade systemoperatingat650MHz inbuiltwithasparcv9roatingpointprocessor,thesparcpr ocessorwouldbereferredtoas thehostprocessor.SomepartsoftheprolingwasdoneinaPC ,withIntelPentiumP4 1.3GhzprocessorrunningMicrosoftWindowsXPandaMacOsXs ystemwithapowerpc G41.1GhzandrunningOsX10.4.11-Tigeroperatingsystem,i nordertoobtainageneralizedviewestimateofthetaskattributes.Theprocessing elementsareassumedtohave thecapabilitytoexecuteanyofthetasks,althoughthetime totaketoexecuteaparticular taskmightvaryfromoneprocessortotheother.Inthesetofs imulationexperimentsconductedthememorycapacitywasxed.Theframeworkalsosupp ortsscalabilitytosupport changesinthetotalmemoryusage.Simplescalartoolsetcan beusedforprolingmemory usageandtheresultscanbeincorporatedintotheexistingp artitioningalgorithm. 3.4.1ComputationofAttributeValues Theexecutiontimeonthehostprocessorisobtainedusingth etaskgraphextraction [19]tool.Theassemblylanguagelistingofthesourcecodef orthesparcprocessoris obtainedusing gcc -GNU-Ccompiler.ThenumberofCPUcyclesthateachtaskwoul d occupyisobtainedfromassemblylistingandtheinstructio nsetarchitectureinformation. Theassemblylistingonotherprocessingelementsisobtain edusinga crosscompiler .A 38

PAGE 50

crosscompilertoolchaincanbeconstructedforawidevarie tyoftargetarchitectures supportedbythegcccompiler.Onceatoolchainhasbeencomp iledandcreatedfora particulararchitecture,thecrosscompilercanbeusedtoc ompileandgenerateassembly listingofanyapplicationforthatspecictargetarchitec ture.Inthisworkacrosscompiler toolchainwassetupfortheIntelStrongARMprocessorserie s,MotorolaM6811hcmicrocontrollerseriesandthepowerpcseries.Thehostplatform usedwasSolaris2.8.Thegcc versionusedwas2.95.3. Thetoolcomputesthefrequenciesofeachtypeofinstructio nforeverytaskonallthe processingelement.Theproductoffrequenciesandthecomp lexityoftheinstructionis thetotalamountofCPUusagecyclesforaparticularinstruc tiontype.Thesumofall suchproductstakenoverallthetypesinstructionwouldthe nresultinthetotalnumber ofCPUcyclesthatwouldbeusedbythetaskonthatparticular processingelement. Althoughassigningastaticcomplexityvalueisanapproxim ation,itisafastermechanism whencomparedto,actuallycomputingtheoverheadofeachin structionindividuallyand obtainingruntimecounts.AdditionallysincetheCPUcycle sforalltheprocessingelements areapproximatedandCPUcyclesareusedonlyasoneofthemet ricsinmodelingthe behaviorofthetasks,therearenodisadvantagesinapproxi mation.Instructionlevelpower characterizationmodelforStrongARMseriesofprocessors wasdoneby[2]In[32]the authorshavecharacterizedtheIntelStrongARMseriesofpr ocessorsandtheyestimate theamountofenergyconsumptioninjouleswheneachinstruc tionisexecutedonthearm processor.FortheMotorolaM68HC11seriesofmicrocontrol lerstheenergyvaluesfor theinstructionsetwereobtainedfrom[5]Fromthecomputat ionprograms,havingalready obtainedtheassemblylistingofthetasks,wecannowcomput etheenergyconsumedby thetaskusingtheenergyvaluesprovidedby[32].3.5ExperimentalResults 39

PAGE 51

Theclusteringalgorithmwastestedonvedierentbenchma rkprogramsofvarying levelsofcommunication.Table3.1.denotesthenumberofno des(tasks)andtheedges presentineachofthevebenchmarkprogramschosenforsimu lation.Inadditiontotesting Table3.1.SizeoftheProgramsUsedinSimulation Program #Nodes #Edges FFT 22 26 StatisticalTool1 44 97 StatisticalTool2 90 163 LaplacianEdgeDetection 105 189 ArithmeticCoding 136 216 theclusteringmechanismonvedierentbenchmarks,forea chbenchmarkthenumber ofparameterstakenintoconsiderationbytheclusteringal gorithmwasvariedinorderto studytheeectofvaryingnumberofparametersontheeecti venessandperformanceof theclusteringalgorithm.Theconstraintsthatwerevaried inordertotesttheperformance oftheclusteringalgorithmincludethefollowing:theexec utiontimetakenforeachtaskto completeitsexecution,theamountdatainKiloBytes(KB)ne ededtobetransferredfrom thetasktoothertasks,thenumberofCPUcyclesneededbythe tasktocompleteexecution andnallytheamountofenergyneededforthetasktocomplet eexecution.Thedefault versionoftheclusteringalgorithmusesonlytwoparameter s,theexecutiontimeandthe communicationoverheadfromthetasktotheothertasks.Whe nthreeparametersareused bytheclusteringalgorithm,thenumberofCPUcyclesusedfo rtheexecutionofthetask isalsoincludedinadditiontotheexistingtwoparameters. Finallytheenergyconsumed bythetaskisalsotakenintoaccountwhenfourparametersar eusedbytheclustering algorithm.Theparameterswerevariedforeachoftheveben chmarks.Thebenchmarks chosenareamixaofsmall/mediumtolargecodesizebenchmar ks.Hencetheclustering algorithmwastestedforthreetosevenclustersinthecaseo fsmalltomediumbenchmarks andvetonineclustersinthecaseoflargebenchmarks.Thes izeofthebenchmarkswere indicatedbytheamountoftheedgespresentinthetaskgraph ofthebenchmark. 40

PAGE 52

Table3.2.SimulationResultsObtainedbyRunningFFT PercentageimprovementincommunicationlatencyforFFT Algorithm ClusteringBasedApproach #Clusters 3 4 5 6 7 #Pi=2 62.8 41.7 21.2 17.0 14.2 3 31.8 30.7 23.1 16.9 16.9 4 36.6 34.9 25.4 17.1 15.6 SimulatedAnnealingBasedApproach #Pi=2 77.3 55.4 34.4 22.7 20.8 3 30.6 40.4 60.8 18.9 15.8 4 58.6 45.6 32.6 24.5 19.7 TabuSearchBasedApproach #Pi=2 43.2 5.0 7.3 5.7 7.4 3 31.5 12.5 13.1 4.6 8.4 4 34.5 7.0 13.9 12.2 7.8 Table3.2.containstheresultsofexecutingtheclustering algorithmforoneofthesmaller benchmarksthefastfouriertransformation-FFT.Tables3. 3.and3.4.containtheresultsof executingtheclusteringalgorithmforthemediumsizedben chmarksstatisticaltools1and 2.Boththestatisticaltoolsperformvariousstatisticalf unctionsbeginningwithsmaller onesincludingcomputingthecorrelation,variance,regre ssionandalsoincludingsomehigh levelstatisticalfunctionscontainingcomplexcomputati ons.Boththestatisticaltoolsare computationallyintensiveapplicationsandalsorequirea largeamountofcommunication bandwidthinordertotransferdataandresultsfromonemodu letoanother. Table3.5.containstheresultofexecutingtheclusteringa lgorithmonimageprocessing algorithmnamelytheLaplacianedgedetectionalgorithm.E dgedetectionisalsoacomputationallyintensivebenchmark.Thecommunicationband widthneededintermsofthe KiloBytesofdatatransferredbetweenthenodesishoweverl esserthanthatinthecase ofthestatisticaltoolsIandII.Table3.6.isthenaltable containingtheresultsofthe executionoftheclusteringalgorithmandisthelargestben chmarkthatthealgorithmwas testedfor.Arithmeticencodingofdatalesisbothcomputa tionallyintensiveandalsothe amountofdatatransferredbetweenthenodesishuge,hencet hecommunicationbetween thenodesisanimportantcriteriainthiscase. 41

PAGE 53

Table3.3.SimulationResultsObtainedbyRunningStatisti calTool1 PercentageimprovementcommunicationlatencyforStatist icalTool1 Algorithm ClusteringBasedApproach #Clusters 3 4 5 6 7 #Pi=2 45.2 55.0 51.5 41.8 39.0 3 50.0 41.4 44.4 42.6 46.2 4 56.5 55.7 47.1 40.0 43.6 SimulatedAnnealingBasedApproach #Pi=2 60.1 67.9 59.8 49.1 44.3 3 57.6 45.8 49.2 50.7 53.1 4 63.6 65.9 53.3 48.3 51.9 TabuSearchBasedApproach #Pi=2 45.5 55.8 40.1 41.9 31.3 3 38.0 36.1 32.7 16.8 18.7 4 37.2 36.7 32.4 18.9 18.3 Thenumberofconstraintsusedforeachexecutionrunofthec lusteringalgorithm isrepresentedinTable3.7..Theresultsareindicativetha tinthecaseofthechosen benchmarksasthenumberofconstraintsusedbytheclusteri ngalgorithmincreasesthe amountofsavingsintermsoftheoveralloptimizationincom municationoverheadreduces. Theresultsfromtheabovetablealsoindicatethattheparam eternumberofclusters alsohasanegativeeectontheclusteringalgorithm.Incre asingthenumberofclustersin theapplicationincreasescommunicationoverheadandhenc ethereasonforthedecrease inimprovementofcommunicationbandwidth.Thecostevalua tionfunctiondevelopedfor theclusteringalgorithmwasunidirectionalandtheprimar ygoalwastooptimizethetotal communicationoverheadbetweentheclusters.Howeverthec ostfunctioncanbemodied accordinguserpreferencestoaltertheobjectiveofthealg orithmandthesamepartitioning algorithmcouldbetweakedtoaccommodatesimultaneousopt imizationofexecutiontime andpowerconsumed.Theuserhastomaintaincautionthatint roducingcomplementary termsinthecostfunctionmightresultinthecasewherethea lgorithmisunabletond anoptimalsolutionandthethesolutionobtainedmaybealoc almaxima/minima. 42

PAGE 54

Table3.4.SimulationResultsObtainedbyRunningStatisti calTool2 PercentageimprovementincommunicationlatencyforStati sticalTool2 Algorithm ClusteringBasedApproach #Clusters 5 6 7 8 9 #Pi=2 65.8 68.8 65.5 48.6 44.6 3 61.7 48.8 43.7 35.8 35.8 4 31.8 28.5 32.8 30.2 28.5 SimulatedAnnealingBasedApproach #Pi=2 70.6 69.8 68.7 73.1 71.4 3 66.1 62.8 63.8 60.8 59.8 4 71.4 65.8 64.1 57.9 55.4 TabuSearchBasedApproach #Pi=2 60.5 53.6 39.6 40.1 35.3 3 50.2 42.3 33.4 13.9 13.2 4 21.9 19.2 24.9 11.2 9.8 3.5.1ComparisonofCumulativeResults Thecodepartitioningwasimplementedwithsimulatedannea lingbasedalgorithmto estimatetheperformance.ATabusearchbasedpartitioning algorithmwasalsoimplementedforthecaseofcomparingclusteringwiththeotherme thods.Inthecomputational resultsinthetablesabovewehaveseenthatsimulatedannea lingbasedapproacheshave performedbetterthanclusteringbasedapproachesandthey havealsoperformedbetter consistentlybyamarginofabout10-15%TheInitialtempera tureforthesimulatedannealerwasdeterminedbyndingtheaveragechangeincostfo rasetofrandommoves fromthestartingcongurationandselectingthetemperatu rewhichleadstoanaccept probabilityof0.95[29].Thetabusearchalgorithmimpleme ntedwasalsoadaptedfrom [28].Wecanseefromtheresultsthatclusteringbasedappro achesoutperformtabusearch basedpartitioningalgorithminmostofthecases.Consider ingtheoverheadinvolvedin executingeachoftheseparateapproachestherstfactorco mparedistheexecutiontime neededforthecompletionofeachapproach.InTable3.8.aco mparisonbetweentheexecutiontimesforaclusteringbasedapproachandasimulated annealingbasedapproachis presented. 43

PAGE 55

Table3.5.SimulationResultsObtainedbyRunningLaplacia nEdgedetection PercentageimprovementcommunicationlatencyforLaplaci anEdgedetection Algorithm ClusteringBasedApproach #Clusters 5 6 7 8 9 #Pi=2 67.1 69.9 64.9 50.3 47.1 3 62.5 57.7 53.1 40.5 37.9 4 35.2 32.1 33.2 31.3 27.8 SimulatedAnnealingBasedApproach #Pi=2 72.1 73.4 72.2 71.5 66.9 3 68.9 65.1 64.2 55.7 54.3 4 66.2 60.3 58.7 50.4 47.2 TabuSearchBasedApproach #Pi=2 64.1 59.6 59.6 43.5 40.7 3 51.3 50.7 49.8 23.7 20.8 4 24.4 23.6 26.1 10.3 8.9 ObservingfromTable3.8.,weseethatthecomparisonofexec utiontimesbetween thealgorithmsisnotpractical.simulatedannealingbased approachesexplodeintime complexitywithincreasingnumberofnodesintheapplicati on.Therateofincreaseof executionincomparisonwiththerateofincreaseofnumbero fnodesintheapplication isveryhighinthecaseofsimulatedannealingbasedapproac hes.Tabusearchbased partitioningschemesarealsoincreasingatahighratewhen comparedtoclusteringbased algorithm.Insomecasessimulatedannealingbasedapproac hesdonotconvergetoan optimalsolutionwithinanacceptabletimeframe. 44

PAGE 56

Table3.6.SimulationResultsObtainedbyRunningArithmet icEncodingofFiles PercentageimprovementcommunicationlatencyforArithme ticEncodingofFiles Algorithm ClusteringBasedApproach #Clusters 5 6 7 8 9 #Pi=2 68.9 70.3 65.6 51.9 50.5 3 63.6 63.4 59.7 44.2 40.3 4 38.7 36.3 34.6 30.8 28.1 SimulatedAnnealingBasedApproach #Pi=2 74.5 76.7 75.8 67.1 66.9 3 70.6 69.3 68.5 59.4 54.6 4 58.5 55.1 53.8 43.6 40.7 TabuSearchBasedApproach #Pi=2 66.3 64.7 61.1 44.9 42.4 3 53.2 52.2 52.7 27.7 24.5 4 27.1 25.7 26.8 11.1 10.2 Table3.7.ConstraintsTerminology #Constraints( P i ) ConstraintsUsed i=2 ExecutionTime+DataSentout i=3 P 2 +CPUCyclesUsed i=4 P 3 +PowerDissipated Table3.8.ComparisonofExecutionTimesforClustering,Si mulatedAnnealingandTabu SearchBasedApproaches Nodes Executiontimeforoneiteration Clustering SAbased(hours) TabuSearch 22 20sec 4:49:16.85 14 44 35sec 6:38:37.44 82sec 90 1:05min 15:23:7.34 1:25min 105 2:05min 17:15:43.8 3:25min 136 3:22min 20hr 5:45min 45

PAGE 57

CHAPTER4 SOFTWAREARCHITECTURE Thesoftwareimplementationofthepartitioningalgorithm canbedividedintotwo parts.Onepartcomprisesoftheprolingtoolsusedforgath eringinformationfromthe applicationsourcecodeandtheotherpartisthe C++ languagebasedobjectoriented designandimplementationofthealgorithm.Thischapterel aboratesontheprolingset upandthendiscussesthedetailedobjectorienteddesignfo rC++softwarearchitecture. Therststepofthesoftwareimplementationprocessisthep rolingsetup.Thespecic characteristicsoftheapplicationsourcecodeareanalyze dandthevaluesforeachofthe attributesofeverynodeinthetaskgraphiscomputedinthep rolingprocess.The frameworkconstructedfortheprolingprocessconsistsof compilers,customdesigned CADtoolsandthirdpartyinstructionlevelsimulators.The dierenttoolsaretethered togetherinatightprocessrowdescribedinthedesignofthe framework.Theentireframe workisautomatedandrequiresnohumanintervention.Ther stinputtotheproling processconsistsoftheapplicationsourcecodecontaining annotationstodemarcatetask boundaries.Thesecondinputtotheprolingprocessisahar dwarelibrarylecontaining specicdetailsofthetargetarchitectureonwhichtheappl icationneedstobeproled.In thecaseofheterogeneousdistributedsystems,therearemu ltiplearchitecturesinvolved. Hencethehardwarelibraryleusuallycontainsarchitectu respecicdetailsfortheall thecomponentsintheheterogeneoussysteminadditiontoth econnectivityinformation betweenthecomponents,likethetopologyofthedistribute darchitectureandthespeed oftransferonthesecommunicationlinks. 46

PAGE 58

gcc cross compiler for P.E 1 gcc cross compiler Source Code Annotated gcc compilersparc version shell scripts sparc assembly code for P.E 2 for P.E 2 instruction level for P.E 1 instruction level for sparc instruction levelpower estimate power estimate power estimate shell scritps CPU cycles for P.E 2 assembly code for P.E 1 assembly code Power estimate Figure4.1.ApplicationSourceCodeProlingTools 4.1ProlingSetup Theprolingframeworkisdescribedinthegure4.1..Theto olsusedintheproling processcanbedividedbroadlyintotwocategories.Crossco mpilersandshellscripts processingthecrosscompileroutput. Crosscompilationistheprocessofcompilinganapplicatio nsourcecodeonahost architecturewiththeobjectofgeneratingabinarycapable ofexecutingonatargetarchitecture.Crosscompilationprocessisespeciallyusefu linthecaseofembeddedsystems wherethetargetarchitecturesusuallydonothaveanysuppo rtforoperatingsystemsor compilers.Acrosscompilertoolchainisbuiltforanyspeci ctargetarchitectureonthe hostprocessor.Theprocessiselaboratedindetailingure 4.4..Thebasictoolsneeded forbuildingandinstallingacrosscompileronanymachinec onsistofthefollowing. gcc aGNUbasedCcompilerformsthebaseforcompilinglibraries andbinariesonthehost machine.The gcc onthehostmachineisusedinbuildingandinstallingthecro sscompiler 47

PAGE 59

Figure4.2.IllustrationofSampleCodeatVariousStages 48

PAGE 60

forthetargetarchitecturewhichwouldbecalledthe cross-gcc .Theothertoolsrequired intheprocessincludeassemblers,linkercombinedwithloa dersandC-libraries.Thelast requirementfortheconstructionofthecrosscompileristh etargetarchitecturespecic Clanguageheaders.Clanguageheaderlesarereadilyavail ablefordownloadformost commonlyknowntargetarchitecturesintheembeddeddomain andtheyarealsoavailableforcommonlyusedoperatingsystemslikelinux,solari setc.Mostofthethirdparty vendorsshippingembeddedprocessorsincludecrosscompil ersanddesigntoolsfordevelopmentforthearchitecturesalongwiththeprocessor.GNUi sanopensourcecommunity initiative,henceheaderlesandlibrariesformostofthee xistingversionsofthecompiler canbeobtainedfromfreelanceprogrammersandresearchgro upsorganizedintoforums. Afterthebuild,thecrosscompileristestedforcorrectnes susingsimplebenchmarks.The applicationsourcecodeisthencompiledonthehostprocess orwiththenewlybuiltcross compiler.Thebinaryleobtainedcanbetransferredtothet argetarchitectureforexecution.Thebinarycanalsobeusedforobtainingprolinginfo rmation.Thebinarylefor thetargetarchitectureispassedthroughasetofinstructi onsetsimulatorstoobtaincycle countinformation.Theinstructionsetsimulatorscanbeus edtogenerateotherproling informationincludingmemoryusage,powerdissipation,ca cheusagepatternsetcwithout actuallyexecutingthebinaryonthetargetarchitecture.T hismethodofrapidprototypingisparticularlyusefulinthedesignofdistributedhete rogeneoussystems.Thesystem architecthastheopportunitytoevaluatetheperformanceo fvariouscongurationsforthe distributedsystem. Crosscompilersprovideaconvenientmethodtoproleandte stanyprototypedesign thatisdesignedforanyprocessor( especiallyembeddedprocessors )onthehostprocessor. Designissuesandperformancebottleneckscanbexedinthe prototypesevenbefore physicallytestingontheintendedprocessor.Theprimarya dvantageofthisprocessisthe hugeamountoftimesavedinthetestoftheprototype.Onceas inglecrosscompilertool chainhasbeensetupforatargetarchitecture,thetoolchai ncanbeusedforcompiling andgeneratingexecutablelesforthetargetarchitecture onthehostarchitectureitself. 49

PAGE 61

Thisprocesseliminatestheneedfortransferoftheexecuta bleontothetargetarchitecture forcompilationorsimulationpurposes.Anotheradvantage ofthemethodisthat,thereis nowscopeforinvestigatingintothebehavioroftheapplica tionwithrespecttothetarget architecture.Hencethearchitecturecanbenetunedtoopt imizetheexecutionofthe application.Theaboveadvantagesprovedtobethemainmoti vationintheuseofcross compilationtechniques. Inthisworkalltheexperimentsweredesigntobeconductedo nasunSPARCIV processor.Crosscompilerswereconstructedforthreeothe rarchitecturesARMprocessor foralinuxbasedsystem,powerPCprocessorforalinuxbased systemandnally m68hc11 amicro-controller.InterARMprocessorisapowerfulembed dedprocessorsupporting complexinstructiontypes.PowerPCisageneralpurposepro cessorcapableofahighdegree ofprogrammability.M68HC11isaMotorolaembeddedmicroco ntrollerwithverylimited levelofprogrammability.Themicrocontrolleralsosuppor tsfewinstructiontypes.Hence thecombinationoftheprocessorschosenrepresentsagoodm ixofprocessingelementswith varyingdegreesofprogrammabilityandperformance.Thecr osscompilerisusedcompile theapplicationsourcewithspeciccompilationswitchest urnedon.Thegcccompilercan beusedforgeneratingdebugginginformationabouttheappl icationcodebeingcompiled. Thecombinationofruntimeoptionsusedininvokingthecomp ilerdecideswhatswitches areturnedon.Thegcccompilercangenerateassemblycodefo rthetargetarchitecture fromtheapplicationsource.Henceusingacrosscompilerth eapplicationsourcecodecan becompiledwithspecicswitchesturnedonsothattheappli cationsourcecodeisproled forthetargetarchitecture.Applicationsourcecodecompi ledwithanARMcrosscompiler builtonasolarismachinecanthusproduceARMassemblylang uagecodeonthesolaris machine.Thegeneratedassemblycodecanthenusedalongwit hinstructionsetsimulators toobtainfurtherinformationabouttheapplicationwhenpa rticularlyexecutingonthe targetarchitectureplatform.Theproblemwiththistechni queisinstructionsetsimulators areusuallytailoredforcompleteprogramsandhencedonotp rovideinformationpertaining toindividualtasks.Inordertoovercomethisproblem,theo riginalapplicationsource 50

PAGE 62

codeispassedthroughthegraphgenerator.Theresultismod iedapplicationsourcewith commentsaddedinthecodedemarcatingeachtaskintheappli cationsourcewithspecic tasknumbering.Thecrosscompilergeneratestheassemblyl istingfortheapplication sourceforeachofthetargetarchitectures.Theapplicatio nsourceisnowconvertedto assemblylistingforeachofthetargetplatforms.Theoutpu tofthecrosscompilation processistheapplicationsourcecodewithannotatedassem blylanguagecodeforeach instructionintheapplicationsourcecode.Adatasheetisp rovidedbythevendorfor everytargetarchitecturethatcontainsinformationabout theinstructionssupportedby thetargetarchitecture.Fromthesupportfortheprocessor targetarchitecture,itiseasy tocompilealibraryofalltheinstructionsupportedbythet argetprocessorandtheaverage numberofCPUcyclesusedforeachinstructiontype. TheprocessofestimatingthenumberofCPUcyclesusedbyeac htaskusesthreeinputs, annotatedoriginalapplicationsourcecode,annotatedand crosscompiledassemblylisting oftheapplicationsourcecodeandnallythedatasheetfora llthetargetarchitectures thatcontainsthecyclecountinformationfordierentcate goriesofinstructionsupported byeachtargetarchitecture.Therststepintheprocessofe xtractiontheCPUcyclesis therstpassofaparserontheannotatedsourcecodetoident ifythetasknumbers.The secondstepisarstpassontheassemblylistingofthesourc ecodetoidentifycorresponding numbersincomments.Onidentifyinganumberedtaskthepro lercrossreferencesthedata sheetforthetargetarchitecturetoidentifythenumberoft hecyclesusedbytheparticular instructionandismaintainedinabuer.Sincemostapplica tionsarewrittenusinghigh levellanguagessupportinghighleveldesigntechniquesli keproceduresandfunctions,there mayremainundenedreferencesintheassemblylistingforw hichtheprolerisunableto ndamatchinthetargetarchitecturedatasheet.Thefuncti on/procedurecallswritten inhighlevellanguagesareusuallystoredassymbolsforina functiontableattheendof themainapplicationprogram,inordertofacilitateajumpt otheaddressofthefunction tagonaprocedurecallroutine.Theproleridentiesthefu nction/procedurenamesat theendoftherstpassontheannotatedapplicationsourcec ode.TheCPUcyclecount 51

PAGE 63

informationforeachoftheproceduresiscalculatedbyaset ofsecondarypassesintothe applicationsourcecodeandauniqueleiscreatedforeachp roceduredenitionandthe cyclecountinformationforthatprocedureisstoredinthat le.Theprolerthenmakes asecondpassontheassemblylistingwhilemaintainingthel istofundenedreferences. Thistimewhenafunction/procedurenameisencountered,th eCPUcyclesusedforthat particularinstructionissubstitutedfromtheseparateli stoflesforeachfunctioncall. Attheendofthesecondpassthebuerforeachnumberedtaskc ontainsthelistofall theinstructionscomprisingthetaskandtheaveragecyclec ountforthatparticulartype ofinstruction.Thecontentsofeachbuerarecumulatively addedtoobtainthetotal numberofcyclesneededforeachtask.Similarlyadatasheet iscompiledforallcandidate targetarchitecturesthatwouldbeusedinthedistributedh eterogeneoussystem.Cross compilersarebuiltandinstalledforallthetargetarchite ctures.Theaboveprocessof computingtheCPUcyclesforeachtaskisthenrepeatedforea chtargetarchitecturewith thecorrespondingcrosscompilertoolchain.Theresultsar ecollatedandarrangedintext buersforfastaccessbythepartitioningalgorithm. Thekeydesignconsiderationsintheimplementationofthep rolingmoduleinclude speedofexecutionandlevelofautomation.Themoduleshoul dbereallyfasttoaccommodaterealtimespeedsandtheamountofhumaninterventionsh ouldbekeptataminimum possible.Mostofthehighlevelprogramminglanguagesaref astinthecaseofthecomputationallyintensiveapplications.Butwhenitcomestorea dingandwritinghugeamounts ofdatafromdiscles,highlevelprogramminglanguagesare reallyslow.Hencepowerful shellprogramsor shellscripts writtenusingacombinationof cshscripting gawk and sed areusedinthestaticprolingprocess. Gawk istheGNUversionoftheoriginalUNIX programminglanguage awk withsomeadvancedfunctionalityadded.Gawkcontainsmany ofthestringmanipulationfunctionssupportedbytheCprog ramminglanguageandinadditionmostoftheClanguageconstructscanbeusedwithinag awkscript.Sincegawkis typicallyascriptinglanguage,itisfasterthanhighlevel programminglanguagesespecially forstring/datamanipulationsfrom streams .A stream inthiscontextcanbedenedas 52

PAGE 64

astreamofdatathatisobtainedasoutputfromoneprocessor gawkcommand.Gawk scriptinglanguageandshellscriptinglanguageslikecshc anoperateecientlyonstream outputsandinputs.Theoutputfromoneshellprocesscanbed irectlystreamedasinput foranothershellprocess.withoutactuallybeingwritteno ntoanytemporarydiscles.The advantageofusingstreamsisthattheinput/outputwaitinv olvedinwritingtotemporary disclesisavoided. Sed -streameditorisideallysuitedforoperationsonhugeamou ntof datastreamandisalsoexceptionallyfast.Theprocessofco mpilingandbuildingeachof theindividualcrosscompilersisexplainedindetailinthe appendices. Figure4.3.IllustrationofOutputfromTaskGraphGenerato r 4.2Sparc-Linux-ArmCrossCompilation Thissectionprovidesindetailthescriptsandpackagesnee dedforcompilingandconstructionacrosscompilerforanARMprocessorrunninglinu xoperatingsystemona Sparcprocessorrunningthesolarisoperatingsystem.Theb uildingblocksrequiredfor constructingthecompilerarethefollowing. binutils -acollectionofGNUbinarytoolscontaining ld thelinkerand as theassembler glibc -alibrarydeningClanguagesystemcallsforthekernel linux-headers-arm -ARMarchitecturespecickernelheaderles 53

PAGE 65

Build and Install Cross Compiler Target Architecture Specific Headers Install and Build and Build C-Libraries Compile Binutils Compile Acquire Specific Headers Cross compilation C-Libraries Binutils Figure4.4.OverviewoftheProcessofBuildingaCrossCompi ler Theabovetoolscanbedownloadedfromwww.gnu.orgorfromfo rumspromotingresearchintheARM-linuxcrosscompilers.Thetoolswhichare in bzip formatcanbe extractedusinga bunzip command.Sometoolsarebundledin RPM (RedhatlinuxPackageManager)les.InordertoextractRPMlesonasolarisop eratingsystem,asmall PERL script rpm2cpio canbeused.ThePERLscriptconvertstheRPMpackagestosolarisarchiveslearchives.Thelearchivescanbenowbeex tractedusingthecommand cpio -copylearchivesinandout.Thenextstepafterobtainingt hetoolpackagesisxing thedirectorytoinstall.Approximatelyabout45MBoffrees paceisrequiredfortheinstallation.AworkingversionoftheGNUCcompiler gcc isrequired.Theversionusedforthis crosscompilerwasgcc2.95.3.Thebinutilsneedtobeinstal ledrstinordertoconstruct thecompiler.Therststepininstallingthebinutilsiscon guringtheinstallationscript. Thesourcelesdirectorycontainsashellscript congure -thislecontainsplaceholdersfor sourcelelocations,librarylocations,installationloc ationandinstallationoptions.The sourcelescanbeconguretobuildacrosscompilerforalit tle-endianorabig-endian architecture.Theplaceholdersarecompletedandthecong urescriptisexecuted.The scriptperformsacheckontheavailabilityofallsourcele s,readandwritepermissionsfor directoriesthatwouldbeusedintheinstallationprocessa ndamountoffreespaceonthe disk.Ifalltheconditionsneededtobuildthecompilerares atised,thescriptgenerates 54

PAGE 66

aninstallationscriptle.Theinstallationscriptcanthe executedbyusingthecommands gmake and gmakeinstall .Thecommandgmakecompilesthesourcelesandbuildsthe libraries.Thecommandwiththeinstalloptionlinksthelib raryles,generatesbinariesas outputandcopiesthebinariestotheirrespectivedirector ies.Thenextstepistobuildthe actualcrosscompilerusingthe ld and as toolswhichhavebeenbuilt. 4.3Motorolam68hc11CrossCompilation Them68hc11isaMotorolaembeddedcontrollerwithlimitedf unctionality.Theadvantageofusingm68hc11isthatitiseasytoprogramwithali mitedsetofsupported instructiontypes.Theforumsavailableongnu.orgprovide opensourcetoolsandsoftware fordevelopmentonthem68hc11andm68hc12platforms.Thebu ndledsoftwarecanbe downloadedandinstalledtogetaworkingversionofthegccc ompiler,theglibclibraries, binutilsandm68hcspecicheaderles.Theversionofcompi lerusedinthisworkwas gcc2.95.3.ThesoftwarewasinstalledonaPCrunningMicros oftwindowsXP.Thecross compilergeneratedm68hc11assemblycodeannotatedwithta sknumbers.Thisassembly listingwastransferredtothesunSparcmachinerunningsol aris8.Theprolingtoolsintegratedintotheframeworkthenparsedtheassemblylistingt oobtainspecicinformation aboutthenumberofCPUcyclesusedforeachtaskandotherdet ailsincludingmemory usageandpowerusage.4.4PowerPCCrossCompilation ThePowerPCprocessorisagenericprocessorcapableofexec utingcomplexinstructions.ThecrosscompilerforPowerPCwassetupontheSunSpa rcmachineandthe assemblylistingwasgenerated.Theprolingoftaskswasdo neonapowerbookG4runningMacOsX10.4.11.TheX-Codeframeworkwasusedtoprole theapplication.The Shark performanceanalyzerwasusedtogeneratetaskrelatedinfo rmation. 55

PAGE 67

4.5 C++ BasedImplementation Inthissectionwedescribeindetailthe C++ programminglanguagebasedsoftware frameworkdevelopedforimplementingthepartitioningalg orithm.Thesoftwareframework canbedividedintotwomajorparts.Therstpartistheentir esetofobjectsthat comprisetheframework.Thispartincludesobjectbehavior modelingusingattributesand methods.Thesecondpartcontainsgenericfunctionalrouti nesforpartitioningandgraph traversal.Sincethecodepartitioningproblemhasbeenmap pedtoagraphpartitioning problem,graphtraversalandpartitioningalgorithmsarei mplementedandcanbeinvoked independentofthenatureoftheobjects.Theadvantageofth isdevelopmentmethodology istherexibilityprovidedtotheuserindecidingtheexactp artitioningalgorithmthatthe userpreferstouse.Partitioningalgorithmsbasedonclust ering,simulatedannealingand tabusearchapproacheshavebeenimplementedinthiswork.T heframeworkhasbeen designedinsuchamannerthatanynewpartitioningalgorith msthatneedtobeutilized, canbeinvokedusingAPI'sprovidedintheobjectframework.4.5.1MotivationforUsing C++ The C++ programminglanguagewastheidealchoiceforthedevelopme ntoftheentire softwarearchitectureframework.Thelanguageisobjector ientedhenceprovidesdata hidingorencapsulationmechanisms.Thebehaviorofallthe individualcomponentsof theframeworkaremodeledusingobjects.Hencealltheopera tionspertainingtoeachof theindividualcomponentsareencapsulatedintotheobject itselfas methods .Thevarious partitioningalgorithmswereimplementedasgenericmetho ds.Sincetheimplementationof theobjectsistransparent,theusercanswitchfromonepart itioningalgorithmtoanother withoutanyprogrammingchange.Inaddition,theframework canalsoaccommodate additionalpartitioningalgorithmswithminimalchangest otheexistingframework.Once thepartitioningalgorithmisimplemented,justadditiono fanotherchoicewillenablethe partitioningalgorithmtoutilizetheexistingAPI'softhe frameworkinordertointegrate intothesystem. 56

PAGE 68

name;worst case time;total data_in;task count;tasks;medoid; compute execution time();inter cluster communication cost(); memory;data bus speed;data bus width;cost; total communication cost(); scaling; compute executioon time();compute total cost(); dissimilarity; compute distances();compute edge distance(); task cluster processing element task attributes name;time;data_in;memberof;parents;children;cyclecount;power; read profiling info();traverse graph();Figure4.5.ObjectBasedDesignoftheMainEntities 57

PAGE 69

4.5.2ObjectDescriptions Theindividualentitiesinthepartitioningproblemexhibi tbehaviorsimilartoobjects. Inthissectionwedescribeinbriefthemainobjectsthatfor mthecorecomponentsofthe softwareframework.The task objectrepresentsthemodelofataskintheapplicationor equivalentlyanodeinthetaskgraph.Theattributesofthet askclassincludethefollowing: thenameofthetask(usuallynodenumber),thetimetakenins econdstocompletethe executionofthetask,theamountofdata(inKB)neededbythe taskforcompletion,theid oftheclustertowhichthisparticulartaskisapart,listof allthenodeshavingandirected edgewiththeirendpointonthisparticulartask(orparentn odes),listofallnodesto whichtherearedirectededgesinitiatingfromthisparticu lartask(orchildnodes),the totalnumberofCPUcyclesthatthetaskwouldneedoneachpro cessingelement,and theamountofpowerneededtoexecutethetask.Theobjectenc apsulatesmethodsfor providingcompleteaccesstothetaskgraph.Thesemethodsi nclude: Parentnodelist ) Obtainingalistofalltheparentnodes Childnodelist ) Obtainingalistofallthechildnodes Distance ) Computingtheedgedistancetoanothertasknode(ifconnect ed) Theobjectalsocontainsmethodstointeractwiththelibrar yofprocessingelementsand proleinformationrepositorytoobtainthenumberofCPUcy clesandpowerconsumption.Theinformationpertainingtoeachtaskisstoredinth ecorrespondingobject.The individualattributesofeachtaskarethennormalizedandc onvertedtodissimilaritiesand storedinthetaskattributesclass. Theobject cluster isrepresentativeofaclusteroftasks.Theattributesofth eobject includethefollowing:nameofthecluster,totalamountofd ataincomingdata(inKB), worstcaseexecutiontimefortheclusterobtainedbycompar ingtheexecutiontimesforthe sameclusteronallprocessingelements,totalnumberoftas kscomprisingthecluster,the nodenumbersofalltasksgroupedwiththisparticularclust erandthenodeidofthemedoid forthisparticularcluster.Theobjectprovidesmethodsfo rtraversingthetaskgraphand 58

PAGE 70

computingtheexecutiontimeforanysub-graphoftheentire graph.Methodsarealso providedforcomputingthecostofdatatransferbetweenapa irofclustersrepresentedby theamountoftimetakentotransferdata(measuredinsecond s)andalsoinadditionto computetheentirecostofasolutionmeasuredasthetotalam ountoftimespentindata transfers.4.5.3TreeTraversalandCostComputations Theintermediatesolutionateachstepofthepartitioningp rocessisasetofclustersof code.Eachcodeclusterconsistsofinterconnectedtasks.T hesteeringlogicofthepartitioningapproachproducestheintermediatesolution.Ifth ecostofthenewintermediate solutionisbetterthantheprevious,theclustercongurat ionisupdatedwiththenewset ofmedoids.Thepartitioningalgorithmisthensteeredinth enewdirection.Thecost computationmoduleplaysavitalroleindecidingthenewdir ectionofprogressatevery intermediatestep.Thesoftwareframeworkconsistsoftree traversalalgorithmstoidentify themembernodesforeverymedoidinthenewsolution.Onceth eclustermedoidsandthe membernodesareidentied,themethodsoftheaggregateobj ect cluster areutilizedto computeattributesliketotalexecutiontime,totalnumber ofCPUcyclesandtotalenergy consumptionofeveryclusteronalltheprocessingelements .Attheendofthisestimation processeach cluster objectcontainsalltheattributesonthedierentprocessi ngelements. Thecomparisonofthecostofonesolutiontoanotheristhuss impliedbyahugemargin, sincetheimplementationoftheentireobjectistransparen ttotheuser. 4.5.4BuildingBlocks Theentiresoftwareframeworkisconstructedwithalotofsc opeforexpansionandscalability.Sinceallthemajorcomponentsofthesystemaremod eledasobjects,additional behaviorthecomponentscanbeeasilybecapturedbytheaddi tionofnewattributesto theobject.Theprogrammingchangesneededtomodelthebeha viorareminimal,sincethe partitioningalgorithmandthecostcomputationmodulesar eisolatedfromtheimplemen59

PAGE 71

tationoftheobjectitself.Furtherinvestigationintonew erpartitioningmethodologiesor includinganyimprovementstothecostcomputationmodulei sextremelysimplied.The algorithmscanbeimplementedseparatelyandtheAPI'sprov idedwiththeobjectscanbe invokedinordertoswitchthealgorithms.Theentirearchit ecturewasdesignedwithscope forfurtherimprovementsandexpansion. 60

PAGE 72

CHAPTER5 CODEGENERATION Thischapterpresentsanautomaticcodegenerationtoolfor partitionedprocedural languageapplicationswithoutmakinganychangestotheori ginalapplicationsourcecode. Thetooldoesnotinvolveanymanualintervention.Theparti tionedcodeclustersare identiedbeforehand.Anyformofcommunicationbetweenth epartitionedcodeclusters isthenachievedbyspecializedcommunicationprimitivesi nsertedintotheoriginalsource code.Thisautomaticapproachtocodegenerationhassigni cantpotential.Thisapproach drasticallysimpliestheprocessofpartitioningapplica tionsforheterogeneousdistributed systems.Thistoolhasbeenimplementedasapartofamasters thesis[27]. Thedesignandimplementationoftheproposedcodegenerati ontoolisshowninFigure5.1..Therststepinvolvesgatheringdataandthecreat ingametadatainformation repository.Thisisthemostcrucialstageastheaccuracyof thecodegeneratordependson thedataaccumulated.Thesecondstepistheactualpartitio ningoftheprogramintoindividualclustersofcodeandtheadditionofprogrammingcons tructstoconverttheclusters intostandaloneprograms.Thethirdstepinvolvestheinclu sionofacommunicationmechanism.Datacommunicationandsynchronizationareimpleme ntedviamessagepassing. Optimizationmethodsusedtoimprovethequalityandperfor manceofcodesuchasusing multicastorbroadcastcommunication,usingaggregatecom municationandeliminationof unnecessarycommunicationarealsoexplainedindetail.5.1NeedforaDataRepository Inordertoattainsynchronizationandcorrectexecutionof theindependentsub-programs ofadistributedapplication,itiscriticaltomaintainacc uratecommunicationbetweenthe 61

PAGE 73

Figure5.1.StepsInvolvedInCodeGeneration codeclusters.Astrongcommunicationmethodologyisneces saryfortheschedulingthe instructionsintheindependentsub-programs.Inordertoi mplementthiscommunication layer,detailedprolinginformationofthesourcecodeisn ecessary.Someofthefactors thatinruencetheschedulingoftheindividualsub-program sinclude ControlFlow DataDependency SoftwareFunctionality Thesedetailsarenotavailablefromthecodeitself.Subseq uently,reverseengineering isappliedtoacquirethisdataabouttheprogram.Theimport anceandresourcefulnessof thereverseengineeringprocessisexplainedinfurtherdet ailinthefollowingsubsection. 62

PAGE 74

5.1.1ReverseEngineering ReverseEngineering[13]istheprocessofextractinginfor mationfromtheexisting softwaresystemusingabottom-upapproach.Duringthispro cess,thesourcecodeis notaltered;althoughadditionalinformationaboutitisge nerated.Thesubjectsoftware systemisrepresentedinaformwheremanyofitsstructural, functionalandbehavioral characteristicscanbeanalyzed. Figure5.2.ReverseEngineeringProcess Thesourcecodecontainsalltheinformationneededaboutth eentitiesusedinit.In ordertoextracttheinformationabouttheentitiesusedini t,thesourcecodeispassed throughthereverseengineeringprocess[36].Relevantinf ormationobtainedfromtheprocessincludedetailssuchas: Identicationoftheentitiesusedinthecode. Modicationlocations. ReferenceLocations. Whatentitiesdotheydependon? Whatotherentitiesdependonthem? 63

PAGE 75

5.1.2The"UnderstandC"Tool Inordertogatherfunctionalandbehavioralinformationof theapplicationthatneeds tobepartitioned,wereverseengineeritusingtheUNDERSTA NDCTooldevelopedby ScienticToolworks,Inc.[33].Severalothertoolslikein Sight,EssentialMetricsandImagix weretestedandfoundunsuitable.The"UnderstandC"toolwa spreferredbecauseofits easeofuse,speedofgeneratingreportsandaccuracy.5.1.2.1FunctionalityandUtilityofTool TheUNDERSTANDCtoolanalysestheapplicationcodetoident ifytheprogram componentsandtheirinterrelationshipsandcreatesahigh erlevelrepresentationofthe softwareprogramthatiseasilyunderstandable.Itusesagr aphmodeltovisualizetherow ofthecode.Theapplicationtobereverseengineeredisgive nasinput,theREtoolproles thecodecreatesacompletedocumentedreportgivingintric atedetailsoftheapplication athand. Thistoolgivestwobasicreports.Therstcontainsalistof theprogramcomponents i.e.alistofallthevariablesandfunctionswithcompleted etailsoftheirdatatypes.The secondreportcontainsalistofthelocationswheretheprog ramentitieswherereferenced ormodied.Thecontrolrowandthedatadependencyofthesof twareprogramarealso illustratedbythetool.5.2CreationofMetadataTable TheCodeGenerationtoolproposedrequiresknowledgeabout theworkingoftheapplication,beforeitcanbedistributedoverseveralproces sors.Thisinformationincludes thedatadependency,controlrowandlistofprogramcompone nts.Althoughthereverse engineeringtoolprovidesamultitudeofinformationabout theapplication,thereisaneed togleanoutandstoreappropriateinformationthatwillaid thedistribution.Thisisdone withthehelpofmaintainingaMetadatatable. ThestepsinvolvedinthecreationoftheMetadatatablearea sfollows: 64

PAGE 76

Figure5.3.CreationofMetadataTable Outputfromreverseengineeringtoolisprovidedasinputfo rtheFramework. Modicationlocationsofprogramentitiesaretabulated. Theannotatedsourceispassedthroughtaskproler. Theinformationfromthetwopreviousprolingstepsarecom bined. Thisprocessisillustratedingure5.3.. Attheendofthepreprocessingstepthemetadatatableconta insthefollowing: Listofprogramentities Referencelocations Listoftasks Entitiesusedineachtask Dependencyinformation AlthoughthecreationofaMetadatatableaccountsforanove rhead,thisprocessis requiredtobeperformedjustonce.Thereafter,distributi ngtheapplicationbyvaryingthe numberofclustersdoesnotrequirethisprocesstobeperfor medagain. 65

PAGE 77

5.3PartitioningProgramintoIndependentSub-programs Figure5.4.Clustering TheannotatedC-Languageapplicationisparsedthoughasep aratorprogramandpartitionsarecreatedusingclustering.Thesepartitionsorb locksofcodearecalledcode clusters.Inordertoexecutethecodeclustersonspeciedp rocessingelements,itisnecessarytoconvertthecodeclustersintoindependentlyexecut ableprograms.AnIllustration ofcreatingindependentsubprogramsisprovidedingure ?? .Theprocessofclustering andprogramcompletionisexplainedinfollowingsections.5.3.1Clustering Theannotatedapplicationsourcecodeispassedthroughthe taskproler.Eachcluster isagroupofseveraltasks. WewillrefertoanexampleshowninFigure5.5..Intheexampl etheapplicationis partitionedintothreeclusters: CodeCluster1containingTask1andTask7 CodeCluster2containingTask3 CodeCluster3containingTask5 66

PAGE 78

Figure5.5.CodeClusterFormation Whenpartitioningisdynamic,theusercanselectthetaskst hatneedtobegroupedin thesameclusteraswellasthenumberofclustersthattheapp licationistobepartitioned into.5.3.2AddingConstructs Thebasicgoalistoconvertthecodeclustersintoindepende ntlyexecutableprograms. Thisrequiresaddingconstructslikedeclarationinformat ionandfunctionstothecode clusters. Anaiveapproachtoaddingconstructswouldbetosimplydupl icatethefunctionsinall thepartitions.However,asthenumberofsub-programsincr ease(thenumberofpartitions inwhichtheprogramisdistributed),thecodesizewouldexp lode. Theapproachusedinthisworkistoobtaincertaininformati onaboutthetasksincluded ineachpartition.Then,usingthemetadatatable,ndtheco rrespondingentitiesand 67

PAGE 79

Figure5.6.AddingConstructs functionsusedineachofthesetasks.Basedonthisinformat ion,declarationandfunctions fortheentitiespresentinthepartitionareadded.Thisopt imizationhelpsinmaintaining thecodesizeandreducingredundantcodeduplication. AsseeninFigure5.7.,codecluster1isconvertedintoaprog ramPartition1byadding constructs.Thisisaccomplishedbyaddingtheappropriate headerlesandfunctionsthat arebeingreferencedbythetasksinthecodecluster.Later, introducingthedeclaration informationoftheentitiesinordertoerrorsoncompilatio n. 5.4CommunicationMethodology Thegures5.8.and5.9.illustratetheabstractionoftheco mmunicationbetweentwo codeclusters.Themainfunctionofcodegenerationistopro ducecodeforeachdistributed processorsuchthateverytaskiscorrectlyexecuted.Befor eataskisexecuted,thetask withprecedencetoit,needstocompleteexecutionandtheda taitemsreferencedbythe taskaremadeavailableinthelocalmemory.Additionally,a ftereachnewdataitemis computed,itmustbesenttotheappropriateprocessorswhos etasksneedthisdatafor furthercomputation.Thisrequirestheimplementationofa communicationlayer. 68

PAGE 80

Figure5.7.CreationofSub-Programs 5.4.1SelectionofPVMastheCommunicationMechanism MostcommoncommunicationtechnologiesincludeMPI,JavaR MIandPVM.Java RMIismoresuitableforJavaapplications,henceisnotcons ideredforthisapproach.We choosePVM[11]overMPIforitsrobustnessandsuitabilityf orheterogeneousarchitectures. 5.4.2PVMCommunicationProcess ThePVMmessagepassingprocessisexplainedinthestepsgiv enbelow: Initializesendbuer:Beforeanymessagecanbesentoutfro moneprocessortoanother, weneedtoallocateandinitializeasendbuer. Placingdataintosendbuer:ThissecondstageofPVMmessag epassinginvolvesplacing thedatathatistobesentintothenewlyinitializedsendbu er. 69

PAGE 81

8 10 15 1 4 2 4 2 5 3 6791213 18 19 20 17 16 11 14 1 5 3 679 11 1213 14 18 19 20 17 16 8 10 15 Figure5.8.IllustrationforPartitioningUsingClusters 3 512 1 2 10 17 16 11 6 Inter Cluster Communication Cluster 2 Cluster 1 13 18 4 7 89 19 20 15 14 Figure5.9.AbstractionofCommunicationBetweenNodesfro mDierentClusters 70

PAGE 82

Sendoutdatainsendbuer:Thisstepinvolvessendingtheda tainthesendbuerto thedestinationprocessor. Retrieveandplacemessageinreceivebuer:Thedestinatio nprocessorreceivesthedata fromthesendingprocessorandcachesitinthereceivebuer .Therearetwodistinct typeofreceivecommands,oneisablockingreceive,wherein thedestinationprocessor willstalltillthedataarrives.Theothertypeofreceiveco mmandisnon-blocking andresumesexecutionifthedatadoesnotarrive. Unpackreceivebuer:Thenalstepincludesunpackingthed atafromthereceivebuer. 5.4.2.1AsynchronousMessagingPrimitives SeveralasynchronousPVMmessagepassingprimitivesusedi nthisworkarementioned below: pvm recv(inttid,intmsgtag) ExecutingareceivecommandwillreceiveamessagewithID msgtag ifitisinthe communicationbueroftheprocessor.Otherwiseitblockst heprocessoridletillthe messagearrives. pvm mcast(int*tids,intntask,intmsgtag) Executingamulticastcommandwillsendthedataintheactiv emessagebuer ntask numberoftaskswithin tids .Thisisusefulforbroadcastingthedata. pvm spawn(char*task,char**argv,intrag,char*where,intnta sk,int*tids) Aprocessnamed task willbeinitiatedinthespeciedprocessor.ThetaskIDofth e spawnedprocesswillbesenttotheparentprocess. pvm initsend(PvmDataDefault) Theroutine pvm initsend clearsthesendbuerandpreparesitforpackinganew message. 71

PAGE 83

5.4.3ImplementingPVMCommunication PVMcommunicationprimitivesareautomaticallyinterleav edinthesub-programsina transparentfashionwithoutchangingtheorderoftheorigi nalsource.Thetechniqueused isfast,errorfreeanddoesnotinvolveinterventionbythep rogrammerunliketraditional methodsthatinvolvedmanualcodingofthecommunicationla yer.However,itisnecessary forthearchitecturetobePVMcompliant. Inthetransparentmodethemode,tasksareautomaticallyex ecutedonthemostappropriatecomputer.Nevertheless,PVMgivestherexibilit ytoallowtheusertospecify theprocessorstowhichthetasksneedtobemappedinthearch itecture-dependentmode. Inlow-levelmode,theusermayspecifyaparticularcompute rtoexecuteatask.Inallof thesemodes,PVMtakescareofnecessarydataconversionsfr omcomputertocomputeras wellaslow-levelcommunicationissues. Oncethecentralizedapplicationissplitintoequivalenti ndependentsubprograms,the parentpartitionisselected.PVMconstructsaretheninser tedintotheparentpartition tomobilizespawningoftheotherpartitionsonthespecied appropriateprocessors.Then eachpartitioniscarefullyproledandmessagepassingpas singconstructsareconservatively appended. Themessagepassingconstructsareinsertedintothesubpro gramsbasedonthefollowing: Foreverytaskthatmodiesanentitythemetadatatableisse archedtocheckifthere areothertasksthatwillbeaected.Ifthetasksbeinginrue ncedbythemodication belongtodierentpartitionsthenewvalueisbroadcasttot hem. Similarly,beforecomputingatask,theprogramcheckswhet hertheentitiesinthat taskhavebeenmodiedpreviously.Iftheentitieshavechan ged,thenewvaluesare receivedfromtherespectivepartitions. Asthesendcommandisnon-blocking,theprocessorthatsend sthemessagecanproceed withtheexecution.Butthereceivecommandisablockingcom mandandwillstalltillthe 72

PAGE 84

messageisnotreceived.Thisaidsanyinherentparallelism intheprogramontheother handensuresthecorrectschedulingandexecutionofthedis tributedprogram. Figure5.10.showstheneedforcommunication.Afterthepre processingandpartition processiscompleteweobtaintheabovepartitions5.10.(Co dePartition1andCodePartition3).Thetaskshavebeensegregatedinamannerthatane ntitythatiscomputedin onepartitionneedstobeprintedinanotherpartition.Thee ntityneedstobesenttoCode partition1assoonasitiscomputedforcorrectexecution. Figure5.10.CommunicationBetweenSub-Programs Thecommunicationbetweenthetwopartitionsisfacilitate dbyadditionofPVMprimitivesasshowninFigure5.11..Inthisimplementation,Cod ePartition1ischosenasthe parentprocesswhichspawnsanotherprocess(CodePartitio n3). ThePVMheaderlesareincludedinallthepartitions.Thepr ogramndsthatthe entitymodiedintask6ofpartition3isrequiredbytask7of partition1.PVMinstructions toinitializethesendbuer,packthenewvalueandsenditto theparentprocessorare 73

PAGE 85

Figure5.11.InterleavingPVMPrimitives appended.Similarly,incodepartition1,PVMinstructions toreceiveandunpackthenew valueareautomaticallyappended. Figure5.12.depictsthealgorithmforinsertingPVMinstru ctionsinthepartitioned code.5.4.4Optimizations Thissectiondealswiththeoptimizationshandledforimpro vingcommunicationtime. Thetechniquesofusingbroadcastcommunicationandelimin ationofredundantcommunicationareexplainedinthefollowingsections.5.4.4.1BroadcastCommunication Ataskcouldsendthesamemessagetomanyothertasksindier entprocessors.The incorrectwaytodothiswouldbeusingrepeatedlyonetoones ending,whichcouldcreate 74

PAGE 86

Foreachtask: Checkthemetadatatableforalltheentitiesusedinthetask Foreachentityinthetask: { Checkiftheentitywasmodiedinanearliertask. { Iftheentityismodiedcheckifthetaskinwhichtheentityi smodied iswithinthesamepartition. Ifwithinthesamepartitionthennoneedtohaveareceivesta tement. Ifnotthenreceivefromtheothertask. { Similarly,ifanentityisbeingmodiedinthetask,broadca stthenew valueiftheentityisbeingusedinadierentpartition. Repeatforalltasks Figure5.12.CommunicationAlgorithm 75

PAGE 87

communicationcontentionintheprocessornetwork.Thusab roadcastingschemeshould beusedtotakeadvantageofthenetworktopologyinordertoa lleviatenetworkmessage trac. Inthisimplementation,afteranentityismodied,thecont rolrowinformationinthe metadatatableissearchedtocheckifallotherpartitionsr equirethemodiedentry.If theyallrequirethemodiedvalueinsteadofhavingmultipl esendreceivecommandswe replaceitwithasinglebroadcastcommand.5.4.4.2OptimizingCommunicationOverhead Figure5.13.LocalMemoryOptimization LocalMemory :Iftwotasksareassignedinthesameprocessor,thereisnon eedfor communicationbetweenthem.Thisisbecauseevenifonetask modiesanentity,thenew valuestillexistsinthelocalmemory.ForexampleintheFig ure5.13.twotasks(task2 andtask3),assignedtothesameprocessor(processor2),ar eeachreceivingdatafrom anothertask(task1)inadierentprocessor(processor1). Processor1issuestwosends withthesamemessagetoprocessor2,andprocessor2issuest woreceivesforthetask2 andtask3. Anoptimizedapproachshouldeliminatesuchredundantcomm unication.Processor2 doesnotrequirereceivingthesamedatafortask3thatitalr eadyhasreceivedfortask2, 76

PAGE 88

asthedatacanbestoredinthelocalmemory.Fortask3,datac anbefetchedfromthe localmemoryitself,reducinghecommunicationexpense.Th usaredundantreceivingcan bedetectedifthedataitemhasalreadybeeninthelocalmemo ry. DeadMessages :Anotheroptimizationistoeliminatecommunicationifano bjectis notgoingtobeusedanyfurther.Thisisbecauseeveniftheda taitemismodieditis notrequiredbyanyothertask.5.4.4.3UseAggregateCommunicationforConsecutiveTasks Figure5.14.AggregateCommunicationOptimization Usingaggregatecommunicationforconsecutivetasks,when required,improvesperformance.Forexample,intheFigure5.14.,Task3inprocessor2 ,requirestwoentitiesthat arebeingchangedinprocessor1.Oneapproachcouldbetwose ndtheentitiesseparately, eachtimetheentitywasmodiedusingtwosendandreceiveme ssages.Thisapproach usesanoptimizedmethodofpackingboththeentitiesinasin glesendcommand.The basicaimistosendlessnumberoflongmessages.5.4.5CodeSizeIncrease Thissubsectionpresentstheresultantincreaseincodesiz easaneectofcodegeneration.Acomparisonbetweenthecommonlyusedbroadcastapp roachandtheoptimized 77

PAGE 89

versionofsendinginformationisperformed.Thenalsimul ationresultsofboththeapproachesarepresentedintable5.1.. Table5.1.OverheadIncreaseinTermsofLinesofCode %increaseinlinesofcode #Clusters naiveapproach optimizedapproach 3 35.78 18.97 3 38.63 19.97 5 42.81 20.88 6 45.24 21.31 7 47.22 22.56 5.5Summary Thischaptercoveredtheimplementationoftheautomaticco degeneratortool.Importantprograminformationwascollectedandstoredinametad atatabletoaidinterleaving thecommunicationlayer.Additionally,theapplicationis segregatedintoindependentsubprogramsbyconservativelyaddingconstructstothecodecl usters.Therationalebehind usingPVMasthecommunicationmethodologywasdiscussed.T hechapteralsocovered theimplementationanddesignofPVMasthecommunicationme thodologyinourwork.A detaileddescriptionoftheoptimizationtechniquesemplo yedtoincreaseperformanceofthe distributedcodewaspresentedinthelastsectionofthecha pter.Thetechniquesinclude usageofaggregatecommunication,broadcastinginformati oninsteadofusingone-to-one communicationandeliminationofredundantcommunication 78

PAGE 90

CHAPTER6 AFRAMEWORKFORARCHITECTURALEXPLORATION Inthischapteranovelapproachandframeworkisdescribedf orarchitecturaldesign spaceexplorationbasedonTransactionLevelModeling(TLM )forSoC'sAsystemlevel design/synthesisframeworkbasedonTLMmodelsisproposed ThemaincomponentsofESLdesignincludeapplicationspeci cprocessors,embedded softwareandhighspeedIntellectualProperty(IP)coresor thirdpartyprocessingelements. Fastandecientsystem-Leveldesignandsynthesisofdistr ibutedheterogeneousarchitecturesiscriticalfornewproductdevelopmentespeciallyin thefastpacedandeverchanging consumerelectronicssegmentofthemarket.Maintainingas tricttimetomarketforanew productiscriticalforthesuccessoftheproductlaunchand thesurvivaloftheproduct line.Abriefoverviewoftheclassicdesignrowisnowpresen ted. 6.1ElectronicSystemLevelDesign Thekeycomponentsofthesystemleveldesignprocessinclud eanaccuratemodelofthe systemandanimplementationtoolforverifyingthemodel.S ystemlevelmodelsareusually describedusinghighlevelprogramminglanguageslikeCorC ++.Themodelsgenerated arethenmappedontoexistingprocessingelementsfromthir dpartyvendorsorinspecial casessomepartsofthemodelareimplementedinhardwareasA SIC's.Oneofthemajor issuesinsystemleveldesignistheproblemofidentifyingt heappropriateprocessingelement formappingaparticulartask.Heterogeneoussystemshavem ultipleprocessingelements andthetaskofthesystemarchitectisfurthercomplicatedb ythenumerouschoicesof architectures.Systemlevelmodelsdescribedusinghighle velprogramminglanguageslack implementationdetails.Hencethedesigner'stasksinclud esmakingassumptionsabout 79

PAGE 91

implementationdetailsthatwerenotavailableintheiniti aldesignphaseThemodelsat thesystemlevelalsodonotincludecycletimeinformation. SystemleveldesignprocesshasbeenautomatedusingCADtoo ls.Thedesignprocessstartswiththerequirementsspecicationdocument.S ystemleveldesignmodelsare implementedusingahighlevelprogramminglanguagefordes ignverication. 6.2ClassicDesignFlow System Specification Prototype System Integration System Validation Hardware SoftwareRe-design Re-design No Communication Hardware Development Development Software Final Implementation Figure6.1.ClassicDesignFlowforElectronicSystemLevel Design Thestepsinvolvedintheclassicaldesignrowforsystemlev eldesign[1]isillustrated ingure6.1.Theprocessstartswiththecollectionofrequi rementsfromthecustomer. Therequirementsarethentranslatedintoanintermediaryf ormknownasthefunctional specicationdocumentorthesystemspecication.Hardwar eandsoftwaredevelopment processesareinitiatedinparallelwiththesystemspecic ationdocumentastheinput. 80

PAGE 92

Duringtheentireprocessofcompletehardwareandsoftware development,thereisno communicationbetweenthehardwaredevelopmentandsoftwa redevelopmentteams.Once thedevelopmentonbothtrackshavebeencompleted,theenti resystemisintegrated togetherintoonemodule.Thesystemisthenvalidatedforfu nctionalcorrectnessand thenforotherperformancebasedissues.Allissuesarising afterthestageofvalidation introduceanothercycleofindependenthardwareandsoftwa redevelopment.Thisprocess independentdevelopmentcontinuesuntilallissueshavebe enresolvedandthesystem modelhasbeenvalidated.Thenextstepinthedesignrowisim plementingtheprototype andtestingtheimplementation. Theprimaryreasonthatthisdesignrowworkedinitiallywas duethelimitednumber ofavailablechoicesforsystemarchitects.Theexistingpr ocessingelementsavailableforuse inthesystemdesignwereusuallyoftwocategories,general purposeprocessorsormicro controllers.Generalpurposeprocessorscouldnotusedine mbeddedsystemdesignssince mostofthesystemsdonotincludeoperatingsystemsupport. Thecostofthegeneral purposeprocessorscouldnotjustifythecostoftheentirep roductafterdevelopment.In thecaseofmicrocontrollersthedegreeofprogrammability wasverylessandeachmicro controllerhadaspecicfunction.EventuallyanASICneede dtobecustomdesignedin ordertointerfacewiththemicrocontrollersandachieveth efunctionalityofthesystem design.ThefunctionalityoftheASICwasclearfromthestar tofthedesign.Hence independentdevelopmentofthehardwareandsoftwarewasst illviableandthenumberof iterationsrequiredtoresolvetheissuesinthefunctional vericationofthesystemdesign waslimitedtoanacceptablenumber. Diminishingfeaturesizesintroducedanewcomplicationin thedesignrow.Thechoices availabletosystemdesignerswerelargeandthesystemleve ldesigndecisionsplayedacriticalroleinthedevelopmentofanecientandoptimalsystem .TheadventofSoC's (SystemonChips)introducednewdesignchoicesandchallen gesinthedesignprocess. Thetraditionaldesignrowcouldnotsupportthedesignofth enewSoC's.Theindependentdevelopmentofhardwareandsoftwareisclearlyanimpe dimentforfastandecient 81

PAGE 93

generationofaprototype.Theproblemcaneasilybexedwit htheintegrateddevelopmentofhardwareandsoftwareinparallel.Theincreasedc ommunicationbetweenthe developmentprocessesassuresaverylowvolumeofissuescr eatedinthedesign.Theabove observationswerethemotivationforthehardware/softwar ecodesignprocess.Distributed heterogeneoussystemsdesignadoptedthehardware/softwa recodesignmethodologyand allthestepsincludingpartitioning,scheduling,simulat ionandvericationwereperformed atthehardwareandsoftwaredomainsinparallelwitheecti vecommunicationandfeed back.6.3SoCDesignFlowCycle Specification Design Prototype Architecture Model Modeling Requirements Collection Algorithm Design Communication Model Partitioning Scheduling Protocol/ Channel Synthesis AccurateFigure6.2.DesignFlowforModelingSOC's ThestepsinvolvedinaSoCdesignrowareillustratedingur e6.2..Theprocess startswiththecollectionrequirementsfromthecustomer. Anappropriatealgorithmis chosenformodelingtherequirementsspecicationandthef unctionalmodelofthesystem isdeveloped.Partitioningandschedulingareperformedon thefunctionalmodelinorderto obtainthearchitecturaldescriptionmodelofthesystem.A tthisstagethefunctionalunits inthesystemdesignarenalized.Thehardwareandsoftware processesareidentied.The 82

PAGE 94

nextstepinthedesignrowisidentifyingthechannelsofcom municationandnalizingthe protocolcongurationsthatwouldbeusedinthecommunicat ionsynthesisofthemodel. Thecommunicationmodelisnalizedafterallcommunicatio nbasedrequirementshave beenidentiedandfullled.Thenalstepinthedesignrowo ftheSoCisusingcycle accuratemodelingtoimplementtheprototype.Thisprototy peissimulatedandtested forfunctionalrequirementcompletion.Itisevidentfromt hestepsinvolvedinSoCdesign thattraditionaldesignrowwouldnotbeecientenoughtoge nerateaoptimalprototype inashorttimeframe. TLM ProtypeTesting Concurrent HW/SW Engineering Based On TLM Models Requirements Specification System Modeling HW/SW Co-Design HW/SW Partitioning Hardware Development Development Software Figure6.3.NovelApproachforModelingSOC's Inordertoovercometheshortcomingsofthetraditionaldes ignrowinthedevelopment ofSoC'sanoveldesignrowwasproposed.Figure6.3.illustr atesthestepsinvolvedinthe noveldesignrow.ThedesignrownowincludestheuseofTLMmo dels.Theauthorin [25]providesanelaboratediscussionontransactionlevel modelingandhowitinteracts withsystemCmodules.Thenoveldesignrowintroducesnumer ouschallengesforthe 83

PAGE 95

systemdesignengineer.Thedesignspaceforarchitectural explorationisnowhuge.We canobservethisfromthefollowingfactors. ThechoiceofreusableIntellectualproperty(IP)coresava ilableishuge Theparametersthataectdesignperformancearenumerous Theneedsofapplicationsarespecic,hencenetuningcanb eperformedonlyona casebycasebasis. Observingfromtheabovementionedreasonswecanseethatit isdiculttocomeupwith onestandardforthedierentlevelsofabstraction.Hencem ostoftheresearchinthisarea iscurrentlycarriedoutoncasestudybasis,particularfor onespecicscenario. 6.4TLM/SystemCBasedExplorationforSoCArchitectures TLMcanbeclassiedaccordingtousecasesinabroadsense. Functionalview: Requirements/Specicationoftheapplication Architect'sview: Architecturalexploration Programmer'sview: Embeddedsoftwaredesign Vericationview(couldalsobecombinedaspartofarchitec tandprogrammerview): SystemvalidationandHW-SWco-verication ATransactioncanbedenedasaneventthatinvolvesadataex changebetweenthe entitiesofthesystems.Theentitiesinthesystemincludeb othcommunicationobjects andprocessingelements[20].Eachtransactionisconsider edtobeatomic,thatiseither completeorhasnotoccurredatall.Transactionlevelmodel scanbeusedatdierentlevels ofabstraction.Broadlyusedlevelsofabstractioninclude thefollowing: Untimedmodels-highestlevelofabstraction-mostlyfunct ional Approximatelytimed-pinaccuratemodels 84

PAGE 96

Accuratelytime-cycleaccuratemodels-mostlyRTLgenerat edcodefordesign verication. Anillustrationofcommunicationarchitecturemodelingus ingTLMisshowningure6.4.. Theentitiesinthesystemincludeprocessingelementsandc ommunicationelements.TLM Figure6.4.SampleCommunicationCodeUsingTLM isuseddescribeboththecategories.Bothprocessingobjec tsandcommunicationobjects canbemodeledatanylevelofabstractionusingTLM.Inthisp roposedapproachwediscuss thescenarioofmodelingthecommunicationarchitecturefo rbus-basedsystemsusingTLM intheexplorationofSoC's.Intheparticularcaseofcommun icationarchitecturemodeling theabstractionlevelsofTLMcanbefurtherdenedasfollow s. Cycleaccurate: -AhighlevelabstractionofRegisterTransferLevel(RTL)d etail, includescycleaccurateandpinaccurateinformation PinAccurate/BusCycleAccurate(PC-BCA): -isanabstractionovertheCA approachwhereinbehaviorinsidethecomponentsarenotmod eledforeverycycle, onlythebusisstillcycleaccurate. 85

PAGE 97

TransactionbasedBusCycleaccurate: -TLMbasedmodelsareusedforcommunicationexploration. Wecanobservefromtheabovedescriptionofthedierentlev elsofabstractionfor modelingthecommunicationarchitectureusingTLMthatthe rearespecicadvantagesin usingaparticularlevelofabstraction.Thecycleaccurate modelsprovidedetailedand accurateinformation.TheycanmodeledusingC/C++oranyhi ghlevelprogramming language.Theobviousdownsideinusingcycleaccuratemode lsisthatsimulationtakes alotoftimeforcompletion.Thepinaccuratemodelsare1000 xfasterthanRTLand requireonlyonetenthofthemodelingeortofRTL.Transact ionbasedmodelsprovidethe highestlevelofabstractionandmostofthesmallerimpleme ntationdetailsaretransparent. Mostlowlevelbussignalsarereplacedbyhighlevelfunctio ns.SimplerInterfacereduces modelingeortattransactionlevel.Thespeedofsimulatio nisabout1000xRTLandthey involveonlyaboutatwentiethofRTLmodelingeort. Figure6.5.SoftwareAreasthatcanbeModeledusingTLM 86

PAGE 98

TheversatilityofTLMmodelsextendtoallareasinthesoftw aredescription.This isbetterillustratedbythegure6.5..Wecanobservefromt hegurethatsinceTLM operatesatdierentlevelsofabstraction.Themodelscanb egeneratedatanylevelinthe softwarearchitecturehierarchy.Dependingontheamounto fdetailrequiredforthemodel dierentlevelofabstractioncanbechosenbyuser.Designe xplorationandoptimization ismaingoaloftheTLMmethodology.Thefunctionalspecica tioncanbemodeledusing TLMandthedesignisveriedusinggeneratedRTLcode.Theop timizationprocesscan nowbeperformediniterationswithfeedbackaftereveryite ration,betweentheTLM planeandtheRTLplane.Figure6.6.illustratestheprocess ofcontinuosfeedbackand optimizationofthesystemdesignusingTLM.HenceTLMiside allysuitedforarchitectural explorationandcommunicationarchitecturemodeling.The frameworkproposedwould taketherequirementsspecicationfromthecustomerinage nericformatlikeUML(unied modelinglanguage).Codegenerationtoolslikerationalro secanbeemployedtogenerate C/C++functionalspecicationfromtheUMLrequirementsdo cument.Thefunctional levelcodeissubjectedtohardwaresoftwarepartitioninga ndthenscheduling.TLMis appliedwithSystemCasfrontendtomodelbehaviorforsimul ation.Thesystemdesigner canthenfullyutilizethepowerofabstractionofTLMtoveri fyandsynthesizethesystem designwiththeleastamountofeortinvolvedfordesignand testing. 87

PAGE 99

Modeling Verification Verification RTL Code Generation Specification Functional Requirements Specification RTL Models TLM Models Figure6.6.OptimizationusingTLMModels 88

PAGE 100

CHAPTER7 CONCLUSION Inthisworkwepresentedacompleteautomatedframeworktoa cheivecodepartitioning andcodegenerationfordistributedheterogeneoussystems .Theframeworkconsistsof prolingtoolsbundledwiththepartitioningalgorithmand thecodegenerationmodule. Thetoolisendtoendautomatedandrequiresnohumaninterve ntion.Theframework canbeusedforrapidprototypinginmappingapplicationson todistributedheterogeneous systems.Theapproachcanbeusedforprovidinginsightinto thebehaviorofapplications ondistributedsystemsandinturncanpromoteecientmappi ngofpartsoftheapplication ontothedierentprocessingelementsofthesysteminorder todecreaseexecutiontime, optimizecommunicationlatencyandevenacheivelowpowerd issipation. Thepartitioningalgorithmdevelopedwasbasedonclusteri ng.Clusteringbasedmethodswerefoundtoproducegoodqualityresultswithaextreme lylowexecutiontime.The lowruntimesfortheclusteringbasedapproachesispreferr ableoverallotherpartitioning approachesinthecaseofdistributedapplicationsrequiri ngfastresponsetime.Thepartitioningalgorithmsimulatestheapplicationforarangeo fclusternumbersandgenerates thecorrespondingmapping.Theusercanconvenientlychoos etheoptimalmappingcongurationbasedontheamountofsavingsincommunicationla tencyorthemetricthatthe costfunctionistunedfor. Theframeworkincludesacompletelyautomatedtoolforcode generationformappingapplicationsontodistributedheterogeneoussystems .Thecodegenerationapproach performscommunicationsynthesisandcompletesthediere ntclustersofcodeintoindividuallyexecutableindependentprograms.Thecommunic ationbetweentheindividual segmentsisacheivedbysynthesizingPVMsystemcalls.Inte rclusterdatadependenciesare 89

PAGE 101

satisedbysendingPVMcallspackedinterclusterdependen tdata.Thecodegeneration approachworksespeciallyforANSICbaseddistributedappl ications. Theadvantagesofalgorithmandthecodegenerationapproac harethattheyareboth scalable.Increasingthenumberofnodesintheapplication haslittleeectontheexecution timeinthecaseofclustering.BothTabusearchandsimulate dannealingbasedapproaches sueradverselywiththeeectofincreaseinthenumberofno des.Theframeworkis alsoveryeasilyextendible.Supportfornewprocessingele mentswithdierentprocessor architectureinthedistributedsystemcaneasilybeacheiv edwiththegenerationofacross compilerforthenewarchitecture.Thesystemrequiresnoot herchangetoaccomodate thenewarchitecture.Theframeworkhasbeendesignedwitht hiskeyideainmind.The prolingtoolsoftheapplicationhavebeenimplementedusi ngpowerfulshellprogramming tools.Thearrangementofmodulesinthedesignrowofthepro lingprocessissuchthat supportfornewsetofprolingtoolscanbeimplementedeasi lybyincludingthenew prolingmoduleintherow.Theresultsofthenewprolingmo dulearealsocumulatively collectedalongwiththeresultsofothermoduleswithoutin troducinganymajorchangesin theprolingprocessrow. FurtherscopeforthisworkisinthedirectionofTLM(Transa ctionLevelModeling) forESL-ElectronicSystem-LevelDesign.Theentireframec anthenbeextendedfrom heterogeneousdistributedsystemstothedomainofembedde dsystems.Rapidprototyping andadvancementsinthespeedproductlifecycledevelopmen tarethekeyfactorsthatis aectingmostoftheembeddedsystemdesigners.TLMtoolspr ovidefastandecient mechanismstovisualize,designimplementandtestprototy pesfornewmarketideaswith averyshortspanoftimeandwithverylittleamountofeort. Thelevelofabstraction intheTLMmodelscontrolstheamountofimplementationspec icinformationthatneeds tobeaninputforthedesignofthedistributedembeddedsyst em.Theembeddedsystemdesignerscanutilizetheshortdesignturnaroundtimef orexploringthenumerous choicesavailableintermsofchoosingprocessorarchitect ures,communcationtopologies andarbitrationprotocolsforsynthesizedcommunicationa rchitectures. 90

PAGE 102

ATLMbaseddesignstrategyandcommunicationsynthesisapp roachhasbeenproposed.Theapproachisyettobetestedforthefulldesignrow .Therststepinthenew approachwouldbetranslationoftheapplicationspecicat iontoasystemlevelfunctional descriptionusingahighlevelprogramminglanguage.These condandthethirdstepstep intheproposednoveldesignrowarethesameparitioningand codegenerationsteps.The nalstepintherowwouldbethecommunicationsynthesisoft hearchitecture.Dierent levelsofabstractioncanusedtorepresentthedetailsofth ecommunicationarchitecture. Theentireframeworkwouldrepresentamodulardesignrowfo rmappingapplicationsto adistributedheterogeneoussystemevenwithoutoperating systemsupport. 91

PAGE 103

REFERENCES [1]A.Rettberg,M.C.ZanellaandF.J.Rammig. FromSpecicationtoEmbeddedSystem Applications .SpringerPublications,2005. [2]A.SinhaandA.P.Chandrakasan.JouleTrack-Awebbasedt oolforSoftwareEnergy Proling.In IEEE/ACMDesignAutomationConference ,pages220{225,2001. [3]A.Spiegel.Pangaea:Anautomaticdistributionfront-e ndforjava.In 4thIEEE WorkshoponHigh-LevelParallelProgrammingModelsandSup portiveEnvironments pages93{99,1999. [4]A.C.NaculandT.Givargis.CodePartitioningforsynthe sisofEmbeddedapplicationswithphantom.In ProceedingsoftheIEEE/ACMInternationalConferenceon computeraideddesign ,pages190{196,2004. [5]C.ChakrabartiandD.Gaitonde.InstructionLevelPower ModelofMicrocontrollers. In InternationalSymposiumonCircuitsandSystems ,pages76{79,1999. [6]D.Chandra,Ch.Fensch,W.K.Hong,L.Wang,E.Yardimcian dM.Franz.Code generationattheproxy:Aninfrastructure-basedapproach toubiquitousmobilecode. In ProceedingsoftheFifthECOOPWorkshoponObject-Orientat ionandOperating Systems(ECOOP-OOOSWS2002),Malaga,Spain ,2002. [7]D.Wu,B.M.Al-HashimiandP.Eles.Schedulingandmappin gofconditionaltask graphforthesynthesisoflowpowerembeddedsystems.In IEEEProceedingsComputersandDigitalTechniques ,volume150,pages262{273,2003. [8]F.VahidandD.Gajski.Closenessmetricforsystem-leve lfunctionalparititioning.In ProceedingsofIEEEInternationalConferenceonEuropeanD esignAutomation1995 pages328{333,1995. [9]F.VahidandT.Givargis. EmbeddedSystemsDesign-AUniedHardware/Software Introduction .JohnWileyandSons,2002. [10]G.StittandF.Vahid.Hardware/SoftwarePartitioning ofSoftwareBinaries.In IEEE/ACMInternationalConferenceonComputerAidedDesig n ,pages164{170, 2002. [11]G.A.Geist,J.A.KohlaandP.M.Papadopoulos.Pvmandmp i:Acomparisonof features. CalculateursParalleles ,8(2):137{150,1996. 92

PAGE 104

[12]G.C.HuntandM.L.Scott.Thecoignautomaticdistribut edpartitioningsystem. In OSDI'99:ProceedingsofthethirdsymposiumonOperatingsy stemsdesignand implementation ,pages187{200,1999. [13]H.A.Muller,S.R.TilleyandK.Wong.Understandingsof twaresystemsusingreverse engineeringtechnologyperspectivesfromtherigiproject .In CASCON'93:Proceedingsofthe1993conferenceoftheCentreforAdvancedSt udiesonCollaborative research ,pages217{226,1993. [14]H.Z.Zebrowitz. GEDAETM-AGraphicalProgrammingandAutocodeGeneration ToolforSignalProcessorApplications .LockheedMartinCorporation. [15]J.Henkel.ALowPowerHardware/SoftwarePartitioning forCore-basedEmbedded. In Proceedingsofthe36thACM/IEEEconferenceonDesignAutom ationConference pages122{127,1999. [16]J.HenkelandY.Li.EnergyConsciousHardwareSoftware PartitioningofEmbeddedSystems:AcasestudyonanMPEG-2Encoder.In ProceedingsoftheSixth InternationalworkshoponHardware/softwarecodesign ,pages23{27,1998. [17]J.KimandI.Lee.Modularcodegenerationfromhybridau tomatabasedondata dependency.In Proceedingsofthe9thIEEEReal-TimeandEmbeddedTechnolo gyand ApplicationsSymposium(RTAS2003) ,pages160{168,2003. [18]J.B.Peterson,R.B.O'ConnorandP.M.Athanas.Schedul ingandpartitioningANSI-C programsontomulti-FPGACCMarchitectures.InK.L.Poceka ndJ.Arnold,editors, IEEESymposiumonFPGAsforCustomComputingMachines ,pages178{187,1996. [19]K.S.Vallerio,andN.K.Jha.TaskGraphExtractionfore mbeddedsystemsynthesis. In ProceedingsofIEEEInternationalConferenceonVLSIDesig n ,pages480{486, 2003. [20]L.CaiandD.Gajski.Transactionlevelmodeling:anove rview.In CODES+ISSS '03:Proceedingsofthe1stIEEE/ACM/IFIPinternationalco nferenceonHardware/softwarecodesignandsystemsynthesis ,pages19{24,2003. [21]L.Guthier,S.YooandA.Jerraya.Automaticgeneration andtargetingofapplication specicoperatingsystemsandembeddedsystemssoftware.I n DATE'01:Proceedings oftheconferenceonDesign,automationandtestinEurope ,pages679{685,2001. [22]L.R.Welch,B.Ravindran,J.HenriquesandD.K.Hammer. MetricsandTechniques forAutomaticPartitioningandAssignmentofObject-based ConcurrentPrograms.In ProceedingsofTheSeventhIEEESymposiumonParallelandDi stributedProcessing pages440{447,1995. [23]M.Baleani,F.Gennari,Y.Jiang,Y.Patel,R.K.Brayton andA.SangiovanniVincentelli.Hw/swpartitioningandcodegenerationofemb eddedcontrolapplications onarecongurablearchitectureplatform.In Proceedingsofthe10thInternational SymposiumonHardware/SoftwareCodesign(CODES),EstesPa rk,Colorado,USA pages151{156,2002. 93

PAGE 105

[24]M.Lopez-VallejoandJ.C.Lopez.Onthehardware-softw arepartitioningproblem: Systemmodelingandpartitioningtechniques. ACMTransactionsonDesignAutomationofElectronicSystems ,8(3):269{297,2003. [25]M.Moy. TechniquesandtoolsforthevericationofSystems-on-a-c hipattransaction level .PhDthesis,INPG,GenobleFrance,2005. [26]N.Liogkas,B.MacIntyre,E.Mynatt,Y.Smaragdakis,E. TilevichandS.Voida. AutomaticPartitioningforPrototypingUbiquitousComput ingApplications. IEEE PervasiveComputing ,3(3):40{47,2004. [27]N.S.Singh.Anautomaticcodegenerationtoolforparti tionedsoftwareindistributed computing.Master'sthesis,UniversityofSouthFlorida,2 005. [28]P.Eles,Z.PengandK.Krzysztof.Systemlevelhardware /softwarepartitioningbased onsimulatedannealingandtabusearch. KluwerJournalonDesignAutomationfor EmbeddedSystems ,2(1):5{32,1997. [29]P.PrabhakaranandP.Banerjee.Parallelalgorithmsfo rforcedirectedscheduling ofrattenedandhierarchicalsignalrowgraphs. IEEETransactionsonComputers 48(7):762{768,1999. [30]P.C.Mahalanobis.Onthegeneraliseddistancesinstat istics.In Proceedingsofthe NationalInstitueofScience ,1936. [31]R.Canal,J-M.ParcerisaandA.Gonzalez.DynamicCodeP artitioningforClustered Architectures. InternationalJournalofParallelProgramming ,29:59{71,2001. [32]S.Nikolaidis.Instruction-levelenergycharacteriz ationofanarmprocessor.In Marlowworkshop ,2003. [33]ScienticToolworks,Inc. UnderstandC [34]T.YangandA.Gerasoulis.Pyrros:statictaskscheduli ngandcodegenerationfor messagepassingmultiprocessors.In ICS'92:Proceedingsofthe6thinternational conferenceonSupercomputing ,pages428{437,1992. [35]V.Kindratenko.CodePartitioningforRecongurableH igh-PerformanceComputing: ACaseStudy.In InternationalConferenceonEngineeringofRecongurable Systems andAlgorithms ,pages143{149,2006. [36]W.LiandC.McDonald.Fullsupportfortextualeditingi nasyntax-directededitor. TechnicalReport94/8,DepartmentofComputerScience,The UniversityofWestern Australia,1994. [37]Y.Li,T.Callahan,E.Darnell,R.Harr,U.KurkureandJ. Stockwood.Hardwaresoftwareco-designofembeddedrecongurablearchitectur es.In DAC'00:Proceedings ofthe37thconferenceonDesignautomation ,pages507{512,2000. 94

PAGE 106

APPENDICES 95

PAGE 107

AppendixAClusteringBackground Hierarchicalclusteringmechanismsareusedforidentifyi ngclustersindataitemsthat needtobecomparedonmultipledimensions.Thisworkusesav ariationoftheK-means clustering.Inthischapter,abreifoverviewofthedieren tclusteringmethodsisprovided. Acomparisonofthemethods,isalsoprovided.Themotivatio nforchoosingK-means clusteringoverotherclusteringapproachesiselaborated .Therestofthechapterprovides anoverviewonthedierentwaystocomputethedistancebetw eenthenodesoftheset andacomparisonofthemethods.Thetradeosinvolvedinusi ngindividualmethodsare explainedfurther. Clusteringalgorithmscanbecategorisedashierarchicalo rnon-hierarchicalapproaches. Hierarchicalclusteringalgorithmscanbefurtherclassi edasdivisiveoragglomerative. Indivisiveapproachesusuallythealgorithmbeginswithth eentiresetasonecluster andthenmultipleclustersareformedbyseperatingoutelem entsthatdonotmatchthe membershippropertyofthesetclosely.Inthecaseofagglom erativebasedapproachesall theelementsareconsideredasindividualclustersandthen theelementswithclosedistances arecombinedtogethertoformclusters.ThisworkusesK-mea nsclustering,whichisanonhierarchicalapproach.Themajordierencebetweenahiera chicalapproachandanonhierarchicalapproachisthatinthecasetheK-meanscluste ringaninitialnumberofclusters isprovidedasinputtothealgorithm.Theadvantageoftheme thodisthattheelements canbedividedintoanynumberofclustersasneededbytheuse r.Thedisadvantageis that,thenumberofclusterschosenbytheuserisnotnecessa rilytheoptimalnumber.This chapterprovidesanoverviewofthek-meansclusteringandt hemetricsusedtocompute thedistancefactorbetweenthenodesinthecluster. K-meansClustering: TheK-meansalgorithmassignseachpointtotheclusterwhos e center(alsocalledcentroidormedoid)isnearest.Thecent eristheaverageofallthepoints inthecluster,thatis,itscoordinatesarethearithmeticm eanforeachdimensionseparately overallthepointsinthecluster...Example:Ifthedataset hasthreedimensionsandthe 96

PAGE 108

AppendixA(Continued)clusterhastwopoints: X =( x 1 ;x 2 ;x 3 )and Y =( y 1 ;y 2 ;y 3 ) : ThenthecentroidZbecomes Z =( z 1 ;z 2 ;z 3 ) ; where z 1 =( x 1 + y 1 ) = 2and z 2 =( x 2 + y 2 ) = 2and z 3 =( x 3 + y 3 ) = 2 : The algorithmstepsasformulatedbyJ.MacQueen(1967)areasfo llows: Chooseinitialclusters,k. Randomlygenerateknewclustercenters mappointstoclosestclustercenter computecosttoverifyifthenewclustercentersformabette rmapping continuetheprocessuntilstoppingcriteriaismet Themainadvantagesofthisalgorithmareitssimplicityand speedwhichallowsittorunon largedatasets.Itsdisadvantageisthatitdoesnotyieldth esameresultwitheachrun,since theresultingclustersdependontheinitialrandomassignm ents.Itminimizesintra-cluster variance,butdoesnotensurethattheresulthasaglobalmin imumofvariance. Themostcommonformofthealgorithmusesaniterativerene mentheuristicknown asLloyd'salgorithm.Lloyd'salgorithmstartsbypartitio ningtheinputpointsintokinitial sets,eitheratrandomorusingsomeheuristicdata.Itthenc alculatesthemeanpoint,or centroid,ofeachset.Itconstructsanewpartitionbyassoc iatingeachpointwiththeclosest centroid.Thenthecentroidsarerecalculatedforthenewcl usters,andalgorithmrepeated byalternateapplicationofthesetwostepsuntilconvergen ce,whichisobtainedwhenthe pointsnolongerswitchclusters(oralternativelycentroi dsarenolongerchanged).Lloyd's algorithmandk-meansareoftenusedsynonymously,butinre alityLloyd'salgorithmisa heuristicforsolvingthek-meansproblem,butwithcertain combinationsofstartingpoints andcentroids,Lloyd'salgorithmcaninfactconvergetothe wronganswer.Thereare twofactorsthatareimportantinordertoobtainanoptimals olutionwiththeK-means 97

PAGE 109

AppendixA(Continued)algorithm.Therstfactorthedistancemetricusedindecid ingthemembershipofnodes toagivencluster.Thesecondfactoristhecostfunctionuse dtosteerthealgorithm. DistanceMetric: Inthissectionabreifoverviewoffunctionusetocomputeth edistancemetricisdescribed.Clusteringapproachesworkbygr oupingelementswithsimilar attributesintoclusters.Thesimilarity(ordissimilarit y)betweennodesisquantiedbythe distancemetric.Theformulationofthedistancemetricdep endsonthedataelements itself.Eachelementintheentiresetcanbeconsideredequi valenttoapoint.Thedimensionofthespacecontainingthepointcorrespondstothenum berofdierentattributesof theelementthatisunderconsideration.Oneoftheearliest techniquesforcomputingthe distancemetricwasproposedbyMahalanobis[30]. D m = q ( x ) T :S 1 : ( x )(A.1) isthedistanceofthemultivariatevectorx=( x 1 ;x 2 ;x 3 ;:::x n ) T fromthegroupofvalues withmean =( 1 ; 2 ; 3 ;::: n ) T : Foranytworandomvectors,xandythedissimilarity measurebetweenthemisdenedasfollows: d ( ~x ~y )= q ( ~x ~y ) T :S ( 1) : ( ~x ~y )(A.2) Thedistancemetricthuscomputedisveryeectiveingroupi ng,dataelementswith alargenumberofattributesandawiderangeofvalues.Hence thisworkadaptsthe mahalanobisdistancemetric.Insteadofapplyingthemetri cdirectlyonthedataelements, itself.Themetricismodiedtoaccountforthedistancebet weenclusters. CostFunctionandTerminationCriteria: Thecostfunctionsteerstheclusteringalgorithm.Improperformulationofthecostfunctioncouldle adtoeitherasuboptimal solutionorintheworstcaseaninniteloopwherethealgori thmisnotabletostopat all.Theclusteringalgorithmisbasedonagreedyapproach, henceoneimportantfactorto 98

PAGE 110

AppendixA(Continued)takeintoaccountwhenevaluatingthesolutionis,thesolut ioncouldbealocaloptimum. Thekeycheckpointsthathavebeenaddedinordertoensureth atthealgorithmdoes notencounteranyoftheabovementioneddrawbacksaredescr ibednow.Therstchange appliedtothecostfunctionwastocomputethechangeinthec ostofthesolutionbased onthetotalcostofallclustersinadditiontovariationsin costofeachcluster.Traditional approacheswouldincludeanodeinaclusterbasedtotallyon thecostoftheindividual clusterunderconsideration.Thesecondimportantchangea ppliedtothecostfuntionin ordertoavoidaninnitesearchforthesolutionwastheaddi tionofamaximumnumberof iterationsofchangefortheclusters.Thereisaprobabilit yofaninferiorqualitysolution, becauseoftherestrictionimposedonthenumberofiteratio n.Athirdchangeisapplied inordertoeliminatesuchaprobability.Theentirecluster ingalgorithmisrepeatedlyexecutedahundredtimeswitheachrunexecutedwitharestrict ionofarandomnumberof iterations.Thenalsolutionisthebestofthehundredruns 99

PAGE 111

ABOUTTHEAUTHOR ViswanathSairamancompletedhighschoolinChennai,India in1994.Receivedhis undergraduatedegreeinBachelorofScience,majoringinMa thfromtheuniversityof Madrasin1997.CompletedhisMastersinComputerApplicati onsfromRegionalEngineeringCollege,Warangal-currentlyNationalInstituteo ftechnology,Warangalinthe year2000.HewasaninternforHoneywellIndiaintheyear200 0foraboutsixmonths. HethenjoinedHCLtechnologiesintheroleofasoftwareengi neerinJuly,2000.Hehas aboutoneyearexperienceinsoftwaredevelopmentandtesti ng,ofwhichfourmonthswere inFaireld,Connecticutandtheroleinvolveddevelopings oftwareforhardwaredevices forIPCcorporation.Aftersuccessfullygaininginvaluabl eexperiencehewasadmittedinto thePh.DprogramattheUniversityofSouthFlordia,Tampawh erehisareaofresearch wasarchitecturaldesignspaceexploration,codepartitio ningandmappingpartitionedapplicationsontodistributedsystems