xml version 1.0 encoding UTF8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 002005317
003 fts
005 20090729120335.0
006 med
007 cr mnuuuuuu
008 090529s2008 flu s 000 0 eng d
datafield ind1 8 ind2 024
subfield code a E14SFE0002705
035
(OCoLC)362267055
040
FHM
c FHM
049
FHMM
090
QA76 (Online)
1 100
Gutierrez Munoz, Alejandro.
0 245
Analysis of current flows in electrical networks for errortolerant graph matching
h [electronic resource] /
by Alejandro Gutierrez Munoz.
260
[Tampa, Fla] :
b University of South Florida,
2008.
500
Title from PDF of title page.
Document formatted into pages; contains 66 pages.
502
Thesis (M.S.C.S.)University of South Florida, 2008.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
3 520
ABSTRACT: Information contained in chemical compounds, fingerprint databases, social networks, and interactions between websites all have one thing in common: they can be represented as graphs. The need to analyze, compare, and classify graph datasets has become more evident over the last decade. The graph isomorphism problem is known to belong to the NP class, and the subgraph isomorphism problem is known to be an NPcomplete problem. Several errortolerant graph matching techniques have been developed during the last two decades in order to overcome the computational complexity associated with these problems. Some of these techniques rely upon similarity measures based on the topology of the graphs. Random walks and edit distance kernels are examples of such methods. In conjunction with learning algorithms like backpropagation neural networks, knearest neighbor, and support vector machines (SVM), these methods provide a way of classifying graphs based on a training set of labeled instances. This thesis presents a novel approach to errortolerant graph matching based on current flow analysis. Analysis of current flow in electrical networks is a technique that uses the voltages and currents obtained through nodal analysis of a graph representing an electrical circuit. Current flow analysis in electrical networks shares some interesting connections with the number of random walks along the graph. We propose an algorithm to calculate a similarity measure between two graphs based on the current flows along geodesics of the same degree. This similarity measure can be applied over large graph datasets, allowing these datasets to be compared in a reasonable amount of time. This thesis investigates the classification potential of several data mining algorithms based on the information extracted from a graph dataset and represented as current flow vectors. We describe our operational prototype and evaluate its effectiveness on the NCIHIV dataset.
538
Mode of access: World Wide Web.
System requirements: World Wide Web browser and PDF reader.
590
Advisor: Lawrence O. Hall, Ph.D.
653
Graph mining
Compound matching
Graph kernel
Graph dataset
Classifier
690
Dissertations, Academic
z USF
x Computer Science
Masters.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.2705
PAGE 1
Analysis of Current Flows in Electrical Networks for ErrorTolerant Graph Matching by Alejandro Gutierrez Munoz A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Department of Computer Science & Engineering College of Engineering University of South Florida Major Professor: Lawrence O. Hall, Ph.D. Dmitry B. Goldgof, Ph.D. Srinivas Katkoori, Ph.D. Date of Approval: November 10, 2008 Keywords: graph mining, compound matching, graph kernel, graph dataset, classifier Copyright 2008, Alejandro Gutierrez Munoz
PAGE 2
DEDICATION Every journey begins with a goal, with a hope. This journey was no different for me. What made it special was the support and encouragement of my wife Edna, my family, and friends. To them, I would like to dedicate these pages that mark the start and not the end of another journey for me. What made it possible was the constant advice and academic guidance of my major professor, Dr. Lawrence O. Hall. For him, I have no other words but to say thank you. What made it all worth it was the personal satisfaction of a job well done. But who made it all happened was God Â– Thank you Lord for all the good opportun ities and people you have crossed in my path.
PAGE 3
ACKNOWLEDGMENTS I would specially like to thank the company that I have been working for during the last five years, Unisource Administrators. They supported my education and provided me all of the necessary tools and time to be able to accomplish this goal. Many thanks to Patrice Say, former VP of Human Resources, who was very supportive of my career during all these years. Special thanks to Andrew Olwert III, President and CEO of Unisource Administrators, who has always believed in me and my work. I would also like to thank Dr. Dmitry B. Goldgof and to Dr. Srinivas Katkoori for serving as part of my committee.
PAGE 4
i TABLE OF CONTENTS LIST OF TABLES iii LIST OF FIGURES iv ABSTRACT vi CHAPTER 1 INTRODUCTION 1 CHAPTER 2 BACKGROUND AND RELATED WORK 4 CHAPTER 3 CURRENT FLOW ANALYSIS 7 CHAPTER 4 CURRENT FLOW VECTORS 13 4.1 GroupN generation 14 4.2 Current flow along GroupN geodesics 16 4.3 Current flow under special scenarios 18 4.4 Current flow with nodal information 20 4.5 Current flow vectorsÂ’ similarity measure 23 CHAPTER 5 IMPLEMENTATION DETAILS 24 5.1 File formats 24 5.1.1 Graph dataset file 24 5.1.2 Current flow vectors dataset file 25 5.2 CFvectorsÂ’ implementation 26 5.2.1 CFvectorsÂ’ algorithm 27 5.2.2 CFvectorsÂ’ computational complexity 32 5.3 CFcompareÂ’s implementation 34 5.3.1 CFcompareÂ’s algorithm 35 5.3.2 CFcompareÂ’s computational complexity 39 5.4 Additional tools 39 CHAPTER 6 EXPERIMENTAL RESULTS 42 6.1 Experimental setup 42
PAGE 5
ii 6.1.1 Graph visual comparison experiments 43 6.1.2 Graph classification problem on the NCIHIV dataset 54 6.2 Result evaluation and comparison 58 CHAPTER 7 SUMMARY AND FUTURE WORK 62 7.1 Summary 62 7.2 Future work 63 REFERENCES 65
PAGE 6
iii LIST OF TABLES Table 3.1 Current flow along paths in Fig. 3.2 12 Table 4.1 GroupN geodesics for graph in Fig. 4.1 16 Table 6.1 Average number of compounds within top 30 matches 55 Table 6.2 Current flow vectors results on HIV00.cfv dataset 59 Table 6.3 Statistical significance of the results for HIV00 60 Table 6.4 Statistical significance of the results for HIV10 61
PAGE 7
iv LIST OF FIGURES Figure 2.1 Sample graph g 4 Figure 3.1 A sample graph to calculate current flows 9 Figure 3.2 A sample graph with voltages and currents 11 Figure 4.1 A simple graph used in Table 4.1 for GroupN generation 15 Figure 4.2 Graph with nonconnected nodes 18 Figure 4.3 Current flow along path causing shortcircuit 19 Figure 4.4 Edge weight modification based on vertex label information 21 Figure 5.1 Graph dataset file abstract grammar 24 Figure 5.2 Graph dataset file 25 Figure 5.3 Current flow vectors dataset file abstract grammar 25 Figure 5.4 Current flow vectors dataset file 26 Figure 5.5 CFvectorsÂ’ command line 26 Figure 5.6 CFvectorsÂ’ algorithm 27 Figure 5.7 Current flow calculation algorithm 29 Figure 5.8 CFcompareÂ’s command line 35 Figure 5.9 CFcompareÂ’s algorithm 35 Figure 5.10 CFcompare results output file 37 Figure 5.11 Graphs 9168, 58368, 50851, and 50848 38 Figure 5.12 Class counts file 39
PAGE 8
v Figure 5.13 GraphvizÂ’s dot file example 40 Figure 6.1 Graph 1899 matches at 0% and 10% 44 Figure 6.2 Graph 3417 matches above 0.97 47 Figure 6.3 Graph 629871 matches above 0.97 48 Figure 6.4 Graph 633892 matches above 0.97 49 Figure 6.5 Graphs 642970, 629789, 106563 matches 50 Figure 6.6 Graphs 26540, 693764, 16086, 121858 matches 51 Figure 6.7 Graphs 643418, 676606, 676419, 675451 matches 52 Figure 6.8 Graphs 673997, 671292 matches 53
PAGE 9
vi ANALYSIS OF CURRENT FLOWS IN ELECTRICAL NETWORKS FOR ERRORTOLERANT GRAPH MATCHING Alejandro Gutierrez Munoz ABSTRACT Information contained in chemical compounds, fingerprint databases, social networks, and interactions between websites all have one thing in common: they can be represented as graphs. The need to analyze, compare, and classify graph datasets has become more evident over the last decade. The graph isomorphism problem is known to belong to the NP class, and the subgraph isomorphism problem is known to be an NPcomplete problem. Several errortolerant graph matching techniques have been developed during the last two decades in order to overcome the computational complexity associated with these problems. Some of these techniques rely upon similarity measures based on the topology of the graphs. Random walks and edit distance kernels are examples of such methods. In conjunction with learning algorithms like backpropagation neural networks, knearest neighbor, and support vector machines (SVM), these methods provide a way of classifying graphs based on a training set of labeled instances. This thesis presents a novel approach to errortolerant graph matching based on current flow analysis. Analysis of current flow in electrical networks is a technique that uses the voltages and currents obtained through nodal analysis of a graph representing an
PAGE 10
vii electrical circuit. Current flow analysis in electrical networks shares some interesting connections with the number of random walks along the graph. We propose an algorithm to calculate a similarity measure between two graphs based on the current flows along geodesics of the same degree. This similarity measure can be applied over large graph datasets, allowing these datasets to be compared in a reasonable amount of time. This thesis investigates the classification potential of several data mining algorithms based on the information extracted from a graph dataset and represented as current flow vectors. We describe our operational prototype and evaluate its effectiveness on the NCIHIV dataset.
PAGE 11
1 CHAPTER 1 INTRODUCTION Several errortolerant graph matching techniques have been developed over the last two decades. Some of these techniques rely upon similarity measures based on the topology of the graphs [1] [2] [3] [4] [5] [6]. Random walks and edit distance kernels are examples of such methods that in conjunction with learning algorithms like knearest neighbors, neural networks, and support vector machines (SVM) provide a way of classifying graphs based on a training set of labeled instances. This thesis investigates an errortolerant graph matching technique based on analysis of current flows in electrical networks. Errortolerant graph matching between two graphs is performed using a similarity measure here proposed. The similarity measure is generated based on current flow vectors extracted from each graph. Current flow vectors are calculated by applying current flow analysis to the graphs which are treated as electrical circuits. Current flow vectors extracted from the graphs capture information about the topology of the graph, information that is later used with the similarity measure to calculate a value between 0 and 1; the greater the value, the more similar the graphs are as defined by the similarity measure. This thesis is organized as follows. In Chapter 2, we present a brief introduction to graph theory and related approaches for graph comparison. In Chapter 3, we present the concept of electrical nodal analysis for fast discovery of connection subgraphs as
PAGE 12
2 proposed by Faloutsos et al. Nodal analysis applied to undirected graphs is at the heart of our current flows for errortolerant graph matching approach, and as shown by [7] [8], in a graph where the edge weights represent the conductance of the edge and vertices represents the nodes of the circuit, the electrical current along an edge is proportional to the net number of times that a random walk along the same edge will traverse it. In Chapter 4, we present current flow analysis for the errortolerant graph matching approach, where several geodesics (shortest paths) are extracted from the graph using as starting and ending points nodes of equal degree. Two sets of geodesics are then evaluated: shortest geodesics and longest geodesics. Current flow analysis is then performed over the two sets of geodesics in order to produce an ndimensional vectorial representation of the graph. Chapter 4 also introduces a similarity measure based on the ndimensional vectorial representation of the graphs generated using current flow analysis. This new similarity measure is a function R G G k :, where two graphs represented by their current flow vectors are compared to each other, and a real number between 0 and 1 is returned as the similarity value between the two graphs. As we will observe, this similarity measure can be used as a kernel in a support vector machine, since k is symmetric and nonnegative, it can make up a positive definite matrix [9]. In Chapter 5, we describe the implementation details of our prototype, which consists of two main programs: CFvectors and CFcompare. CFvectors is the program used to generate the vectorial representation of the graphs using current flow analysis. CFcompare is the program used to compare two set of graphs and generate a similarity value between each pair of graphs among both sets. Other tools were developed as part of this thesis in order to facilitate the analysis and
PAGE 13
3 visualization of the results. These tools are: sdf2gds and gds2dot. Both of these tools are used to transform the file format used by CFvectors and CFcompare to a more standard format. Chapter 6 describes the results obtained on the NCIHIV dataset [10]. Some examples of different chemical compounds represented as graphs, and their closest matches are presented in order to provide a graphical comparison between them. As we will show, the ability to store the graph information as a current flow vector, and later, use this representation to find similar graphs, based on the topology, is quite useful. Finding the best match using our similarity measure against a database of more than 40,000 compounds takes a few seconds, and once the current flow vectors have been calculated and stored there is no need to calculate them again as they will remain unchanged for each graph. Chapter 7 presents a summary and ideas for future research using the technique here proposed.
PAGE 14
4 CHAPTER 2 BACKGROUND AND RELATED WORK Graphs offer a powerful way to represent structured data. Several applications where graph representation is used like shape analysis, character recognition, and chemical compound matching take advantage of the benefits of graph databases. The ability to compare two graphs represents an important contribution to the area of graph mining. Graph matching usually refers to comparing the structural similarity between two or more graphs. Graph matching approaches are mainly divided in two classes: exact graph matching and errortolerant graph matching [11]. Let us define a graph g as a fourtuple g = (V,E,u,v), where V denotes a finite set of nodes or vertices, E denotes a finite set of edges, where V V E u denotes a node labeling function, and v denotes an edge labeling function. Fig. 2.1 shows sample graph g with its correspondent fourtuple. V = {1,2,3,4} E = {(1,2), (1,3), (2,3), (3,4), (4,3)} u(x) = = = = = N x H x O x C x 4 3 2 1 v(x,y) = = = = = = = = = = = 1 3 4 1 4 3 3 3 2 1 3 1 1 2 1 y x y x y x y x y x Figure 2.1 Sample graph g 1 C O N H 1 3 1
PAGE 15
5 Graphs can be classified into two main categories: directed and undirected. Undirected graphs are those where for every edge E u v E v u ) ( ) (. Directed graphs on the contrary are those where there is at least one edge E v u ) ( such that E u v ) (. A subgraph g1 of g2 is a graph such as that for graph g1 = (V1, E1, u1, v1) and g2 = (V2, E2, u2, v2), graph g1 g2. A graph is a subset of another graph if it posses the following characteristics: 1. 2 1 V V 2. ) 1 1 ( 2 1 V V E E = 3. 1 ) ( 2 ) ( 1 v u u u u u = 4. 1 ) ( ) ( 2 ) ( 1 E v u v u v v u v = Based on the definitions of graph and subgraph we can now define exact and errortolerant graph matching. In exact graph matching the objective is to identify whether all vertices, edges, node labels, and edge labels are identical between two graphs. This is called graph isomorphism. The most common approach to check graph isomorphism is to traverse a search tree checking all nodetonode correspondences. The computational complexity of the search tree procedure is exponential in the number of nodes of both graphs [11]. A similar problem for graph isomorphism is the problem of finding subgraph isomorphism. In other words, to detect if a smaller graph is part of a bigger graph. A subgraph isomorphism between graphs g1 and g2 exists if the larger graph can be transformed into the smaller graph by removing some nodes and edges. The subgraph isomorphism problem belongs to the NPcomplete class of computation.
PAGE 16
6 Due to its computational complexity the exact graph matching problem is not implemented in real scenarios, the errortolerant graph matching problem is more suitable for larger graph databases and graphs with a high number of vertices. Several approaches have been proposed for the errortolerant graph matching problem [1] [2] [3] [4] [5] [6]. Some of these approaches are based on similarity measures like the graph edit distance, where a list of edit operations is performed in order to transform one graph into another. Edit operations can be edge edit operations or vertex edit operations. An example of an edit vertex operation is to add or remove a node from one graph in order to find a correspondence on the other graph. The concept of graph edit distance is then presented as the cost of the edit path. Each edit operation can be assigned a specific value. For example, removing a vertex could be more costly than changing an edge label. Other errortolerant graph matching approaches include walkbased graph kernels [20]. Walkbased kernels are defined for directed labeled graphs. The process of mapping a graph to multisets of label sequences, or walks, is what is known as a walk kernel. Cycle pattern kernels (CPK) [20] are based on the idea of mapping graphs into a selected group of cycles and trees. Another approach is to map a graph to a set of frequent subgraphs (FSG) previously indentified from the training dataset [17] [18]. The following chapters will introduce current flow analysis for errortolerant graph matching as an approach in which graphs are transformed into a vector of current flows over particular paths of the graph in order to compare the current flow vectors between graphs of a graph database.
PAGE 17
7 CHAPTER 3 CURRENT FLOW ANALYSIS The flow of electrical currents in a network of resistors can be measured between any two nodes in an electrical network. Current flow analysis combines OhmÂ’s law and KirchhoffÂ’s current law to determine the voltage values at each node along the electric circuit. Nodal voltage analysis of electrical circuits is performed by solving a system of equations in which the unknowns are the voltages at different nodes in the circuit. Current flow along the various branches of the circuit can be determined based on the voltages at each node in the circuit. The analysis of the current flows between specific pairs of nodes in a graph is at the heart of our errortolerant graph matching technique. In [7] an approach related to electrical currents in a network of resistors was proposed. This approach tries to solve the problem of finding a connection subgraph that can deliver as many units of electrical current as possible. For this purpose, a graph ()E V G = is treated as an electrical network, where edge weights represent conductance (()v u C represents the conductance between nodes u and v), and the vertices represent the nodes of the electrical circuit (()u V represents the voltage at node u). The voltages at each node of the circuit are calculated by combining Ohm's law and Kirchhoff's current law.
PAGE 18
8 ()()()() ()v V u V v u C v u I v u ÂŠ = , : (1) ()= uv u I t s v 0 : (2) Having s as the source node, and t as the target node, equations (1) and (2) determine the voltages and currents as the solution to the following linear system: () ()() () t s u u C v u C v V u Vv, = (3) ()()0 1 = = t V s V (4) ()()=vv u C u C (5) solving (3) with boundary conditions (4) will determine the voltages at each node. ()u C represents the total conductance of node u this is, the sum of all edge weights adjacent to u Once the current,()v u I ,, values are available, the current along a particular path: (){}t s P P I , ,K ) = is defined as the prorated current along that path from source to target. ()()u s I u s I , = ) (6) ()() () () 1 1 1 1 1, , ,ÂŠ ÂŠ ÂŠ= = =i out i i i iu I u u I u u s I u u s IK ) K ) (7) ()(){}=v u v outv u I u I, (8)
PAGE 19
9 where ()u Iout represents the total current leaving a node, which is equal to the sum of all currents leaving the node in a downhill stream, where a downhill stream from node u to node v means that voltage at node u is higher than voltage at node v ()()v V u V >. Since the idea of the approach presented in [7] is to find the best connection subgraph, the concept of captured flow ()H CF is introduced. ()H CF of a subgraph H of G is the total delivered current, summed over all paths from source s to target t that belong to H For the purpose of this thesis, we are only considering single paths in G not subgraphs of it. The concept of delivered current over a path,) ( P I ) is very important in the calculation of the current flow vectors in the next chapter. The following example illustrates the process of calculating the voltages, currents, and current flows of graph in Fig. 3.1 which can be treated as an electrical circuit. Figure 3.1 A sample graph to calculate current flows Voltages at the source and target nodes have been fixed to 1 and 0 respectively, 0 ) ( 1 ) ( = = t V s V. Any vertex of degree 1, this means having only one edge connected to the vertex, excluding source and target nodes will have a voltage of 0. This is equivalent to a ground node in an electrical circuit; therefore a voltage of 0 is assigned to represent a sink node in the circuit. Once the voltages for source node and ground nodes have been specified we can proceed to solve a system of linear equations with n variables, where n s c a b t 1 1 3 2 1 2
PAGE 20
10 is equal to the total number of vertices in the graph excluding the source and ground nodes, in this case n is 3. This system can be reduced to solving an eigenvector calculation of the form: V V I B A = 0 (9) = =j ij ijw w a A (10) where matrix A represents the relationship between the nodes based on their connection weights, B represents the boundary conditions for s t and other ground nodes, and I is the identity matrix of size n n . The solution to the system of equations represents the voltages at each of the n nodes. 0 1 1 3 1 0 1 7 2 0 2 3 ÂŠ ÂŠ = ÂŠ ÂŠ ÂŠ ÂŠ ÂŠ ÂŠ = ÂŠ ÂŠ ÂŠ V w w w V w w w w w w w w wsc sb sa i ci cb ca bc i bi ba ac ab i ai (11) 10 0 ) ( 31 0 ) ( 54 0 ) ( 10 0 31 0 54 0 = = = = c V b V a V V (12) With the voltages for each node we can now calculate the current for each edge using equation (1). Once we have calculated the current along each edge, we need to calculate the current flow along each possible path between source and target. This is done using equations (6) and (7). The current flow is a prorated amount from source to
PAGE 21
11 target based on the total current along each node in the path and the total current leaving each of the nodes along the path. Now we proceed to calculate the current along each path between s and t Based on the current flow only downhill paths can be calculated. A downhill path from node u to node v means that voltage at node u is higher than voltage at node v ()()v V u V >: 46 0 ) 54 0 1 ( 1 )) ( ) ( )( ( ) ( = ÂŠ = ÂŠ = a V s V a s C a s I 69 0 ) 31 0 1 ( 1 )) ( ) ( )( ( ) ( = ÂŠ = ÂŠ = b V s V b s C b s I 46 0 ) 31 0 54 0 ( 2 )) ( ) ( )( ( ) ( = ÂŠ = ÂŠ = b V a V b a C b a I 21 0 ) 10 0 31 0 ( 1 )) ( ) ( )( ( ) ( = ÂŠ = ÂŠ = c V b V c b C c b I 93 0 ) 0 31 0 ( 3 )) ( ) ( )( ( ) ( = ÂŠ = ÂŠ = t V b V t b C t b I 20 0 ) 0 10 0 ( 2 )) ( ) ( )( ( ) ( = ÂŠ = ÂŠ = t V c V t c C t c I (13) the following figure shows all voltages and currents, as well as the flow of the current based on the voltages. Figure 3.2 A sample graph with voltages and currents t 0.46 0.69 0.93 0.46 0.21 0.20 V ( a ) = 0.54 V ( s ) = 1 V ( c ) = 0.10 V ( b ) = 0.31 V ( t ) = 0 a b c s
PAGE 22
12 Having the values of all currents along each edge we can proceed to calculate the current along a particular path,(){}t s P P I , K ) =. Using equations (7) and (8) the prorated amount of current that flows from node s to node t is calculated. Table 3.1 shows the values of the current flow along different paths from s to t : Table 3.1 Current flow along paths in Fig. 3.2 t b s 56 0 93 0 21 0 93 0 69 0 = + t c b s 13 0 20 0 20 0 93 0 21 0 21 0 69 0 = + t b a s 38 0 93 0 21 0 93 0 46 0 46 0 46 0 = + t c b a s 08 0 20 0 20 0 93 0 21 0 21 0 46 0 46 0 46 0 = + as we can observe from Table 3.1 patht b s is the one that delivers the most current from s to t As we mentioned before, in [7], the concept of captured flow is introduced to denote the current flow along selected paths of graph G forming subgraph H CF(H) denotes the sum of all the current flows along each path from H. The idea behind the captured flow in subgraph H was to identify the subgraph that will deliver the most current relative to the number of nodes being added to H For our purposes we will not consider this concept as we are interested in current flows along several single paths depending on the characteristics of the source and target nodes.
PAGE 23
13 CHAPTER 4 CURRENT FLOW VECTORS Current flow analysis along the shortest and longest geodesics (shortest paths) of a graph as presented in [12], provides a method to describe the graph structure such that the information needed to represent the graph is reduced significantly compared to the original size of the graph representation. Once a graph has been described using current flow analysis, its new representation is an ndimensional vector that stores the current flow along geodesics of different node degrees. ()()v G E V GV vdeg max ) ( ,= = (14) ()(). 2 :2bound upper an is G R G fG (15) As we can observe from (15), the current flow vector is represented by functionf, which transforms the input graph G into a ()G 2dimensional vector, where ()G represents the highest node degree among all vertices in the graph. The actual dimension of the vector is twice the size of the maximum degree; this is due to the analysis of the current flow along shortest and longest geodesics of the graph. The size of the vector when based on the highest vertex degree actually represents an upper bound of the final vector size; this is because for a given graph, some geodesics for a specific node degree may not exist resulting in a current flow vector of lower dimensionality.
PAGE 24
14 In the following sections we will describe the steps needed to perform the transformation from a graph representation()E V G,=, to a vectorial representation nR G =. Section 4.1 describes the process of geodesic selection, also called GroupN generation. In Section 4.2, we describe the steps needed to calculate the current flow along each of the selected geodesics. Nodal analysis is used as shown in Chapter 3 in order to calculate the prorated current along geodesics. 4.1 GroupN generation Current flow analysis requires the selection of voltage source s and target ground t nodes. Once the selection of these nodes has been done, boundary conditions as described in (4) can be applied to solve the linear system in (3) to find the voltages and currents of the circuit. Path selection is made using shortest paths (geodesics) along the graph. Different source and target nodes will provide different paths, hence different current flows along each path. Each current flow al ong a particular path will capture different characteristics of the topology of the graph as different connections and flows along each path will differ from each other. The idea then is to select a representative number of paths that will capture as much information as possible about the topology of the graph using current flows. To this end, two different sets of paths are defined: shortest geodesics and longest geodesics. Since different graphs will render different geodesics, we need a way to pair them when comparing them. A good way to describe the characteristics of a geodesic is based on the node degree of its source and target nodes. In order to provide a standard framework of comparison between current flows along geodesics of different graphs, the
PAGE 25
15 selection of geodesics is limited to those in which the source and target nodes have the same node degree. The concept GroupN encompasses the group of geodesics that share the same degree N in their source and target nodes. Each GroupN will have two sets of geodesics: shortest geodesics and longest geodesics. As noted before, by selecting a representative number of paths along the graph we are providing a way to capture as much information as possible about the topology of the graph. Fig. 4.1 and Table 4.1 provide an example of a graph and its corresponding GroupN shortest and longest geodesics. 1 2 3 5 4 9 6 7 8 10 Figure 4.1 A simple graph used in Table 4.1 for GroupN generation Geodesics in Group1 are those where their source and target nodes are of degree 1, in this case, nodes 8 and 10. The same applies for other GroupNs. As we can see, multiple geodesics of the same length of the same groupN can be generated using different source and target nodes. Current flows along these geodesics are averaged to produce a single current flow value for each groupN set. For example, Group2 shortest geodesics are (1,2) and (1,3), both with the same path length; when calculating the
PAGE 26
16 current flow for Group2, both current flows are calculated, current flow between node 1 and node 2 and current flow between node 1 and node 3. The resulting current flow values are then averaged to produce a single value for Group2. Table 4.1 GroupN geodesics for graph in Fig. 4.1 GroupN Shortest Geodes ic Longest Geodesic Group1 (8,10) (8,10) Group2 (1,2), (1,3) (1,7) Group3 (4,5), (4,9), (5,6) (4,6), (5,9), (6,9) As mentioned before, the selection of the geodesics is done using a singlesource shortest path algorithm from source to target. In this case we are using Dijkstra's algorithm [13]. It is important to note here that edge weights in the graph represent the cost (or resistance) of going from one node to another. This annotation is important in the sense that while calculating the current flows using equations (1), (2), (3), and (4) the value of the edge weights represent the conductance between the nodes rather than resistance. Therefore a conductance equal to the reciprocal of the edge weight is used while performing the nodal analysis. 4.2 Current flow along GroupN geodesics Now that we have described how to generate the GroupN sets for a given graph, we can proceed with the current flow calculation along each of the geodesics. Calculating currents is done using nodal analysis as described in Chapter 3. Each geodesic can be described as a path from source to target. ) , (t s PK=. As noted by equation (4) voltage
PAGE 27
17 values for source and target nodes are initialized to()()0 1= = t V s V. In this scenario, the source will act as the voltage source and the target as a ground. In the event the graph has some nodes of degree 1 that are neither the source nor the target, these nodes are considered to be grounds as well, hence the voltage at these nodes is()0= u V. Once the voltages for source nodes and ground nodes have been specified we can proceed to solve a system of linear equations with n variables, where n is equal to the total number of vertices of the graph minus source and ground nodes. The solution to the system of equations represents the voltages at each of the n nodes. With the voltages for each node, we can now calculate the current for each edge using equation (1). Once we have calculated the current along each edge, we need to calculate the current flow along the geodesic. This is done using equations (6) and (7). The current flow is a prorated amount from source to target based on the total current along each node in the path and the total current leaving each of the nodes along the path. Having calculated the current flow along each geodesic we can calculate the current flow for each GroupN set. For each set (shortest and longest) we average the current of all geodesics of the same degree. This is, for GroupNs with more than one geodesic of the same length we calculate an average current flow. Having calculated all of the GroupNs current flows for shortest and longest geodesics, we can produce our vectorial representation of graph G : ()GroupN D R GGroupN n Ddeg max2 = = (16) the vectorial representation of the graph is defined by: [ ] D DLN LN SN SN G , , ,1 1K K= (17)
PAGE 28
18 where iSN is the average current flow value for the shortest geodesic(s) from Groupi. Similar, iLN is the average current flow value for the longest geodesic(s) from Groupi. If a particular degree is not represented in the GroupNs, a value of 0 is assigned to the current flow for that group. 4.3 Current flow under special scenarios Certain considerations need to be taken into account in order to assure that nodal analysis will yield useful results. For disjoint graph representations where certain sections of the graph are not connected to each other, we need to exclude the nodes where there is no path from the source to the node. This can be accomplished by using BFS (BreathFirst Search) [13]. This will prevent calculating currents along nonexisting connections in the circuit. Fig. 4.2 shows a graph where nodes 0S, 1O, 2O, and 4O are not connected to the rest of the graph. By running BFS to determine if there is a path from source to any of these nodes the algorithm can decide whether or not to calculate the current flow. 0S 4O 1 3O 1 2O 2 1O 2 8C 12N 1 11N 1 7C 10C 1 9C 2 17C 14C 1 13C 2 16C 1 15C 1 2 1 6N 2 5C 1 1 Figure 4.2 Graph with nonconnected nodes
PAGE 29
19 On the other hand, certain topology configurations of a graph, in particular where too many ground nodes (nodes of degree 1) are present, and closed rings (cycles) provide alternative routes from the current to flow from source to target avoiding the extra ground nodes, display the potential for an odd distribution of voltages along the geodesic; i.e. voltages along the nodes of the path will not always be in a descending configuration from source to target, causing the current flow calculation to yield negative results. To avoid this scenario, we opted to exclude any node pair that causes this behavior from the prorated calculation of the current flow as presented in equation (7). This situation is analogous to shortcircu iting an electrical network. Figure 4.3 Current flow along path causing shortcircuit
PAGE 30
20 Fig. 4.3 shows an example of a scenario where a path from source node O s 21 = to target node N t13=, flows in a downhill stream (as defined in Chapter 3) until it reaches node 9C. Since node 9C is connected to a ground node (degree of node 12O is 1), the current flows down to this node. Current also flows through a closed ring to reach target node 13N through node 8C. As we can see, since the current flows from 8C to 9C, if we try to calculate the current along the path (grayed out nodes), we would get negative results. As noted, when a scenario like this one arises, we opted for ignoring the portion of the current between nodes 8C and 9C, and shortcircuit the network from 4C to 8C. 4.4 Current flow with nodal information So far the information about the graph being captured using current flow analysis has been limited to shortest and longest geodesics between nodes of same degree. Current flow analysis has only used edge weight information as a conductance equal to the reciprocal of the edge weight to generate the current flow vectors. No information about the vertex labels has been included in any of the calculations. An extension to the technique investigated by Faloutsos et al. in [7] is here proposed. In order to include nodal information, this is, to take into account the label associated with each vertex, we perform an edge weight modification to each edge of the graph based on the vertices that such edge connects. Let us say that for the graph on the left of Fig. 4.4 we want to modify the edge weights based on the vertex labels. We can arbitrarily assign to each label a different value. For example, label C = 1, label O = 0.5, label H = 0.8, and label N = 0.3.
PAGE 31
21 Figure 4.4 Edge weight modification based on vertex label information As we can observe for the graph on the right of Fig. 4.4 all edge weights are now different. Each edge weight is modified depending upon what vertices it was connecting. By modifying the edge weights based on the vertex information we are trying to incorporate into the current flow calculation some vertex information. Since the current flow calculation is based entirely on the edge weights treated as a network of resistors, by modifying each edge weight we are modifying the current flow along each path. Now the question is how to assign a numeric value to each vertex label? What happens if there are a large number of vertex labels? How big or how small should the values added to the edge weights be compared to the original edge weights? All these questions are better answered based on the characteristics of the graphs to be compared. For example, in our case we will be working with the NCIHIV dataset. This dataset contains 42,689 chemical compounds that are represented as graphs. The vertex labels are elements of the periodic table; in other words, the number of labels relative small. For our purposes we opted for assigning a value to each vertex based on the frequency of the element in the whole dataset. Elements such as carbon (C), oxygen (O), and nitrogen (N) were the most common; elements such as aluminum (Al) were less common in the whole dataset. For elements with high frequency the value assigned was lower compared to those with less C H C O N 1+1+0.8 1+1+1 1+1+0.5 1+0.3+0.5 1+1+0.5 1+0.8+0.5 1 + 0.8 + 0.3 C H C O N 1 1 1 1 1 1 1
PAGE 32
22 frequency. The idea here is that the most common elements will not provide as distinctive characteristics about the graph topology as those that are more unique. The proportion of the edge weight to the smallest and/or to the largest original edge weight is also important as we do not want the information about the vertices make the original edge weight less important. A percentage of the smallest original edge weight is recommended. For example, if the original edge weights are 1, 2 and 3; the least common vertex label will be assigned a percentage of the smaller original edge weight, in this case 1. The percentage can be a 10%. For example, the value to be added to edges that connect vertices with the least common label will be of 0.1. This value will be smaller for the next least common label up to the point where the addition to the edge weight will be 0 (the most common label). Other approaches to include vertex label information into the current flow calculation are also valid. For example, in the case of the NCIHIV dataset, instead of using the frequency of the labels in the dataset we could have decided to assign similar values to elements with similar chemical characteristics. Other examples of incorporating vertex label information into the current flow calculation will be for computer images represented as graphs; where a color segmentation algorithm can be performed on the image to segregate it into larger sections of similar color that then will be connected with each other; this will construct a graph of color sections. The vertex label for each section will be the color associated with it. Similar colors will be then assigned similar numeric label values; this will allow the current flow analysis to incorporate color information while comparing images represented as graphs.
PAGE 33
23 4.5 Current flow vectorsÂ’ similarity measure The representation of the graph topology as an ndimensional vector allows us to define a similarity measure between two graphs as a numeric value in the range of [0..1], R G G k :. We define the similarity measure k between graphs G 1 and G 2 as: [][] [][] =+ ÂŠ =D d d d d dSN G SN G SN G SN G S12 1 2 1 (18) [][] [][] =+ ÂŠ =D d d d d dLN G LN G LN G LN G L12 1 2 1 (19) () D L S G G k 2 1 2 1 + ÂŠ = (20) ()) 2 deg( max ), 1 deg( max maxG G D= (21) the value of k is a real number in [0..1], where the closer the value is to 1, the more similarities are shared between the current flows vector of both graphs. Equation (20) can be described as computing the differences between each pair of GroupN geodesics from graphs G 1 and G 2. The first summation S compares the shortest geodesics from both graphs, while the second summation L compares the longest geodesics from both graphs. As mentioned before, function k can be used as a graph kernel. A positive definite Gram matrix K can be constructed from function k given that k is always positive and symmetric the function k can be referred as positive definite (pd) kernel [9]. In the following chapter we will present the implementation details of the similarity measure k from the current flow vector creation process to the errortolerant graph matching approach using the current flow vectors.
PAGE 34
24 CHAPTER 5 IMPLEMENTATION DETAILS During the implementation of our prototype we developed two main programs: CFvectors and CFcompare. CFvectors is the program used to generate the vectorial representation of the graphs using current flow analysis as described in Chapter 4; CFcompare is the program used to compare two set of graphs and generate a similarity value between each pair of graphs among both sets as described in Section 4.5. 5.1 File formats During the course of our development we defined two file formats to be used by our programs; these are: graph dataset file (.gds) and current flow vectors dataset file (.cfv). 5.1.1 Graph dataset file The graph dataset file stores a set of directed graphs as described in the next abstract grammar: Graph Dataset: Graph + Graph: BEGIN graph_name graph_class vertices edges END vertices: {v vertex_id vertex_label}+ edges: {e from to edge_weight}* Figure 5.1 Graph dataset file abstract grammar An example of a .gds file representing graph G1 is presented in Fig. 5.2.
PAGE 35
25 Graph G1 Â– Class: CA BEGIN G1 CA v 1 C v 2 O v 3 H v 4 N v 5 C v 6 H v 7 Z e 1 2 1 e 1 3 1 e 2 3 3 e 3 4 1 e 3 5 2 e 4 3 1 e 4 5 4 e 4 6 2 e 6 7 5 e 7 4 3 e 7 6 5 END Figure 5.2 Graph dataset file 5.1.2 Current flow vectors dataset file The current flow vectors dataset file stores the vectorial representation of the graphs as described in the next abstract grammar: Graph Dataset: Graph + Graph: graph_name graph_class ([ S + L *]  [ S L +]) S: S:degree:current_flow_value L: L:degree:current_flow_value Figure 5.3 Current flow vectors dataset file abstract grammar The current flow vectors dataset format allows for a sparse representation of the current flows. As noted before, not all GroupN degrees will be present in a graph. Only those degrees present in the graph need to be stored in the .cfv files. Fig. 5.4 shows a .cfv file. 1 C O H Z C N H 1 3 2 1 4 2 3 5
PAGE 36
26 Graph Class GroupN Shortest Geodesic Longest Geodesic G1 CM Group1 0.85467 0.85467 G1 CM Group3 0.44655 0.78462 G2 CA Group2 0.34677 0.56677 G2 CA Group3 0.35477 0.97887 G3 CI Group5 0.67878 0.00779 GroupNs for graphs G1, G2, and G3 G1 CM S:1:0.85467 S:3:0.44655 L:1:0.85467 L:3:0.78462 G2 CA S:2:0.34677 S:3:0.35477 L:2:0.56677 L:3:0.97887 G3 CI S:5:0.67878 L:5:0.00779 Figure 5.4 Current flow vectors dataset file 5.2 CFvectorsÂ’ implementation CFvectors was implemented on ANSI C++ using the Template Numerical Toolkit (TNT), which is a collection of interfaces and reference implementations of numerical objects useful for scientific computing in C++ [14]. CFvectors receives as a parameter a .gds file, and returns as output a .cfv file: $> cfvectors Â–help usage: cfvectors Figure 5.5 CFvectorsÂ’ command line Once the translation from the graph representation to the vectorial representation of the graph has been performed using CFvectors, there is no need to perform this step again on the same dataset. The following section shows a pseudocode version of the CFvectors program.
PAGE 37
27 5.2.1 CFvectorsÂ’ algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 CFvectors(gdsfile) { // array used to store all graphs array GraphDataset; // array that stores current flows vectors array TempCurrents; // associative array that holds the counts for each vertex label array LabelCounts; // global variable that stores the minimum resistance value in the // whole dataset double MinResistance; // output file file cfvfile; For each graph in gdsfile { // extract each graph from gdsfile and add it to the dataset GraphDataset.add(graph); // count the vertex labels and store the values // for example, if graph has 3 vertices with label "C" // and 1 with label "N", LabelCounts will add to the // overall count of C, 3, and to overall N count, 1. LabelCounts.count_labels(graph); // find the minimum resistance value in the graph // and keep it if it is lower than the current // MinResistance value for the whole dataset if graph.min_edge_weight() < MinResistance { MinResistance = graph.min_edge_weight() } // end if } // end For each // Order the label counts from most common to least common. LabelCounts.ReverseSort(); // Find the amount to be added per each vertex label to the // edge weights as described in section 4.4. double ResistanceIncrement; ResistanceIncrement = MinResistance 0.1 / LabelCounts.size(); // the more common the label, the less resistance increment For i = 1 to LabelCounts.size() { LabelCounts[i].resistance_increment = ResistanceIncrement i; } // Get the current flow vector for each graph For each graph in GraphDataset { Figure 5.6 CFvectorsÂ’ algorithm
PAGE 38
28 57 58 59 60 61 62 63 64 65 66 TempCurrents = graph. GetCurrentFlows() ; // write current flow vector to output cfvfile.write(TempCurrents); } return cfvfile; } Figure 5.6 (continued) The CFvectorsÂ’ algorithm receives as a parameter the graph dataset file, line 1. In line 19, each graph inside the graph dataset is processed in order to extract the current flow vector and store it in the output file. In line 28, a global variable used to store the frequency of each vertex label is updated; this section relates to the inclusion of nodal information into the current flow calculation. In lines 3335 another global variable is modified, the MinResistance variable is used to store the lowest edge weight value in the whole dataset. As described in Section 4.4, the approach to include nodal information into the current flow calculation is to use a percentage of the lowest edge weight value to add to each edge weight depending on the vertex labels it connects. In this case we are using a 10% of the minimum resistance value. Lines 4051 calculate the appropriate resistance increment value for each vertex label depending on its frequency. The most common label will be at the top of the LabelCounts array after this has been sorted in reverse order, line 40. Starting with the most common label the resistance increment value increases in a proportion equal to the number of labels in the dataset. For example, if there are only 20 different labels in the whole dataset, and the minimum edge weight is 1, then each increment will be 0.1/20 greater than the previous one; with the least
PAGE 39
29 common label getting a resistance increment of 0.1. The following algorithm shows the current flow calculation that is done in lines 5562. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 graph::GetCurrentFlows() { array Geodesics; vertex source; vertex target; array CachedCF; array GroupNcurrents; array I; array Iout; double Itemp; // Using DijkstraÂ’s algorithm to find shortest path between all // GroupN pairs of the graph. The Geodesics variable stores both // shortest and longest geodesics of the graph for each // GroupN. Geodesics = this.GetGeodesics(); // Using the global LabelCounts values modify the edge weights prior to // calculating the current flows For each v in this.vertices { For each e in v.adjency_list { e.weight += LabelCounts[v.label].resistance_increment; } } // Calculate the current flows for all geodesics For each x (shortest or longest) geodesic in Geodesics { For each group_i in Geodesics.GroupN { For each g in group_i.Geodesics { // set source and target nodes source = g.first_vertex; target = g.last_vertex; // check to see if the current flow between // (source,target) has not been calculated if CachedCF[(source,target)] is NULL { // Find the voltages for the circuit having source and target // nodes the first and last vertices of the geodesic. // This function solves the system of equations as described // on Chapter 3. voltages = FindVoltages(source,target); // calculate the currents using the voltages ONLY for downhill // current flows as defined in Chapter 3. // Current is equal to I = ( V(u) V (v) ) / R(u,v) For each e(u,v) in this.adjency_list { I[u,v] = ( voltages[u] voltages[v] ) / edge(u,v).weight; // Add each current that goes out of u // to the Iout figure Figure 5.7 Current flow calculation algorithm
PAGE 40
30 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 Iout[u] += I[u,v]; } // calculate the prorated current through he geodesic (s,...,t) Itemp = 1; For each nodepair(u,v) in g { Itemp *= I[(u,v)]/Iout[u]; } // store current flow value between source and target CachedCF[(source,target)] = Itemp; } // endif // add the CF to the result groupN (shortest or longest) GroupNcurrents[x,group_i].add(CachedCF[(source,target)]); } // end For each g // Now that all geodesics currents in the ithgroupN // have been calculated create an average in case there is // more than one. if (GroupNcurrents[x,group_i].size() > 1 ) { GroupNcurrents[x,group_i] /= GroupNcurrents[x,group_i].size(); } } // end For each group_i } // end For x (shortest or longest) return GroupNcurrents; } Figure 5.7 (continued) The GetCurrentFlows() function is in charge of generating the current flow vector based on the current graph. In line 16, the shortest and longest geodesics of the graphs are found using DijkstraÂ’s algorithm. Each geodesic belongs to a particular GroupN. For example, if a graph has 2 vertices, u and v of the same degree, there could possibly be more than one path between those 2 vertices. The GetGeodesics() function in line 16 will find all the shortest paths between u and v and it then will keep one shortest path with the minimum path length (shortest geodesic), and it will keep the shortest path with the maximum path length (longest geodesic). This process will be applied to all GroupN pairs. In the event more than 1 pair of the same degree (same GroupN) exists, then an average for all the shortest geodesics and an average for all the longest geodesics will be
PAGE 41
31 calculated, Lines 7678. Lines 2685 depict the process to go through all GroupNs, shortest and longest, and all geodesics for the current graph and the calculation of the current flow for each of the GroupN groups. In lines 3233 the source and target nodes are selected. These are the start and ending nodes of each geodesic. Since a geodesic could be both the shortest and longest geodesic at the same time, a cache vector is implemented to avoid calculating the same current flow for the same nodepair. Line 37 verifies whether the current flow for a given pair has been already calculated. In line 43, the function FindVoltages(source,target) calculates all the voltages for each node in the graph. This function solves the system of equations using a LUdecomposition after the initial voltage values for source, target, and ground nodes have been specified. In order to prevent trying to calculate voltages for nonconnected vertices (in the case of disjoint graphs), BFS (breathfirstsearch) is used to determine if a vertex is connected through any path to the source node of the circuit. Once all voltages have been calculated, we can proceed to calculate the current value for each edge. Lines 5056 calculate the current using OhmÂ’s law. The value for Iout(u), this is the total current that exits from node u is calculated in line 54. Once all currents have been calculated we can find the prorated current along the geodesic; this is done in lines 5962. In line 69, the GroupNcurrents variable is modified to add the prorated current amount calculated in lines 5962; each current flow belongs to a particular GroupN and shortest or longest set. In lines 7678, once all current flows have been calculated, an average of the current flows for each GroupN shortest, and GroupN longest set is calculated if there is more than 1 current flow per set. The output of the function is the current flows for all GroupN sets of the current graph.
PAGE 42
32 5.2.2 CFvectorsÂ’ computational complexity In order to analyze the computational complexity of the CFvectorsÂ’ algorithm we will assume that the graph dataset file contains only one graph. The label count and resistance increment sections, lines 2851, are dominated by the sorting of the labels by frequency. Line 28, the label count is done in O (V), while the resistance increment calculated in lines 3335 is done in O (E). The sorting of the labels based on their frequency is done in O (V log(V)). The modification of the resistance increment values is done in O (V). We can say that the section prior to the calculation of the current flows is done in O ( (2V + E) + V log(V) ). The current flows calculation is much more computationally expensive compared to the prior section. Starting with the discovery of the geodesics and GroupNs in line 16 of the GetCurrentFlows() function. As described on the previous section, in order to find all geodesics, GroupNs must be identified first. The process of identifying GroupNs requires evaluating all possible paths between node pairs of the same degree. For example, for a graph with 6 nodes of degree 3 the number of paths that can be form with between two nodes of degree 3, one as the source and the one as the target, is: 2 ) 1 ( )! 2 ( 2 2 ÂŠ = ÂŠ = n n n n n (22) 15 2 ) 1 6 ( 6 )! 2 6 ( 2 6 2 6 = ÂŠ = ÂŠ = (23) The number of paths to be evaluated for each node degree is on the order of 2 /2n where n is the number of nodes of a particular degree. The worst case scenario given a particular graph structure is for a fully connected graph where a path exists between every single
PAGE 43
33 pair of nodes. In this case, the number of paths to be evaluated is on the order of 2 /2V. For each of the GroupN nodepairs both shortest and longest geodesics must be found. This process is being done using DijkstraÂ’s algorithm which can be done in O ((E+V) log(V)) using a priority queue [13]. The whole process of finding all geodesics takes approximately ()2 )) log( ) ( (2V V E V O+ . Once all geodesics have been found, the process of calculating voltages and currents is on the order of ) (3V O due to the LUdecomposition to find the voltages. For sparse graphs, the voltage calculation can be improved to O (E) operations per iterations, and the number of iterations depends on the gap between both the largest and second largest eigenvalues [7]. The section that calculates the current flows can be said to be on the order of ()2 )) log( ) ( (2V V E V O+ and the whole CFvectors algorithm is on the order of ) ) 2 / )) log( ) ( (( ) log( ) 2 ((3 2V V V E V V V E V O+ + + + +. Since we assumed that the graph dataset will have only 1 graph, the total computational complexity of processing a full graph dataset will increase proportional to the number of graphs in the dataset. We can observe that the computational complexity of the algorithm is relatively high, but we must keep in mind that this step must be done only once for each graph. Once a graph has been transformed from its graph representation to a vectorial representation, the current flow vector that represents the graph will never change and it can be used in any future comparison of the graph against another current flow vector representing another graph. It is worth mentioning that the size of the graphs in the NCIHIV dataset is relatively small, with the largest graph having only 214 nodes.
PAGE 44
34 5.3 CFcompareÂ’s implementation CFcompare was implemented on ANSI C++. CFcompare provides several options to compare two .cfv files. The two datasets to be compared are called: query dataset, and base dataset. The query dataset is usually a smaller dataset that we want to compare against our base dataset. Since the number of results that can be obtained from comparing the query dataset to the base dataset is equal to the number of graphs in the query dataset multiplied by the number of graphs in the base dataset, CFcompare provides the ability to limit the number of results to avoid generating huge output files. These options are n and t Option n allows the user to define the top N results to be generated. Option t allows the user to define a value from 0 to 1, this value represents a threshold for the similarity measure, meaning that only graphs where the similarity measure is equal or greater than the provided threshold value would be returned. Results are stored in a text file that shows the name of the graph being compared, followed by the graphs that met the criteria provided by the user (either top N, above or equal to threshold, or all base graphs) in descending order based on the similarity value (closest matches are listed first), this helps to identify the closest matches in a more efficient manner. In case the results are needed for classification purposes, CFcompare can provide counts based on the class labels in the base dataset. Option c allows the user to request class counts to be included. Class counts will be generated in a separate file from the results, showing the total number of graphs from each class that met the criteria provided by the user.
PAGE 45
35 $> cfcompare Â–help Usage: cfcompare [options] query_set_file [base_set_file] If base set is not provided, it compares the query set to itself. options: t (0..1): Match value threshold n (1..n): Top n best matches c output class counts Figure 5.8 CFcompareÂ’s command line 5.3.1 CFcompareÂ’s algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 CFcompare(QueryFile,BaseFile,topN,threshold,produceClassCounts) { array QueryDataset; array BaseDataset; array ClassCounts; array topN_matches; int TotalDegrees; double g1_shortest,g1_longest, g2_shortest, g2_longest; double mv,S,L; file ResultsFile, ClassCountsFile; // Load files For each graph in BaseFile { BaseDataset.add(graph); } if QueryFile == BaseFile { QueryDataset = BaseDataset; } else { For each graph in QueryFile { QueryDataset.add(graph); } } For each q in QueryDataset { For each b in BaseDataset { // Select the max degree between the two graphs TotalDegrees = max( q.CurrentFlows.size(), b.CurrentFlows.size()); // calculate the differences between the each current flow of the // same degree For d=0 to TotalDegrees { g1_shortest = q.CurrentFlows[d].shortest(); g1_longest = q.CurrentFlows[d].longest(); g2_shortest = b.CurrentFlows[d].shortest(); g2_longest = b.CurrentFlows[d].longest(); } if g1_shortest == g2_shortest { // No difference S += 0.0; } Figure 5.9 CFcompareÂ’s algorithm
PAGE 46
36 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 else { // Partial Difference S += abs(g1_shortestg2_shortest) / (g1_shortest+g2_shortest); } if (g1_longest == g2_longest){ // No difference L += 0.0; } else { // Partial Difference L += abs(g1_longestg2_longest) / (g1_longest+g2_longest); } // TotalDegrees 2 is to account for d shortest // and d longest geodesics mv = 1.0 ( (S + L) / (TotalDegrees 2.0) ); // If the match value is greater or equal than // the threshold set by the user then store the value // for pair (q,b) if (mv >= threshold) { topN_matches.add((q,b),mv); } } // end For each b topN_matches.ReverseSort(); // Output first top N matches; For i=0 to topN { ResultsFile.write(topN_matches[i]); // Output class counts if requested by user if ProduceClassCounts = True { // count per each class how many graphs in the top N ClassCounts[q,topN_matches[i].class]++; } } // end For i to topN if ProduceClassCounts = True { ClassCountsFile.write(ClassCounts[q]); } } // end For each q return; } Figure 5.9 (continued) CFcompare first loads the query and base files into datasets, lines 1224. If the query set and the base set are the same, the load will only take place one time as both datasets will be the same. Once both datasets are loaded, each pair of graph from the
PAGE 47
37 query set and the base set are compared to one another, lines 2693. First, the TotalDegrees variable is calculated, this is equivalent to equation (21) in Section 4.5. For each set of shortest and longest current flow values for each node degree the difference between graph q from the query dataset and graph b from the base dataset is calculated, lines 3459. Please note that in the event that one of the graphs, either q or b, does not have a particular current flow value for a determined node degree, a value of 0 will be assigned. With all the current flow values for a particular node degree, that is, current flow values for the shortest and longest geodesics between vertices of the selected node degree, the match value is then calculated as per equation (20), line 63. In lines 6870, the match value is compared to the threshold set by the user. If the match value is above or equal to the threshold then the match is stored. In line 75, all matches that were above or equal to the threshold are ordered from bigger to smaller. Please remember that the closer the value is to 1 the closer both graphÂ’s current flow vectors are similar to each other. In Lines 7885 we write the output of only the first top N matches based on their match values. If the user requested class counts to be created, a file containing the count of how many graphs of a particular class were in the top N matches for each graph. Fig. 5.10 shows a sample result output file. 9168,CA 58368:CI:0.99474 50851:CI:0.52679 50848:CA:0.51256 50848,CA 50851:CI:0.95929 64052:CI:0.93493 9168:CA:0.51256 50851,CI 50848:CA:0.95929 64052:CI:0.90196 9168:CA:0.52679 58368,CI 9168:CA:0.99474 50851:CI:0.52153 50848:CA:0.50731 64052,CI 50848:CA:0.93493 50851:CI:0.90196 58368:CI:0.49733 Figure 5.10 CFcompare results output file This file compared five graphs to each other storing only the top 3 matches. Both query and base files were the same. Since both, query and base dataset, are the same for this experiment, each graph will be compared to all other graphs in the dataset excluding
PAGE 48
38 itself. As we can observe, the file first shows the name of the graph followed by the class of the graph (if available). For example, graph 9168 belongs to the CA class. The first match for graph 9168 is graph 58368, the class of the matched graph is also shown, followed by the match value. In this case graph 58368 belongs to class CI and the match value between graph 9168 and graph 58368 was 0.99474. Fig. 5.11 shows graph 9168 and its three closest matches. As we can observe, the closest match, graph 58368, is similar to 9168, while other matches are clearly different. Graph 9168 5C 10C 1 9O 1 8C 1 1C 3C 1 2C 1 4O 1 30C 32O 1 31O 2 18C 21C 1 22O 2 17C 20C 1 19C 1 12C 16O 1 14C 1 6C 1 11O 1 7C 1 1 1 1 29C 1 28C 2 27C 1 26C 2 25C 1 24C 2 23C 1 2 1 13C 2 15C 1 1 1 1 0C 1 Graph 58368. MV = 0.99474 18C 23C 1 22O 1 21C 1 14C 16C 1 15C 1 17O 1 43C 45O 1 44O 2 31C 34C 1 35O 2 30C 33C 1 32C 1 25C 29O 1 27C 1 19C 1 24O 1 20C 1 1 1 1 2C 6C 1 5C 1 1C 4C 1 3C 1 42C 1 41C 2 40C 1 39C 2 38C 1 37C 2 36C 1 2 1 26C 2 28C 1 1 1 1 12C 11C 10C 1 9C 1 8C 1 7C 1 1 1 1 1 0N 1 1 13C 1 Graph 50851. MV = 0.52679 15C 17C 1 16C 2 7C 11C 1 8N 1 4C 9C 1 2 3N 1 6O 2 2C 1 5C 1 1C 1 2 20C 19C 1 18C 2 2 1 14C 1 13S 1 12C 1 10C 1 2 2 0C 1 Graph 50848. MV = 0.51256 14C 16C 1 15C 2 7C 11C 1 8N 1 4C 9C 1 2 3N 1 6O 2 2C 1 5C 1 1C 1 2 19C 18C 1 17C 2 2 1 13S 1 12C 1 10C 1 2 2 0C 1 Figure 5.11 Graphs 9168, 58368, 50851, and 50848
PAGE 49
39 graph, class, top, CA, MaxMV_CA, CI, MaxMV_CI 9168, CA, 3, 1, 0.51256, 2, 0.99474 50848, CA, 3 1, 0.51256, 2, 0.95929 50851, CI, 3, 2, 0.95929, 1, 0.90196 58368, CI, 3, 2, 0.99474, 1, 0.52153 64052, CI, 3, 1, 0.93493, 2, 0.90196 Figure 5.12 Class counts file Fig. 5.12 shows the class counts file the graphs in Fig. 5.11. Class counts files also store the maximum match value for each class. For example for graph 50848, in the top 3, it has 1 graph from class CA with a value of 0.5126 and 2 from class CI from which the maximum match value was 0.95929. In this particular example class CM is not represented as the dataset file only contained two classes, CA and CI. 5.3.2 CFcompareÂ’s computational complexity CFcompareÂ’s computational complexity is linear in time to the dimension of the current flow vector representing the graph, O (D). The process of comparing one graph to another boils down to solving equation (20); when comparing one graph to a base dataset the complexity increases to O (D log D), this is caused by the ordering of the top N results. CFcompareÂ’s computational complexity highlights the benefits of the proposed approach, the process of converting the graph to a vectorial representation, albeit costly in time, is only needed one time per graph; any future comparison of such graph to a database of current flows representing graphs will be almost linear in time. 5.4 Additional tools Other tools were developed as part of this thesis in order to facilitate the analysis and visualization of the results. These tools are: SDF2GDS and GDS2DOT. SDF2GDS
PAGE 50
40 converts an .sdf file also known as Structur es Data File which is a common file format developed by Molecular Design Limited to handle a list of molecular structures with associated properties [15] into a .gds file, which is the format expected by CFvectors. GDS2DOT exports each graph in the graph dataset file to separate .dot files for each graph. .dot files as defined by [16] are used by Graphviz as its input format. Graphviz is a popular open source suite of tools developed by AT&T research labs for graph visualization. GDS2DOT performs a special ordering of the vertices in order to prepare the graph for a better rendering using GraphvizÂ’s neato layout engine. Vertices with a larger number of edges are defined first in the .dot file; this will tell Graphviz to position those vertices first, producing a better graphical representation of the graph. graph CA50848{ node[shape="circle"] "v14" [label ="14 C"] "v7" [label ="7 C"] "v4" [label ="4 C"] "v3" [label ="3 N"] "v2" [label ="2 C"] "v1" [label ="1 C"] "v19" [label ="19 C"] "v18" [label ="18 C"] "v17" [label ="17 C"] "v16" [label ="16 C"] "v15" [label ="15 C"] "v13" [label ="13 S"] "v12" [label ="12 C"] "v11" [label ="11 C"] "v10" [label ="10 C"] "v9" [label ="9 C"] "v8" [label ="8 N"] "v5" [label ="5 C"] "v6" [label ="6 O"] "v0" [label ="0 C"] "v0""v1" [label ="1"] "v1""v2" [label ="2"] "v1""v3" [label ="1"] "v2""v4" [label ="1"] Figure 5.13 GraphvizÂ’s dot file example
PAGE 51
41 "v2""v5" [label ="1"] "v3""v6" [label ="2"] "v3""v7" [label ="1"] "v4""v8" [label ="2"] "v4""v9" [label ="1"] "v5""v10" [label ="2"] "v7""v11" [label ="1"] "v7""v8" [label ="1"] "v9""v12" [label ="2"] "v10""v12" [label ="1"] "v11""v13" [label ="1"] "v13""v14" [label ="1"] "v14""v15" [label ="2"] "v14""v16" [label ="1"] "v15""v17" [label ="1"] "v16""v18" [label ="2"] "v17""v19" [label ="2"] "v18""v19" [label ="1"] } Figure 5.13 (continued)
PAGE 52
42 CHAPTER 6 EXPERIMENTAL RESULTS 6.1 Experimental setup During the testing of our prototype we applied our algorithms to the NCIHIV dataset of chemical compounds [10]. This dataset contains 42,689 chemical compounds, 423 of which are active (CA), 1081 are moderately active (CM), and 41,185 are inactive (CI). The NCIHIV dataset has been used in the empirical evaluation of several graph mining techniques [17] [18] [ 19] [20]. The first step of our experiments was to convert the NCIHIV dataset from .sdf to .gds. We used SDF2GDS for this purpose. Once we had our graph dataset file, the next step was creating the current flow vectors file. Using CFvectors on the NCIHIV.gds file took a little over 10 minutes on a 2.4 GHz Pentium 4 with 512Mb of RAM. With the NCIHIV.cfv file at hand we were ready to test the graph matching potential of our algorithm. The first set of experiments as described in Section 6.1.1 will show the results of several comparisons of multiple graphs against the whole NCIHIV dataset. The second set of experiments as described in Section 6.1.2 will show the results on the graph classification problem for the NCIHIV dataset by using the class counts obtained from the CFcompare algorithm as the input to several classification models like neural networks, knearest neighbors, and rule based systems. The results are compared to those reported in [17].
PAGE 53
43 6.1.1 Graph visual comparison experiments After converting the NCIHIV dataset from an .sdf file to a .gds file and generating the current flow vectors file (NCIHIV.cfv) the next step in our research was to find out how the similarity measure works for finding isomorphisms. We compared the NCIHIV dataset against itself with a threshold of 1.0. As described before, CFcompare will compare each graph in the query dataset against all other graphs in the base dataset. Since both, query and base dataset, are the same for this experiment, each graph will be compared to all other graphs in the dataset excluding itself. For this experiment, we used two current flow dataset files (.cfv) extracted out of the same NCIHIV dataset. Each .cfv file was produced using a slightly different version of CFvectors for each file. The versions vary from each other on the percentage used during the nodal information integration step described in Section 4.4. Our default implementation, as described in the pseudocode in Fig. 5.7, used only 10% of the lowest resistance value in the dataset. The NCIHIV dataset contains only three possible values for the edge weights, or bonds, between its nodes, or atoms, single, double, and triple bond, represented with weights of 1, 2, and 3 respectively. The other implementation of the CFvectors uses a different percentage of the lowest edge weight value; in this case the percentage is 0% this is equivalent to excluding nodal information when calculating the current flow. The first current flow vectors dataset file to be evaluated was produced using the 0% percentage; we will refer to this current flow vectors dataset as HIV00.cfv. The second file produced was using the 10% percentage; we will refer to this dataset as HIV10.cfv.
PAGE 54
44 Graph 1899 4C 5S 1 6S 2 1C 3C 1 2N 1 1 1 0O 2 Graph 6745 4C 5S 1 6O 2 1C 3C 1 2N 1 1 1 0O 2 HIV00 M.V = 1.0 HIV10 Â– M.V = 1.0 Graph 2858 4C 5S 1 6N 2 1C 3C 1 2N 1 1 1 0O 2 HIV00 M.V = 1.0 HIV10 Â– M.V = 1.0 Graph 45956 4C 5S 1 6N 2 1C 3C 1 2N 1 1 1 0O 2 HIV00 M.V = 1.0 HIV10 Â– M.V = 1.0 Graph 1895 4C 5N 1 6N 2 1C 3N 1 2N 1 1 1 0N 2 HIV00 M.V = 1.0 HIV10 Â– M.V = 0.99970 Graph 227159 4C 5S 1 6O 2 1C 3S 1 2N 1 1 1 0O 2 HIV00 M.V = 1.0 HIV10 Â– M.V = 0.99970 Graph 65248 4C 5S 1 6S 2 1C 3S 1 2N 1 1 1 0N 2 HIV00 M.V = 1.0 HIV10 Â– M.V = 0.99960 Graph 4645 4C 5N 1 6S 2 1C 3N 1 2S 1 1 1 0S 2 HIV00 M.V = 1.0 HIV10 Â– M.V = 0.99950 Figure 6.1 Graph 1899 matches at 0% and 10%
PAGE 55
45 For the HIV00.cfv dataset we found 4,371 compounds with matches having a 1.0 matching value. For the HIV10.cfv dataset we found 2,216 compounds with matches having a 1.0 matching value, indicating the match criteria was more difficult. Fig. 6.1 shows the matches for graph 1899 and their corresponding match values at both 0% and 10%. As we can observe from the figure, all graphs are nearly identical. The only difference between graphs 1899, 2858, 6745, 45956 is at vertex 6, where for graph 1899 it is a sulfur atom, S; for graphs 2858 and 45956 it is a nitrogen atom, N; for graph 6745 it is an oxygen atom, O. Current flow analysis allows for an errortolerant graph matching where the matches will not always be perfect, allowing for a graph to be matched to closely related graphs. From the two datasets, HIV00 and HIV10, we can observe the impact that excluding the nodal information during the current flow calculation process has, in particular for the HIV00, where no nodal information is included, more matches were obtained compared to the number of matches for the HIV10. The number of compounds with matches at a 1.0 threshold for the HIV00 dataset is double the number of matches for the HIV10 dataset. This highlights the benefits of including nodal information in the current flow analysis, as not only the structure of the graph, but also the label information could be important during the graph matching process. Due to the sheer size of the dataset we cannot visually verify each of the results, but after a random verification of several compounds, each and every one of those that were visually verified were perfect isomorphisms (excluding the same graph compared to itself). This, of course, by no means allows us to present a sound statement about the
PAGE 56
46 efficiency of our prototype, but it is encouraging to see the excellent results achieved when finding isomorphisms. The next experiment in our research was to perform a graph matching with a lower matching value. Instead of applying a threshold of 1.0, we tested with different thresholds. We compared the HIV10.cfv dataset to itself with a threshold: 0.97. Applying the lower threshold we obtained many more compounds with matches; in this case 27,684. Fig. 6.1 shows the match values for the HIV10.cfv dataset that were above the 0.97 threshold for compound 1899. As we can observe, when excluding the vertex label information from the current flow analysis in the HIV00.cfv dataset, the algorithm selected all graphs with identical structure, regardless of the vertex labels. When using the HIV10.cfv, which incorporates the vertex label information into the current flow analysis, only those with labels with very similar values where selected. We can observe that both 2858 and 45956 are identical on structure and vertex labels. Graph 6745 only differs in one vertex and given the fact that both oxygen (O) and nitrogen (N) are very common labels in the dataset, both should have very similar values. Other graphs that after comparison returned a lower match value differed in more than one vertex label. Fig. 6.2 shows results for compound 3417 at the 0.97 threshold. Only three matches were found for this graph with a match value higher than 0.97. Two of the matches, graphs 629861 and 629864 display a very similar structure, but as we can see from their match values, their current flows are different when compared to graph 3417.
PAGE 57
47 Graph 3417 5C 10S 1 9O 2 4C 8S 1 7N 2 2N 1 1 13C 14C 1 12C 1 11C 1 1 1 3C 6C 1 1C 1 0C 1 1 Graph 629861 M.V = 0.98596 6P 10O 1 9O 1 8O 2 5C 1 3C 4C 1 7C 1 0C 1 2O 1 1O 2 13C 14C 1 11C 12C 1 1 1 1 1 Graph 56450 M.V = 0.98231 4C 8C 1 7N 2 3C 6C 1 5N 2 2C 1 1 10C 1 11O 1 1 9O 1 0N 2 1O 1 Graph 629864 M.V = 0.97140 10P 13O 1 12O 1 11O 2 8C 1 5C 6O 1 4C 1 1 3C 1 2C 1 16C 17C 1 14C 15C 1 1 1 7C 1 9O 2 1 1C 1 0C 1 1 Figure 6.2 Graph 3417 matches above 0.97 From Fig. 6.2 we can observe that all th e graphs share a structure with a main body and two long appendixes. The main idea behind current flow analysis for errortolerant graph matching is that by calculating the current flows along specific paths of the graph, the algorithm will capture information not only about the vertices along the
PAGE 58
48 path, but also about vertices along side the path as electric current flows through them. The characteristics of each graphÂ’s structure and vertex information will provide a very singular footprint that will allow matching similar graphs as the current flows values should be similar for similar structures. Our next figure shows one graph that only returned two matches within the 0.97 threshold. One match value is really high and the other one is much closer to the threshold. Graph 629871 20N 22O 1 21O 2 17C 1 18C 2 14C 19C 2 15C 1 12N 1 13C 1 3C 10C 1 4C 2 2C 1 9N 1 1C 2 6C 1 1 16C 1 2 11N 1 2 8C 2 7C 1 2 5C 1 0C 1 2 Graph 629870 M.V = 0.99991 20N 22O 1 21O 2 17C 1 18C 2 14C 19C 2 15C 1 12N 1 13C 1 9C 10C 1 2C 1 3C 1 1C 2 6N 1 1 16C 1 2 11N 1 2 8C 2 7C 1 2 5C 4C 1 2 0C 1 2 Graph 694620 M.V = 0.97398 23N 25O 1 24O 2 17C 22C 2 18C 1 12C 15O 1 13C 2 9C 14C 2 10C 1 8C 16O 2 7N 1 1 0C 1 5C 2 1C 1 4C 1 1 3C 2 6S 1 21C 1 20C 2 19C 1 2 1 1 11C 1 2 1 2N 1 2 Figure 6.3 Graph 629871 matches above 0.97
PAGE 59
49 Graphs 629871 and 629870 are nearly identical as illustrated by their match value. On the other hand, compound 694620 shares certain characteristics with compound 629871, especially at the end of the graph which is made up of two oxygen and one nitrogen atom. We expect that during the current flow analysis these characteristics are the ones providing a particular current flow for specific geodesics that when compared to those of another graph will provide the similarities needed to obtain a high match value. Figure 6.4 shows compound 633892, which only had one match within the 0.97 threshold. As we can observe, it is hard to discern particular characteristics between these two graphs, other than the current flow vectors are similar. Graph 633892 18C 24O 1 21O 1 19C 1 15C 20C 2 16C 1 12C 14O 1 13O 2 11C 1 1 4C 5C 1 8I 1 3C 2 9C 1 1C 6O 1 2C 2 23C 1 22C 1 1 1 17C 1 1 25C 1 10C 1 1 7C 1 1 0C 1 2 Graph 639234 M.V = 0.97450 11S 14N 1 13O 2 12O 2 15C 1 19C 1 16C 1 9C 10O 1 8C 2 20C 1 6C 1 7C 2 5C 1 4C 1 1 24C 25C 1 23C 1 22C 1 21N 1 1 18C 1 17C 1 1 1 3N 2 2C 1 1C 2 0C 2 1 Figure 6.4 Graph 633892 matches above 0.97
PAGE 60
50 Figures 6.56.8 show different graphs and their closest two matches with their correspondent match values for the HIV10.cfv dataset. The graphs displayed here were chosen at random from the 42,689 compounds in the NCIHIV dataset. GRAPH MATCH 1 MATCH 2 Graph 642970 Class CI Graph 333711 Class CI M.V = 0.99296 Graph 119076 Class CI M.V = 0.98010 64C 66O 1 65O 2 61C 62O 2 58C 59O 2 55C 56O 2 52C 53O 2 49C 50O 2 46C 47O 2 43C 44O 2 40C 41O 2 37C 38O 2 34C 35O 2 25C 27O 1 24C 1 31O 1 23C 1 28O 1 22C 1 29C 1 20C 1 26O 1 21O 1 17C 1 16C 1 18O 1 15C 1 19O 1 14C 1 1 12C 1 13O 1 5C 7O 1 4C 1 10O 1 3C 1 32S 1 1C 8C 1 2O 1 0C 1 1 6O 1 1 30O 1 1 1 1 1 1 1 1 11C 1 1 1 9O 1 1 1 1 1 63C 1 60C 1 57C 1 54C 1 51C 1 48C 1 45C 1 42C 1 39C 1 36C 1 33C 1 19C 21O 1 20O 2 14C 16N 1 17O 2 13C 1 15N 1 8C 11O 1 10O 2 1C 4C 1 5N 1 0C 1 2N 1 3O 2 18C 1 1 12C 1 9S 1 7S 1 6C 1 1 1 1C 3C 1 2C 1 4C 1 27C 29O 2 28C 1 22C 25O 1 26C 1 21C 24O 2 23C 1 19C 1 20C 1 14C 1 17N 1 13C 18O 1 12C 1 16O 2 10C 1 15C 1 5C 1 11C 1 8C 1 7C 1 1 6C 1 1 1 1 9C 1 1 1 1 1 0C 1 Graph 629789 Class CI Graph 670310 Class CI M.V = 0.97940 Graph 637419 Class CI M.V = 0.97739 16C 20C 2 17S 1 14C 1 15O 1 11C 13C 1 12O 2 9C 1 10C 2 8C 1 1 5C 7O 1 6O 2 2C 1 3C 1 1C 2 4C 1 0C 2 1 1 19C 1 18C 2 1 2 7C 9C 1 10C 1 8C 1 21C 23O 1 22O 2 15C 18C 1 16C 1 14C 2 20C 1 12C 2 11C 1 13C 2 1 17O 2 4C 1 5C 1 3C 2 6C 1 1C 1 2C 2 24C 1 19C 1 1 1 1 1 1 0C 1 2 16C 21C 2 17C 1 14C 15O 2 12C 1 13C 2 8C 10N 1 9O 2 6C 7C 1 4C 1 5C 2 3C 1 11O 1 2C 1 2 1C 2 1 20C 1 19N 2 18C 1 2 1 1 1 0C 1 Graph 106563 Class CI Graph 131300 Class CI M.V = 0.97703 Graph 148201 Class CI M.V = 0.97109 0S 4O 1 3O 1 2O 2 1O 2 8C 12N 1 11N 1 7C 10C 1 9C 2 17C 14C 1 13C 2 16C 1 15C 1 2 1 6N 2 5C 1 1 6C 9N 1 10O 2 2C 5C 1 4C 2 11C 12O 1 8C 1 7C 2 2 1 3C 1 1C 1 0C 1 1 3C 6C 1 7O 2 2C 5C 1 4C 2 11C 9C 1 8C 2 10C 1 2 1 1C 1 0C 1 2 Figure 6.5 Graphs 642970, 629789, 106563 matches
PAGE 61
51 GRAPH MATCH 1 MATCH 2 Graph 26540 Class CI Graph 26542 Class CI M.V = 1.00000 Graph 4971 Class CI M.V = 0.99973 8C 9C 1 3C 7N 2 6Cl 1 2C 5N 1 4Cl 1 1C 1 2 1 2 0Cl 1 8C 9O 1 3C 7N 2 6O 1 2C 5N 1 4C 1 1C 1 2 1 2 0Cl 1 8C 9O 1 3C 7C 2 6C 1 2C 5C 1 4C 1 1C 1 2 1 2 0Cl 1 Graph 693764 Class CI Graph 641523 Class CI M.V = 0.98283 Graph 645311 Class CI M.V = 0.97586 19C 24C 2 20C 1 17C 25O 1 18C 1 16C 2 27O 1 15C 1 29O 1 13C 2 14C 1 8C 10C 1 9C 2 4C 1 5N 1 3C 2 6C 1 1C 1 2N 2 0C 1 2 11Cl 1 30C 1 28C 1 26C 1 23C 1 22C 2 21C 1 2 2 12O 1 1 7C 1 2 1 14C 21O 1 15C 1 13C 2 23O 1 10C 2 11C 1 9C 19O 1 7C 1 8C 1 5C 6O 1 4C 1 1 2C 16C 1 3C 1 0C 2 25O 1 1C 1 26C 1 24C 1 22C 1 20C 1 17C 18C 2 1 12C 1 2 1 1 2 2 20C 21C 1 18C 25O 1 19C 2 17C 1 27O 1 15C 2 16C 1 12N 13C 2 14C 1 11C 1 1 10C 2 22C 1 5C 1 4C 1 1 2C 9O 1 3C 1 1C 2 6O 1 2 1 2 8C 1 23C 1 7C 1 2 0C 2 1 26C 1 24C 1 28Cl Graph 16086 Class CM Graph 79050 Class CI M.V = 0.97385 Graph 84096 Class CI M.V = 0.97365 0Fe 6O 1 5O 1 4O 1 3O 1 2O 1 1O 1 29C 35C 1 27C 33C 1 25C 31C 1 23C 2 30C 1 21C 2 28C 1 19C 2 26C 1 17C 1 15C 1 13C 1 9C 2 18C 1 8C 2 16C 1 7C 2 14C 1 39C 38C 37C 36C 1 2 34C 1 2 32C 1 2 2 2 2 24C 1 22C 1 20C 1 2 2 2 12N 1 11N 1 10N 1 2 2 2 1 1 1 26C 28C 1 27C 2 24N 25C 1 20C 22C 1 18C 2 21O 1 16C 1 14C 2 17C 1 11C 1 7C 2 12C 1 2C 6C 1 5C 2 1N 4C 1 3C 1 31C 30C 1 29C 2 2 1 1 23C 1 1 1 19C 1 2 15C 1 13C 2 10C 1 9C 2 8O 1 2 1 1 1 0C 1 1 19C 25C 1 16C 23C 1 22C 2 12C 2 20N 1 9C 1 17N 1 5C 1 11O 1 4C 2 10C 1 3C 8C 1 7C 2 2C 2 6C 1 0C 1 1 1N 1 28C 27C 26C 1 2 24C 1 2 21C 2 18C 1 2 15C 1 14C 2 13C 1 2 2 1 2 1 Graph 121858 Class CM Graph 76061 Class CI M.V = 0.99992 Graph 94547 Class CI M.V = 0.98027 18C 19O 1 12C 15O 1 11C 14C 1 13C 2 6C 1 7O 1 5C 10C 2 9C 1 4C 8C 1 1 2C 1 2 1C 1 3C 1 20C 1 17C 1 16C 2 2 1 1 2 2 0O 2 18C 19O 1 12C 15O 1 11C 14C 1 13C 2 6C 1 7O 1 5C 10C 2 9O 1 4C 8C 1 1 2C 1 2 1C 1 3C 1 20C 1 17C 1 16C 2 2 1 1 2 2 0O 2 19C 20O 1 12C 15O 1 11C 14C 1 13C 2 6C 1 7O 1 5C 10C 2 9O 1 4C 8C 1 1 2C 1 2 1C 1 3C 1 21C 1 17C 1 16C 2 18C 1 2 1 1 2 2 0O 2 Figure 6.6 Graphs 26540, 693764, 16086, 121858 matches
PAGE 62
52 GRAPH MATCH 1 MATCH 2 Graph 643418 Class CM Graph 70804 Class CA M.V = 0.97309 Graph 659624 Class CI M.V = 0.97285 30C 33C 1 32C 1 31C 1 26C 28O 1 27C 1 22C 2 23C 1 19C 20O 2 18N 1 1 17C 1 21O 2 15C 1 16C 1 12C 13C 1 11C 1 1 10C 1 1 9C 1 1 7C 1 8N 1 6C 1 2 3C 1 4C 2 2C 1 1 29C 1 25C 2 24C 1 2 14C 1 1 5C 1 1C 2 0C 2 1 0P 4N 1 3N 1 1N 1 2O 2 26C 30O 1 29O 2 24C 1 27C 1 23C 2 25C 1 19C 1 22O 2 18C 21O 1 20O 2 16C 1 17N 1 12C 15C 2 14C 1 5C 11C 1 10C 2 9C 1 8C 1 7C 1 6C 1 31C 28C 1 2 2 1 1 13C 1 2 1 1 1 1 28C 33C 2 29C 1 25S 27C 1 26O 2 19C 24C 2 20C 1 18C 1 1 14C 15S 1 16S 2 13N 1 17C 1 12C 1 34O 2 11C 1 1 7C 9C 1 8O 2 5C 6N 1 2C 10C 1 3C 1 32C 1 31C 2 30C 1 2 1 23C 1 22C 2 21C 1 2 2 2 1 4C 1 2 1C 2 0C 2 1 Graph 676606 Class CM Graph 639749 Class CI M.V = 0.97875 Graph 639734 Class CI M.V = 0.97875 40C 45C 2 41C 1 34C 36C 1 35O 2 31C 33C 1 32O 2 27C 29C 1 28O 2 23C 25C 1 24O 2 19C 21C 1 20O 2 12C 16C 2 13O 1 11C 26O 1 10C 1 22O 1 9C 1 17C 1 7C 8O 1 6C 1 1 30O 1 5C 1 4C 1 37C 1 3C 2 39S 1 2N 1 1 1C 1 1 0C 1 1 2 44C 1 43C 2 42C 1 2 38N 3 1 1 1 18O 1 1 15C 1 14C 2 1 1 38C 40C 1 39O 2 35C 37C 1 36O 2 31C 33C 1 32O 2 27C 29C 1 28O 2 24C 25O 1 23C 1 34O 1 22C 1 26O 1 21C 1 30O 1 20C 1 1 12C 17C 2 13C 1 10N 1 9C 1 11O 2 8N 1 1 6C 1 7O 2 5C 1 4C 1 1 3C 2 42Br 1 1C 2C 2 41Br 1 1 1 1 19C 1 18O 1 1 16C 1 15C 2 14C 1 2 1 0C 2 1 38C 40C 1 39O 2 35C 37C 1 36O 2 31C 33C 1 32O 2 27C 29C 1 28O 2 24C 25O 1 23C 1 34O 1 22C 1 26O 1 21C 1 30O 1 20C 1 1 12C 17C 2 13C 1 10N 1 9C 1 11S 2 8N 1 1 6C 1 7O 2 5C 1 4C 1 1 3C 2 42Br 1 1C 2C 2 41Br 1 1 1 1 19C 1 18O 1 1 16C 1 15C 2 14C 1 2 1 0C 2 1 Graph 676419 Class CA Graph 661186 Class CA M.V = 0.99984 Graph 335755 Class CI M.V = 0.98643 6C 14C 1 20C 1 19C 2 15C 1 5C 1 22O 2 4C 1 13C 2 3N 1 11C 1 2C 1 10C 1 1C 2 7C 1 21C 1 18C 1 17C 2 16C 1 2 12C 1 2 9C 2 8C 1 2 0O 1 1 4C 17C 1 5C 1 15C 1 22C 2 18C 1 6C 11C 2 1 14O 2 2C 10C 1 3S 1 1C 2 7C 1 0N 1 1 13C 1 21C 1 20C 2 19C 1 2 16C 1 12C 2 1 9C 2 8C 1 2 1 13C 17O 1 12C 16C 1 15C 2 8C 2 14C 1 7C 1 9O 1 4C 1 1 3C 2 6O 1 2C 2 5C 1 1C 1 1 21C 20C 1 19C 2 18C 1 2 1 2 10C 1 11C 1 2 0O 2 Graph 675451 Class CA Graph 675450 Class CM M.V = 1.00000 Graph 675449 Class CM M.V = 1.00000 11C 12C 1 19C 1 18C 1 17C 20C 1 15C 16C 1 21O 2 9C 24O 1 8C 1 22C 1 7C 1 23C 1 5C 10O 1 4C 1 1 3C 2 14O 1 2C 1 1 1C 2 6O 1 0C 2 1 13C 1 25C 26C 1 1 2 1 2 1 1 11C 12C 1 19C 1 18C 1 17C 20C 1 15C 16C 1 21O 2 9C 24O 1 8C 1 22C 1 7C 1 23C 1 5C 10O 1 4C 1 1 3C 2 14O 1 2C 1 1 1C 2 6O 1 0C 2 1 13C 1 25C 26C 1 1 2 1 2 1 1 11C 12C 1 19C 1 18C 1 17C 20C 1 15C 16C 1 21O 2 9C 24O 1 8C 1 22C 1 7C 1 23C 1 5C 10O 1 4C 1 1 3C 2 14O 1 2C 1 1 1C 2 6O 1 0C 2 1 13C 1 25C 26C 1 1 2 1 2 1 1 Figure 6.7 Graphs 643418, 676606, 676419, 675451 matches
PAGE 63
53 GRAPH MATCH 1 MATCH 2 Graph 673997 Class CA Graph 686774 Class CM M.V = 0.97777 Graph 696894 Class CM M.V = 0.97552 20S 25N 1 22O 2 21O 2 17C 18C 2 24S 1 16C 1 1 14C 15C 1 23C 1 13C 2 1 19Cl 1 12C 11C 2 6C 1 2N 1 3N 1 1C 1 1 4N 2 2 10C 1 9C 2 8C 1 7C 1 2 5N 2 1 0C 1 2 8S 20N 1 10O 2 9O 2 19C 1 15N 1 16C 1 14C 1 17N 2 13C 1 24C 1 12C 2 21C 1 4C 5C 1 7Cl 1 3C 2 25C 1 1C 1 2C 2 0C 1 2 6S 1 23C 2 22C 1 2 18N 2 1 11C 1 2 1 9S 12N 1 11O 2 10O 2 21C 17N 1 18N 1 16C 1 19N 2 15C 1 25C 1 14C 2 22C 1 4C 5C 1 7Cl 1 2C 3C 1 8C 1 1C 1 2 0C 1 2 6S 1 24C 2 23C 1 2 20N 2 1 13C 1 2 1 2 Graph 671292 Class CA Graph 671291 Class CA M.V = 1.00000 Graph 662767 Class CM M.V = 0.97861 12S 14O 2 13O 2 17N 19O 1 18O 2 7C 1 8C 2 6C 1 1 11C 2 4C 15N 1 5C 1 3C 1 2 0C 2 1C 1 20Cl 1 16C 1 10C 1 9C 2 1 2C 1 2 12S 14O 2 13O 2 17N 19O 1 18O 2 7C 1 8C 2 6C 1 1 11C 2 4C 15N 1 5C 1 3C 1 2 0C 2 1C 1 20C 1 16C 1 10C 1 9C 2 1 2C 1 2 13P 16C 1 15C 1 14C 1 32C 34O 1 33O 2 19C 2 18C 1 23C 2 17C 1 29C 2 27C 1 12C 2 1 8C 10O 1 9O 2 6C 1 7O 2 5C 1 0C 1 2 1C 1 35C 36C 1 1 31C 30C 2 1 28C 1 2 26C 25C 1 24C 2 1 22C 21C 1 20C 2 1 2 2 11C 1 4C 1 3C 2 2C 1 2 Figure 6.8 Graphs 673997, 671292 matches As we can observe from Fig. 6.56.8, current flow analysis for errortolerant graph matching allows for a fast comparison of a graph against a dataset of graphs of a considerable size. In some cases, the results are quite good, like for graphs 26540 and 106563; while on other cases, like graph 642970 it is hard to tell similarities between the selected graph and its matches. Overall, the selection of matches based on their current flow vectors should provide results where the structure of the graphs are very similar due to the fact that by calculating the current flow along shortest and longest geodesics of each GroupN pair of nodes should provide a very distinctive current flow vector for each, and a similar current flow vector for graphs with similar structure.
PAGE 64
54 In the next section, we will explore the predictive power of current flow vectors on the NCIHIV dataset. As mentioned before, the NCIHIV dataset classifies its 42,689 compounds in active (CA), 423, moderately active (CM), 1081, and 41,185 are inactive (CI). By using the results of current flow analysis for graph classification we will show that for the NCIHIV dataset the similarity of the compounds is an indicator of the class they belong to. We will also show that by using current flow analysis for graph comparison will allow us to determine which compounds are similar to each other, therefore when predicting the class for a compound we will rely on its closest matches as obtained from using current flow analysis. 6.1.2 Graph classification problem on the NCIHIV dataset In order to evaluate the predictive power of current flow vectors on the NCIHIV dataset we compare the results against the frequent subgraph discovery algorithm (FSG). In [18] [19], M. Deshpande et al. investig ated the predictive power of the FSG algorithm using support vector machines (SVM) as the classification technique. The use of SVM enabled them to associate a higher cost for the misclassification of positive instances. Three different classification problems were defined in [18] and [19]: 1. CA vs. CM 2. CA+CM vs. CI 3. CA vs. CI We compared the results of current flow vect ors for each of these classification problems. The first step in our experiment was to generate the current flow vectors for all 42,689 compounds. We calculated the current flow vectors for both 0% and 10% of the lowest
PAGE 65
55 edge weight to incorporate nodal information into the current flow analysis. This step took approximately 10 minutes for each dataset, HIV00.cfv and HIV10.cfv. The next step was performed for each of the datasets. The second step was to compare each of the compounds against all other compounds in the dataset using our similarity measure (excluding the compound compared to it). The comparison took a little over 6 hours for each dataset. The results of the second step were the similarity values (match values) for each of the compounds in the dataset compared to all the other compounds as well as the class counts files. Given the size of the dataset we only stored the 100 closest matches for each compound. The following table shows the average number of compounds of a particular class within the top 30 closest matches for the HIV00.cfv dataset and for the HIV10.cfv dataset. Table 6.1 Average number of compounds within top 30 matches Class Dataset Average number of CA Average number of CI Average number of CM HIV00 4.012 ( 4.538) 23.281 ( 6.105) 2.707 ( 2.755) CA HIV10 4.019 ( 4.553) 23.300 ( 6.094) 2.681 ( 2.746) HIV00 0.241 ( 0.836) 29.034 ( 1.506) 0.725 ( 1.037) CI HIV10 0.242 ( 0.840) 29.033 ( 1.507) 0.725 ( 1.037) HIV00 0.901 ( 2.039) 27.441 ( 3.899) 1.658 ( 2.294) CM HIV10 0.899 ( 2.038) 27.439 ( 3.901) 1.661 ( 2.296) From the results on Table 6.1 we can observe several facts. First, the number of average active compounds within the top 30 matches (for the HIV00 dataset) for a compound that is active is 4.0132 with and standard deviation of 4.538; this is higher compared to the average number of compounds that are inactive, 23.281 with a standard deviation of 6.105, and moderately active, 2.707 with a standard deviation of 2.755. This is a good
PAGE 66
56 indicator that the closest matchesÂ’ class for a compound could help to determine its class. If we observe results for the inactive compounds, we notice that within the top 30 matches the average number of inactive compounds 29.034 with a standard deviation of 1.506. This indicates that an inactive compound should have within the first 30 matches at least 29 matches. If we compare this to the number of inactive compounds within the top 30 for the active compounds, 23.281 to 29.034, we can clearly see the difference. Another fact to notice from Table 6.1 is how close the results are for both HIV00.cfv and HIV10.cfv datasets. This could indicate several things; first, that the nodal information did not have enough influence on the current flow calculation; second, that the nodal information does not play a pivotal role in the determination of the class for the NCIHIV dataset, or third, that even after including the nodal information based on the vertex labels (atoms) on the current flow calculation the matches returned were very similar due to the structure of the compounds. Based on this fact, we will show the results for the HIV00.cfv dataset when using the class counts obtained for the top 100 matches. As mentioned before, class count files store the number of compounds and the maximum match value per each class within a particular top N. Therefore, after obtaining the top 100 matches for each graph in the dataset we proceeded to extract the feature vectors out of the class counts files for each top N/experiment combination. The idea behind creating one training dataset out of each class count file is to evaluate different classification algorithms with different top Ns, it could be that the best classification is obtained by only looking at the first 30 matches, or it could be that when looking at the first 80 matches better results would be obtained.
PAGE 67
57 Let us start with experiment 1, CA vs. CM. Each training dataset for this experiment will contain 1,504 feature vectors, one feature vector for each graph that belongs to classes CA or CM. We created one training dataset for each top N. Since each top N generates a particular class count file, we have 100 class count files from which we extracted 100 training datasets each with 1,504 feature vectors. Each feature vector contains 6 attribute values and 1 class label. The attributes are: 1. Number of CA compounds within the top N matches 2. Number of CM compounds within the top N matches 3. Number of CI compounds within the top N matches 4. Maximum match value for a CA compound 5. Maximum match value for a CM compound 6. Maximum match value for a CI compound The class label indicates the real class that the compound being compared belongs to. For attributes 4, 5, and 6 the maximum match value is determined by the highest ranking compound for a particular class. In order to identify the different training datasets we used the following naming convention: Ei_topN, where i is the experiment number and N is the top N class count file used to extract the feature vectors. For example, one training dataset for experiment 1 will be the one extracted from the class count file produced out of the top 10 results. We will refer to this particular training dataset as E1_top10, where E1 represents that is for experiment 1 and top10 indicates that it was extracted from the class count file for the top 10 results. The reason for identifying the particular experiment in the name of the dataset is due to the fact that we are evaluating a 2class classification problem. For experiment 2
PAGE 68
58 when we combine CA+CM, any compound belonging to class CA or CM will be class 1 while CI would be class 2. Similarly to the training dataset E1_top10, we will have 99 more training datasets for experiment 1, from E1_top1 to E1_top100. Anal ogously, we will have 100 training datasets for experiment 2, CA+CM vs. CI; each training dataset with 42,689 training instances (one for each graph in the NCIHIV dataset). For experiment 3, CA vs. CI, we have another 100 training datasets, each with 41,608 training instances (one for any graph belonging to classes CA or CI). Using the 300 training datasets, we used a variety of classification algorithms including nave Bayes, backpropagation neural networks, and support vector machines in order to attempt to classify the compounds based on the six attributes defined. Each of the algorithms was tested with different parameters in order to find the best set of settings for each algorithm. The idea behind testing several classification algorithms was to find the best classifier that will capture the underlying patterns stored in the feature vectors extracted out of the class count files. This was done using Weka 3.5.6 [21]. In order to determine the best combination of dataset/algorithm/settings we performed a 5x2 crossvalidation with an Ftest. The most accurate on all three classification problems was a backpropagation neural network. 6.2 Result evaluation and comparison After determining the best training dataset/classifier combination using a 5x2 crossvalidation with an Ftest, we performed a fivefold crossvalidation on the best datasets. The reason to perform a fivefold crossvalidation is to be able to compare the
PAGE 69
59 results against those presented in [18] and [19], where a fivefold crossvalidation was also performed. The results for dataset HIV00.cfv are shown in Table 6.2. Table 6.2 Current flow vectors results on HIV00.cfv dataset Dataset Classifier Weka Options L: learning curve M: momentum N: epochs H: neurons on hidden layer K: kernel type ( 2 Â– RBF) C: cost parameter C for CSVC G: gamma value for RBF kernel Area under a receiver operating characteristic curve (ROCAUC) E1_Top36 Neural Network L 0.3 M 0.2 N 150 H 2 0.781 (0.022) E2_Top57 Neural Network L 0.3 M 0.2 N 500 H 4 0.715 (0.013) E3_Top10 Neural Network L 0.3 M 0.2 N 150 H 4 0.865 (0.022) E1_Top20 Nave Bayes 0.754 (0.012) E2_Top38 Nave Bayes 0.711 (0.020) E3_Top53 Nave Bayes 0.859 (0.015) E1_Top22 SVM K 2 Â–C 1.0 Â–G 0.125 0.764 (0.019) E2_Top57 SVM K 2 Â–C 1.1 Â–G 0.200 0.710 (0.032) E3_Top32 SVM K 2 Â–C 0.9 Â–G 0.175 0.861 (0.027) As we can observe from Table 6.2 for experiment 1, CA vs. CM, the best dataset/classifier was E1_top36 with a neural network; this means that by using a back propagation neural network with a learning rate of 0.3 (L 0.3), a momentum of 0.2 (M 0.2), during 150 epochs (N 150), and with 2 units in the hidden layer (H 2) to classify the class counts within the top 36 matches we obtained an area under the curve (AUC) of 0.781 with a standard deviation of 0.022. The AUC was calculated using the default method provided by Weka, which for the multilayer perceptron (back propagation neural network) produces a receiver operating characteristic (ROC) curve [21] by modifying the
PAGE 70
60 threshold of the output unit to determine what class the instance belongs to. Once the ROC curve has been determined the area under the curve is calculated. In [18] [19], each classification pr oblem was evaluated using a fivefold crossvalidation and ROC curves. In order to determine statistical significance when comparing the results of current flow vectors against frequent subgraph kernel (FSG) we obtained an AUC average over a fivefold crossvalidation. Since we do not have the variance of the AUC for the FSG results we will assume the same sampled pool variance as the one for current flow vectors. Table 5.3 shows the results for each of the three classification problems comparing current flow vectors to FSG. Table 6.3 Statistical significance of the results for HIV00 Class. Problem AUCCF AUCFSG (cost 1.0) Mean Diff STDev NonPaired Ttest Confidence Level (1) CA vs CM 0.781 0.774 0.007 0.021 0.50309 68% (win) (2) CA+CM vs CI 0.715 0.742 0.027 0.013 3.2839 98% (loss) (3) CA vs CI 0.865 0.839 0.026 0.020 1.868619 93% (win) As we can observe from the results, current flow vectors performed better on classification problems (1), and (3). In classification problem (2) FSG outperformed current flow vectors. Based on the analysis of statistical significance we can see that in classification problem (1) current flow vectors performed better but only with a 68% level of confidence that there is statistical significance. On the other hand, performance on classification problem (3) showed statisti cal significance at 93% favoring the results obtained when using current flow vectors. In classification problem (2) FSG outperformed current flows and it is clear given the high level of confidence, 98%, that the results for this particular problem were better than current flow vectors.
PAGE 71
61 Similar results were obtained when applying the most accurate classifiers to the class count results for the HIV10.cfv dataset. As noted previously from the averages between the HIV00.cfv and the HIV10.cfv dataset, class counts for both datasets are nearly identical. Table 6.4 shows the results for the HIV10.cfv dataset. Table 6.4 Statistical significance of the results for HIV10 Class. Problem AUCCF AUCFSG (cost 1.0) Mean Diff STDev NonPaired Ttest Confidence Level (1) CA vs CM 0.779 0.774 0.005 0.025 0.31623 62% (win) (2) CA+CM vs CI 0.717 0.742 0.025 0.014 2.82346 98% (loss) (3) CA vs CI 0.867 0.839 0.028 0.019 2.33010 97% (win)
PAGE 72
62 CHAPTER 7 SUMMARY AND FUTURE WORK 7.1 Summary Current flow analysis in electrical networks as a tool for errortolerant graph matching holds the potential to be a very powerful approach for structural graph similarity. This technique can prove very valuable in datasets where the topological information of the graphs holds most of the information; by including nodal information during the current flow calculation, the incorporation of the information stored in the vertex labels is taken into account while generating a current flow vector to represent a particular graph. Examples of such graph datasets are chemical compound datasets, fingerprint matching, and handwriting recognition datasets. As shown in our empirical results, current flow analysis emerges as a promising technique to detect graph isomorphisms, even on datasets where the vertex label information is important, as seems likely in the case of chemical compounds. Similar graph structures could provide hints about the chemical composition, and current flow similarity could yield good results to find similar compounds. The potential of current flow analysis for graph classification is very promising as demonstrated by the results obtained on the NCIHIV dataset. Comparing the results obtained using current flow vectors (CFV) against the frequent subgraph kernel (FSG) we observed that for experiment number 1, CA vs. CM, the results were about the same. For
PAGE 73
63 experiment number 2, CA+CM vs. CI, FSG is better than our approach; and for experiment number 3, CA vs. CI, current flow vectors produced better results. Based on these results, it is encouraging to see a somehow competitive performance given the fact that it was the first set of experiments for a new technique. The usage of class counts is only one of many options available to utilize the results provided by current flow vectors analysis. The usage of a voting mechanism between the different classifiers created by using different top Ns matches could be another avenue to investigate and hopefully obtain better results. Another use of the results produced by comparing the current flow vectors of a graph database is to classify graph structures with kernel methods. As mentioned before the characteristics of a function R G G k :to be referred as a graph kernel is to be a valid positive kernel. Since k is symmetric and nonnegative, it can make up a positive definite matrix. 7.2 Future work Future work analyzing other datasets and further exploration of the classification capabilities of the current flow similarity measure is needed in order to develop the full potential of this promising technique. The usage of match values to make up a graph kernel that incorporates into a support vector machine or other kernelbased algorithm that isolate the learning algorithm from the instances, in other words, the learning algorithm does not need to access any of the information contained in the graph directly. Further work on the area of visual comparison is also needed in order to consolidate current flow analysis as a vi able technique for graph comparison. As mentioned before, experiments on image databases in order to find similar images would
PAGE 74
64 be a great area of research to apply our technique. Different methods to incorporate nodal information into the current flow calculation can be adapted depending on the dataset to be evaluated. In the case of images, similar color information could add similar resistance values to the edges connected by specific nodes.
PAGE 75
65 REFERENCES [1] B. T. Messmer and H. Bunke, Â“A new algorithm for errortolerant subgraph isomorphism detection,Â” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 5, pp. 493Â–504, 1998. [2] M. Neuhaus and H. Bunke, Â“Edit distancebased kernel functions for structural pattern classification,Â” Pattern Re cogn., vol. 39, no. 10, pp. 1852Â–1863, 2006. [3] H. Bunke and K. Shearer, Â“A graph distance metric based on the maximal common subgraph,Â” Pattern Recogn. Lett., vol. 19, no. 34, pp. 255Â–259, 1998. [4] D. Justice, Â“A binary linear programming formulation of the graph edit distance,Â” IEEE Trans. Pattern Anal. Mach. In tell., vol. 28, no. 8, pp. 1200Â–1214, 2006, fellowAlfred Hero. [5] M.L. Fernndez and G. Valiente, Â“A graph distance metric combining maximum common subgraph and minimum common supergraph,Â” Pattern Recogn. Lett., vol. 22, no. 67, pp. 753Â–758, 2001. [6] W. D. Wallis, P. Shoubridge, M. Kraetz, and D. Ray, Â“Graph distances using graph union,Â” Pattern Recogn. Lett., vol. 22, no. 67, pp. 701Â–704, 2001. [7] C. Faloutsos, K. S. McCurley, and A. Tomkins, Â“Fast discovery of connection subgraphs,Â” in KDD Â’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM Press, 2004, pp. 118Â–127. [8] P. G. Doyle and J. L. Snell, Â“Random walks and electric networks,Â” Mathematical Association America, vol. 22, 1984. [9] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001. [10] DTP, Â“AID2DA99 42,689 2d structures with aids test data as of october 1999, in sdf format.Â” Downloaded from http://cactus.nci.nih. gov/ncidb/download.html, Oct. 1999.
PAGE 76
66 [11] D. J. Cook and L. B. Holder, Mining Graph Data. John Wiley & Sons, 2007. [12] L. O. Hall and A. Hildoer, Â“Compound matching using current flows,Â” 2006. [13] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. The MIT Press, 2001. [14] R. Pozo, Â“TNT Home Page,Â” 2004. [Online]. Available: http://math.nist. gov/tnt/ [15] A. Dalby, J. G. Nourse, W. D. Hounshell, A. K. I. Gushurst, D. L. Grier, B. A. Leland, and J. Laufer, Â“Description of several chemical structure file formats used by computer programs developed at molecular design limited,Â” Journal of Chemical Information and Computer Sciences, vol. 32, no. 3, pp. 244Â–255, 1992. [16] E. Gansner, E. Koutsofios, and S. North, Â“Drawing graphs with dot,Â” AT&T Bell Laboratories, Murray Hill, NJ, USA, Technical Report, Feb. 2002. [Online]. Available: http://www.research.att.com/sw/tools/graphviz/dotguide.pdf [17] C. Borgelt and M. R. Berthold, Â“Mining molecular fragments: Finding relevant substructures of molecules,Â” in ICDM Â’02: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDMÂ’02). Washington, DC, USA: IEEE Computer Society, 2002, p. 51. [18] M. Deshpande, M. Kuramochi, and G. Karypis, Â“Automated approaches for classifying structures,Â” in BIOKDD, 2002, pp. 11Â–18. [19] _____, Â“Frequent substructurebased approaches for classifying chemical compounds,Â” in ICDM Â’03: Proceedings of the Third IEEE International Conference on Data Mining. Washington, DC, USA: IEEE Computer Society, 2003, p. 35. [20] S. Kramer, L. D. Raedt, and C. Helma, Â“Molecular feature mining in HIV data,Â” in KDD Â’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2001, pp. 136Â–143 [21] I. H. Witten, E. Frank, Data Mining: practical machine learning tools and techniques, 2nd ed. Morgan Kaufmann Publishers, 2005.
