
Graphtheoretic techniques for web content mining
Citation 
 Permanent Link:
 http://digital.lib.usf.edu/SFS0024839/00001
Material Information
 Title:
 Graphtheoretic techniques for web content mining
 Creator:
 Schenker, Adam
 Place of Publication:
 [Tampa, Fla.]
 Publisher:
 University of South Florida
 Publication Date:
 2003
 Language:
 English
Subjects
 Subjects / Keywords:
 graph similarity
graph distance clustering machine learning classification Dissertations, Academic  Computer Science and Engineering  Doctoral  USF ( lcsh )
 Genre:
 government publication (state, provincial, terriorial, dependent) ( marcgt )
bibliography ( marcgt ) theses ( marcgt ) nonfiction ( marcgt )
Notes
 Summary:
 ABSTRACT: In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied to data represented by graphs since no suitable graphtheoretical concepts were previously available. We introduce the novel Graph Hierarchy Construction Algorithm (GHCA), which performs topicoriented hierarchical clustering of web search results modeled using graphs. The system we created around this new algorithm and its prior version is compared with similar web search clustering systems to gauge its usefulness. An important advantage of this approach over conventional web search systems is that the results are better organized and more easily browsed by users. Next we present extensions to classical machine learning algorithms, such as the kmeans clustering algorithm and the kNearest Neighbors classification algorithm, which allows the use of graphs as fundamental data items instead of vectors. We perform experiments comparing the performance of the new graphbased methods to the traditional vectorbased methods for three web document collections. Our experimental results show an improvement for the graph approaches over the vector approaches for both clustering and classification of web documents. An important advantage of the graph representations we propose is that they allow the computation of graph similarity in polynomial time; usually the determination of graph similarity with the techniques we use is an NPComplete problem. In fact, there are some cases where the execution time of the graphoriented approach was faster than the vector approaches.
 Thesis:
 Thesis (Ph.D.)University of South Florida, 2003.
 Bibliography:
 Includes bibliographical references.
 System Details:
 System requirements: World Wide Web browser and PDF reader.
 System Details:
 Mode of access: World Wide Web.
 General Note:
 Includes vita.
 General Note:
 Title from PDF of title page.
 General Note:
 Document formatted into pages; contains 145 pages.
 Statement of Responsibility:
 by Adam Schenker.
Record Information
 Source Institution:
 University of South Florida Library
 Holding Location:
 University of South Florida
 Rights Management:
 All applicable rights reserved by the source institution and holding location.
 Resource Identifier:
 001441470 ( ALEPH )
53961827 ( OCLC ) AJM5910 ( NOTIS ) E14SFE0000143 ( USFLDC DOI ) e14.143 ( USFLDC Handle )
Postcard Information
 Format:
 Book

Downloads 
This item has the following downloads:

Full Text 
xml version 1.0 encoding UTF8 standalone no
record xmlns http:www.loc.govMARC21slim xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmarcxmlschemaMARC21slim.xsd
leader nam Ka
controlfield tag 001 001441470
003 fts
006 med
007 cr mnuuuuuu
008 031203s2003 flua sbm s0000 eng d
datafield ind1 8 ind2 024
subfield code a E14SFE0000143
035
(OCoLC)53961827
9
AJM5910
b SE
SFE0000143
040
FHM
c FHM
090
TK7885
1 100
Schenker, Adam.
0 245
Graphtheoretic techniques for web content mining
h [electronic resource] /
by Adam Schenker.
260
[Tampa, Fla.] :
University of South Florida,
2003.
502
Thesis (Ph.D.)University of South Florida, 2003.
500
Includes vita.
504
Includes bibliographical references.
516
Text (Electronic thesis) in PDF format.
538
System requirements: World Wide Web browser and PDF reader.
Mode of access: World Wide Web.
Title from PDF of title page.
Document formatted into pages; contains 145 pages.
520
ABSTRACT: In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied to data represented by graphs since no suitable graphtheoretical concepts were previously available. We introduce the novel Graph Hierarchy Construction Algorithm (GHCA), which performs topicoriented hierarchical clustering of web search results modeled using graphs. The system we created around this new algorithm and its prior version is compared with similar web search clustering systems to gauge its usefulness. An important advantage of this approach over conventional web search systems is that the results are better organized and more easily browsed by users. Next we present extensions to classical machine learning algorithms, such as the kmeans clustering algorithm and the kNearest Neighbors classification algorithm, which allows the use of graphs as fundamental data items instead of vectors. We perform experiments comparing the performance of the new graphbased methods to the traditional vectorbased methods for three web document collections. Our experimental results show an improvement for the graph approaches over the vector approaches for both clustering and classification of web documents. An important advantage of the graph representations we propose is that they allow the computation of graph similarity in polynomial time; usually the determination of graph similarity with the techniques we use is an NPComplete problem. In fact, there are some cases where the execution time of the graphoriented approach was faster than the vector approaches.
590
Adviser: Kandel, Abraham
653
graph similarity.
graph distance.
clustering.
machine learning.
classification.
690
Dissertations, Academic
z USF
x Computer Science and Engineering
Doctoral.
773
t USF Electronic Theses and Dissertations.
4 856
u http://digital.lib.usf.edu/?e14.143

printinsert_linkshareget_appmore_horiz  
Cite this
item
close
APACras ut cursus ante, a fringilla nunc. Mauris lorem nunc, cursus sit amet enim ac, vehicula vestibulum mi. Mauris viverra nisl vel enim faucibus porta. Praesent sit amet ornare diam, non finibus nulla.
MLACras efficitur magna et sapien varius, luctus ullamcorper dolor convallis. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Fusce sit amet justo ut erat laoreet congue sed a ante.
CHICAGOPhasellus ornare in augue eu imperdiet. Donec malesuada sapien ante, at vehicula orci tempor molestie. Proin vitae urna elit. Pellentesque vitae nisi et diam euismod malesuada aliquet non erat.
WIKIPEDIANunc fringilla dolor ut dictum placerat. Proin ac neque rutrum, consectetur ligula id, laoreet ligula. Nulla lorem massa, consectetur vitae consequat in, lobortis at dolor. Nunc sed leo odio.
