1 / 44

Intelligent Web Topics Search Using Early Detection and Data Analysis

Intelligent Web Topics Search Using Early Detection and Data Analysis. by Yixin Yang. Presented by Yixin Yang (Advisor Dr. C.C. Lee). July 30, 2003. Outline. Introduction and Background Related Work Our Approach System Architecture Crawl Algorithms and Implementation

mireya
Download Presentation

Intelligent Web Topics Search Using Early Detection and Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) July 30, 2003

  2. Outline • Introduction and Background • Related Work • Our Approach • System Architecture • Crawl Algorithms and Implementation • Experimental result • Conclusions and Future Work

  3. Introduction and Background

  4. General-purpose search engine General-purpose search engine: designed to crawl and index the web, get pages as many as possible Problem with General-purpose search engine : These search engines lack the capabilities of finding the relevant web sites for a giving specific topic

  5. Topic-specific Search Engine Topic-specific search engine: focuses on one or a limited number of topics, get topic-related pages as fast as possible without deviating to unrelated pages. To help the topic-specific search engines to crawl the related hyperlinks, the relevant topics should provide to topic-specific search engines in advance.

  6. Topic-specific Search Engine Sample • Internet directories like Yahoo!, or topic-specific search engine like MathSearch offer higher quality, they construct a hierarchy for topic, but they require intensive human efforts (hierarchy will be maintained by humans to update ) • Google rates sites based on how frequently someone links to a page -- the more links, the more relevant. Essentially, it harnesses human judgment.

  7. Topic-specific Search Engine Sample (cont.) • Focused Crawler utilizes both web link structure information and content similarity (based on document classification), butthis system unnecessarily visits too many irrelevant pages.

  8. Metadata in Web Documents • Metadata of a page x, is the description about the page, x, furnished by other page y, that hyperlink to x. • Analogy: Citation in a research paper y1 y2 x y3

  9. Metadata in HTML Four kinds of Hyperlinks in HTML document • Anchor (<a>) tags • Image (<img>) tags • Map and Area tags • Frame tags Each of tags have attributes associated with them. Anchor and Area: name, title, alt, href… Image: alt, src, dynsrc, lowsrc… • Anchor text (<a ..>text</a>)

  10. Metadata Experiment by J. Yi and N. Sundaresan • Studied a sample set of 20, 000 HTML pages, and 206,000 hyperlink references • Showed anchor text are most frequently used and reliable

  11. Target topics, Candidate topics and Relevant topics Target Topics: the topics that may consist of many sub-topics. Relevant Topics: the topics that related to a target topic. Candidate Topics: the topics that potentially relevant to a target topic.

  12. Related Work

  13. Recent Research in Topic-specific Search Engine :Topic Expansion Algorithm • Presented by Jeonghee Yi and Neel Sundaresan • Discovers relevant topics of a given topic • Does not need to visit unnecessary web pages and does not need intensive human effort.

  14. Recent Research in Topic-specific Search Engine :Topic Expansion Algorithm ( cont.) Four steps in Topic Expansion Algorithm: • Collects large number of Web pages • Extracts words from the text that is contained inside HTML document tags • Selects some words that are potentially relevant to the target topic. • Uses a formula and a relationship-based architecture for finding the relation between words to refine and return the relevant topics.

  15. What is confidence? • A good way to explain confidence is using association rule. • An association rule is an expression X =>Y, where X and are sets of items. The intuitive meaning of such a rule is that transaction in database which contain the items in X also contain the items in Y.

  16. Formula By J. Yi and N. Sundaresan

  17. Recent Research in Topic-specific Search Engine :Topic Expansion Algorithm ( cont.) • Topic Expansion Algorithm does not need to visit unnecessary web pages and does not need intensive human effort. But it still needs : much human involvement to update the architecture many web pages of Web sites crawling

  18. Our Research

  19. Our Approach • Uses early detection and data analysis techniques for detecting and analyzing candidate topics • Add Stop Word Filter, Candidate topic Selector, Candidate topic filter to the typical web crawler • Simplify the formula used by J. Yi and N. Sundaresan

  20. Formula we used:

  21. System Architecture

  22. System Architecture

  23. Components in Our System • Web Crawler • Page Parser • Stop Word Filter • Candidate Topic Selector • Candidate Topic Filter • Relevant Term Database

  24. Crawl Algorithms and Implementation

  25. Crawl Algorithms: just like the typical Web Crawler • Starts with a (set of) predefined web URLs and downloads them • Breath-first-search • Uses Recursion

  26. This system: add something inside typical web crawler

  27. Implementation : Language and Technology • Java Programming Language • Java HTTP Request • Java AWT and Swing • Java Database Connectivity (JDBC)

  28. Databases Used for this system:

  29. 10 components implemented for this system: • WebCrawler • HttpPage • HTMLparser • CandidateTopicFilter • CandidateTopicSelector, • StopWordFilter, • StopWordsTable, • Tokens.DoStat, DBAccess

  30. Important Components • HTMLparser Parses HTML pages and Extracts the Meta data. Uses javax.swing.test.html parser package to parse the HTML page. Get the text inside <A> tags • StopWordsTable When start the whole application, read the stop words from Database “stop_words” table and put every word into a Hash table.

  31. Important Components (cont.) • StopWordFilter Reads the words extracted from Metadata one by one, if a token can also be found in “stop_words” hush table, remove this word.

  32. Important Components (cont.) • CandidateTopicSelector Reads one string (tokens) processed by Stop word filter The attribute “total_num” for every word (except target topic) increase one if find target topic inside this string, the attribute “co_occur_num” for every word inside this string (except target topic) also increase one

  33. Important Components (cont.) • CandidateTopicFilter Check every words inside the “candidate_topics” table, and calculate every words by this formula: if words meet the requirement,put these words to “relevant_terms” table.

  34. Formula we used:

  35. Experimentalresult

  36. Run Application • This system can run on MS-DOS Prompt application from any windows system such as Windows XP, Windows ME etc. • You must install and set up Microsoft Access database before run this system.

  37. Steps: • Step 1: Set up ODBC Data Source

  38. Steps (cont.) • Step 2: Start the Java application in MS-DOS Prompt console

  39. Steps (cont.) • Step 3: Enter the start URL and target topic in java application

  40. Experiment Result : for target topic XML

  41. Compared to Topic Expansion Algorithm, this system also has: • Lower number of web pages crawled Other system : crawled 34,000 web pages to get 49 relevant topics out of 54 actual relevant topics Our system : crawled 17,000 web pages and get 51 relevant topics out of 57 actual relevant topics • Don’t need a relation-based architecture Other system : most of them need a relationship-based hierarchy and update the hierarchy every time. Our system : Use stop word table instead

  42. Conclusions • Use early detection and data analysis techniques for detecting and analyzing candidate topics. • Improves crawl performance – visiting less number of web pages makes the system more efficient • Less human involvement - no need to create a relationship-based hierarchy

  43. Future Work • Adapt other character set such as Chinese,Korean, Japanese. • Need to find a better way to detect the new born words and find their relevance to a specific topic.

  44. Questions ?

More Related