180 likes | 330 Views
An Internet Forum Index. Curtis Spencer Ezra Burgoyne. The Problem. Forums provide a wealth of information often overlooked by search engines. Semi-structured data provided by forums is not taken advantage of by popular search software (e.g. Google, Yahoo).
E N D
An Internet Forum Index Curtis Spencer Ezra Burgoyne
The Problem Forums provide a wealth of information often overlooked by search engines. Semi-structured data provided by forums is not taken advantage of by popular search software (e.g. Google, Yahoo). Despite being crawled, many useful information rich posts never appear in results due to low page rank. Discovering what the best forums are for a given topic is difficult even when the help of a search engine is enlisted. Forum users are often unaware of related information found on rival forums. A forum’s own search software is often slow and returns poor results.
Quick summary of solution Forum detection crawlers continually find new forums with the help of a web search engine (e.g. Dogpile) These discovered forums are eventually wrapped in their entirety through a distributed crawler. Forum content collected in the database is indexed using latest MySQL fulltext natural language index. Search ranking algorithm uses data ignored by traditional search engines such as number of replies, number of views, popularity of poster, etc.
Forums supported by Forum Looter vBulletin phpBB
Discovering forums Using WordNet, a program serves dictionary words and their synonyms to a set of distributed crawlers. Every link returned by Dogpile subjected to a detection algorithm that consist of URL formations as well as common patterns in the markup. Detects the three most popular forum types used on the internet: vBulletin, phpBB, Invision In trying to be good netizens, Dogpile website only accessed every two minutes. In addition, robots.txt was respected.
Distributed forum wrapping (architecture) Synchronized Java RMI server stores a queue of jobs. Distributed crawlers retrieve jobs from the central RMI server. Distributed crawlers wrap whatever page their fetched job contains and saves results into database. Distributed crawlers can schedule new jobs on the RMI server, too.
More on distributed forum wrapping (access) In trying to be good netizens, each forum website is only accessed once in any 20 second time period. At the mentioned rate, it would take two months to completely wrap some of the largest forums out there. Last request times of forum websites were set by individual client crawlers and kept track of in the RMI server. Exponential back off algorithm used for slow sites.
More on distributed forum wrapping (performance) Java RMI server performed very well, only using 10% of an AMD Athlon CPU. Individual crawlers were rather memory intensive. Individual crawlers used JTidy for parsing pages and performed DOM manipulation in addition to regular expressions to extract data. Memory use attributed to Hibernate and JTidy. Database access caused main bottleneck in the distributed system.
Indexing of data Shadow database periodically updates forum data from crawl and creates an incremental MySQL fulltext index. Has built-in support for stop words and document frequency scaling.
Ranking algorithm for search results Forum software has many missed opportunities for metadata analysis. We do a hierarchical weighted value calculation for each post once it is matched a NLP query from MySQL. An approximation of this calculation is Value = w(apc)*apc + w(numViews) + w(numReplies) + w(isThread())
Future changes Natural language parsing of forum post corpus so ranking may be affected by things such as “thank you” or “great post” replies. Collaborative filtering of results per query by tracking user clicks. Improve resource usage of crawlers.