Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU)

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

Why Depression? • Leading cause of disability burden in Australia • One in five people suffer from a mental disorder in any one year • The Web is a good way to deliver information and treatments, but ... • A lot of depression information on the Web is of poor quality

Bluepages Search (BPS)

BluePages Search

Bluepages Search • Indexes approximately 200 sites, e.g. • Whole server: suicidal.com/ • Directory: www.healingwell.com/depression/ • Individual page: www.mcmanweb.com/article-226.htm • Approximately 2 weeks of manual effort to create / update seed list and include patterns • Experiments showed that Google (with ‘depression’) had better relevance but more bad advice • Relevance: Only 17% of relevant pages returned by Google were contained in the BPS crawl

Approach • BPS: higher quality but much lower coverage, and … • It is time consuming to identify and maintain the list of sites to be included • Is it worth it? Can it be done more cheaply? • How to increase coverage but still maintain high quality? • Can we automate the process? => • Seed list: Using an existing directory, e.g.: DMOZ, Yahoo! Directory • Crawling: • Use general crawler with inclusion/exclusion rules • Use focused crawler with mechanisms to predict relevant/high quality links from source pages

DMOZ is “the most comprehensive human-edited directory of the web” Depression directory contains: Links to a few other DMOZ pages Links to servers, directories, and individual pages about depression DMOZ Depression Directory Other pages in DMOZ Servers, directories & individual pages

DMOZ Seed List • How to generate • Start from the depression directory • Decide whether to include links to other pages within the DMOZ site (little manual effort) • Automatically generate most of the seed URLs • Seed URLs are same as URLs, except that default page suffixes are removed. E.g.: www.depression.com/default.asp has the pattern www.depression.com

Should DMOZ be used? • Requires very little effort in boundary setting • Provides a big seed list of URLs locating heterogeneously on the Web (three times bigger than BPS) • Using 101 judged queries from our previous study, we retrieved 227 judged URLs from DMOZ of which 186 were relevant (81%) => DMOZ provided a good set of relevant pages with little effort, but…can we find more relevant pages else where?

Focused Crawler • Seeks, acquires, indexes and maintains pages on a specific set of topics • Requires small investment in hardware and network resources • Starts with a seed list of URLs relevant to the topics of interest • Follows links from seed pages to identify the most promising links to crawl Is focused crawling a promising technique for building a depression portal?

One link away URLs DMOZ Crawl Additional Link-accessible Relevant Information • Illustration of one link away collection • If pages in the current crawl have no link to additional relevant content, the prospect of successful focused crawling is very low

Additional Link Experiments • Experiment: Relevance of outgoing links from a crawled collection • An unrestricted crawler starting from the BPS crawl can reach 25.3% (quite high) more known relevant pages in one single step from current crawled pages. • Experiment: Linking patterns between relevant pages • Out of 196 new relevant URLs, 158 were linked to by known relevant pages.

Findings for Additional Links • Relevant pages tend to link to each other • Outgoing link set of a good collection contains quite a large number of additional relevant pages • These support the idea of focused crawling, but … • How can a crawler tell which links lead to relevant content?

Hypertext Classification • Traditional text classification only looks at the text in each document • Hypertext classification uses link information • We experimented with anchor text, text around the link and URL words • Here is an example

Features • URL: http://www.depression.com/psychotherapy.html => URL words: depression, com, psychotherapy • Anchor text: psychotherapy • Text around the link: • 50 bytes before: section, learn • 50 bytes after: talk, therapy, standard, treatment

Input Data & Measures • Calculate tf.idf for all the features appearing in each URL • 10-fold cross validation on 295 relevant and 251 irrelevant URLs • Classifiers: IBK, ZeroR, Naïve Bayes, C4.5, Bagging and AdaboostM1, etc. • Measures: Accuracy, precision and recall.

Classifier Accuracy (%) Precision (%) Recall (%) ZeroR 54.02 54.02 100 Complement Naïve Bayes 71.06 77.51 65.42 Naïve Bayes 73.07 78.03 69.83 J48 77.83 88.15 68.13 Hypertext Classification - Results => In overall, J48 is the best classifier

Hypertext Classification - Others • Bagging and boosting showed little improvement for recall • No applicable results in the literature relating to the topic of depression to compare • A classifier looking at the content of the target pages showed similar results => Hypertext classification is quite effective

Findings • Web pages about depression are strongly interlinked • DMOZ depression category seems to provide a good seed list for a focused crawl • Predictive classification of outgoing links using link features achieves promising results => Cheap and high coverage depression portal might be built & maintained using focused crawling techniques starting with the DMOZ seed list

Future Work • Build a domain-specific search portal: • URL ranking in the order of degree of relevance • Data structures to hold accumulated information for unvisited URLs • Determine how to use the focused crawler operationally: • No include/exclude rules, but appropriate stopping conditions • What to do if none of the outgoing links are classified as relevant?

Future Work • Incorporate site quality into the focused crawler or filtering high quality pages after crawling • Extend the techniques to other domains, such as health related domains, is it applicable?

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU)

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU)

Presentation Transcript

Feasibility Study

Feasibility study

FOCUSED CRAWLING

Feasibility Study

Feasibility Study

Feasibility Study

A Feasibility Study

A Feasibility Study

Portal Technology Feasibility Study

A Market Feasibility Study

Feasibility study

PROTEL, A FEASIBILITY STUDY

Policy Search for Focused Web Crawling

Adaptive Focused Crawling

Feasibility Study

Adaptive Focused Crawling

Geographically Focused Collaborative Crawling

Focused Crawling and Collection Synthesis

Feasibility Study

Feasibility Criteria in Feasibility Study | Create A Feasibility Report with us

Feasibility study

Feasibility Study