210 likes | 342 Views
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU). Why Depression?. Leading cause of disability burden in Australia
E N D
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Why Depression? • Leading cause of disability burden in Australia • One in five people suffer from a mental disorder in any one year • The Web is a good way to deliver information and treatments, but ... • A lot of depression information on the Web is of poor quality
Bluepages Search • Indexes approximately 200 sites, e.g. • Whole server: suicidal.com/ • Directory: www.healingwell.com/depression/ • Individual page: www.mcmanweb.com/article-226.htm • Approximately 2 weeks of manual effort to create / update seed list and include patterns • Experiments showed that Google (with ‘depression’) had better relevance but more bad advice • Relevance: Only 17% of relevant pages returned by Google were contained in the BPS crawl
Approach • BPS: higher quality but much lower coverage, and … • It is time consuming to identify and maintain the list of sites to be included • Is it worth it? Can it be done more cheaply? • How to increase coverage but still maintain high quality? • Can we automate the process? => • Seed list: Using an existing directory, e.g.: DMOZ, Yahoo! Directory • Crawling: • Use general crawler with inclusion/exclusion rules • Use focused crawler with mechanisms to predict relevant/high quality links from source pages
DMOZ is “the most comprehensive human-edited directory of the web” Depression directory contains: Links to a few other DMOZ pages Links to servers, directories, and individual pages about depression DMOZ Depression Directory Other pages in DMOZ Servers, directories & individual pages
DMOZ Seed List • How to generate • Start from the depression directory • Decide whether to include links to other pages within the DMOZ site (little manual effort) • Automatically generate most of the seed URLs • Seed URLs are same as URLs, except that default page suffixes are removed. E.g.: www.depression.com/default.asp has the pattern www.depression.com
Should DMOZ be used? • Requires very little effort in boundary setting • Provides a big seed list of URLs locating heterogeneously on the Web (three times bigger than BPS) • Using 101 judged queries from our previous study, we retrieved 227 judged URLs from DMOZ of which 186 were relevant (81%) => DMOZ provided a good set of relevant pages with little effort, but…can we find more relevant pages else where?
Focused Crawler • Seeks, acquires, indexes and maintains pages on a specific set of topics • Requires small investment in hardware and network resources • Starts with a seed list of URLs relevant to the topics of interest • Follows links from seed pages to identify the most promising links to crawl Is focused crawling a promising technique for building a depression portal?
One link away URLs DMOZ Crawl Additional Link-accessible Relevant Information • Illustration of one link away collection • If pages in the current crawl have no link to additional relevant content, the prospect of successful focused crawling is very low
Additional Link Experiments • Experiment: Relevance of outgoing links from a crawled collection • An unrestricted crawler starting from the BPS crawl can reach 25.3% (quite high) more known relevant pages in one single step from current crawled pages. • Experiment: Linking patterns between relevant pages • Out of 196 new relevant URLs, 158 were linked to by known relevant pages.
Findings for Additional Links • Relevant pages tend to link to each other • Outgoing link set of a good collection contains quite a large number of additional relevant pages • These support the idea of focused crawling, but … • How can a crawler tell which links lead to relevant content?
Hypertext Classification • Traditional text classification only looks at the text in each document • Hypertext classification uses link information • We experimented with anchor text, text around the link and URL words • Here is an example
Features • URL: http://www.depression.com/psychotherapy.html => URL words: depression, com, psychotherapy • Anchor text: psychotherapy • Text around the link: • 50 bytes before: section, learn • 50 bytes after: talk, therapy, standard, treatment
Input Data & Measures • Calculate tf.idf for all the features appearing in each URL • 10-fold cross validation on 295 relevant and 251 irrelevant URLs • Classifiers: IBK, ZeroR, Naïve Bayes, C4.5, Bagging and AdaboostM1, etc. • Measures: Accuracy, precision and recall.
Classifier Accuracy (%) Precision (%) Recall (%) ZeroR 54.02 54.02 100 Complement Naïve Bayes 71.06 77.51 65.42 Naïve Bayes 73.07 78.03 69.83 J48 77.83 88.15 68.13 Hypertext Classification - Results => In overall, J48 is the best classifier
Hypertext Classification - Others • Bagging and boosting showed little improvement for recall • No applicable results in the literature relating to the topic of depression to compare • A classifier looking at the content of the target pages showed similar results => Hypertext classification is quite effective
Findings • Web pages about depression are strongly interlinked • DMOZ depression category seems to provide a good seed list for a focused crawl • Predictive classification of outgoing links using link features achieves promising results => Cheap and high coverage depression portal might be built & maintained using focused crawling techniques starting with the DMOZ seed list
Future Work • Build a domain-specific search portal: • URL ranking in the order of degree of relevance • Data structures to hold accumulated information for unvisited URLs • Determine how to use the focused crawler operationally: • No include/exclude rules, but appropriate stopping conditions • What to do if none of the outgoing links are classified as relevant?
Future Work • Incorporate site quality into the focused crawler or filtering high quality pages after crawling • Extend the techniques to other domains, such as health related domains, is it applicable?