1 / 21

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU)

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU). Why Depression?. Leading cause of disability burden in Australia

ornice
Download Presentation

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

  2. Why Depression? • Leading cause of disability burden in Australia • One in five people suffer from a mental disorder in any one year • The Web is a good way to deliver information and treatments, but ... • A lot of depression information on the Web is of poor quality

  3. Bluepages Search (BPS)

  4. BluePages Search

  5. Bluepages Search • Indexes approximately 200 sites, e.g. • Whole server: suicidal.com/ • Directory: www.healingwell.com/depression/ • Individual page: www.mcmanweb.com/article-226.htm • Approximately 2 weeks of manual effort to create / update seed list and include patterns • Experiments showed that Google (with ‘depression’) had better relevance but more bad advice • Relevance: Only 17% of relevant pages returned by Google were contained in the BPS crawl

  6. Approach • BPS: higher quality but much lower coverage, and … • It is time consuming to identify and maintain the list of sites to be included • Is it worth it? Can it be done more cheaply? • How to increase coverage but still maintain high quality? • Can we automate the process? => • Seed list: Using an existing directory, e.g.: DMOZ, Yahoo! Directory • Crawling: • Use general crawler with inclusion/exclusion rules • Use focused crawler with mechanisms to predict relevant/high quality links from source pages

  7. DMOZ is “the most comprehensive human-edited directory of the web” Depression directory contains: Links to a few other DMOZ pages Links to servers, directories, and individual pages about depression DMOZ Depression Directory Other pages in DMOZ Servers, directories & individual pages

  8. DMOZ Seed List • How to generate • Start from the depression directory • Decide whether to include links to other pages within the DMOZ site (little manual effort) • Automatically generate most of the seed URLs • Seed URLs are same as URLs, except that default page suffixes are removed. E.g.: www.depression.com/default.asp has the pattern www.depression.com

  9. Should DMOZ be used? • Requires very little effort in boundary setting • Provides a big seed list of URLs locating heterogeneously on the Web (three times bigger than BPS) • Using 101 judged queries from our previous study, we retrieved 227 judged URLs from DMOZ of which 186 were relevant (81%) => DMOZ provided a good set of relevant pages with little effort, but…can we find more relevant pages else where?

  10. Focused Crawler • Seeks, acquires, indexes and maintains pages on a specific set of topics • Requires small investment in hardware and network resources • Starts with a seed list of URLs relevant to the topics of interest • Follows links from seed pages to identify the most promising links to crawl Is focused crawling a promising technique for building a depression portal?

  11. One link away URLs DMOZ Crawl Additional Link-accessible Relevant Information • Illustration of one link away collection • If pages in the current crawl have no link to additional relevant content, the prospect of successful focused crawling is very low

  12. Additional Link Experiments • Experiment: Relevance of outgoing links from a crawled collection • An unrestricted crawler starting from the BPS crawl can reach 25.3% (quite high) more known relevant pages in one single step from current crawled pages. • Experiment: Linking patterns between relevant pages • Out of 196 new relevant URLs, 158 were linked to by known relevant pages.

  13. Findings for Additional Links • Relevant pages tend to link to each other • Outgoing link set of a good collection contains quite a large number of additional relevant pages • These support the idea of focused crawling, but … • How can a crawler tell which links lead to relevant content?

  14. Hypertext Classification • Traditional text classification only looks at the text in each document • Hypertext classification uses link information • We experimented with anchor text, text around the link and URL words • Here is an example

  15. Features • URL: http://www.depression.com/psychotherapy.html => URL words: depression, com, psychotherapy • Anchor text: psychotherapy • Text around the link: • 50 bytes before: section, learn • 50 bytes after: talk, therapy, standard, treatment

  16. Input Data & Measures • Calculate tf.idf for all the features appearing in each URL • 10-fold cross validation on 295 relevant and 251 irrelevant URLs • Classifiers: IBK, ZeroR, Naïve Bayes, C4.5, Bagging and AdaboostM1, etc. • Measures: Accuracy, precision and recall.

  17. Classifier Accuracy (%) Precision (%) Recall (%) ZeroR 54.02 54.02 100 Complement Naïve Bayes 71.06 77.51 65.42 Naïve Bayes 73.07 78.03 69.83 J48 77.83 88.15 68.13 Hypertext Classification - Results => In overall, J48 is the best classifier

  18. Hypertext Classification - Others • Bagging and boosting showed little improvement for recall • No applicable results in the literature relating to the topic of depression to compare • A classifier looking at the content of the target pages showed similar results => Hypertext classification is quite effective

  19. Findings • Web pages about depression are strongly interlinked • DMOZ depression category seems to provide a good seed list for a focused crawl • Predictive classification of outgoing links using link features achieves promising results => Cheap and high coverage depression portal might be built & maintained using focused crawling techniques starting with the DMOZ seed list

  20. Future Work • Build a domain-specific search portal: • URL ranking in the order of degree of relevance • Data structures to hold accumulated information for unvisited URLs • Determine how to use the focused crawler operationally: • No include/exclude rules, but appropriate stopping conditions • What to do if none of the outgoing links are classified as relevant?

  21. Future Work • Incorporate site quality into the focused crawler or filtering high quality pages after crawling • Extend the techniques to other domains, such as health related domains, is it applicable?

More Related