1 / 11

Practical Issues for Automated Categorization of Web Sites

Practical Issues for Automated Categorization of Web Sites. John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107. (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon). Outline. Project overview Web content

adamdaniel
Download Presentation

Practical Issues for Automated Categorization of Web Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Issues for Automated Categorization of Web Sites John M. Pierre jpierre@metacode.com Metacode Technologies, Inc. 139 Townsend Street San Francisco, CA 94107 (Collaborators: B. Wohler, R. Daniel, M. Butler, R. Avedon)

  2. Outline • Project overview • Web content • Automated Categorization • Feature Selection • Metadata • Experimental Setup • Data • Targeted Spidering • System Architecture • Results • Conclusions

  3. Project Overview • Specific: • Categorize large number of domain names by industry category • NAICS classification scheme • ~30,000 domain names for testing (.com) • Text categorization approach • General: • Domain specific classification • Metadata • Targeted spidering • Feature selection • Classifier training

  4. Web Content: Automated Categorization • Challenges: • Vast (over 1 Billion pages) • Heterogeneous (content, formats, not just HTML) • Dynamic (growing, changing) • Benefits: • Good source of information • Accessible! • Machine readable (vs. machine understandable) • Semi-structured • Tools: • Classification • Automated classification • Text Categorization/Machine Learning • Intelligent agents • Related Work • Manual: • Yahoo! • Open Directory Project • Looksmart • Automatic: • Northern Light • Thunderstone/Texis • Inktomi • Other: • EU Project DESIRE II • Pharos • Attardi, Sebanstiani et al • L. Page et al • McCallum et al

  5. Web Content: Feature Selection • Text Features: (D. Lewis) • Relatively few in number • Moderate in frequency of assignment • Low in redundancy • Low in noise • Related to semantic scope to the classes to be assigned • Relatively unambiguous in meaning • Preliminary Experiment • 1125 web domains • SEC+NAICS training set Use metadata if possible, use body text as last resort!

  6. Web Content: Metadata

  7. Experimental Setup: Targeted Spidering Domain name ‘Query’ Pages HTTP Get live? Yes No Try www. Frames? Yes Use <body> No Metatags? No Yes <a href=? Send Query prod, service, about, info, press, news

  8. Experimental Setup: Data Classification scheme: NAICS 11 Agriculture, Forestry, Fishing and Hunting 21 Mining 23 Construction 31-33 Manufacturing 42 Wholesale Trade 44-45 Retail Trade 48-49 Transportation and Warehousing 51 Information 52 Finance and Insurance 53 Real Estate and Rental and Leasing 54 Professional, Scientific and Technical Services 55 Management of Companies and Enterprise 56 Admin. Support, Waste Mgmt and Remediation Srvcs 61 Educational Services 62 Health Care and Social Assistance 71 Arts, Entertainment & Recreation 72 Accommodation and Food Services 81 Other services (except 92) 92 Public Administration 99 Unclassified Establishments • Test Data • ~30,000 domain names (SIC) • ~13,500 pre-classified/content • Training Data • “SEC-NAICS”: • 1504 SEC 10-K fillings (SIC) • 426 NAICS labels/descriptions • “Web pages”: • 3618 pre-classified domains • Crosswalk • SIC <-> NAICS

  9. Spider Experimental Setup: System Architecture The Web Domain Names Text Query SEC-NAICS IR Engine Web pages Matching documents Decision Foo.com 11, 21, 23

  10. Results P=Precision = # correctly assigned / # assigned R=Recall = # correctly assigned / # total correct F1 = 2 P R / (P+R) micro-averaged = computer over all categories macro-averaged = per category, then averaged

  11. Conclusions • Domain Specific Classification • Knowledge Gathering • Use of specialized knowledge • Targeted Spidering • Efficient use of resources • Extract key features, Metadata • Training • Prior knowledge • Bootstrapping • Classification • Robust, tolerant of noisy data • Benefits of Semantic Web • Better Metadata • Semantic linking & intelligent spidering

More Related