1 / 20

Chapter 7 Web Content Mining Xxxxxx

Chapter 7 Web Content Mining Xxxxxx. Introduction. Web-content mining techniques are used to discover useful information from content on the web textual audio video still images metadata hyperlinks. Introduction.

zenia
Download Presentation

Chapter 7 Web Content Mining Xxxxxx

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7Web Content MiningXxxxxx

  2. Introduction • Web-content mining techniques are used to discover useful information from content on the web • textual • audio • video • still images • metadata • hyperlinks

  3. Introduction • Some of the web content is generated dynamically using queries to database management systems • Other web content may be hidden from general users

  4. Introduction • Problems with the web data • Distributed data • Large volume • Unstructured data • Redundant data • Quality of data • Extreme percentage volatile data • Varied data

  5. Introduction • Two approaches of web-content mining: • agent-based • software agents perform the content mining • database oriented • view the Web data as belonging to a database

  6. Web Crawler • A computer program that navigates the hypertext structure of the web • Crawlers are used to ease the formation of indexes used by search engines • The page(s) that the crawler begins with are called the seed URLs. • Every link from the first page is recorded and saved in a queue • Builds an index visiting number of pages and then replaces the current index • Known as a periodic crawler because it is activated periodically

  7. Web Crawler • Another type isa Focused Crawler • Generally recommended for use due to large size of the Web • Visits pages related to topics of interest • If a page is not pertinent, the entire set of possible pages below it is pruned

  8. Multiple Layered Database • Every layer of the database is more generalized than the layer below it • Unlike the lowest level, the upper levels are structured and can be mined by an SQL-like query language

  9. Multiple Layered Database • Provides an abstracted view of a fraction of the web • Virtual Web View (VWV), can be constructed

  10. Search Engine • Basic components to a search engine: • The spider gathers new or updated information on Internet websites • The index used to store information about several websites • The search software performs searching through the huge index in an effort to generate an ordered list of useful search results

  11. Types of Queries • Boolean Queries: • Boolean logic queries connect words in the search using operators such as AND or OR • Natural Language Queries: • In natural language queries the user frames as a question or a statement • Thesaurus Queries: • In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system

  12. Types of Queries • Fuzzy Queries: • Fuzzy queries reflect no specificity • Term Searches: • The most common type of query on the Web is when a user provides a few words or phrases for the search • Probabilistic Queries: • Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy

  13. The Robot Exclusion • Why would the developers prefer to exclude robots from parts of their websites? • The robot exclusion protocol • to indicate restricted parts of the Website to robots that visit our site • for giving spiders (“robots”) limited access to a website

  14. The Robot Exclusion • Website administrators and content providers can limit robot activity through two mechanisms: • The Robots Exclusion Protocol is used by Website administrators to specify which parts of the site should not be visited by a robot, by providing a file called robots.txt on their site. • TheRobots META Tag is a special html META tag that can be used in any Web page to indicate whether that page should be indexed, or parsed for links.

  15. Personalization of Web Content • Used to modify the contents of a web page as per the needs of a user • Essentially, this involves building web pages exclusively for each user

  16. Types of Web Page Personalization • Collaborative filtering: • Achieves personalization by suggesting Web pages that have earlier been given high ratings from similar users • Manual techniques: • Perform personalization via the use of rules that are used to classify individuals based on profiles or demographics • Content-based filtering: • Retrieves pages based on the similarity between them and user profiles

  17. Multimedia Information Retrieval • Perspective of images and videos • Content system for images is the Query by Image Content (QBIC) system: • A three-dimensional color feature vector, where distance measure is simple Euclidean distance. • k-dimensional color histograms, where the bins of the histogram can be chosen by a partition-based clustering algorithm. • A three-dimensional texture vector consisting of features that measure scale, directionality, and contrast. Distance is computed as a weighted Euclidean distance measure, where the default weights are inverse variances of the individual features.

  18. Multimedia Information Retrieval • The query can be expressed directly in terms of the feature representation itself • For instance, Find images that are 40% blue in color and contain a texture with specific coarseness property

  19. Multimedia Information Retrieval • MIR System www.hermitagemuseum.org/html_En/index.html • A QBIC Layout Search Demo that illustrates a step by step demonstration of the search described in the text can be found at: www.hermitagemuseum.org/fcgi-bin/db2www/qbicLayout.mac/qbic?selLang=English.

  20. Multimedia Information Retrieval • As multimedia become apparent as a more extensively used data format, it is vital to deal with the issues of: • metadata standards • classification • query matching • presentation • evaluation • To guarantee the development and deployment of efficient and effective multimedia information retrieval systems

More Related