290 likes | 505 Views
Chapter 7 Web Content Mining Xxxxxx. Introduction. Web-content mining techniques are used to discover useful information from content on the web textual audio video still images metadata hyperlinks. Introduction.
E N D
Introduction • Web-content mining techniques are used to discover useful information from content on the web • textual • audio • video • still images • metadata • hyperlinks
Introduction • Some of the web content is generated dynamically using queries to database management systems • Other web content may be hidden from general users
Introduction • Problems with the web data • Distributed data • Large volume • Unstructured data • Redundant data • Quality of data • Extreme percentage volatile data • Varied data
Introduction • Two approaches of web-content mining: • agent-based • software agents perform the content mining • database oriented • view the Web data as belonging to a database
Web Crawler • A computer program that navigates the hypertext structure of the web • Crawlers are used to ease the formation of indexes used by search engines • The page(s) that the crawler begins with are called the seed URLs. • Every link from the first page is recorded and saved in a queue • Builds an index visiting number of pages and then replaces the current index • Known as a periodic crawler because it is activated periodically
Web Crawler • Another type isa Focused Crawler • Generally recommended for use due to large size of the Web • Visits pages related to topics of interest • If a page is not pertinent, the entire set of possible pages below it is pruned
Multiple Layered Database • Every layer of the database is more generalized than the layer below it • Unlike the lowest level, the upper levels are structured and can be mined by an SQL-like query language
Multiple Layered Database • Provides an abstracted view of a fraction of the web • Virtual Web View (VWV), can be constructed
Search Engine • Basic components to a search engine: • The spider gathers new or updated information on Internet websites • The index used to store information about several websites • The search software performs searching through the huge index in an effort to generate an ordered list of useful search results
Types of Queries • Boolean Queries: • Boolean logic queries connect words in the search using operators such as AND or OR • Natural Language Queries: • In natural language queries the user frames as a question or a statement • Thesaurus Queries: • In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system
Types of Queries • Fuzzy Queries: • Fuzzy queries reflect no specificity • Term Searches: • The most common type of query on the Web is when a user provides a few words or phrases for the search • Probabilistic Queries: • Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy
The Robot Exclusion • Why would the developers prefer to exclude robots from parts of their websites? • The robot exclusion protocol • to indicate restricted parts of the Website to robots that visit our site • for giving spiders (“robots”) limited access to a website
The Robot Exclusion • Website administrators and content providers can limit robot activity through two mechanisms: • The Robots Exclusion Protocol is used by Website administrators to specify which parts of the site should not be visited by a robot, by providing a file called robots.txt on their site. • TheRobots META Tag is a special html META tag that can be used in any Web page to indicate whether that page should be indexed, or parsed for links.
Personalization of Web Content • Used to modify the contents of a web page as per the needs of a user • Essentially, this involves building web pages exclusively for each user
Types of Web Page Personalization • Collaborative filtering: • Achieves personalization by suggesting Web pages that have earlier been given high ratings from similar users • Manual techniques: • Perform personalization via the use of rules that are used to classify individuals based on profiles or demographics • Content-based filtering: • Retrieves pages based on the similarity between them and user profiles
Multimedia Information Retrieval • Perspective of images and videos • Content system for images is the Query by Image Content (QBIC) system: • A three-dimensional color feature vector, where distance measure is simple Euclidean distance. • k-dimensional color histograms, where the bins of the histogram can be chosen by a partition-based clustering algorithm. • A three-dimensional texture vector consisting of features that measure scale, directionality, and contrast. Distance is computed as a weighted Euclidean distance measure, where the default weights are inverse variances of the individual features.
Multimedia Information Retrieval • The query can be expressed directly in terms of the feature representation itself • For instance, Find images that are 40% blue in color and contain a texture with specific coarseness property
Multimedia Information Retrieval • MIR System www.hermitagemuseum.org/html_En/index.html • A QBIC Layout Search Demo that illustrates a step by step demonstration of the search described in the text can be found at: www.hermitagemuseum.org/fcgi-bin/db2www/qbicLayout.mac/qbic?selLang=English.
Multimedia Information Retrieval • As multimedia become apparent as a more extensively used data format, it is vital to deal with the issues of: • metadata standards • classification • query matching • presentation • evaluation • To guarantee the development and deployment of efficient and effective multimedia information retrieval systems