Chapter 6 Web Content Mining

Chapter 6Web Content Mining

Web Mining • Data mining techniques applied to the Web • Three areas: • web-usage mining • web-structure mining • web-content mining

Web Usage Mining • Does not deal with the contents of web documents • Goals: - to determine how a website’s visitors use web resources - to study their navigational patterns • The data used for web-usage mining is essentially secondary

Web Structure Mining • Web-structure mining is concerned with the topology of the Web • Focuses on data that organizes the content and facilitates navigation • The principal source of information is hyperlinks, connecting one page to another • Chapter 8 presents web-structure mining

Web Content Mining • Web-content mining deals with primary data on the Web • actual content of the web documents • web-content mining is to extract information • users locate and extract information relevant to their needs • Web-content mining is composed of multiple data types: text, images, audio, and video • It also deals with crawling the Web and searching for information

Web Content Mining

Web Content Mining • Web-content mining techniques are used to discover useful information from content on the web • textual • audio • video • images • metadata

Origin of web data • Some of the web content is generated dynamically using queries to database management systems • Other web content may be hidden from general users

Problems with Web data • Problems with the web data • Distributed data • Large volume • Unstructured data • Redundant data • Quality of data • Extreme percentage volatile data • Varied data

Web Crawler A computer program that navigates the hypertext structure of the web • Crawlers are used to ease the formation of indexes used by search engines • The page(s) that the crawler begins with are called the seed URLs. • Every link from the first page is recorded and saved in a queue

Periodic Web Crawler Builds an index visiting number of pages and then replaces the current index • Known as a periodic crawler because it is activated periodically

Focused Web Crawlers Generally recommended for use due to large size of the Web • Visits pages related to topics of interest • If a page is not pertinent, the entire set of possible pages below it is pruned

Web Crawler Crawling process • Begin with group of URLs • Submitted by users • Common URLs • Breath-first or depth-first • Extract more URLs Numerous crawlers • Problem of redundancy • Web partition  robot per partition

Focused Crawler The focused crawler structure consists of two major parts: • The distiller • The classifier

The Distiller • A distiller verifies which pages contain links to other relevant pages, which are called hub pages. • Identifies hypertext nodes which are considered as good access points to more relevant pages (HITS algorithm).

The hypertext classifier • A hypertext classifier establishes a resource rating to estimate how advantageous it would be for the crawler to pursue the links out of that page. • The classifier connects a significant score for each document with respect to the crawl topic. • Evaluates the relevance of hypertext documents according to the given topic.

Focused Crawler The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller

Focused Crawler- how it works • User identifies sample documents that are of interest. • Sample documents are classified based on a hierarchical classification tree. • Documents are used as the seed documents to begin the focused crawling

Focused Crawler Each document is classified into a leaf node of the taxonomy tree • One approach, hard focus, follows links if there is an ancestor of this node that has been marked as good • Another approach, soft focus, identifies the probability that a page, d, is relevant as follows: • where c is a node in the tree (thus a page) is the indication that it has been labeled to be of interest • The priority of visiting a page not yet visited is the maximum of the relevance of pages that have been visited and point to it

Context Graph • Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC) • The CFC performs crawling in two steps: 1. Context graphs and classifiers are constructed using a set of seed documents as a training set 2. Crawling is performed using the classifiers to guide it. • How is it different from the focused crawler? Context graphs are updated during the crawl.

Context Graph

Search Engines

Search Engine Uses a ‘spider’ or ‘crawler’ that crawls the Web hunting for new or updated Web pages to store in an index

Search Engine Basic components to a search engine: The crawler /spider Gathers new or updated information on Internet websites The index Used to store information about several websites The search software Performs searching through the huge index in an effort to generate an ordered list of useful search results

Search Engine Mechanism

Search Engines • Generic structure of all search engines is basically the same • However, the search results differ from search engine to search engine for the same search terms, why?

Responsibilities of Search Engines • Document collection • choose the documents to be indexed • Document indexing • indicate the content of the selected documents. • Searching • indicate the user information need into a query • retrieval (search algorithms, ranking of web pages) • Results • present the outcome

Phases of Query Binding Query binding is the process of translating a user need into a search engine query

Phases of Query Binding Three-tier process : 1. The first level involves the user formulating the information need into a question or a list of terms using experiences and vocabulary and entering it into the search engine. 2. The search engine must translate the words with possible spelling errors into processing tokens. 3. The search engine must use the processing tokens to search the document database and retrieve the appropriate documents.

Types of Queries • Boolean Queries: Boolean logic queries connect words in the search using operators such as AND or OR. • Natural Language Queries: In natural language queries the user frames as a question or a statement • Thesaurus Queries: In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system

Types of Queries cont. • Fuzzy Queries: Fuzzy queries reflect no specificity. (handling misspelling, variations of the same word) • Term Searches: The most common type of query on the Web is when a user provides a few words or phrases for the search • Probabilistic Queries: Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy .

The Robot Exclusion Why would the developers prefer to exclude robots from parts of their websites? • The robot exclusion protocol • to indicate restricted parts of the Website to robots that visit asite • for giving crawlers/spiders (“robots”) limited access to a website

The Robot Exclusion Website administrators and content providers can limit robot activity through two mechanisms: • The Robots Exclusion Protocol is used by Website administrators to specify which parts of the site should not be visited by a robot, by providing a file called robots.txton their site. • TheRobots META Tag is a special html META tag that can be used in any Web page to indicate whether that page should be indexed, or parsed for links.

Example of the Robots META Tag <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> If a web page contains the above tag, a robot should not index this document (indicated by the word NOINDEX), nor parse it for links (specified using NOFOLLOW).

The Robot Exclusion

Robots.txt • The "User-agent: *" means this section applies to all robots. • The "Disallow: /" tells the robot that it should not visit any pages on the site.

Example-1 User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

Example-1 • In this example, three directories are excluded. • Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. • Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Example-2 User-agent: Google Disallow: / To allow a single robot What modifications on the robots.txt if we wanted to exclude the bing robot?

Important Considerations when using robots.txt • Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. • The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. • It is not advisable to use /robots.txt to hide information.

Robots META tag • Robots.txt is only accessible by web administrators. • META tag can be used by individual web page authors • The robots META tag is placed in the <HEAD> section of the HTML page.

Robot META tag <html> <head> <meta name=“robots” content=“noindex, nofollow”> … <title>..<title> </head>

Content terms • ALL, • NONE, • INDEX, • NOINDEX, • FOLLOW, • NOFOLLOW • ALL= INDEX, FOLLOW • NONE=NOINDEX, NOFOLLOW

Content combinations <meta name=“robots” content=“index,follow”> == <meta name=“robots” content=“all”> <meta name=“robots” content=“noindex,follow”> <meta name=“robots” content=“index,nofollow”> <meta name=“robots” content=“noindex,nofollow”> = <meta name=“robots” content=“none”>

Exercise Check if the KSU website has a robot exclusion file robots.txt.

Multimedia Information Retrieval • Perspective of images and videos • Content system for images is the Query by Image Content (QBIC) system: • A three-dimensional color feature vector, where distance measure is simple Euclidean distance. • k-dimensional color histograms, where the bins of the histogram can be chosen by a partition-based clustering algorithm. • A three-dimensional texture vector consisting of features that measure scale, directionality, and contrast. Distance is computed as a weighted Euclidean distance measure, where the default weights are inverse variances of the individual features.

Multimedia Information Retrieval The query can be expressed directly in terms of the feature representation itself • For instance, Find images that are 40% blue in color and contain a texture with specific coarseness property • Or a specific layout

Multimedia Information Retrieval • MIR System www.hermitagemuseum.org/html_En/index.html • A QBIC Layout Search Demo that illustrates a step by step demonstration of the search described in the text can be found at: www.hermitagemuseum.org/fcgi-bin/db2www/qbicLayout.mac/qbic?selLang=English.

Multimedia Information Retrieval • As multimedia become apparent as a more extensively used data format, it is vital to deal with the issues of: • metadata standards • classification • query matching • presentation • evaluation • To guarantee the development and deployment of efficient and effective multimedia information retrieval systems

Chapter 6 Web Content Mining

Chapter 6 Web Content Mining

Presentation Transcript

Web Mining

Web Mining An introduction to Web content text mining

Web Mining

Chapter 7: Web Content Mining

Web Mining

Web Mining

Chapter 8 Web Structure Mining

Web Mining

Web mining

Web Mining

Web Mining

Chapter 7 Web Content Mining Xxxxxx

WEB MINING

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Web Mining

Web Mining

Web Mining

WEB MINING

Providing Intelligent Content by Using Semantic Web and Web Mining

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis