Web Mining

Web Mining G.Anuradha References from Dunham

Objective • What is web mining? • Taxonomy of web mining? • Web content mining • Web structure mining • Web usage mining

What is web mining? • Mining of data related to WWW • Data present in Web pages or data related to web activity • Web data is classified • Content of web pages • Intrapage structure which include code and actual linkage • Usage data – how used by visitors • User profiles

Taxonomy of Web Mining

Web Content Mining • Extension of basic search engines • Search engines are keyword-based • Traditional search engines use crawlers • to search the Web • gather information • indexing techniques to store the information • query processing to provide fast and accurate information to users

Taxonomy of Web content mining WEB CONTENT MINING AGENT BASED APPROACH DATABASE APPROACH USE SOFTWARE SYSTEMS TO PERFORM THE CONTENT MINING EG. SEARCH ENGINES VIEWS WEB DATA AS BELONGING TO DATABASE WEB IS A MULTILEVEL DATABASE AND QUERY LANGUAGES ARE USED FOR QUERYING THE DATA CONTENT MINING IS A TYPE OF TEXT MINING

Text mining hierarchy Simple Complex

Crawlers

How do crawlers work? • Robot, spider, crawler is a program that traverses the hypertext structure in the web • Page that the crawler starts is referred to as seed URL • All links from that page are recorded and saved in a queue • The new pages are in turn searched and their links are saved • The crawlers collect information about each page, extract keywords, store indices for users

Types of crawlers • Periodic crawlers: activated periodically; every time it is activated it replaces the existing index • Incremental crawler: updates the index incrementally instead of replacing it • Focused crawler: visits pages related to topics of interest

Focused crawling

Architecture of focused crawler • Has 3 components: • Crawler: Performs the actual crawling on the Web. It visits pages based on priority-based structure associated with pages by classifier and distiller • Classifier: Associates a relevance score for each document with respect to the crawl topic. Determines the resource rating • Distiller: Determines which pages contain links to many relevant pages. These are called hub pages.

Harvest Rate • Harvest rate is the performance objective for focused crawler • The seed documents are used to begin the focused crawling • The relevant documents are found using • Hard focus: Follows links if there is an ancestor of that node which is marked as good • Soft focus: identifies the relevant page with a probability c- is a page and good(c) is an indication that the page is a relevant page

Context focused crawler • Crawling takes place in two phases • Training phase: context graphs and classifiers are constructed using a set of seed documents as training set • Classifiers are used for crawling and context graphs are updated. • Context crawler overcomes the problems of focused crawler • Follows links from those pages which point to relevant pages but they themselves are not relevant • Helps in backward crawling

Context graph • Rooted graph in which root represents seed document and nodes at each level represent pages that have links to node at higher level • Context graph created for all seed documents are merged to create a merged context graph

Harvest system • Based on use of caching, indexing, crawling • Harvest is centered around the use of • Gatherers: obtain information for indexing from Internet Service Provider • Brokers: provides index and query interface • Brokers may directly or indirectly interface with gatherers

Virtual Web View • Large amount of unstructured data can be handled using multiple layered database(MLDB) on top of the web data • Every layer of this dbase is more generalized then the preceding layer • The upper layer are structured and can be accessed using SQL • View of MLDB- Virtual Web View(VWV)

WebML • Query language which supports data mining operations on MLDB • Four primitive operations in WebML are • COVERS • COVERED BY • LIKE • CLOSE TO SELECT * FROM document in “www.engr.smu.edu”\\ WHERE ONE OF keywords COVERS “cat”

Personalization • Contents of a web page are modified to fit the desires of the user • Advertisements are sent to a potential customer based on his specific knowledge • Personalization is performed on target web page • Targeting is different from personalization • In targeting businesses display advertisements at other sites visited by their users • In personalization when a person visits a Web site, the advertising can be designed specifically for that person

Personalization Contd…. • Personalization is a combination of clustering, classification and prediction • Types of personalization are • Manual techniques – user registration details • Collaborative filtering • Content-based filtering • Eg. My Yahoo

Web Structure Mining • Creating a model of the web organization • Used to classify Web pages or to create similarity measures between documents

Page Rank • Designed to increase the effectiveness of search engines and improve their efficiency • Used to • Measure the importance of a page • Prioritize the pages returned from a traditional search engine using keyword searching • Page Rank is calculated based on the number of pages that point to it

Page Rank Contd… Where c between 0 to 1 used for normalization; Bp=Set of pages that point to p Fp=set of links out of p Nq=|Fq|

Rank Sink • When there is a cyclic reference a rank sink problem occurs • Eliminated using an additional term cE(v) to the page rank formula • E(v)- is a vector that adds an artificial link.

Hyperlink-induced topic search(HITS) • Finds hubs and authoritative pages • HITS has two components • Based on a given set of keywords relevant pages are found • Hubs and authority measures are associated with these pages. Pages with highest values are returned

Authorities and hubs • The algorithm produces two types of pages: - Authority: pages that provide an important, trustworthy information on a given topic - Hub: pages that contain links to authorities • Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs Selime Işık-Büşra İpek

5 2 5 1 6 1 1 3 6 7 4 7 Authorities and hubs (2) a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) Selime Işık-Büşra İpek

Definitions • Authority: pages that provide an important, trustworthy information on a given topic • Hubs:pages that contain links to authorities • Indegree:number of incoming links to a given node, used to measure the authoritativeness • Outdegree:number of outgoing links from a given node, here it is used to measure the hubness Selime Işık-Büşra İpek

HITS Algorithm • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs. • Together they form a bipartite graph: • Hubs Authorities

Step By Step HITS-1 • determines a base set S • let set of documents returned by a standard search engine be called the root set R • Initialize S to R Selime Işık-Büşra İpek

Step By Step HITS - 2 • Add to S all pages pointed to by any page in R. • Add to S all pages that point to any page in R • Maintain for each page p in S: Authority score: ap(vector a) Hub score: hp (vector h) Selime Işık-Büşra İpek

Step By Step HITS - 3 • For each node initiliaze the ap and hp to 1/n • In each iteration calculate the authority weight for each node in S Selime Işık-Büşra İpek

Step By Step HITS - 4 • In each iteration calculate the hub weight for each node in S • Note:The hub weights are computed from the current authority weights, which were computed from the previous hub weights. Selime Işık-Büşra İpek

Step By Step HITS - 5 • After new weights are computed for all nodes, the weights are normalized: Selime Işık-Büşra İpek

The Pseudocode of HITS Selime Işık-Büşra İpek

HITS Example • Root Set R {1,2,3,4} • Extend it to form the base set S Selime Işık-Büşra İpek

Authority and Hubness Weight HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Selime Işık-Büşra İpek

HITS vs PageRank • HITS emphasizes mutual reinforcement between authority and hub webpages, while PageRank does not attempt to capture the distinction between hubs and authorities. It ranks pages just by authority. • HITS is applied to the local neighborhood of pages surrounding the results of a query whereas PageRank is applied to the entire web • HITS is query dependent but PageRank is query-independent Selime Işık-Büşra İpek

HITS vs PageRank (2) • Both HITS and PageRank correspond to matrix computations. • Both can be unstable: changing a few links can lead to quite different rankings. • PageRank doesn't handle pages with no outedges very well, because they decrease the PageRank overall Selime Işık-Büşra İpek

Conclusion • HITS is a general algorithm used for calculating the authority and hubs in order to rank the retrieved data • The basic aim of that algorithm is to induce the Web graph by finding set of pages with a search on a given topic (query). Selime Işık-Büşra İpek

INPUT W ///WWW viewed as a directed graph q //Query s //support OUTPUT A //Set of authority pages H //Set of hub paged HITS Algorithm R=SE(W,q) //SEARCH ENGINE SE IS USED TO FIND A SMALL SET ROOT R B=RU{pages linked to from R}U{pages that link to pages in R}; G(B,L)=Subgraph of W induced in B;//B –vertices or pages in G and L is links G(B,L1)=Delete links in G within same site; Xp=∑yq //authority weights Yp=∑xp //hub weights A={p|p has one of the higestxp}; H={p|p has one of the highest yp};

Web usage mining • Mining on web usage data, or web logs • Web log is a listing of page reference data (clickstream data) • Logs are examined at client or server perspective • Server perspective-mining uncovers information about the sites where the server resides • Client perspective- information about a user is detected • Aids in personalization

Web usage mining applications • Personalization for a user • From frequent access behavior of user, overall performance can be improved • Caching of frequently accessed pages • Modifications of linkage structure, common access behavior are accessed. • Gather business intelligence to improve sales and advertisements

Issues related with web log • Identification of exact user is not possible from log • With web client cache, sequence of pages a user visits is difficult to uncover from server site • Legal, privacy and security issues to be resolved

Preprocessing • The preprocessing phase includes • cleansing • User identification • Session identification • Path completion • Formatting

What is log? • Log ={(u1,p1,t1),….,(un,pn,tn)} • Ppages; UUsers;

What is session? • Ordered list of pages accessed by a user {<p1,t1>,,p2,t2>….<pn,tn>} • Each session has a unique identifier called as session ID. • The length of session is number of pages in it denoted by len(S) • D be a database having all sessions and length of D is total len(S)

Recap of networking • What is ISP? • Internet Service Provider • What are cookies? • Cookies are used in identifying a single user regardless of machine used to access the WEB

Trie • Data structure that is used to keep track of patterns during web usage mining • Path from root to leaf represents a sequence • Tries are used to store strings fro pattern-matching applications • Each character in the string is stored on the edge to the node and common prefixes of strings are shared

Sample tries A C N CAR A ANY Y CART R $ T $ TRIE SUFFIX TRIE

Web Mining

Web Mining

Presentation Transcript

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web mining

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

WEB MINING

Web-Mining Agents Data Mining