Web Mining and Link Analysis

Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011 - Columbia University

Web Mining v. Data Mining Web Mining is: Discovering useful information from the World-Wide Web and its usage patterns • Structure (or lack of it) • Textual information and linkage structure – unstructured data • Scale • Data generated per day is comparable to largest conventional data warehouses • Speed • Often need to react to evolving usage patterns in real-time (e.g., merchandising, web security) Data Mining - Volinsky - Fal 2011 - Columbia University

What is “the Web”? • The WWW is huge, widely distributed, global information service center for • Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. • Hyper-link structure is what makes it so useful • provides rich sources for data mining • Essentially, infinite size (>20B pages) • With lots of duplication Data Mining - Volinsky - Fal 2011 - Columbia University

Why Web Mining? • Useful to study human digital behavior, e.g. search engine data can be used for • Exploration e.g. # of queries per session? • Modeling e.g. any time of day dependence? • Prediction e.g. which pages are relevant? • Applications • Understand social implications of Web usage • Design of better tools for information access • E-commerce applications • Advertising is a key driver of online business Data Mining - Volinsky - Fal 2011 - Columbia University

Advertising Applications • Revenue of many internet companies is driven by advertising • Key problem: • Given user data: • Pages browsed • Keywords used in search • Demographics • Determine the most relevant ads (in real-time) • Includes bidding/pricing of ads • Another major problem: “click fraud” • AdSense – place Google ads on your web site • AdWords – buy “keywords” to put on Google search • Determine fraudulent usage through data mining • Understanding the user is key to these types of applications Data Mining - Volinsky - Fal 2011 - Columbia University

Data Sources for Web Mining • Web content • Text and HTML content on Web pages • User generated content: Blogs, microblogs (Twitter), social networks • Web connectivity • Hyperlink/directed-graph structure of the Web • Web user data • Data on how users interact with the Web • Navigation data, aka “clickstream” data • Search query data (keywords for users) • Online transaction data • Who has this data? Data Mining - Volinsky - Fal 2011 - Columbia University

Accessing data for web mining • Scripting languages like Perl or Python make web scraping access easy. • “user-generated” content is meant to be consumed! • Many websites have APIs for access to data • If there is an API, please follow it! • Can be open: wikipedia, imdb • Can be restricted: facebook, ebay, amazon • If you are interested, a good book is Data Mining - Volinsky - Fal 2011 - Columbia University

Examples of Web Mining Viz • Volume of data along with useful APIs makes a lot of data available for visualization and analysis. • Twitter happiness metric • Blogpulse.com Data Mining - Volinsky - Fal 2011 - Columbia University

Analyzing User navigation • Web logs • Record activity between client browser and a specific Web server • Easily available • Collected by server, ISP • Search engine records • Text in queries, which pages were viewed, which snippets were clicked on, etc • Client-side browsing records • Automatically recorded by client-side software • Harder to obtain, but much more accurate than server-side logs • Other sources • Cookies: collected on client/browser, readable by server • Web site registration, purchases, email, etc • ISP recording of Web browsing Data Mining - Volinsky - Fal 2011 - Columbia University

Example of Web Log entries Apache web log: 207.237.112.68 - - [25/Oct/2009:06:13:30 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" 207.237.112.68 - - [25/Oct/2009:06:13:35 -0400] "GET /~volinsky/DataMining/HW/HW4.html HTTP/1.1" 304 - "http://www.research.att.com/~volinsky/DataMining/Columbia.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" 69.86.67.231 - - [25/Oct/2009:10:10:36 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.10) Gecko/2009042700 SUSE/3.0.10-1.1.1 Firefox/3.0.10" 66.234.60.140 - - [25/Oct/2009:10:21:29 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 200 11527 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)" 66.234.60.140 - - [25/Oct/2009:10:21:33 -0400] "GET /~volinsky/DataMining/HW/HW4.html HTTP/1.1" 200 4584 "http://www.research.att.com/~volinsky/DataMining/Columbia.html" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)" 68.239.18.39 - - [25/Oct/2009:11:03:11 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9" 68.239.18.39 - - [25/Oct/2009:12:09:47 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9" 66.65.114.97 - - [25/Oct/2009:12:17:15 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 200 11527 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3" 128.59.154.126 - - [25/Oct/2009:13:14:18 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 200 11527 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)" 128.59.154.126 - - [25/Oct/2009:13:14:21 -0400] "GET /~volinsky/DataMining/HW/HW4.html HTTP/1.1" 200 4584 "http://www.research.att.com/~volinsky/DataMining/Columbia.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729 66.249.65.210 - - [01/Oct/2009:05:52:03 -0400] "GET /~volinsky/myrefs.html HTTP/1.1" 200 21698 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Data Mining - Volinsky - Fal 2011 - Columbia University

Routine Server Log Analysis • Typical statistics/histograms that are computed • Most and least visited web pages • Entry and exit pages • Referrals from other sites or search engines • What are the searched keywords • How many clicks/page views a page received • Error reports, like broken links • Many software products that produce standard reports of this type of data • e.g., are there clusters/groups of users that use the site in different ways? Data Mining - Volinsky - Fal 2011 - Columbia University

Descriptive Summary Statistics Data Mining - Volinsky - Fal 2011 - Columbia University

Web data measurement issues • Important to understand how data is collected • Web data is collected automatically via software logging tools • Advantage: • No manual supervision required • Disadvantage: • Data can be skewed (e.g. due to the presence of robot traffic) • Important to identify robots (also known as crawlers, spiders) Data Mining - Volinsky - Fal 2011 - Columbia University

Robot / human identification • Removal of robot data is important preprocessing step before any clickstream analysis • Robots come in all shapes and sizes • Good: Google wants to map the net to provide good search • Bad: Competitor is scraping your web site to see what you are up to (or steal data) • Robot page-requests often identified using a variety of heuristics • e.g. some robots self-identify themselves in the server logs • Robots.txt • Also, robots should identify themselves via the User Agent field in page requests • Patterns of access • How would you detect robots? How would you escape detection? • Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002) provide a detailed description of using classification techniques to learn how to detect robots Data Mining - Volinsky - Fal 2011 - Columbia University

A time-series plot of UCI Website data Number of page requests per hour as a function of time from page requests in the www.ics.uci.edu Web server logs during the first week of April 2002. Data Mining - Volinsky - Fal 2011 - Columbia University

From Tan and Kumar, 2002Overallaccuraciesof around 90%were obtainedusing decisiontree classifiers, Like spam, identifying bots (like spam) is a constant arms race Data Mining - Volinsky - Fal 2011 - Columbia University

Sessionizing • Aggregating clicks into sessions can be useful • e.g., what did you do when you sat at the computer? • how might you determine this? Data Mining - Volinsky - Fal 2011 - Columbia University

Client-side data • Advantages of collecting data at the client side: • Direct recording of page requests (eliminates ‘masking’ due to caching) • Recording of all browser-related actions by a user (including visits to multiple websites) • More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer) • Preferred mode of data collection for studies of navigation behavior on the Web • Companies like ComScore and Nielsen use client-side software to track home computer users • but with what biases? Data Mining - Volinsky - Fal 2011 - Columbia University

comScore Report 2008 • 185 million U.S. people age 2+ online in a month, spending an average of 29 hours online per person* • 80% of 824 million global Internet users now outside of U.S. • 99% of online population search in a month, conducting 22 searches per searcher** • 75% of online population stream a video, viewing an average of 70 videos per viewer per month*** • Up 36% vs YA • 66% of online population visit a social networking site, spending 4 hours per month per visitor* • 40% of online population visit a blog site in a month* February 2008, U.S., comScore Media Metrix ** February 2008, U.S., comScore qSearch 2.0 *** January 2008, U.S., comScore Video Metrix Data Mining - Volinsky - Fal 2011 - Columbia University

Modeling Clickrate Data • Data • goal is to build a time-series model that characterizes user click rates • Usually: cluster data into user types Data Mining - Volinsky - Fal 2011 - Columbia University

Markov models for page prediction • Why would we want to predict where a user is surfing? • pre-cached web pages save time • General approach is to use a finite-state Markov chain • Each state can be a specific Web page or a category of Web pages • If only interested in the order of visits (and not in time), each new request can be modeled as a transition of states • For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states Data Mining - Volinsky - Fal 2011 - Columbia University

Markov models for page prediction • Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general, • first-order Markov assumption • This provides a simple generative model: Data Mining - Volinsky - Fal 2011 - Columbia University

Markov models for page prediction • If we denote Tij = P(st = j|st-1 = i), we can define a P x P transition matrix • Each page is a “state”: P can be of the order 105 to 106 • If P is large, we might cluster P pages into M clusters, which now become the states in the Markov model Data Mining - Volinsky - Fal 2011 - Columbia University

Markov models for page prediction • Tij = P(st = j|st-1 = i) represents the probability that an individual user’s next request will be from state j, given they were in state i • We can add E, an end-state to the model • E.g. for three categories with end state: • Rows sum to 1 • E denotes the end of a sequence, and start of a new sequence Data Mining - Volinsky - Fal 2011 - Columbia University

Markov models for page prediction • First-order Markov model assumes that the next state is based only on the current state • This is a strong assumption! • Doesn’t consider ‘long-term memory’ • We can try to capture more memory with kth-order Markov chain (increased complexity) Data Mining - Volinsky - Fal 2011 - Columbia University

Transition probability estimates for Markov model • Where nij is the number of cases that go from state i to state j. niis the number of cases starting in state i. • Smoothed parameter estimates: • qij is a prior transition matrix • If nij = 0 for some transition (i, j), smoothed version allows prior knowledge to be incorporated, instead of having a parameter estimate of0. • If nij > 0, we get a smooth combination of the data-driven information (nij) and the prior. a determines how much the prior (qij) matters Data Mining - Volinsky - Fal 2011 - Columbia University

Ranking Web Pages Data Mining - Volinsky - Fal 2011 - Columbia University

Ranking web pages • Web pages are not equally “important” • How do you determine the “importance” of a web page? • Big Idea: Inlinks are a measure of importance. • Virtualstapler.com = 178 • Nytimes.com = 13,000 • Are all inlinks equal? • They are important if linked to by many important sites • Recursive question! Data Mining - Volinsky - Fal 2011 - Columbia University

Simple recursive formulation • Each link’s vote is proportional to the importance of its source page • if pages link to me, my links count more • If page P with importance x has n outlinks, each link gets x/n votes Data Mining - Volinsky - Fal 2011 - Columbia University

Yahoo Amazon M’soft Simple “flow” modelcourtesy Rajaraman, Ullman y = y /2 + a /2 a = y /2 + m m = a /2 y/2 y a/2 y/2 m a/2 m a Data Mining - Volinsky - Fal 2011 - Columbia University

Solving the flow equations • 3 equations, 3 unknowns, • y+a+m = 1 • Solution: y = 2/5, a = 2/5, m = 1/5 • Nice for a small example, but need something more general Data Mining - Volinsky - Fal 2011 - Columbia University

Matrix formulation • Matrix M has one row and one column for each web page • Suppose page j has n outlinks • If j links to i, then Mij=1/n • Else Mij=0 • Columns sum to 1 • Suppose r is a vector with one entry per web page • ri is the importance score of page i • Call it the rank vector Then, The flow equations can be written r = Mr • So the rank vector is an eigenvector of the web matrix Data Mining - Volinsky - Fal 2011 - Columbia University

Yahoo r = Mr Amazon M’soft y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m Example y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y = y /2 + a /2 a = y /2 + m m = a /2 Data Mining - Volinsky - Fal 2011 - Columbia University

Power Iteration method • Simple iterative scheme • Suppose there are N web pages • Initialize: r0 = [1/N,….,1/N] • Iterate: rk+1 = Mrk • Stop when |rk+1 - rk|1 <  Data Mining - Volinsky - Fal 2011 - Columbia University

Yahoo Amazon M’soft Power Iteration Example y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . . Data Mining - Volinsky - Fal 2011 - Columbia University

Random Walk Interpretation • Imagine a random web surfer • At any time t, surfer is on some page P • At time t+1, the surfer follows an outlink from P uniformly at random • Ends up on some page Q linked from P • Process repeats indefinitely • Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t • p(t) is a probability distribution on pages Data Mining - Volinsky - Fal 2011 - Columbia University

The stationary distribution • Where is the surfer at time t+1? • Follows a link uniformly at random • p(t+1) = Mp(t) • Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) • Then p(t) is called a stationary distribution for the random walk • Our rank vector r satisfies r = Mr • So it is a stationary distribution for the random surfer • (also, r is an eigenvector of M) Data Mining - Volinsky - Fal 2011 - Columbia University

Existence and Uniqueness A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0. Data Mining - Volinsky - Fal 2011 - Columbia University

Spider traps • A group of pages is a spider trap if there are no links from within the group to outside the group • Random surfer gets trapped • Spider traps violate the conditions needed for the random walk theorem • Solution for traps: • At every step, with probability , follow a link at random • With probability 1-, jump to some page uniformly at random (teleport) • Common values for  are in the range 0.8 to 0.9 • This is the essence of Google’s PageRank algorithm Data Mining - Volinsky - Fal 2011 - Columbia University

Matrix formulation • Suppose there are N pages • Consider a page j, with set of outlinks O(j) • We have Mij = 1/|O(j)| when j links to i and Mij = 0 otherwise • The random teleport is equivalent to • adding a teleport link from j to every other page with probability (1-)/N • reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| Data Mining - Volinsky - Fal 2011 - Columbia University

Previous example with traps Yahoo y a m y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M’soft y a = m 1/3 1/3 1/3 1/3 1/6 1/2 1/4 1/6 7/12 5/24 1/8 2/3 0 0 1 . . . Data Mining - Volinsky - Fal 2011 - Columbia University

Previous example with =0.8 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Yahoo 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.28 0.2 0.52 0.24 0.17 0.58 0.212 0.152 0.636 . . . Data Mining - Volinsky - Fal 2011 - Columbia University

The Google Model • Google uses a combination of tools: • TFIDF from query to page retrieval • PageRank to upweight important pages • Link text info • Problems: • Biased against topic-specific authorities • Ambiguous queries e.g., jaguar, spears • Susceptible to Link spam • Artificial linkscreated in order to boost page rank • called Google Bombing “miserable failure” Data Mining - Volinsky - Fal 2011 - Columbia University

Other measures of importance • Hubs and Authorities (Klienberg) • PageRank model works on the assumption that important pages link to imporant pages • Kleinberg notes that important sites might not link to each other • authorities • pages which are prominent for a given topic • hubs • assemble high-quality guides and direct users to authorities • A good hub page is one that points to many good authority pages, A good authority page is one that is pointed to by many good hub pages • each page gets a hub score and an authority score…this helps also in defining web communities Data Mining - Volinsky - Fal 2011 - Columbia University

HITS: Hubs and Authorities • The HITS algorithm has two basic steps: • Authority Update: Update each node's Authority score to be equal to the sum of the Hub Scores of each node that points to it. • Hub Update: Update each node's Hub Score to be equal to the sum of the Authority Scores of each node that it points to. • Let a be the vector of authority scores and h be the vector of hub scores • a=[1,1,....1], • h = [1,1,.....1] ; • do a=MTh; h=Ma; • Normalize a and h; • Repeat until a and h converge • The vectors a* and h*represent the authority and hub weights Data Mining - Volinsky - Fal 2011 - Columbia University

Web Advertising and Auction Models Data Mining - Volinsky - Fal 2011 - Columbia University

History of web advertising • Banner ads (1995-2001) • Initial form of web advertising • Popular websites charged X$ for every 1000 “impressions” of ad • Called “CPM” rate • Modeled similar to TV, magazine ads • Untargeted to demographically targeted • Low clickthrough rates • low ROI for advertisers Data Mining - Volinsky - Fal 2011 - Columbia University

Performance-based advertising • Introduced by Overture around 2000 • Advertisers “bid” on search keywords • “second price” auction (why?) • When someone searches for that keyword, the highest bidder’s ad is shown • Advertiser is charged only if the ad is clicked on • Google’s version came out in 2000 • Called “Adwords” Data Mining - Volinsky - Fal 2011 - Columbia University

Ads vs. search results Data Mining - Volinsky - Fal 2011 - Columbia University

Web 2.0 • Performance-based advertising works! • Multi-billion-dollar industry • Ad server is incented to provide best ads for a given search - they only get paid if successful! • auction model is sensible…what are you willing to pay? • Top words : • http://www.cwire.org/highest-paying-search-terms/ • Interesting problems • What ads to show for a search? • If I’m an advertiser, which search terms should I bid on and how much to bid? Data Mining - Volinsky - Fal 2011 - Columbia University

Web Mining and Link Analysis