770 likes | 942 Views
Link Mining in the Blogosphere Workshop on Community-based Web Service Computing and Mining. NCKU CSIE IKM Lab. Hung-Yu Kao 2008. 12. Outlines. Motivation Link mining Basis Random walking, Mutual Reinforcement Social metrics Some Related Work on blog links Our Work Link Extraction
E N D
Link Mining in the BlogosphereWorkshop on Community-based Web Service Computing and Mining NCKU CSIE IKM Lab. Hung-Yu Kao 2008. 12
Outlines • Motivation • Link mining • Basis • Random walking, Mutual Reinforcement • Social metrics • Some Related Work on blog links • Our Work • Link Extraction • Blog Ranking • Blog match finding
Social relationship mining hyping Google Trend for “social network”, “information retrieval”, “data mining”, “semantic web” and “PageRank” (http://www.google.com/trends?q=pagerank%2C+social+network%2C+information+retrieval%2C+data+mining%2C+semantic+web&ctab=0&geo=all&date=all&sort=0)
Users / Information Users Users / Information Users / Information Information New Interactions Users Information
Differences for researchers • Throng of pages • With complicated, but ruled styles • Informational v.s. emotional • Orz, , ^^, 冏, 凸,… • Throng of links • physical v.s. virtual links • simple v.s. diverse / clustered • Throng of machine-understandable human knowledge • Collaborative tagging / filtering / bookmarking
Ranking in Web2.0 • Rank pages, rank people • Blog ranking • More interaction, much capitalism impact • PageRank Prediction • More knowledge repository, more latent ontology • Wikipedia, Del.icio.us • Information extraction / understanding become essential, realizable • Visual / Semantic block extraction
Link analysis -- Motivation • For one query, which pages are the answer set? • Results of search engines • Rank manually • Rank by similarity • Rank by hit rate (need usage log) • Rank by link analysis (google) • Relevant v.s. Authoritative • Intra-page v.s. inter-page • Users need authoritative pages among relevant pages.
Link analysis -- Motivation • Human knowledge is real, convincing and trustable information • E.g., classification by human in yahoo • Hyperlinks contain information about the human judgment • Social sciences • Nodes: persons, organizations • Edges: social interaction • Easy job ?Counting in-links for popularity
HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites
Authority and Hubness 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)
HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights
PageRank • Introduced by Page et al (1998, WWW) • The weight is assigned by the rank of parents • Difference with HITS • HITS takes Hubness & Authority weights • The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree • Query independent
PageRank example • Confirm the result • # of inlinks from high ranked page • hard to explain about 5&2, 6&7 * How do you create your homepage highly ranked ? * How to detect it ?
Limits of Link Analysis 眾好之(spam),必查焉,眾惡之(new page),必查焉--論語·衛靈公 • Stability • Adding even a small number of nodes/edges to the graph has a significant impact • Topic drift – similar to TKC • A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page • Content evolution • Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks • Incremental link analysis
Link analysis in a social network • Node entity • Edge relationship • We want to know in this social network • Which (group of) node / edge is influential • Which (group of) node / edge is important • Which node is an outlier • Information flow / tracking
Centrality • Degree centrality • In-degree, out-degree • Localization, isolation • Closeness centrality • Geodesic distance between the entity and all other entities • Betweeness centrality • Gendesic path • Eigenvector centrality • Central entity receiving many communications from other well-connected entities (central entities) • Power centrality
Centralization = 1 Centralization = 0 Network centralization • Summary of centralization of a network • E.g.,
9/11 Hijackers Graph Reference from “The Text Mining Handbook”, Ronen Feldman, James Sanger, P257.
Some Related work in blog ranking (with link information) • Technorati (technorati.com/) • real-time blog search engine which watches over 100 million blogs • Multiple list • Number of fans • Blog authority: counts the number of blogs linking to • BlogLook (look.urs.tw/) • 60,000+ bloggers • Ranking from many features • #Inlink / #post in Google(general SE, blogger SE) and Yahoo • Scores in delicious, furl, Hemidemi, Myshare • Index factor, impact score, Page score, Technoratiscore, Bloginference score
Some Related work in blog ranking • EigenRumor (Fujimura, 2005) is based on eigenvector calculation of the adjacency matrix of links • BlogRank (Apostolos, 2006) is a generalized form of PageRank which use similarity features to make the link graph denser. • Identifying influential bloggers (Nitin, WSDM 2008)
Influential Properties (Nitin, WSDM 2008) • Recognition: Citations (incoming links) • The more influential the referring posts are, the more influential the referred post becomes. • Activity Generation: Volume of discussion (comments) • Large number of comments indicates that the blog post affects many such that they care to write comments, hence influential. • Novelty: Referring to (outgoing links) • Novel ideas exert more influence. Large number of outlinks suggests that the blog post refers to several other blog posts, hence less novel. • Eloquence: “goodness” of a blog post (length) • Short spam message • Copy message
EigenRumor • Scoring each blog entry by weighting the hub and authority scores of the bloggers based on eigenvector calculations • similar to HITS • focuses on the behaviors of bloggers on blog posts • the adjacency matrix is constructed from agent-to-object links, not page-to-page (or object-to-object) links • Agent: • it is used to represent an aspect of human being such as a blogger • Object: • it is used to represent any object such as a blog entity
EigenRumor • Two Matrixes • Provisioning Matrix • P= [pij] (i=1…m,j=1…n) • pij means a provisioning link • In this notation, pij=1 if agent i provides object j and zero otherwise. • Evaluation Matrix • E= [eij] (i=1…m,j=1…n) • eij means a evaluation link • The evaluation link is assigned weight eijbased on the strength of the support given to object j • Assuming eijhas the range of [0,1] and higher values indicate stronger support
AlgorithmScores -1 • The EigenRumor algorithm scores agents in two aspects: • information evaluation (hub score) • information provisioning (authority score) • To implement this idea, two scores for each agent and one score for each object are introduced in the algorithm • agent property • Authority score • Hub score • object property • Reputation score
Agent Object AlgorithmMapping to blog community
Motivation • Informative block (IB) that presented in a form of block on the Web is meaningful data for extractor on page analysis. • Blog is hot! There are many investigation on it. • Ex: social network and trend analysis • There are something different between Blog page and general page on IB scoring and ranking. • DOM tree is not a flat tree already.
Motivation • Related works on block extraction • MDR [1], IKM [3], IEPAD [2] • They have some limitations on CSS Web page • More <DIV> tag for page layout, but less <TABLE> • Tree ambiguity • Use CSS to design Web page style • Data presentation does not correspond to DOM tree structure • Can’t extract single Presentation block • Our objective • Extract all blocks on CSS Web page by CSS properties • Visual attributes and attribute entropy facilitate block extraction
The properties of CSS Web page • CSS selector • HTML tag name, CLASS attribute and ID attribute • A block is with high information content if it contains many varied selectors • CSS definition • A CSS definition comprise a selector, a property and a value • CSS definitions indicate some visual information for tree modification
Block tag analysis • Content page
CSS tag analysis • Content page
A B C The properties of CSS Web page • Layer containment • Structural containment • DOM tree structure • visual containment • Block presentation structure • Structural containment is not equal to visual containment on CSS Web page
System architecture • Three processes for block extraction • Tree Generation (TG), Entropy Evaluation Model (EEM) and Block Identification (BI)
System architecture • Tree Generation • DOM Parse transforms a Web page into DOM tree • Tree Constructer uses Tag filter and Visual Information Module to modifies DOM according to node attributes • Entropy Evaluation Model • Use Partial Path Entropy Evaluation (PPEE) to calculate attribute entropy • Aggregation function provides thresholds automatically to BI for block type notation
B AH AT A A A A A AH AT D C A A A B A T L T C D L CSS Tag Entropy Tree Structure Informative Block
The performance of CB Extraction • CB Evaluation
Visual Tree • Visual Tree
Motivation • Among this large number of blogs, people need to know which blogs would be more informative. • Google use PageRank to rank web pages, and provide a successful service for searching web pages • Blogs is not only a set of web pages but contains many particular characteristics and interactive behaviors. • A ranking method based on the characteristics of blogs is needed
Informative Blogs • An informative blog post is normally commented by many bloggers • Users may cite the informative posts or send a trackback while writing posts of relative topics • A blog with informative posts is an informative blog • We will use these blog features and relationships to design a modified PageRank algorithm for blog rankings
Idea • To quantify the quality of blogs , the interactive behaviors or links between all blogs are great indicators • Comment • Trackback • Blogrolls • Hyperlinks in the Content
Linking relationship Proposed Original
Blog Network • Network Structure • Each node represents a blog • Each edge between two nodes represents a relationship for the two blogs • There are three general types of edges in the blog network • Support Edge (Support Relationships) : comment , trackback between blogs) • Similarity Edge (Similarity Relationships) : common links in contents or users between blogs (a virtual edge with lower weight ) • Hyperlink Edge: the links in contents between blog and a web page
Based on the original PageRank, we adjust the probability of a blog surfer to follow a link in blog A to another blog B PageRank: We combing several blog relationships( ) with different weight, the probability ( ) is give by a new formula Besides the support relationships which constructing the Blog Network, similarity relationships are used in this formula because the similarity of blogs may convince the surfer more reason to stay on the blog Local Blog Rank Algorithm
The probabilities from Blog A to Blog B is decided by the following three factors Blog Relationship Type (ex: similarity, comment, trackback…) Different Blog Relationships are given different weights Show the relationships with other blogs Blog Relationship Number (ex: number of comments ) The number of the corresponding relationship Blog Quality Score (BQ) Normalized Blog Features Show the general activity of a blog It’s assumed that if users know quality of the blog features for each blog, the probability of moving to a blog with higher activity and attention is more than others. Local Blog Rank Algorithm
The probability formula X are the blogs to which the Blog A links The Relationship Score combines all kinds of relationship between blog A and K, and is calculated by the weight and number of corresponding relationship type multiplying the blog quality score of K Local Blog Rank Algorithm