570 likes | 810 Views
Web. Knowledge. Web Content Mining. 17.3.2016. Najlaa Gali and Pasi Fränti. Web. Web Mining. Goals: Automatic extraction of useful information from web Challenging task: heterogeneity and lack of structure Three types: Web usage mining : Discovery of user patterns from web usage logs
E N D
Web Knowledge Web Content Mining 17.3.2016 Najlaa Gali and Pasi Fränti
Web Mining Goals: • Automatic extraction of useful information from web • Challenging task: heterogeneity and lack of structure Three types: • Web usage mining: Discovery of user patterns from web usage logs • Web structure mining: Discovery from the structure of links. • Web content mining: Discovery from content: text and images.
Applications 1 • To gather, categorize, organize and provide the best possible information available on the WWW to the user requesting the information. • Keyword (or term) based association analysis. • Topic classification. • Similarity detection • Cluster pages by a common author • Cluster pages containing information from common source
Applications 2 • Sequence analysis: predicting a recurring event, discovering trends. • Event detection and tracking. • Help to understand the users behavior. • Anomaly detection: find information that violates usual patterns. • Discovery of frequent phrases. • Text segmentation. • Produce a higher quality of information to the user based upon the requests made through examining images, content, formats and web structure (improving quality of search results). • Businesses can maximize the use of this text mining to improve marketing of their sites as well as the products they offer.
Content of Web PageHypertext Markup Language (HTML, XHTML) Logo image Navigation bar Title Keywords Images Text
Components of Web page • Web page is created from : Hypertext Markup Language (HTML), Cascading Style Sheets (CSS) and Java script (JS). • HTML: describes the structure of a website. • CSS: define the look and layout of text and other material. • Script such as Java script: affect the behavior of HTML web pages.
HTML • A standard markup language used to create web pages. • Its elements consist of tags enclosed in angle brackets (like <html>) • The elements form the building blocks of all websites. • The HTML allows images and objects to be embedded and can be used to create interactive forms. • Can embed scripts and styling sheets. <!DOCTYPE html> <html> <head> <title>This is a title</title> </head> <body> <p>Welcome to DAA++</p> </body> </html>
Example of HTML elements • Title: <title>The Title</title> • Heading: <hx>Heading level</hx>, x= 1…6 • Paragraph: <p>Paragraph</p> • Link: <ahref="https://www.daa.com/">A link to DAA++!</a> • Line break: <p>This <br> is a paragraph <br> with <br> line breaks</p> • Image: <img src=“/documents/PasiFranti.jpg” style=“width: 80px; height: 104px;”>
CSS • Designed primarily to enable the separation of document content from document presentation, including aspects such as the layout, colors, and fonts. • Often used to set the visual style of web pages and user interfaces written in HTML and XHTML. • Enable multiple HTML pages to share formatting by specifying the relevant CSS in a separate .css file, so that to reduce the complexity and repetition in the structural content with every node.
How to style using CSS • HTML presentational attributes (without CSS) <h1><font color="red"> DAA++</font></h1> • CSS style properties <h1style="color:red">DAA++</h1> • Link to external styling sheet <link href="path/to/file.css" rel="stylesheet"> • Internal styling <html> <head> <style> #xyz { color:red } </style> </head> <body> <pid="xyz"> Hello DAA++</p> </body> </html>
Javascript • One of the three essential technologies (HTML,CSS, JS) of WWW content production. • Adds client-side behavior to HTML pages such as animation of page elements, Interactive content (games), and playing audio and video. • Validating input values of a web form to make sure that they are acceptable before being submitted to the server. <script> document.body.appendChild (document.createTextNode('Hello World!')); var h1 =document.getElementById('header'); // holds a reference to the <h1> tag h1 =document.getElementsByTagName('h1')[0]; // accessing the same <h1> </script> <noscript>Your browser either does not support JavaScript, or has it turned off. </noscript>
DOM Concept • DOM makes all components of a web page accessible • HTML elements • their attributes • text • They can be created, modified and removed with JavaScript. html head body meta meta h1 title p ul a li li
DOM objects • DOM components are accessible as objects or collections of objects • DOM components form a tree of nodes • relationship parent node – children nodes • Attributes of elements are accessible as text • Browsers can show DOM visually as an expandable tree
Example of DOM html INTERSPORT DW SPORTS Unit 49a The Circus, Cabot Circus BS1 3BD BRISTOL, United Kingdom <div> <h3>INTERSPORT DW SPORTS</h3> < p> <span>Unit 49a TheCircus, CabotCircus </span><br> <span>BS1 3BD</span> <span>BRISTOL, United Kingdom</span> </p> </div> DOM tree div h3 p INTERSPORT DW SPORTS span span span BS1… BRISTOL… Unit 49 a…
Text nodes in DOM • Text node • Can only be as a leaf in DOM tree • it’s nodeValue property holds the text • innerHTML can be used to access the text <p> This is text <a href="/path/page.html">link in it</a>. </p> p This is text a Link in it
Summary Extraction • Title “Rosso restaurant”, “City pharmacy” • Keywords “restaurant, food, lunch, dinner” • Representative Image • Short description ma-pe: 16.00-22.00 la: 12.00-22.00 puh. 013 227 874
Web Page Title <title>Wentworth House Hotel Bath Hotels - Cheap Hotels in Bath, Somerset, UK</title> • Title Tag (91 %) • Logo image (89 %) • Web page body (93 %)
Title and Meta Tags • The obvious source for titling • But includes also additional information • <title> Piato Restaurant– 123 Blues Point Road, McMahons Point, Sydney | Visit Piato and experience the life & flavour of Europe. North Sydney Functions. North Sydney Restaurants.</title> • <title> Joensuu Keskusta | Intersport - Sport to the people </title> • Segmentation is needed! Joensuu Keskusta Intersport Sport to the people
Work flow https://www.jdwetherspoon.com/pubs/all-pubs/england/london/the-coronet-holloway The coronet
Extract Content of Title and Meta tags <title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title> <meta name="keywords" content="The Coronet" />
Segmentation by delimiters <title>Sydney Waterfront Restaurant | Restaurant Milsons Point -Aqua Dining</title> <title>SIGNORELLI GASTRONOMIA - Pyrmont Italian Restaurant - EAT • DRINK • SHOP • COOKItalian Restaurant Pyrmont Sydney –Signorelli Gastronomia</title> <title>Neutral Bay Club | Tennis, Bowls, Bistro & Functions | Sydney</title> <title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title> Pre-defined delimiter patterns:
Candidate Segments <title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title> <meta name="keywords" content="The Coronet" /> • Candidates • The Coronet • Holloway • Our Pubs • J D Wetherspoon
Scoring-Position in Title and Meta Tags <title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title> • Appear first or last either in Title or Meta gets 0.1 0.1 0.0 0.0 0.1 0.1 <meta name="keywords" content="The Coronet" /> • Candidates • The Coronet • Holloway • Our Pubs • J D Wetherspoon 0.1 0.0 0.0 0.1
Popularity Among Header Tags <h1 class="banner-inner__title">The Coronet</h1> <h2 class="venue-finder__title-text">Find a pub or hotel</h2> <h2 class="venue-finder__title-text">Our Pubs</h2> <h2 class="venue-finder__title-text" ng-hide="isPubName">Check out your nearest pub or hotel</h2> <h3 class="feature-panel__title">Discover our food menu</h3> <h3 class="feature-panel__title">Our drinks selection</h3> <h4 class="tab__title">Nearby J D Wetherspoons</h4> • Candidates • The Coronet 1 6 = 6 • Holloway • Our Pubs 1 5 = 5 • J D Wetherspoon 1 3 = 3 Frequency Weight
Scoring-Position in Web Link Domain Path File name https:// www.jdwetherspoon.com/ pubs/all-pubs/england/london/ the-coronet-holloway 1 3 1.5 Dice similarity measure • Candidates • The Coronet • Holloway • Our Pubs • J D Wetherspoon 3 × 0.70 = 2.1 3 × 0.58 = 1.74 1.5 × 0.00 = 0.00 1 × 1.00 = 1.00
Rank Segments Normalizing
Impact of criteria • Criteria 1 has the lowest impact (0.65) • More generic words such as home and welcome are often placed at the beginning • Either the slogan, address or general information about the web page is placed at the end of the title. • Criteria 2 has slightly higher impact (0.68) • Heading tags are not always used, and even when existing, the correct title is not always there. • Criterion 3 is statistically significant in comparison with criteria 1 and 2.
Web Page Body Content of text nodes N-grams (n=1…6) Filter by part-of-speech (POS) patterns
Construct DOM tree Body div div h2 h1 p div Aqua Dining Sydney Waterfront Restaurant |… Aqua Dining offers a… div h3 h5 a Feeling social?.. a facebook Navigation
Extract text nodes Navigation Feeling Social? Find us on Facebook Sydney Waterfront Restaurant Restaurant Milsons Point Aqua Dining offers a quintessential Sydney dining experience with unrivalled harbour views that sweep from Luna Park to the world famous Sydney Harbour Bridge and the Sydney Opera House.
Apply POS tagging Navigation Feeling Social? Find us on Facebook Sydney Waterfront Restaurant Restaurant Milsons Point Aqua Dining offers a quintessential Sydney dining experience with unrivalled harbour views that sweep from Luna Park to the world famous Sydney Harbour Bridge and the Sydney Opera House. NNP=Proper noun, singular NNPS=Proper noun, plural NN=Noun, singular or mass VBG=Verb, gerund VB=Verb, base form PRP=Personal pronoun DT=Determiner CC=Coordinating conjunction JJ=Adjective NNP VBG VB NNP PRP IN NNP NNP NN NNP NNPS NNP NNP JJ NN NN NNP NNP VBZ DT NNP IN NNS IN NN NN DT NNP JJ WDT IN NNP NN JJ NNP CC DT NNP NNP NNP NNP NNP
Extract potential phrases Navigation Feeling Social? Find us on Facebook Sydney Waterfront Restaurant Restaurant Milsons Point Aqua Dining offers a quintessential Sydney dining experience with unrivalled harbour views that sweep from Luna Park to the world famous Sydney Harbour Bridge and the Sydney Opera House. NNP=Proper noun, singular NNPS=Proper noun, plural NN=Noun, singular or mass VBG=Verb, gerund VB=Verb, base form PRP=Personal pronoun DT=Determiner CC=Coordinating conjunction JJ=Adjective NNP VBG VB NNP PRP IN NNP NNP NN NNP NNPS NNP NNP JJ NN NN NNP NNP VBZ DT NNP IN NNS NN NN IN NNP JJ DT WDT IN NNP NN JJ NNP CC DT NNP NNP NNP NNP NNP
Feature extraction • Similarity with the link of the web page • Appearance in title tag • Appearance in meta tag • Popularity on the web page (frequency) • Appearance in heading (h1, h2…h6) tags • Capitalization • Capitalization frequency • Independent appearance • Phrase length
Similarity with web link https://www.aquadining.com.au/
Appearance in Title tag <title>SydneyWaterfront Restaurant | Restaurant Milsons Point - AquaDining</title>
Appearance in Meta title tag <metaproperty="og:title" content="SydneyWaterfront Restaurant | Restaurant Milsons Point - AquaDining" />
Appearance in Header tags <h1class="site-title">AquaDiningSydney Restaurant</h1> <h2>AquaDining</h2> Weight= 6 Weight= 5
Popularity on the web page <title>SydneyWaterfront Restaurant | Restaurant Milsons Point - AquaDining</title> <metaproperty="og:title" content="SydneyWaterfront Restaurant | Restaurant milsonspoint - AquaDining" /> <h1class="site-title">AquaDiningSydney Restaurant</h1> <h2>AquaDining</h2>
Capitalization frequency <title>Sydney Waterfront Restaurant | Restaurant Milsons Point - AquaDining</title> <metaproperty="og:title" content="Sydney Waterfront Restaurant | Restaurant milsonspoint - AquaDining" /> <h1class="site-title">aquaDiningSydney Restaurant</h1> <h2>aquaDining</h2>
Independent appearance <title>Sydney Waterfront Restaurant | Restaurant Milsons Point - Aqua Dining</title> <meta property="og:title" content="Sydney Waterfront Restaurant | Restaurant milsons point - Aqua Dining" /> <h1 class="site-title">aqua Dining Sydney Restaurant</h1> <h2>aqua Dining</h2>
Classifiers • Naive Bayes • Support Vecotr Machines (SVM) • Clustering • K-nearest neighbors (k-NN)
Results with Titler corpus Extracted titles
Results with Mopsi Services Annotated titles
Logo Image • ~89 % of web pages have their title within a logo image • Needs to detect logo image • Apply OCR • Challenging !!!