1 / 39

Creating Operational Redundancy for Effective Web Data Mining

In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. This type of standard collection will allow any company to turn unstructured web data into structurally sound, valuable content.

jcleblanc
Download Presentation

Creating Operational Redundancy for Effective Web Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective Web Data Mining Using Operational Redundancy Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc

  2. Premise The interactions of a user can be used to personalize their experience

  3. Elements of Mining Redundancy User User Website Data Mining Interaction Mining Emotional State Mining

  4. Our Subject Material HTML content is poorly structured You can’t trust that anything semantically valid will be present There are some pretty bad web practices on the interwebz

  5. How We’ll Capture This Data Start with base linguistics Extend with available extras

  6. The Basic Pieces Keywords Without all the fluff Weighting Word diets FTW Page Data Scrapey Scrapey

  7. Capture Raw Page Data Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key

  8. Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda

  9. Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others

  10. Questions to Keep in Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?

  11. Fetching the Data: cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);

  12. //list of findable / replaceable string characters $find = array('/\r/', '/\n/', '/\s\s+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); preg_replace('#<script(.*?)>(.*?)</ $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);

  13. //set up list of stop words and the final found stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);

  14. Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;

  15. //loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }

  16. Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+, header Bonus points for adding in content location modifiers

  17. Weighting Important Tags //our keyword weights $weights = array("keywords" => "3.0", "meta" "header1" "header2" => "2.0", => "1.5", => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }

  18. Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words

  19. Adding in Time Interactions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations

  20. Grouping Using Commonality Common Interests Interests User A Interests User B

  21. Using Color Theory Products with a feel-good message Happiness, energy, encouragement Health care (but not food!) Relatable, calm, friendly, peace, security Startups / innovative products Creativity, imagination Auction sites (but not sales sites!) Passion, stimulation, excitement, power

  22. What We’re Talking About

  23. The CSS Service Engine lesscss.org sass-lang.com learnboost.github.com/stylus

  24. Design Engine Foundation: LESSPHP http://leafo.net/lessphp/ +

  25. The Basics of a Design Engine //create new LESS object $less= new lessc(); //compile LESS code to CSS $less->checkedCompile( '/path/styles.less', 'path/styles.css'); //create new CSS file and return new file link echo "<link rel='stylesheet' href='http://path/styles.css' type='text/css' />";

  26. Passing Variables into LESSPHP //create a new LESS object $less = new lessc(); //set the variables $less->setVariables(array( 'color' => 'red', 'base' => '960px' )); //compile LESS into PHP and unset variables echo $less->compile(".magic { color: @color; width: @base - 200; }"); $less->unsetVariable('color');

  27. Implementing Color Functions Lighten / Darken Saturate / Desaturate Mix Colors Adjust Hue

  28. Managing Irrelevant Content Remove / hide content based on user profile and state

  29. Managing Irrelevant Content //variables passed into LESS compilation $less->setVariables(array( "percent" => "80%", )); //LESS template .highlight{ @bg-color: "#464646”; @font-color: "#eee"; background-color: fade(@bg-color, @percent); color: fade(@font-color, @percent); }

  30. Acting on Disinterest / Boredom Traits of the Bored Distraction Repetition Tiredness Reasons for Boredom Lack of interest Readiness

  31. Highlighting on Agitated Behavior Highlight relevant content to reduce agitated behavior

  32. Acting Upon User Queues Variables passed into LESS script $less->setVariables(array( "percent" => "100%", "size-mod" => "2" ));

  33. Acting Upon User Queues LESS script logic for color / size variations .highlight{ @bg-calm: "blue"; @bg-action: "red"; @base-font: "14px"; background-color: mix(@bg-calm, @bg-action, @percent ); font-size: @size-mod + @base-font; }

  34. Interaction and Emotion Plugin jQuery Behavior Miner by Cedric Dugas https://github.com/posa bsolute/jquery- behavior-miner

  35. In the End… What a person is interested in What a person is doing What their emotional state is

  36. Thank You! Questions? http://slideshare.com/jcleblanc Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc

More Related