BINGO!: Bookmark-Induced Gathering of Information

BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany

Part I System Overview

Motivation • Web search engines • The vector space model • Link analysis & authority ranking • Information demands • Mass queries (“madonna tour”) • Needle-in-a-haystack queries (“solidarity eisler”) ?

Semistructured Data DB Core Technology Networking Workflow and E-Services Data Mining XML Web Retrieval Semistructured Data DB Core Technology Networking Workflow and E-Services Data Mining XML Web Retrieval Overview (II) WWW ROOT

Focused Crawling Queue Crawler Classifier Results

Focused Crawling (2) Key aspects: • the mathematical model and algorithm that are used for the classifier (e.g., Naive Bayes vs. SVM) • the feature set upon which the classifier makes its decision (e.g., all terms vs. a careful selection of the "most discriminative" terms) • the quality of the training data

Focused Crawling (3) Queue Crawler SVM Classifier H I T S Hubs Authorities Re-Training SVM Archetypes

System Overview ...... ..... ...... ..... W W W Link Analyzer Document Analyzer Classifier Crawler Adaptive Re-Training Feature Selection URL Queue Docs Feature Vectors Ontology Index Training Docs Book- marks Hubs & Authorities

Part II System Components

Focus Manager Focusing strategies • Depth-first (df): • Breadth-first (bf): • Strong focus (learning phase) • Soft focus (harvesting phase) • Tunneling

Focus Manager (2) • Sample URL Prioritization confidence = 0.4 topic=A 1 confidence = 0.3 topic=A confidence = 0.85 topic=A confidence = 0.6 topic=B 2 3 4 7 8 9 10 5 6 DF strong order: 1–2–5–3–6–4–9–10 .. BF strong order: 1–2–5–3–4–6–9–10 .. DF soft order: 1–2–5–6–3–7–8–4–9–10 .. BF soft order: 1–2–5–3–6–4–7–8–9–10 ..

Feature Selection • Mutual Information (MI) criterion: A is the number of documentsin Vj containing Xi, B is the number of documents with Xi in "competitive" topics C is the number of documents in Vj without Xi N is the overall number of documents in Vj and its competitive topics Time complexity: O(n)+O(mk) for n documents, m terms and k competitive topic.

Feature Selection (2) • Top features for the topic “DB Core Technology" with regard to tf*idf (left) and MI (right) tf*idf score MI weight below 1.4927 storag 0.1428 et 1.2778 modifi 0.1258 graph 1.2446 sql 0.1209 involv 1.0406 disk 0.1179 accomplish 0.9491 pointer 0.1150 backup 0.8613 deadlock 0.1001 command 0.8567 redo 0.1001 exactli 0.8112 implement 0.0963 feder 0.7764 correctli 0.0911 histor 0.6822 size 0.0911

Classifier x2  Input: n training vectors with components (x1, ..., xm, C) and C = +1 or C = -1      δ ? δ   σ  V x1 ¬ V Training: Compute Classification: Check

Hierarchical Classification • Recursive classification by the taxonomy tree. Decisions based on topic-specific feature spaces ROOT 0.8 -0.5 0.2 0.1 Semistructured Data Workflow and E-Services Semistructured Data DB Core Technology Networking 0.2 -0.7 0.4 Data Mining Data Mining XML Web Retrieval

 Link Analysis Web graph G = (S, E) • The HITS Algorithm ? Iterative approximation of the dominant Eigenvectors of ATA and AAT:

Retraining based on Archetypes Two sources of potential archetypes: • Link analysis → Nauthgood authorities • SVM classifier → Nconf best-rated docs • To avoid the "topic drift" phenomenon: the classification confidence of an archeteype must be higher than the mean confidence of the previous iteration's training documents.

Retraining (2) if {at least one topic has more than Nmax positive documents or all topics have more than Nmin positive documents} { for each topic Vi { link analysis using all documents of Vi as base set; hubs (Vi) = top Nhub documents; authorities (Vi) = top Nauth documents; sort docs of Vi in descending order of confidence; archetypes (Vi) = top Nconf from confidence ranking  auth (Vi); remove from archetypes(Vi) all docs with confidence < mean of the previous iteration; archetypes (Vi) = archetypes(Vi)  bookmarks (Vi) }; for each topic Vi { perform feature selection based on archetypes (Vi); re-compute SVM decision model for Vi } re-initialize URL queue using hubs (Vi) to URL queue } }

Part III Evaluation

Testbed • Bookmarks: homepages of researchers in the various areas • Leaf nodes were filled with 9 -15 bookmarks • The total training data comprised 81 documents • Focused crawl: • Crawling time: 6h • Visited: 11000 pages (1800 hosts), link distances 1 – 7 • 4230 positively classified (675 different hosts) • Entire crawl: 7 iterations with re-training. • Parameters: • Nmin = 50, Nmax = 200, • Nhub = 50, Nauth = 20, Nconf = 20. • Feature selection: MI criterion, best 300 for each topic; • Authority ranking: HITS algorithm

Crawling Precision

Crawling Precision (2)

Crawling Recall

Archetype Selection Topic „Data Mining“: URL SVM confidence http://www.it.iitb.ernet.in/~sunita/it642/ 1.35 http://www.research.microsoft.com/research/datamine/ 1.31 http://www.acm.org/sigs/sigkdd/explorations/ 1.28 http://robotics.stanford.edu/users/ronnyk/ 1.24 http://www.kdnuggets.com/index.html 1.18 http://www.wizsoft.com/ 1.16 http://www.almaden.ibm.com/cs/people/ragrawal/ 1.14 http://www.cs.sfu.ca/~han/DM_Book.html 1.14 http://db.cs.sfu.ca/sections/publication/kdd/kdd.html 1.14 http://www.cs.cornell.edu/johannes/publications.html 0.78

Archetype Selection (2)

Feature Selection Topic „Data Mining“: Feature MI weight mine 0.178 knowledg 0.122 olap 0.106 frame 0.086 pattern 0.066 genet 0.061 discov 0.053 miner 0.053 cluster 0.049 dataset 0.044

Future Work • Large-scale experiments (portal generator) • Annotation and semantic classification of HTML sources (e.g. transformation of HTML to XML for improved data management, detection of “information units”) • Advanced feature construction and feature selection algorithms • Fault tolerance on document collections with wrong samples, adaptive re-training • ... ?

Crawler • Key features: • asynchronous DNS lookups with caching • multiple download attempts • advanced duplicate recognition • following multiple redirects • advanced topic-balanced URL-queue • document filters for common datatypes • focusing strategies

Classifier (II) • Training: • Find hyperplane that separates the samples with maximum margin (quadratic optimization task): • Classification: • Test unlabeled vector y for • Very efficient runtime in O(m)

Related Work • General-purpose crawling • Focused crawling • Authority ranking • Classification of Web documents • Web ontologies

BINGO!: Bookmark-Induced Gathering of Information

BINGO!: Bookmark-Induced Gathering of Information

Presentation Transcript

Vapor Recovery and Gathering Pipeline Pigging

Intelligence Gathering

Induced Hypothermia

Drug Induced Liver Injury: Implications in drug discovery and development

nearingzero (nz182.jpg)

HEPARIN INDUCED THROMBOCYTOPENIA (HIT)

13.42 Lecture: Vortex Induced Vibrations

Ventilator-induced Diaphragmatic Dysfunction

Organphosphorus Compounds-Induced Neurotoxicity

Capital Area Cyber Security User Group CLASS 3 Active Information Gathering the Fine Art of Scanning

CRCT BINGO 2013

the next steps

Biology Bingo

Balanced Math Framework

BINGO!!!

information-gathering conformism

Today’s agendum: Induced emf.

PA 508 Library Instruction

Steroid-Induced Hyperglycemia Case Study

2017 Gathering of Nations Pow Wow in New Mexico