Do Not Crawl In The DUST: Different URLs Similar Text

Do Not Crawl In The DUST:Different URLs Similar Text Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar

Talk Outline • Problem statement and motivation • Related work • Our contribution • The DustBuster algorithm • Experimental results • Concluding remarks

Even the WWW Gets Dusty • DUST–Different URLs Similar Text • Examples: • Standard Canonization: • “http://domain.name/index.html”“http://domain.name” • Domain names and virtual hosts • “http://news.google.com”“http://google.com/news” • Aliases and symbolic links: • “http://domain.name/~shuri” “http://domain.name/people/shuri” • Parameters with little affect on content • Print=1 • URL transformations: • “http://domain.name/story_”“http://domain.name/story?id=”

DUST Rules! • Dust rule:Transforms one URL to another • Example: “index.html” “” • Valid DUST rule: r is a valid DUST rule w.r.t. site S if for every URL u  S, • r(u) is a valid URL • r(u) and u have “similar” contents • Why similar and not identical? • Comments, news, text ads, counters

DUST is Bad • Expensive to crawl • Access the same document via multiple URLs • Forces us to shingle • An expensive technique used to discover similar documents • Ranking algorithms suffer • References to a document split among its aliases • Multiple identical results • The same document is returned several times in the search results • Any algorithm based on URLs suffers

We Want To • Given: a list of URLs from a site S • Crawl log • Web server log • Want: to find valid DUST rules w.r.t. S • As many as possible • Including site-specific ones • Minimize number of fetches • Applications: • Site-specific canonization • More efficient crawling

How do we Fight DUST Today? (1) Standard Canonization • Domain name aliases • Standard extensions • Default file names: index.html, default.htm • File path canonizations: “dirname/../” “”, “//” “/” • Escape sequences: “%7E” “~”

Standard Canonization is not Enough • Site-specific DUST: • “story_”“story?id=“ • “news.google.com”“google.com/news” • “labs”“laboratories” • This DUST is harder to find

How do we Fight DUST Today? (2) Shingles • Shingles are document sketches [Broder,Glassman,Manasse 97] • Used to compare documents for similarity • Pr(Shingles are equal) = Document similarity • Compare documents by comparing shingles • Calculate Shingle: • Take all m word sequences • Hash them with hi • Choose the min • That's your shingle

Shingles are Not Perfect • Shingles expensive: • Require fetch • Parsing • Hash • Shingles do not find rules • Therefore, not applicable to new pages

More Related Work • Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01] • Identifying plagiarized documents [Hoad,Zobel 03] • Finding near-replicas [Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03] • Copy detection [Brin,Davis,Garcia-Molina 95], [Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96]

Our contributions • An algorithm that • finds site-specific valid DUST rules • requires minimum number of fetches • Convincing results in experiments • Benefits to crawling

Types of DUST • Alias DUST: simple substring substitutions • “story_1259” “story?id=1259” • “news.google.com” “google.com/news” • “/index.html” “” • Parameter DUST: • Standard URL structure: protocol://domain.name/path/name?para=val&pa=va • Some parameters do not affect content: • Can be removed • Can changed to a default value

Our Basic Framework • Input: URL list • Detect likely DUST rules • Eliminate redundant rules • Validate DUST rules using samples: • Eliminate DUST rules that are “wrong” • Further eliminate duplicate DUST rules No Fetch Required

How to detect likely DUST rules? • Large support principle: Likely DUST rules have lots of “evidence” supporting them • Small buckets principle: Ignore evidence that supports many different rules

Large Support Principle • A pair of URLs (u,v) is an instance of rule r, if: • r(u) = v • Support(r) = all instances (u,v) of r Large Support Principle The support of a valid DUST rule is large

Rule Support:An Equivalent View • : a string • Ex:  = “story_” • u: URL that contains  as a substring • Ex: u = “http://www.sitename.com/story_2659” • Envelope of  in u: • A pair of strings (p,s) • p: prefix of u preceding  • s: suffix of u succeeding  • Example: p = “http://www.sitename.com/”, s = “2659” • E(α): all envelopes of  in URLs that appear in input URL list

Envelopes Example

Rule Support:An Equivalent View •  : an alias DUST rule • Ex:  = “story_”,  = “story?id=“ • Lemma: |Support( )| = | E() ∩E()| • Proof: • bucket(p,s) = {  | (p,s)  E() } • Observation: (u,v) is an instance of   if and only if u = p  s and v = p  s for some (p,s) • Hence, (u,v) is an instance of   iff (p,s)  E() ∩ E()

Large Buckets • Often there is a large set of substrings that are interchangeable within a given URL while not being DUST: • page=1,page=2,… • lecture-1.pdf, lecture-2.pdf • This gives rise to large buckets:

I am a DUCK not a DUST Small Bucket Principle • Big Buckets: • popular prefix suffix • Often do not contain similar content • Big buckets are expensive to process Small Buckets Principle Most of the support of valid Alias DUST rules is likely to belong to small buckets

Algorithm – Detecting Likely DUST Rules No Fetch here! • Scan Log and form buckets • Ignore big buckets • For each small Bucket: • For every two substrings α, β in the bucket • print (α, β) • Sort by (α, β) • For every pair (α, β): • Count • If (Count > threshold) print αβ

Size and Comments • Consider only instances of rules whose size “matches” • Use ranges of sizes • Running time O(Llog(L)) • Process only short substrings • Tokenize URLs

Our Basic Framework • Input: URL list • Detect likely DUST rules • Eliminate redundant rules • Validate DUST rules using samples: • Eliminate DUST rules that are “wrong” • Further eliminate duplicate DUST rules No Fetch Required

Eliminating RedundantRules • Rule φrefines rule ψ if SUPPORT(φ) SUPPORT(ψ ) “/vlsi/” “/labs/vlsi/” “/vlsi”  “/labs/vlsi” Lemma: A substitution ruleα’  β’ refines ruleα βif and only if there exists an envelope (γ,δ) such thatα’ = γ◦α◦δandβ’=γ◦β ◦ δ • Lemma helps us identify refinements easily • φ refines ψ? remove ψ if supports match No Fetch here!

Validating Likely Rules • For each likely rule r, for both directions • Find sample URLs from list to which r is applicable • For each URL u in the sample: • v = r(u) • Fetch u and v • Check if content(u) is similar to content(v) • if fraction of similar pairs > threshold: • Declare rule r valid

Comments About Validation • Assumption: • if validation beyond threshold in 100 it will be the same for any validation above • Why isn’t threshold 100%? • A 95% valid rule may still be worth it • Dynamic pages change often

Experimental Setup • We experiment on logs of two web sites: • Dynamic Forum • Academic Site • Detected from a log of about 20,000 unique URLs • On each site we used four logs from different time periods

Precision at k

Precision vs. Validation

Recall • How many of the DUST do we find? • What other duplicates are there: • Soft errors • True copies: • Last semesters course • All authors of paper • Frames • Image galleries

DUST Distribution • In a crawl examined 18% of the crawl was reduced. 47.1 DUST 1.8% misc 25.7% Images 17.9% Exact Copy 7.6% Soft Errors

Conclusions DustBuster is an efficient algorithm • Finds DUST rules • Can reduce a crawl • Can benefit ranking algorithms

THE END

Things to fix • = => --> • all rules with “” • Fix drawing urls crossing alpha not all p and all s

So far, non-directional • Prefer shrinking rules • Prefer lexicographically lowering rules • Check those directions first

Parametric DUST • Parameter name and possible values • What rules: • Remove parameter • Substitute one value with another • Substitute all values with a single value • Rules are validated the same way the alias rules are • Will not discuss further

False Rules • Unfortunately we see a lot of “wrong” rules • Substitute 1 with 2 • Just wrong: • One domain name with another with similar software • False rules examples: • /YoninaEldar/ != /DavidMalah/ • /labs/vlsi/oldsite != /labs/vlsi • -2. != -3.

Filtering out False Rulese • Getting rid of the big buckets • Using the size field: • False dust rules: • May give valid URLs • Content is not similar • Size is probably different • Size ranges used • Tokenization helps

DustBuster – cleaning up the rules • Go over list with a window • If • Rule a refines rule b • Their support size is close • Leave only rule a

DustBuster – Validation • Validation per rule • Get sample URLs • URLs that the rule can be applied • Apply URL => applied URL • Get content • Compare using shingles

DustBuster - Validation • Stop fetching when: • #failures > 100 * (1-threshold) • Page that doesn't exist is not similar to anything else • Why use threshold < 100%? • Shingles not perfect • Dynamic pages may change a lot fast

Detect Alias DUST – take 2 • Tokenize of course • Form buckets • Ignore big buckets • Count support only if size matches • Don't count Long substrings • Results are cleaner

Eliminate Redundancies • 1: EliminateRedundancies(pairs_list R) • 2: for i = 1 to |R| do • 3: if (already eliminated R[i]) continue • 4: to_eliminate_current := false • /* Go over a window */ • 5: for j = 1 to min(MW, |R| - i) do • /* Support not close? Stop checking */ • 6: if (R[i].size - R[i+j].size > max(MRD*R[i].size, MAD)) break • /* a refines b? remove b */ • 7: if (R[i] refines R[i+j]) • 8: eliminate R[i+j] • 9: else if (R[i+j] refines R[i]) then • 10: to_eliminate_current := true • 11: break • 12: if (to_eliminate_current) • 13: eliminate R[i] • 14: return R No Fetch here!

Validate a Single Rule • 1:ValidateRule(R, L) • 2: positive := 0 • 3: negative := 0 • /* Stop When You Are sure you either succeeded or failed */ • 4: while (positive < (1 - ε) N AND (negative < εN) do • 5: u := a random URL from L to which R is applicable • 6: v := outcome of application of R to u • 7: fetch u and v • 8: if (fetch u failed) continue • /* Something went wrong, negative sample */ • 9: if (fetch v failed) OR (shingling(u)  shingling(v)) • 10 negative := negative + 1 • /* Another positive sample */ • 11: else • 12: positive := positive + 1 • 13: if (negative ε N ) • 14: retrun FALSE • 15: return TRUE

Validate Rules • 1:Validate(rules_list R, test_log L) • 2 create list of rules LR • 3: for i = 1 to |R| do • /* Go over rules that survived = valid rules */ • 4: for j = 1 to i - 1 do • 5: if (R[j] was not eliminated AND R[i] refines R[j]) • 6: eliminate R[i] from the list • 7: break • 8: if (R[i] was eliminated) • 9: continue • /* Test one direction */ • 10: if (ValidateRule(R[i].alpha  R[i].beta, L)) • 11: add R[i].alpha  R[i].beta to LR • /* Test other direction only if first direction failed*/ • 12: else if • (ValidateRule(R[i].beta  R[i].alpha, L)) • 13: add R[i].alpha  R[i].beta to LR • 14: else • 15: eliminate R[i] from the list • 16: return LR

Do Not Crawl In The DUST: Different URLs Similar Text

Do Not Crawl In The DUST: Different URLs Similar Text

Presentation Transcript

What is a text?

Text Mining

MOVEMENT TECHNIQUES

Unit 6: Text A

Similar Triangles

Causes of the Depression

Interaction 1

4.RL.1

Text Scaffolds for Effective Surface Labeling

The Moon

PROCEDURE TEXT

300+ Frequently Used Templates

Title

Unit 4: Text I

Text Mining

Beyond Text

All Eyes on the Pond

Combustible Dust National Emphasis Program

Informational Text Structures

Unit 3 Knowing Yourself