780 likes | 957 Views
Information retrieval. 201 9 /20 20. crawler. web crawler , Web spider , Web robot Starts from one/several sources ( url ) Stores documents cache / retrieved data Looks for new urls within documents Stores new url to the stack Visits next url (recursively / from stack). example.
E N D
Information retrieval 2019/2020
crawler • web crawler, Web spider, Web robot • Starts from one/several sources (url) • Stores documents cache / retrieved data • Looks for new urls within documents • Stores new url to the stack • Visits next url (recursively / from stack)
example Hyperlinks are underlined Depth-first: 1,3,2,4,5,6 Breadth-first: 1,3,6,4,2,5
strategies • Breadth-first • Depth-first • Partial PageRank • Restrictions: • Max number of downloaded pages • Max depth • Max time • Documents type • Selected domains • Restricted URL – based on regexp • Download only static documents
crawling policies • selection policy • Which page should be downloaded • re-visit policy • When to visit page again • politeness policy • Do not irritate your collogues • parallelization policy • How to perform parallel crawl
selection policy • breadth-first • Most used? • High PageRank ranked pages will be visited first • Can be improved by partial PageRank • backlink-count • Number of links pointing to the page • partial PageRank • Computed based on already collected urls • OPIC (On-line PageImportanceComputation) • each page is given an initial sum of "cash" which is distributed equally among the pages it points to
deep web • Sometimes dynamic pages ?&… • Sometimes only “through search” available: • No links pointing to the site • Sitemaps • …
re-visitpolicy • uniform • we synchronize all elements at the same rate, regardless of how often they change. That is, all elements are synchronized at the same frequency. • proportional • we synchronize element e with a frequency f that is proportional to its change frequency λ. • freshness of copy • freshness is the fraction of the local database that is up-to-date • “Best strategy” – based on thedomain • (weighted) proportional + ignore of high dynamic pages
re-visitpolicy • Junghoo Cho and Hector Garcia-Molina. 2003. Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28, 4 (December 2003), 390-426. • „we prove that the uniform policy is better than the proportional policy under any distribution of λ values“ • more than 20% of pages had changed whenever we visited them • more than 40% of pages in the com domain changed every day • pages in edu and gov domain are very static
politenesspolicy • Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. • Server overload, especially if the frequency of accesses to a given server is too high. • Poorly written crawlers, which can crash servers or routers. • Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.
politenesspolicy • Time interval • Identification – User-agent HTTP req. • Crawler trap • “is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash.” https://fleiner.com/bots/
crawler trap • http://example.com/bar/foo/bar/foo/bar/foo/bar/... • dynamic pages with infinite number of pages (e.g., calendar) • http://www.example.org/calendar/events?&page=1&mini=2015-09&mode=week&date=2021-12-04 • extremely long pages (lot of text causing lexical analyzer to crash) • …
parallelization policy • Dynamic assignment • Central server is balancingload, URLs • A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed downloaders. • A large crawler configuration, in which the DNS resolver and the queues are also distributed. • Staticassignment • Nodes inform others which pages are downloaded • Hash URL websites
problem of similar sources • URL normalization, hash, page fingerprint • Identical content is rare • Crawler tries to detect site differences and makes decision
crawler vs. scraper https://www.quora.com/What-are-the-biggest-differences-between-web-crawling-and-web-scraping
parsing complications • What format is it in? • pdf/word/excel/html? • What language is it in? • What character set encoding is in use? • Each of these is a classification problem, which we will study later in the course • But these tasks are often done heuristically: • The classification is predicted with simple rules • Example: "if there are many “the” then it is English".
parsing complications • Documents being indexed can include docs from manydifferentlanguages • A single index may have to contain terms of severallanguages • Sometimes a document or its components can containmultiplelanguages/formats • French email with a German pdf attachment
segmentation • Header, • Footer, • Menu and navigation, • Main content. • Sentences, • Paragraphs, • Bullets, • Chapters with headline.
emails segmentation • Header, • Email text, • Replied or forwarded text, • Attachments, • Signature.
segmentation approaches • Statistic approaches • No. of words, links comparing to other segments • Machine learning • Supervised learning • Features engineering • Patterns • Regexp, trees, graphs.. • Visual approaches
segmentation approaches https://www.ics.uci.edu/~lopes/teaching/cs221W15/slides/WebCrawling.pdf
to text conversion • HTML: NekoHTML • http://nekohtml.sourceforge.net/ • DOC: MS Word - Apache POI. • http://poi.apache.org/ • PDF: OS Linux - pdftotext. Java – PDFBox • http://pdfbox.apache.org/ • Emails: formateml, mail server, Thunderbird (not MS Outlook) libraryJavaMail. • http://www.oracle.com/technetwork/java/javamail/index.html • Apache Tika • Unified API
tokenization • (Garabík et al., 2004): Token je arbitrárna jednotka textu, ktorá rozširuje lingvistický význam pojmu slovo. Za token sa v automatickej segmentácii textu považuje akýkoľvek reťazec znakov medzi dvoma medzerami (whitespace), aj jednotlivé znaky interpunkcie, ktoré nemusia byť oddelené medzerou od predchádzajúceho alebo nasledujúceho tokenu. Textsa teda z formálneho hľadiska skladá z tokenov a medzier (whitespace).
tokenization • Input: “Friends, Romans and Countrymen” • Output: Tokens • Friends, • Roman • and • Countrymen • A token is an instance of a sequence of characters • Each such token is now a candidate for an index entry, after further processing • But what are valid tokens to emit?
tokenization • Issues in tokenization: • Finland’scapital → • Finland? Finlands? Finland’s? • Hewlett-Packard → Hewlett and Packard • as twotokens? • state-of-the-art: break uphyphenatedsequence • co-education • lowercase, lower-case, lowercase ? • San Francisco: one token or two? • How do youdecideitisone token?
general idea • If you consider 2 tokens (e.g. splitting words with hyphens) then queries containing only one of the twotokenswillmatch • Ex1. Hewlett-Packard – a query for "packard“ will retrieve documents about "Hewlett-Packard" OK? • Ex2. San Francisco – a queryfor "francisco“ will match docs about "San Francisco" OK? • If you consider 1 token then query containing only one of the two possible tokens will not match • Ex3. co-education – a query for "education“ will not match docs about "co-education".
numbers • 3/20/91 Mar. 12, 1991 20/3/91 • 55 B.C. • B-52 • My PGP key is 324a3df234cb23e • (800) 234-2333 • Often have embedded spaces (but we should not split the token) • Older IR systems may not index numbers • But often very useful: think about things like looking up error codes/stacktraces on the web • Will often index “meta-data” separately • Creation date, format, etc.
LuceneAnalysistokenization http://lucene.apache.org/ • XY&Z Corporation – xyz@example.com • WitespaceAnalyzer • [XY&Z] [Corporation] [–] [xyz@example.com] • SimpleAnalyzer – kills numbers • [XY] [Z] [corporation] [xyz] [example] [com] • StopAnalyzer • [XY] [Z] [corporation] [xyz] [example] [com] • StandardAnalyzer • [XY&Z] [corporation] [xyz@example.com]
ElasticAnalysistokenization https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html • "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.„ • Standard analyzer • the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone • Simpleanalyzer • the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone • Stop analyzer • quick, brown, foxes, jumped, over, lazy, dog, s, bone • Patternanalyzer • regexp
lexical analysis • [cesta ~ WORD]; • [9 ~NUMBER]; • [, ~ COLON]; • [1.2.2005 ~ DATE]; • [www.fiit.stuba.sk ~ LINK] • CIT je ... pracovisko ... zriadené k 1.2.2005 • [cit ~ WORD]; [je ~ WORD]; [pracovisko ~ WORD]; [zriadené ~ WORD]; [k ~ WORD]; [1.2.2005 ~ DATE]
lexical tags to terms • compound words (one or several) • inserting words (notebook, laptop) • spell correction • not in documents • when users interact • necessary when queries are text • documents without punctuation (sms, chat, emails)
languageissues • French • L'ensemble-one token or two? • L ? L’ ? Le ? • Wantl’ensemble to matchwithun ensemble • Until at least 2003, itdidn’t on Google • Internationalization!
languageissues • Germannouncompounds are notsegmented • Lebensversicherungsgesellschaftsangestellter • ‘lifeinsurancecompanyemployee’ • Germanretrievalsystems benefit greatlyfrom a compoundsplitter module • Cangive a 15% performanceboostforGerman
Katakana Hiragana Kanji Romaji languageissues • Chinese and Japanese have no spaces between words: • 莎拉波娃现在居住在美国东南部的佛罗里达。 • Not always guaranteed a unique tokenization • Further complicated in Japanese, with multiple alphabets intermingled • Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
languageissues • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right • Words are separated, but letter forms within a word form complex ligatures ← → ← → ← start • “Algeria achieved its independence in 1962 after 132 years of French occupation.” • With Unicode, the surface presentation is complex, but the stored form is straightforward
language detection • statistics approaches • N-grams
Na Slovenskomieriobrielietadlo Ked sme ho objednávali, panovalavosveteúplneinásituácia. „Predpokladáme, žebudemepotrebovatpremiestnovatjednotkynaväcšievzdialenosti. (...) Viete, že SR jeaktívna v niekolkýchmisiách, operáciáchužaj v súcasnosti. Tietopotrebujemeneustálezásobovat, prepravovatludí, rotovat.“ Výrokzaznel z ústniekdajšiehoministraobrany Martina Fedora koncommája 2006. PrávevtedypredvádzalizahranicnívýrobcovianavojenskomletiskuKuchynanaZáhorívelkédopravnélietadlá, z ktorýchsimaloSlovenskovybratnáhraduzadosluhujúcestrojeAntonov. Z ponukysmesinapokonvybralidvelietadlá Spartan C-27J. Prvé z nich by maloprístnasledujúcimesiac – viacnež 11 rokov od propagacnejakcie v Kuchyni. MedzicasomsazmenilasituáciavovzdialenomIraku a aj v eštevzdialenejšomAfganistane. Využijemeeštevôbecobjednanélietadlá? Ministerstvomájasnúodpoved. Milióny a miliardy Zaprvélietadlo Spartan talianskejfirmyAleniaAermacchismemalipodladohodyzaplatit 34,5 miliónaeur, dalších 25 miliónoveursi mala vyžiadatpodpora a výcvik. Kedževýrobca s dodávkoumešká, môžemežiadatkompenzácie. Lietadlo by malodorazit v case, ked sa u násdiskutuje o omnohoväcšíchnákupnýchplánoch v armáde. Na obnovuvojenskejtechniky by chcelrezortmiliardyeur. Do akejmierysúplányreálne, by samaloukázatužonedlhopripredstavovaníverejnéhorozpoctunanasledujúceroky. K dodávkeSpartanusanedávnovyjadrilnácelníkgenerálnehoštábuozbrojnýchsíl Milan Maxim. „Urciteneostane bez využitia,“ ubezpecovalnastretnutí s novinármi.
normalization to terms • We need to “normalize” words in indexed text as well as query words into the same form • We want to match U.S.A. and USA • Result is a term: a term is a (normalized) word type, which is an entry in our IR system dictionary • We define equivalence classes of terms by, e.g., • deleting periods to form a term • U.S.A., USA ∈ [USA] • deleting hyphens to form a term • anti-discriminatory, antidiscriminatory ∈[antidiscriminatory]
other languages • Accents: e.g., French résumé vs. Resume • Umlauts: e.g., German: Tuebingen vs. Tübingen • Should be equivalent • Most important criterion: • How are your users like to write their queries for these words? • Even in languages that standardly have accents, users often may not type them • Often best to normalize to a de-accented term • Tuebingen, Tübingen, Tubingen ∈[Tubingen]
Is this German “mit”? other languages • Tokenization and normalization may depend on the language and so is intertwined with language detection • Crucial: Need to “normalize” indexed text as well as query terms identically Morgen will ich in MIT …
case folding • Reduce all letters to lower case • exception: upper case in mid-sentence? • e.g., General Motors • Fed vs. fed • SAIL vs. sail • Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… • Longstanding Google example: [fixed in 2011…] • Query C.A.T. • #1 result is for “cats” (well, Lolcats) not Caterpillar Inc.
normalization to terms • do we handle synonyms and homonyms? • E.g., by hand-constructed equivalence classes • car = automobile color = colour • We can rewrite to form equivalence-class terms • When the document contains automobile, index it under car-automobile (and vice-versa) • Or we can expand a query • When the query contains automobile, look under car as well • what about spelling mistakes? • One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics