820 likes | 1.21k Views
From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges. Google from the Inside Out. Hardware and Database Creation Relevance Ranking and Link Analysis Advanced and “Hidden” Search Features Hands-on Session Pay-for-Placement and Revenue Issues
E N D
From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges
Google from the Inside Out • Hardware and Database Creation • Relevance Ranking and Link Analysis • Advanced and “Hidden” Search Features • Hands-on Session • Pay-for-Placement and Revenue Issues • Our Google “Wish List” • Other Services to Keep Our Eyes On
Google’s Beginnings • 1996 -- Sergey Brin, Larry Page of Stanford develop “BackRub” –based on analysis of links TO a page from other sites • Sept. 7, 1998 Menlo Park, CA –- Google launches in beta with over 10,000 queries a day • December, 1998 – Listed in PC Magazine’s Top 100 Websites
What’s in a name? • “Google” is a play on “googol”, a term coined by mathematician Milton Sirotta to refer to the number one followed by 100 zeros
Google’s Hardware • Over 10,000 servers in two locations containing “hundreds of copies of the database” • Index of more than 3 billion web documents • Handles thousands of queries on a sub-second basis • Interviews in MP3 format with Chief Operations Engineer Jim Reese • //technetcast.com/tnc_play_stream.html? stream_id=420 (1 hr. 13 min) • //technetcast.com/tnc_play_stream.html? stream_id=421 (15 min.)
Google’s Multi-faceted Database • Indexed html pages • Unindexed html pages • Other file types • Html pages that are re-indexed daily
What types of pages are unindexed? (25%) • Dead or inaccurate links • Duplicate pages • Database-generated URLs • Pages with robots.txt or noindex meta tags • Pages on an intranet • Pages “waiting” to be indexed fully
How did they get into Google? • Google crawls and downloads links in the documents it encounters • Some of these links are dead, or inaccurate or cannot be crawled for other reasons (intranets, robots.txt) • The URL’s are in the database, but the documents are not
Why does Google leave them in? • They are not COMPLETELY unindexed • Indexed elements include • Words in the URL http://members.home.net/gourdeaud/ • Words in the anchor text on indexed pages that link to the unindexed URL <a href= members.home.net/gourdeaud/ >Gourdeaud’s biography</a> • Can be useful in URL searches or unique term queries and PageRank
How can I distinguish unindexed pages in search results? • No extract • No page size • No cached copy of the page
Adobe Portable Document Format (pdf) Adobe PostScript (ps) Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk Lotus WordPro (lwp) MacWrite (mw) Microsoft Excel (xls) Microsoft PowerPoint (ppt) Microsoft Word (doc) Microsoft Works (wks, wps, wdb) Microsoft Write (wri) Rich Text Format (rtf) Text (ans, txt) Deep Web Components: Non-html filetypes (1.75%)SEARCH SYNTAX “california power shortage” filetype:pdf
Google Non-html FiletypesWarning! • FOR NON-HTML FILES • Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file • INSTEAD, click the “View as HTML” option; no applications will be opened and no risk of virus or worm • NOTE: Titles for non-html files are frequently not descriptive of content
Non-html filetypes in GoogleNotess Study March 6, 2002 – 25 One-Word Searches
Deep Web Components:Daily re-indexed pages (.15%) • Over 3 million • Regular html pages that Google has noticed are frequently updated. • Google re-indexes these “every day or so” • Date of Google’s last visit to the page appears in the results listing
Google’s Database • Freshness • Breadth • Depth
Database Freshness • Refreshes its entire web index “on a roughly monthly basis, about every 28 days”. • On-going process • Some segments fresher than others
Notess Study April 6, 2002Pages that are updated daily and report that date
Database Breadth (Size) • About 3 billion documents (indexed and unindexed) • Daily figure on the homepage 3,083,324,652 on March 8, 2003 (Not including Images or Usenet) • FAST (alltheweb.com) claimed 2.1 billion indexed documents , March 8, 2003
Database Depth • Google “typically” downloads the first 110 K of a web document • Download includes URL’s of outgoing links
Database “Blending” • Results from Google’s News vertical engine are included in results for all searches • Blending is increasingly common among search services • News • Shopping • Directory
Relevance Ranking and Link Analysis Google’s “PageRank” Demystified
Relevance Ranking • Processing and presenting retrieved results • Proprietary information • Search Engine Optimization Industry has made it even more so • “How can I make my site rank high in Google?”
What happens when I enter a search at Google? • Check of search syntax and spelling • Query routed to the appropriate server “based on the [database] segment on which the answer is likely to be found”
What happens when I enter a search at Google? • Processing of Visible text • Search term(s) position – title, heading, text • Search term(s) frequency • Search term(s) proximity • Processing of Invisible text • Meta tags • Anchor text (within the <a> tag href) <a href=www.hws.edu >Hobart and William Smith Colleges</a>
What happens when I enter a search at Google? • PageRank link analysis applied • Click popularity (Google Toolbar voting data) • Link context (Proximity of links to your search term(s) within the document) • Final dynamic mix of “about 25 factors”
PageRank Demystified • Patented link analysis program • Part of Google since its beginnings • Objective – To make ranking more of a “human process” • Assigns each page in Google a PageRank score, which is dynamic (changeable) • Weighs heavily in final ranking of results
PageRank’s Multi-layered processing • Layer I • Do others think your site is of value as demonstrated by linking to you? IF SO … • Layer II • Are these “others” in turn linked to by sites recognized through linkage within “web communities”?
PageRank’s Multi-layered processing • A Favorable Ranking Scenario A .com site selling prosthetics linked TO by A local orthopedic association in turn linked TO by A national orthopedic group in turn linked TO by The National Institutes of Health
Visualizing Linkage in Google’s Database with TouchGraph • Browser: http://www.touchgraph.com/TGGoogleBrowser.html • Instructions: http://www.touchgraph.com/TGGB_FullInstructions.html
How Does Google Identify “Web Communities”? • Mutual linkage patterns • Metadata elements and keywords found in common • Human examination/verification of the quality of key sites within the community • Other proprietary factors ???????
PageRank Nitty Gritty • Every page of a site can have a PageRank score, not just the main page • The value of a link from Site B to Site A is decreased with each additional link from Site B to anyother site Rationale: If Site B has only a few links, each one could be more important than if Site B has hundreds of outgoing links
PageRank Nitty Gritty • Requires human adjustment in the case of large subject directories and quality lists of links • PageRank scoring is a dynamic process always in flux • To find a page’s PageRank score, go to the Toolbar and click on the green meter
PageRank Feedback • Site A has NO outgoing links, but is linked TO by Site B • Site A decides to create a single link to Site B • This increases Site B’s PageRank score • Site B’s increased score in turn automatically increases Site A’s score
Sounds easy to manipulate… • Possibilities include • Spam • Link “farms” • Cloaking (sneaky re-directs) • Google is vigilant • If Google detects any manipulation of PageRank, it eliminates the domain from its database and never crawls there again.
PageRank Processing • How does Google know who has linked to Site A, for example? • By searching its database for all sites with links to Site A • No way to do this by examining Site A, as there is no physical change to a document when it is linked TO
Implications of PageRank • PageRank is entirely dependent on linkage data derived from the Google database • Breadth, depth and freshness of the crawl is critical to accurate and current data for PageRank scoring
A Different Perspective on PR:Anti-Google • Daniel Brandt claims • “PageRank discriminates against new web sites” (which may not yet be linked to by other sites). • “Careless custodian of private information” (Google associates each search with a cookie, set to last 36 years) • Maintains googlewatch.org
PageRank –A Summary All links are not created equal • Is this site linked TO by “good” web pages associated with this topic? • EXAMPLE: If a page is linked to by a subject directory (Yahoo, OD, LII) its rank will be higher than another page with many links from personal web pages, link “farms”, etc. • NOTE: Link Analysis (PageRank) is not the same as Link Popularity (number of links)
Searching Google: Touring the Known and the Unknown Please share your discoveries with us!
Command Searching with Google’s Fields (aka Search Operators) • Field Searches that cannot be combined with other search elements: • NOTE: No space allowed between operator and following text • cache: retrieves cached version of the specified URL • link: retrieves pages that have links to the specified URL • related: retrieves pages that are “similar” to the specified URL (same as Similar Pages feature in results listing)
Command Searching with Google’s Fields (aka Search Operators) • Field Searches that cannot be combined with other search elements: • info: retrieves information that Google has about the specified URL • stocks: retrieves stock information about the companies whose ticker symbols follow the stocks: operator stocks:intc (Intel)
Command Searching with Google’s Fields (aka Search Operators) • Field Searches that can be combined with other search elements: • site: restrict results to those from the specified domain site:www.google.com PageRank NOTE: retrieves all pages from www.google.com that contain PageRank anywhere