390 likes | 512 Views
Chapter 5: Look up things. Introduction to Computers CS1100.01 Dr. Zhizhang Shen. What about it?. We are now in a situation of information explosion : too much stuff. Given all the information out there, how do we look for the stuff that we really want?
E N D
Chapter 5: Look up things Introduction to ComputersCS1100.01 Dr. Zhizhang Shen
What about it? • We are now in a situation of information explosion: too much stuff. • Given all the information out there, how do we look for the stuff that we really want? • We will try to have a basic understanding as how the information are structured. • What should we do with all the information that we have found?
Where should I look for it? • The obvious and familiar • To find tax information, ask IRA • Libraries online • Many college and public libraries let you access their online catalogs and other information resources • Libraries provide online facilities that are well organized and trustworthy • Our own Lamson library puts lots of good stuff on line • Remember that many pre-1985 documents are not yet available online • The reference librarians are real experts
Did we talk about this before? • All information are organized hierarchically. • Information is grouped into a small number of categories, each of which is easily described (top-level classification) • Each category is divided into subcategories (second-level classifications • This process is repeated several times, until…
So on and so forth… • Eventually the classification becomes simple enough for you to look through the whole category to find the information you need • This is a process of elimination as much as choosing appropriate subcategories: not only what you want, but also don’t want. • It is similar to the folder structures as we went through to organize the files.
How is Web site organized? • A site consists of many pages organized into various folders. • Each folder contains web pages and various supportive material, audio, video clips, images, etc. • Each page contains sets of related links • For example, sidebar and top of page navigation links • Content information often fills the rest of a page
How does search engine work? • A search engine works in two parts: • Crawler: Visits sites on the Internet, discovering Web pages and building an index to the Web's content, sort of the index tables at the end of the books, to facilitate later searches. • Query processor: Looks up user-submitted keywords in the index and reports back which Web pages the crawler has found containing those words • Popular Search Engines: Google, Yahoo!, MSN, AOL, Ask, Bing,….
What does a crawler do to a page? • It first identifies all the links to other Web pages on that page • Checks its records to see if it has visited those pages recently • If not, adds them to the list of pages to be crawled • Keep records in an index the keywords used on a page (appear in the title, the body, or in anchor text)
Oops, I missed it… Crawlers can miss pages, thus fail to include in the index, because: • No page points to it, i.e., an isolated page • This page is dynamically created on-the-fly, such as the pages we created earlier this week, thus not added in any index yet. • This page contains only images, or Tim’s bike clip, but no key words. • The type of this page is not recognized (not HTML, PDF, etc.)
Now what? One an index is developed, a query processor, when getting keywords from users, will look them up in its index • Even if a page has not yet been crawled, it might be reported because it is linked from a page that has been crawled, and the keywords appear in the anchor text on the crawled page • Thus, it is important to give the right words to look up the right things.
Is this page really important? • Google's idea: PageRank gives preferrence • Order links by relevance to user • Relevance is computed by counting the links to a page (the more pages link to a page, the more relevant/important that page must be) • Each page that links to another page is considered a "vote" for the value of that page • Google also considers whether the "voting page" is itself highly ranked, or valued.
Ask the right question(s) • Google is perhaps one of the most used search engine. • Choose the right words and it won’t hurt if we know a bit about how the search engine will use them • Words or phrases? • Search engines generally consider each word separately • Ask for an exact phrase by placing quotations marks around it, e.g., “Chinese restaurants”
Logical Operators • AND, OR, NOT • AND: Tells search engine to return only pages containing both terms (default) Chinese AND restaurants • OR: Tell search engine to find pages containing either word, including pages where they both appear Chinese OR Thai • NOT/-: Excludes pages with the given word -Definitely not French • AND and OR are infix operators; they go between the terms • NOT/- is a prefix operator; it precedes the term to be excluded • Google Help: Cheat Sheet • http://www.google.com/help/cheatsheet.html
Results as of October 6, 2009 • Sailboard: 302,000 • Sailboard And Rentals: 13,400 • Sailboard And Rentals And Oregon: 2,730 • sailboard And Rentals And Oregon And hood river: 1, 230. • sailboard And Rentals And Oregon And hood river-car: 903.
Search and sort • Out out of 50% of all computer execution time is spent on these two operations: looking for things and putting the results into order. • Google searches for stuff and then put them into an order of relevance.
Question: • How to figure out the relevant words so that the search engine will find the stuff fast? • Some tips: • Be clear about where you want it to look: (company, business, military, sports teams, etc.) • Think about what type of organization might publish the stuff that you are looking for: (clothing, cars, planes, …)
More tips • List words that are likely to appear on the pages you are looking for (candy, shoes, red sox, financial aids, ice hockey, restaurant, Huntington, bike, …) • Assess the results • Before looking at each returned page, check the results to see if this is really the stuff you are looking for. (Is this the book, the show, the event,…?) • If not, what are the other different words that I should/could look for and where should I do it?, etc?
More tips (Cont’d) 3. Consider a two-pass strategy (focused searches) • Do a broad topic search, and then search within your results • When looking for spots for fall vacations, type “Fall vacations” first • Out of all the stuff popping up, select the one that appears the most to you, for example, “Fall vacation ideas“.
More things to try… • What I don’t want, … • “fall vacations –New England” Anywhere but New England. • “fall vacations site:npr.org” I only want to go to those places as discussed in the npr site because I only trust NPR. • Question: What does the search engine do?
Web information: truth or fiction? • Anyone can publish anything on the web • blogs and wikipedia: The latter contains over seven million articles in over 200 languages. Any one an edit a certain item with the edit buttion. • Recently, it changed a bit: you can no longer edit pages of those people that are still alive. • Some of what gets published is false, misleading, deceptive, self-serving, slanderous, or disgusting • If it is on the web it must be true. – NOT!
Another point of confusion? • Registered domain names may be misleading or deliberate hoaxes • www.whitehouse.gov vs. www.whitehouse.org vs. www.whitehouse.com • White whitehouse.gov is authentic, whitehouse.com has nothing to do with 1600 Pennsylvania Avenue. (Have a look at the disclaimer at the top right corner) • Any one can register a domain name, by paying as low as $8.95. • Here is how to order
Question: • How do we know if the pages we find in our search are reliable? • Look for who or what organization publishes the Web page • Respected organizations publish the best information available, www.nytimes.com, http://www.npr.org/
Who is behind this address? • InterNIC (www.internic.net/whois.html) provides the name of the company that assigned the site's IP address, and a link to the WhoIs server maintained by that company • Go to the WhoIs Server site and type the domain name or IP address again. • Information returned is the owner's name and physical address.
The good ones • Web sites are most believable if they have these features: • Physical Existence—Site provides a street address, phone number, e-mail address • Expertise—Site includes references, citations or credentials, related links • Clarity—Site is well organized, easy to use, and has site-searching facilities • Currency—Site was recently updated • Professionalism—Site's grammar, spelling, and punctuation are correct; all links work
Remember that a site can have all these features and still not be legitimate. • When in doubt, check it out (including cross checking). Ask a librarian. • Example: http://www.dhmo.org/(Hoax about dangers of Dihydrogen monoxide – H2O) • Is it really a hoax? • Check it out yourself.
Homework • Multiple choice: odd numbered • Short answers: even numbered • Complete Exercises 5, 6 and 7.