1 / 30

Searching through the Internet

Searching through the Internet. Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University. Outline. Introduction Information Retrieval Indexing Smarter Internet Searching Examples. Introduction. Internet has enormous quantity of information: billions of web pages

glenndavis
Download Presentation

Searching through the Internet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University

  2. Outline • Introduction • Information Retrieval • Indexing • Smarter Internet Searching • Examples

  3. Introduction • Internet has enormous quantity of information: • billions of web pages • thousands of newsgroups • Two questions face any information seeker: • (1) How can I find what I want? • (2) How can I know that what I find is any good?

  4. Information Retrieval • Goal = find documents relevant to an information need from a large document set Info. need Query IR system Document collection Retrieval Answer list

  5. Example Google Web

  6. Search Engine • Consists of: • the interface you use to type in a query • an index of Web sites that the query is matched with • and a software program (called a spider or bot) that goes out on the Web and gets new sites for the index

  7. IR problem • First applications: in libraries (1950s) ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> • External attributes and internal attribute (content) • Search by external attributes = Search in DB • IR: search by content

  8. Possible approaches 1. String matching (linear search in documents) - Slow 2. Indexing - Fast - Flexible to further improvement

  9. Documents Query Indexing Indexing Query Representation Document Representation Index Comparison Function Results

  10. Main problems in IR • Query evaluation (or retrieval process) • To what extent does a document correspond to a query? • System evaluation • How good is a system? • Are the retrieved documents relevant? (precision) • Are all the relevant documents retrieved? (recall)

  11. Document indexing • Goal = Find the important meanings and create an internal representation • Factors to consider: • Accuracy to represent meanings (semantics) • Exhaustiveness (cover all the contents) • Facility for computer to manipulate • What is the best representation of contents? • Word: good coverage, not precise • Phrase: poor coverage, more precise • Concept: poor coverage, precise Accuracy (Precision) Coverage (Recall) Word Phrase Concept

  12. Keyword selection and weighting • How to select important keywords? • Simple method: using middle-frequency words • Search engines usually disregard minor words such as "the, and, to, etc."

  13. Result of indexing • Each document is represented by a set of weighted keywords (terms): D1 {(t1, w1), (t2,w2), …} e.g. D1  {(comput, 0.2), (architect, 0.3), …} D2  {(comput, 0.1), (network, 0.5), …}

  14. Retrieval • The problems underlying retrieval • Retrieval model • How is a document represented with the selected keywords? • How are document and query representations compared to calculate a score?

  15. Vector space model • Vector space = all the keywords encountered <t1, t2, t3, …, tn> • Document D = < a1, a2, a3, …, an> ai = weight of ti in D • Query Q = < b1, b2, b3, …, bn> bi = weight of ti in Q • R(D,Q) = Sim(D,Q)

  16. Matrix representation Document space t1 t2 t3 … tn D1 a11 a12 a13 … a1n D2 a21 a22 a23 … a2n D3 a31 a32 a33 … a3n … Dm am1 am2 am3 … amn Qb1 b2 b3 … bn Term vector space

  17. Some formulas for Sim Dot product Cosine Dice Jaccard t1 D Q t2

  18. (Classic) Presentation of results • Query evaluation result is a list of documents, sorted by their similarity to the query. • E.g. doc1 0.67 doc2 0.65 doc3 0.54 …

  19. IR on the Web • No stable document collection (spider, crawler) • Duplication • Huge number of documents • Multimedia documents • Multilingual problem • …

  20. Tips for smarter Internet searching • Use unique, specific terms • Use the minus operator (-) to narrow the search • yarmouk -university • Utilize quotation marks, to view "consecutive words of a phrase," such as "flower arrangement." • Enter a short question, such as " what time is it in amman?“, “3.55*4.5-11 =“, “who is the king of england?”, “what is the distance between the sun and earth”

  21. Smarter Internet Searching • inurl:test results • only testmust be found in the web address (URL) • allinurl:test results • Both test AND results must be found in the web address. • define: • will provide definitions of the words, gathered from various online sources. • define: search engine

  22. Smarter Internet Searching • Allintext • Sometimes you get pages that do not have your search term/phrase in them. • Why? Because Google also searches for pages that just link to the target page. • Use allintext to get only those pages that have your search terms in them.

  23. Smarter Internet Searching • Allinanchor: • Returns only pages that link to pages with your search terms, but not in the actual pages. • This is the opposite of allintext. • Site: • Limit your search to a specific web site. • Example: • students site:yu.edu.jo • students site:yu.edu.jo filetype:pdf

  24. Smarter Internet Searching • Don't use common words and punctuation • Common words and punctuation marks should be used when searching for a specific phrase inside quotes • Most search engines do not distinguish between uppercase and lowercase • Maximize AutoComplete

  25. Smarter Internet Searching • The wildcard operator (*): Google calls it the fill in the blank operator. For example, amusement * will return pages with amusement and any other term(s) the Google search engine deems relevant. • Using a wildcard (*) for a character does not work in Google. cat* returns the same results as cat.

  26. Smarter Internet Searching • Related sites: • For example, related:www.yu.edu.jo can be used to find sites similar to Yarmouk University site. • Specific file type: For example Information retrieval filetype:ppt

  27. Examples • Searching for papers • YU library • Google scholar • Searching for instructor resources • Morgan Kaufmann • Pearson

  28. Examples • Searching for books to buy • Amazon.com • Ebay.com • Searching for items to buy • Electronics: bustbuy.com • Searching for hotels • Expedia.com • Priceline.com • Booking.com

  29. Examples • Regional search • Google jo • Searching for images • Google images • Searching for a job • Jobsinacademia.net • Academickeys.com

  30. The End.

More Related