290 likes | 490 Views
Online Database vs. Web Search Engines. 571-Information Access and Retrieval. Online Database. Overview of Online Database 30 years (William (2006). From 1975 to 2005, databases increased considerably, from 301 to 17539
E N D
Online Database vs. Web Search Engines 571-Information Access and Retrieval
Overview of Online Database 30 years (William (2006) • From 1975 to 2005, databases increased considerably, from 301 to 17539 • database records from 52 million to 21.02 billion, and database entries from 301 to 16532. • The number of producers has not grown as fast as databases because one producer might publish multiple databases. • The number of publishers increased from 200 to 3208 from 1975 to 2005. • In 2005, the average producer produced 5.13 databases. Since each vendor might provide services from multiple databases, the number of vendors grew at a slower pace from 105 to 2811.
Types of search • Known item search • Specific-information search • Subject search • Exploring/Browsing information • Others
General search steps • Search plan • System access • Database selection (Optional) • Search query formulation • Preliminary results evaluation • Search query reformulation (Optional) • Final results evaluation (Optional)
Some search Strategies • Building blocks • combine sub-searches • Citation pearl growing • use the index term to retrieve further similar citations • Successive fractions • reduce the set using narrower index terms • Most specific facet first • start with the most specific concept
Search Strategy Formulation • Imagine the title and keywords of relevant documents • Boolean • and, or, not • proximity operator • adj, near, freq, atleast • search fields/segments • au, co, ti, de • Use controlled vocabulary to identify context • truncation • string • plural • single character
How to find related Words? • Personal knowledge • terminology • relevant document • Term mapping provided by system • Feedback from search results • title, descriptor, text • Others
Search Strategy Reformulation • System • search fields • vocabulary • more like this • refine search • Limit/focus search • User • relevance feedback
Narrow search • Find the right database • Add another word or phrase • Negative feedback (exclude one aspect of the search statement) • Exclude related terminology • Restrict to certain field • title, descriptor, frequency, etc. • Restrict to certain types of publication • Restrict to certain time range • Restrict to certain language
Evaluate search results • Known item • title, author, publication, date • Specific information • Key Word In Context (KWIC) • Subject information • title, abstract, descriptor, full text
Check for Tutorial for online databases • http://www.uwm.edu/Libraries/ris/courses/sois510/ • http://training.dialog.com/onlinecourses/recorded/ • http://www.sois.uwm.edu/DE_Info/cahansen/WT3/WT3.html
Characteristics of web IR • Web documents • Distributed stored • Growing in size • Deep and surface documents • Multiple formats • Various in quality • Frequently changed • Others • Users • Various user groups • Others • Systems
What is search engines? Users Search Engine Internet
Key components • Data collection • Web spider or crawler • Data processing • Ranking • Indexing • Query formulating • Interface • Matching • Result displaying
How ranking works? • Literally match • Measure of word significance: The frequency of word occurrence (term frequency) • location: relative position of a word • Examples • http://www.searchenginewatch.com/webmasters/work.html • http://www.searchenginewatch.com/webmasters/rank.html
How ranking works? (Cont’) • Hyperlinks (Brin&Page 1998) • PR(A)=(1-d) + d(PR(T1)/C(T1) +…+PR(Tn)/C(Tn)) * • PA(A)—Page Rank of document A • C(A)—Number of outgoing links from document A • d—Dumping factor between 0-0.85 * http://infolab.stanford.edu/~backrub/google.html
Other Types of Search Engines • Directories • hierarchically organized indexes that allow you to browse through lists of web sites by category or subject • Meta-search engines • query multiple search engines simultaneously and return a complete set of hits • Specialized search engines • Create a database of sites on a specific topic using robots or spiders • For specific user groups • Visualization
Examples of Directories • Yahoo Directory http://dir.yahoo.com/ • The Internet Public Library http://www.ipl.org/ • Librarians’ Index to the Internet http://sunsite.berkeley.edu/InternetIndex • INFOMINE, from the University of California, is a good example of an academic subject directory
Examples of Meta-Search Engines • MetaCrawler www.metacrawler.com • Ixquick http://ixquick.com/ • Clusty http://clusty.com/ • Mamma www.mamma.com
More examples of Specialized Search Engines • Career Mosaic www.careermosaic.com • Diseases, Disorders and related topics www.mic.ki.se/Diseases/index.html • The Day in History www.historychannel.com/today • Shareware.com www.shareware.com
User Behaviors • Web queries are short, not much modified, very simple in structure • Very few advanced search features, if do so, half of them are mistakes • View only first one or two pages • No interested in relevance feedback
User search patterns in different environments (Jansen &Pooch, 2001)
Appendix A: Tips • Most search engines employ the principles of Boolean logic in the formulation of search queries. If you take the time to understand the basics of Boolean logic, you will have a better chance of search success. • Search engines tend to have a default Boolean logic. This means that the space between multiple search terms defaults to either OR logic or AND logic. This has become a de facto standard. It is imperative that you know which logical operator is the default. Nowadays, the default logic tends to be AND, but you should always check the site's Help file to make sure. • Another de facto standard is the requirement to search for phrases within quotations, e.g., "dealth penalty".
Appendix A (Cont’) • If the option is available, use proximity operators (e.g., NEAR) if these are available rather than specifying an AND relationship between your keywords. This will make sure that your search terms are located near each other in the full text document. The closer your terms are placed, the more possibly relevant the document will be. Google does proximity searching by default. • Field searching is another extremely important way of limiting your search results in large search engines that contain millions of full-text files. For example, TITLE:slavery in a search engine such as AltaVista will bring you more relevant hits than merely searching on the keyword slavery. • To enhance subject searches, try the URL field to narrow your results. The URL field offers a good way to search for certain subject terms. This is because of the make-up of the URL.
Appendix A (Cont’) • The Internet is a self-publishing medium. It is not a library of evaluated publications selected by professionals. Rather, the Internet is a bulletin board containing everything from the definitive to the spurious. Everything, everything must be analyzed for its appropriateness for research use. • Before you select a search tool, always think about your topic and what you are trying to find. Once you begin your research, be sure to try out a handful of sites. Don't rely on a single site. • Don't just Google everything! Google is great, but there are other useful tools on the Web, too. Google has become so popular that many people use this tool exclusively, and miss out on others that might be more useful for their particular search. • Others?
Appendix B Anatomy of a URL This is a URL on the CNN home page: http://www.cnn.com/feedback/comments.html This URL is typical of addresses hosted in domains in the United States: Protocol: http Host computer name: www Second-level domain name: cnn Top-level domain name: com Directory name: feedback File name: comments.html The directory name and file name often contain subject terms. These can be searched with the URL field. For example, URL:slavery will give you more relevant results than the keyword slavery by searching for this term as a directory name or a file name.
Appendix C • Search engine comparison chart • http://www.infopeople.org/search/chart.html • http://www.searchengineshowdown.com/features/ • Tutorials • Google Tutorial