1.33k likes | 2.66k Views
Search Engine. Search Engine. A Web search engine is a search engine designed to search for information on the World Wide Web. Information may consist of web pages, images and other types of files. Eg: Google.com, Yahoo!, Altavista.com, Excite.com. Search Engine.
E N D
Search Engine • A Web search engine is a search engine designed to search for information on the World Wide Web. • Information may consist of web pages, images and other types of files. • Eg: Google.com, Yahoo!, Altavista.com, Excite.com
Search Engine Basically, a search engine is a software program that searches for sites based on the words that you designate as search terms. Search engines look through their own databases of information in order to find what it is that you are looking for.
Web Search Engine Web search engine is a tool designed to search for information on the World Wide Web. The search results are usually presented in a list and are commonly called hits. The information may consist of web pages, images, information and other types of files..
Web Search Engine Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input
Web Search Engine It is a prograWeb Search Enginem that searches documents for specified keywords and returns a list of the documents where the keywords were found. Although search engine is really a general class of programs, the term is often used to specifically describe systems like Google, Alta Vista and Excite that enable users to search for documents on the World Wide Web and USENET newsgroups.
Web Search Engine Search engines and directories are not the same thing; although the term "search engine" often is used interchangeably. Search engines automatically create web site listings by using spiders that "crawl" web pages, index their information, and optimally follows that site's links to other pages.
Web Search Engine Spiders return to already-crawled sites on a pretty regular basis in order to check for updates or changes, and everything that these spiders find goes into the search engine database.
How it works? Typically, a search engine works by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document. Each search engine uses a proprietaryalgorithm to create its indices such that, ideally, only meaningful results are returned for each query.
How it works? They include incredibly detailed processes and methodologies, and are updated all the time. This is a bare bones look at how search engines work to retrieve your search results. All search engines go by this basic process when conducting search processes, but because there are differences in search engines, there are bound to be different results depending on which engine you use.
How it works? The searcher types a query into a search engine. Search engine software quickly sorts through literally millions of pages in its database to find matches to this query. The search engine's results are ranked in order of relevancy.
Example of Search Engines Google is always a safe bet for most search queries, and most of the time your search will be successful on the very first page of search results. Yahoo is also a great choice, and finds a lot of stuff that Google does not necessarily pick up.
Example of Search Engines There are some search engines out there that are able to answer factual questions, among these are Answers.com, BrainBoost, Factbites, and Ask Jeeves. There are quite a few search engines that will help you do this with clustered results or search suggestions. Some of these include Clusty, WiseNut, AOL Search, and Teoma, in addition to Gigablast, AllTheWeb, and SurfWax.
Example of Search Engines There are lots of great search engines that deal primarily in academic and research oriented results. Included among there are Scirus, Yahoo Reference, National Geographic Map Machine, MagPortal, CompletePlanet, FirstGov, and EducationWorld.
Example of Search Engines Images on the Web are easy to find, especially with targeted image search engines such as Picsearch, Ditto, and of course, Google has some fantastic image search capabilities. You can also check out my list of Image Search Engines-Directories-Collections, or Clip Art-Buttons-Graphics- Icons-Images on the Web.
Example of Search Engines There's so much multimedia on the Web that your main problem will be finding enough time to look at it all. Here are a few places you can use to search for sounds, movies, and music on the Web: Loomia, Torrent Typhoon, The Internet Movie Database, SingingFish, and Podscope. For even more multimedia search engines and sites
Example of Search Engines Finding someone with similar interests on the Web via a blog or online community is simple. Use LjSeek, Technorati, and Daypop to search for blogs; find people with ZoomInfo, Pretrieve, or Zabasearch, and search for discussion groups and message boards with BoardTracker.
Operation of Search Engine • A search engine operates, in the following order • Web crawling • Indexing • Searching
Web Crawling A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot, or—especially in the FOAF(an acronym of Friend of a friend) community—Web scutter. This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
Web Crawling Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam). A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
Internet bots, also known as web robots, WWW robots or simply bots, are software applications that run automated tasks over the Internet. • There are important characteristics of the Web that make crawling it very difficult: • its large volume, • its fast rate of change, and • dynamic page generation. • The behavior of a Web crawler is the outcome of a combination of policies: • a selection policy that states which pages to download, • a re-visit policy that states when to check for changes to the pages, • a politeness policy that states how to avoid overloading Web sites, and • a parallelization policy that states how to coordinate distributed Web crawlers.
Indexing Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing.
Indexing Popular engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus.
Search A web search query is a query that a user enters into web search engine to satisfy his or her information needs. Web search queries are distinctive in that they are unstructured and often ambiguous; they vary greatly from standard query languages which are governed by strict syntax rules.
GOOGLE • "Googol" is the mathematical term for a 1 followed by 100 zeros. The term was coined by Milton Sirotta, nephew of American mathematician Edward Kasner, and was popularized in the book, "Mathematics and the Imagination" by Kasner and James Newman.
PURPOSE • The purpose of inventing Google's is to organize the world's information and make it universally accessible and useful
Founded by Larry Page and Sergey Brin • September 7, 1998 • Begun as a research project • They hypothesized that a search engine that analyzed the relationships between websites would produce better ranking of results than existing techniques, which ranked results according to the number of times the search term appeared on a page • Nicknamed – ‘Backrub’
Originally, the search engine used the Stanford University website with the domain google.stanford.edu. • The domain google.com was registered on September 15, 1997 and the company was incorporated as Google Inc. on September 7, 1998 at a friend's garage in Menlo Park, California. • Originally it was Googol.com but the name has already registered to another site.
Introduction • Google's founders Larry Page and Sergey Brin developed a new approach to online search that took root in a Stanford University dorm room and quickly spread to information seekers around the globe. Named for the mathematical term "googol," • Google operates websites at many international domains, with the most trafficked being Google.com. Google is widely recognized as the world's best search engine because it is fast, accurate and easy to use. The company also serves corporate clients, including advertisers, content publishers and site managers with cost-effective advertising and a wide range of revenue-generating search services.
Google History • According to Google lore, company founders Larry Page and Sergey Brin were not terribly fond of each other when they first met as Stanford University graduate students in computer science in 1995. Larry was a 24-year-old University of Michigan alumnus on a weekend visit; Sergey, 23, was among a group of students assigned to show him around. By January of 1996, Larry and Sergey had begun collaboration on a search engine called BackRub, named for its unique ability to analyze the "back links" pointing to a given website. A year later, their unique approach to link analysis was earning BackRub a growing reputation among those who had seen it. In September 1998, Google Inc. opened its door in Menlo Park, California. Already Google.com, still in beta, was answering 10,000 search queries each day. The press began to take notice of the upstart website with the relevant search results, and articles extolling Google appeared in USA TODAY and Le Monde. That December, PC Magazine named Google one of its Top 100 Web Sites and Search Engines for 1998. Google was moving up in the world.
Objectives and Goals • To push more development and understanding into the academic realm. • To build system that reasonable numbers of people can actually use. Usage are important to Google because they think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern web systems. • To build an architecture that can support novel research activities on large-scale web data. • To set up an environment where other researches can come in quickly, process large chunks of the web, and produce interesting results that have been very difficult to produce otherwise. • To set up a Space-lab like environment where researches or even students can propose and do interesting experiments on our large-scale web data.
Features Overview • The Google Toolbar enables you to conduct a Google search from anywhere on the web • Google AdWords program to promote their products and services on the web with targeted advertising, and they believe AdWords is the largest program of its kind. • Google AdSense program to deliver ads relevant to the content on their sites, improving their ability to generate revenue and enhancing the experience for their users
Technology Overview • PageRank Technology: PageRank reflects Google's view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that Google believes are important pages receive a higher PageRank and are more likely to appear at the top of the search results. PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. Important pages receive a higher PageRank and appear at the top of the search results. Google's technology uses the collective intelligence of the web to determine a page's importance. There is no human involvement or manipulation of results, which is why users have come to trust Google as a source of objective information untainted by paid placement.
Hypertext-Matching Analysis: Google's search engine also analyzes page content. However, instead of simply scanning for page-based text (which can be manipulated by site publishers through meta-tags), Google's technology analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. Google also analyzes the content of neighboring web pages to ensure the results returned are the most relevant to a user's query.
Anchor Text • The text of links is treated in a special way in Google search engine. Most search engines associate the text of a link with the page that the link is on. In addition, they associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled.
System Anatomy • BigFiles BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. The allocation among multiple file systems is handled automatically. The BigFiles package also handles allocation and deallocation of file descriptors, since the operating systems do not provide enough for our needs. BigFiles also support rudimentary compression options.
Repository The repository contains the full HTML of every web page. Each page is compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. They chose zlib’s speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib’s 3 to 1 compression. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; they can rebuild all the other data structures from only the repository and a file which lists crawler errors.
Document Index The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL. This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seeks during a search.
Lexicon The lexicon has several different forms. One important change from earlier systems is that the lexicon can fit in memory for a reasonable price. In the current implementation Google can keep the lexicon in memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words (though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers.
Hit List A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible.They chose a hand optimized compact encoding since it required far less space than the simple encoding and far less bit manipulation than Huffman coding
Forward Index The forward index is actually already partially sorted. It is stored in a number of barrels (we used 64). Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s with hitlists which correspond to those words. This scheme requires slightly more storage because of duplicated docIDs but the difference is very small for a reasonable number of buckets and saves considerable time and coding complexity in the final indexing phase done by the sorter.
Inverted Index The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. It points to a doclist of docID’s together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents.
Crawling The Web In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers.
Each crawler maintains its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
Indexing The Web - Parsing Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone’s imagination to come up with equally creative ones. • Indexing Documents into Barrels After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID’s, their occurrences in the current document are translated into hit lists and are written into the forward barrels.
- Sorting In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage.
Searching • The goal of searching is to provide quality search results efficiently. made great progress in terms of efficiency. Therefore, they have focused more on quality of search in our research, although they believe their solutions are scalable to commercial volumes with a bit more effort. The Google query evaluation process is show in Figure 4. • To put a limit on response time, once a certain number (currently 40,000) of matching documents are found, the searcher automatically goes to step 8 in Figure 4. This means that it is possible that sub-optimal results would be returned. They are currently investigating other ways to solve this problem. In the past, they sorted the hits according to PageRank, which seemed to improve the situation.