220 likes | 319 Views
A Cross Platform Application for Searching the Web. Linux Bangalore/2001 Manu Konchady December 12th, 2001. Problem. - The Web consists of more than 1.5 billion pages as of June, 2001 and grows at over a million pages a day (excluding the ‘hidden web’).
E N D
A Cross Platform Application for Searching the Web Linux Bangalore/2001 Manu Konchady December 12th, 2001
Problem - The Web consists of more than 1.5 billion pages as of June, 2001 and grows at over a million pages a day (excluding the ‘hidden web’). - 99% of these pages may not be of interest to any single individual. - How do we locate the valuable (relevant) pages in the least time ?
What does the web look like ? Terminology: - In-links, links to a page - Out-links, link froma page - Hub, a page with many Out-links - Authority, a page with many In-links
What does the web look like ? Contd. An experiment by Altavista and IBM to analyse 200 Million Web pages and 1.5 Billion links - SCC, A Strongly Connected Core is the heart of the Web (56 Million) - IN pages can reach the SCC, but cannot be reached from it (43 Million) - OUT pages can be reached from the SCC, but do not link back (44 Million) - Tendrils are pages not accessible from the SCC (44 Million) - Disconnected pages (16 Million pages)
What does the web look like ? Contd. Observations: - The fraction of web pages with x In-links is proportional to 1 / (x ** 2.1) (power law) - Similar observation for pages with Out-links - Hub pages are useful to navigate the web and increase the connectivity of the web - Authority pages should be easier to find than other pages (multi-topic or single topic)
Current Products to Search the Web - These products are also known as Agents, Bots, or Spiders. - Most of the products available are Windows based: BullsEye, Copernicus, Lexibot, and others - These products collect results from hundreds of search engines and perform some limited organization and analysis.
Cross Platform Tools Why Cross Platform ? - While Linux grows in popularity, a majority of apps are written for the Windows platform Which development tools ? - MySql, Perl, and Java. - All 3 tools are Open Source Tables to store and manage information, Perl to collect and process the information, and Java to display the information
MySQL MySQL evolved from a database written in 1979 and today runs on multiple platforms including Linux, Windows, and Solaris. It was developed at TcX, a Swedish company and recently made Open Source. - It is light weight and fast compared to other relational databases. - Comes with extensive online documentation (An O’Reilly book on MySql is available as well) - Installed on over 0.5 million servers worldwide - Works with several Gigabytes of data - Supports interfaces to a variety of languages including Perl, Java, C, Python, C, and C++ - Available for free download from www.mysql.com
Perl Perl (Practical Extraction and Reporting Language) was first developed by Larry Wall in 1987. It was created to overcome problems with awk, shell scripts, and C. - It is a scripting language (interpreted rather than precompiled) - Easy to include many of the UNIX tools without shelling out - i.e. combines tools such as sed, tr, awk, grep and others. - Originally designed for fast text manipulation - Available on Linux and other UNIX platforms as well as Windows (Over 29 Ports of Perl) - Data structures are not bounded by prebuilt limitations - Some of the applications of Perl include Database access, File management, CGI Scripts, Client/Server processing, Data Formatting, Disk management, Process management, and a cross platform GUI based on the Tk toolkit. - Hundreds of public domain modules to perform a variety of functions - Very popular with system administrators and some developers - It is famous for being difficult to read, many ways to implement the same function. - Leaves programming discipline to the discretion of the developer
Java Java was developed by Sun in 1995. Major additions of APIs and other functions to the language have made it a popular language for developers. Our interest in the language is to build a sophisticated user interface for the applications. The Swing API (part of the Java Foundation Classes, JFC) was an enhancement to the existing Abstract Window Toolkit (AWT) API. A variety of components such as buttons, check boxes, radio buttons, scroll bars, text panes, slider bars, and other complex widgets Supports dynamic tables, periodic updates of tables without user intervention Many options for setting boundaries, colors, scrolling, and controlling behaviour of components
How do they work together ? JDBC and DBI are standards for issuing database calls. The use of JDBC and DBI makes it easier to change databases, if necessary. Parallel processing on platforms differ - Linux uses threads or processes to run in parallel, a process will have its own memory space, while a thread runs in the same memory space and uses less resources - Windows also provides threads and processes - Harder to implement parallel code using Threads than Processes in Perl - The fork or system call is the easiest way to start an independent process in Linux - In Windows, the Win32 API can be used to start independent processes
A Perl Spider - Intelligent pruning
A Perl Spider - Architecture
A Perl Spider contd. - Create multiple independent spider processes - Query search engines such as Google or Wisenut - Select URLs to process from a common table - Use DB locks to synchronize access to the table - Use Fork or System calls in Linux and the Win 32 API in Windows
A Perl Spider contd. Assign relevancy to a web page based on user queries and additional keywords - Frequency of occurrence of keywords - Location of keywords - Word distance between keywords Prioritize domains and process URLs from high priority domains before other URLs Block certain sites or restrict access to a few sites
A Perl Spider contd. Evaluating links to follow: - rank a link to an external site higher than a link to the same site - check if any of the query keywords occur in the anchor text or link itself - assign a higher weight to links from a very relevant page - follow the link if it exceeded a threshold (low, medium, or high
A Perl Spider contd. A spider terminates when - No more URLs can be processed - Time limit exceeded - URL limit exceeded - User decides to stop
Results - List of Hubs - List of Authorities - List of pages ordered by relevance - List of sites with highest average relevancy - Export the link structure to a link analysis tool
Summary - Application to address the searching problem on the Web - Use of cross platform tools (MySQL, Perl, and Java) to build the application - Architecture of the solution (parallel processing, user interface, and NLP.