330 likes | 404 Views
Programming for WWW (ICE 1338). Lecture #4 July 2, 2004 In-Young Ko iko .AT. i cu . ac.kr Information and Communications University (ICU). Announcements. Our TA Name: Mr. Trinh Minh Cuong Email: minhcuong .AT. icu.ac.kr Office: F641 Office Hours: Tuesday 11-12PM, Thursday 2-4PM
E N D
Programming for WWW(ICE 1338) Lecture #4July 2, 2004In-Young Koiko .AT. icu.ac.krInformation and Communications University (ICU)
Announcements • Our TA • Name: Mr. Trinh Minh Cuong • Email: minhcuong .AT. icu.ac.kr • Office: F641 • Office Hours: Tuesday 11-12PM, Thursday 2-4PM • Please send the instructor your team information • Please send the instructor your information for creating a Unix account • Submit your homework#1 (a URL or HTML source) by tomorrow Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Review of the Previous Lecture • Cascading Style Sheet • Web-based Information Integration • Examples • Information Mediators • Information Wrappers (Web Wrappers) Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Contents of Today’s Lecture • Basic UNIX Commands • More on Web-based Information Integration • JavaScript Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
UNIX Operating System • A multi-user, multi-tasking operating system • Developed by Ken Thompson and Dennis Ritchie at the Bell Lab in early 70’s • Success factors of UNIX • Written in a high-level language (C language) – improving readability and portability • Support of primitives (system calls) – permitting complex programs to be built efficiently • A hierarchical file system – easy maintenance • Hiding the machine architecture from the user – allowing programs to be run on different machines • http://www.unix-systems.org/ Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Architecture of UNIX Systems Other application programs sh who nroff a.out cpp Kernel date comp Hardware cc we as ld grep vi ed Other application programs Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Basic UNIX Shell Commands • cd - Changes directories to the one named • pwd - Displays the current working directory • ls - Lists the contents of the current directory • ls -l - Same as above, but it lists with more information • mkdir - Make a directory • rmdir - Remove a directory • cat - Concatenate or show a files contents • cp - Copy a file • mv - Rename or move a file to a different name or directory • rm - Remove a file • logout - Terminates a Unix Shell session • man - Access manual pages http://infohost.nmt.edu/tcc/help/unix/unix_cmd.html Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Publishing Web Pages on the Server • Copy your files to the ‘public_html’ directory under your home directory in the server • Use FTP to copy your files in a local directory to the server directory ftp vega.icu.ac.kr (login with your user ID) cd public_html lcd d:\myweb put index.html (mput *.html) quit • Your homepage is now accessible from http://vega.icu.ac.kr/~yourid Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Connections Between Web Clients and Servers A Web Server A Web Browser Listen Connect Accept Write 80 Process Read Return A Web server is a daemon process that executes in the background waiting for some event to occur Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Sockets A Web Server • A socket is an end point for communication between two machines • A socket is an association of a protocol, address and process to an end point of communication A Web Browser Listen Connect Accept Write 80 Process Read Return Sockets Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Accessing Web Contents from Java Programs via Sockets import java.net.*; import java.io.*; … Socket sk = new Socket(www.icu.ac.kr, 80); OutputStream os = sk.getOutputStream(); PrintWriter pw = new PrintWriter(os); pw.println("GET /index.html"); pw.println(); pw.flush(); InputStream is = sk.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { System.out.println(line); } Socket Creation Write Request Read Results Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Accessing Web Contents from Java Programs via URL Connections import java.net.*; import java.io.*; … URL url = new URL(“http://www.icu.ac.kr”); URLConnection urlc = url.openConnection(); InputStream is = urlc.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { System.out.println(line); } URL Object Creation URL Connection Creation Read Results Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Java String Manipulation Methods for Result Parsing • int indexOf(String str, int fromIndex) • int lastIndexOf(String str, int fromIndex) • boolean startsWith(String prefix) • boolean endsWith(String suffix) • boolean matches(String regex) • String[] split(String regex) • String substring(int begineIndex, int endIndex) • String toLowerCase() • String toUpperCase() http://java.sun.com/j2se/1.4.2/docs/api/index.html Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Web Wrapper for Naver.com URL Summary Title Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Result Parsing Strategies • Structure-based Parsing • Analyzes Web pages based on tag hierarchies • Cannot be used for ill-formed HTML documents • Pattern-based Parsing • Search for a unique string pattern to locate a result item • Needs to identify such unique string patterns first Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Structure-based Result Parsing Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Pattern-based Result Parsing • Find out a unique pattern to locate a result item • e.g., “<tr><td><font” in the Naver result pages • Find the prefix and suffix patterns to extract an information piece (e.g., URL, title, summary) from the result item • e.g., “a href=” to extract a URL from a result line Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Java Implementation of Web Wrapper public void WebWrapper(String host, String path, String query, int startIndex, int pageSize) { try { String address = "http://" + host + path + "?where=webkr" + "&query=" + query + "&start=" + startIndex + "1" + “&display=" + pageSize; URL url = new URL(address); URLConnection urlc = url.openConnection(); urlc.setRequestProperty("Accept", "*/*"); urlc.setRequestProperty("User-Agent", "Mozilla/4.0"); InputStream is = urlc.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { // System.out.println(line); // } } catch(Exception e) { e.printStackTrace(); } } Query Translation Parsing Results Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Web Robots • A Web robot is a program (agent) that collects information while following all the links on a Web page • Web Robots = Crawlers = Spiders • Web search engines use Web robots to collect and index Web documents • A tag to tell Web robots not to index a page: <metaname=“robots" content=“noindex,nofollow”/> • Crawling methods: • Breadth-first crawling • Depth-first crawling Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Breadth First Crawlers http://ibook.ics.uci.edu/Slides/39 Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Depth First Crawlers http://ibook.ics.uci.edu/Slides/39 Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Web-based Information Management Applications (Example Scenario) Identify Recurring Disaster Areas in China, e.g. Locations of Floods Cross-product between place names and the disaster-type categories An Web document collection about ‘China disasters’ Classify documents based on the disaster types mentioned For each map layer displayed, get the set of place names and classify the documents based on the place names Plot the document clusters on the map to figure out the major flooding areas Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Web-based Information Management Applications (Example App. Design) : Sequential connection : Pipelined connection Keyword Editor Keyword Extractor Product Categories Mapping Clusters Search Engines Place Name Extractor Place Name Generator Pipelined components Generate multiple sets of place names Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Problems in Composing Large-scale Information Management Applications • Time-consuming to explore and test a large number of options • Hard to choose appropriate services for collections • Hard to quickly substitute and test a service within a sequence of steps • Difficulties of capturing and reusing shared patterns of information management steps • Difficult to record and recurrently perform information management steps • Necessity of extracting abstract patterns of information management steps and reusing them • Hard to cope with dynamic aspects of Web resources Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Characteristics of Large-scale Information Management Tasks • Incremental development of information management steps for an abstract task goal • Recurrent executions of the steps • Evolving requirements of users • Shared patterns of management steps • Collection-based information processing • Dynamic aspects of information sources and services • Large and growing number of component services Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Improvement Goals • Significantly reduce construction time, keeping costs low • Enable very rapid construction/adaptation of new applications • Provide static and run-time diagnostic tools, facilitating debugging and performance tuning tasks Rapid Composition and Reconfiguration of Large-scale Custom Applications Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
JavaScript • The goal of JavaScript is to provide programming capability at both the client and server ends of a Web connection • Originally developed by Netscape, as LiveScript • Became a joint venture of Netscape and Sun in1995, renamed JavaScript • Now standardized by the European Computer Manufacturers Association as ECMA-262(also ISO 16262) • User interactions with HTML documents inJavaScript use the event-driven model ofcomputation Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
A Popup Window <html> <head><title>ICE1338</title> <style type = "text/css"> <!-- p { font-size: 12pt; color: blue; background-color: yellow } h2, h3 { font-size: 16pt; color: red; font-style: oblique } --> </style> <script language = "JavaScript"> function displayDate() { alert("Today's date is: " + new Date() + "!!"); } </script> </head> <body onLoad="displayDate()"> <br/> <h2>Programming for WWW</h2> Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
JavaScript vs. Java • Both share similar syntax • JavaScript is a scripting language, not a programming language • JavaScript is an interpreter-based language • JavaScript is dynamically typed • JavaScript does not support class-based inheritance • JavaScripts are usually embedded in HTML documents Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
General Syntax of JavaScript • Direct embedding of a JavaScript code: <script language = "JavaScript"> -- JavaScript script – </script> • Indirect JavaScript specification: <script language = "JavaScript" src = "myScript.js“/> • Identifier form: begin with a letter or underscore,followed by any number of letters, underscores, and digits • Case sensitive • 25 reserved words, plus future reserved words • Comments: both // and /* … */ Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Document Object Model HTML “A platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents” <html> <head> <title>My Document</title> </head> <body> <h1>Header</h1> <p>Paragraph</p> </body> </html> http://www.mozilla.org/docs/dom/technote/intro/ var header = document.getElementsByTagName("H1").item(0); header.firstChild.data = "A dynamic document"; Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
DOM Specification • http://www.w3.org/TR/DOM-Level-2-HTML/html.html • e.g., Programming for WWW (Lecture#4) In-Young Ko, Information Communications University
Screen Outputs • The model for the browser display window is the Window object • Properties: • window.document • window.screenLeft • window.screenTop • … • Methods: • alert: • confirm • prompt http://devedge.netscape.com/central/javascript/ Programming for WWW (Lecture#4) In-Young Ko, Information Communications University