1 / 22

Wrappers

Wrappers. Kapowtech RoboSuite 6.0. Team číslo 10 – Vampires , stretnutie číslo 2. Do c ument s. Lixto WhitePaper Wrapper Development Tools Piggy Bank WebVCR Kapow RoboSuite Documentation. Why wrappers ?. HTML is used to display data the data is stored inside your HTML

josef
Download Presentation

Wrappers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires , stretnutie číslo 2

  2. Documents • Lixto WhitePaper • Wrapper Development Tools • Piggy Bank • WebVCR • Kapow RoboSuite Documentation

  3. Why wrappers? • HTML is used to display data • the data is stored inside your HTML • WEB is designed for human consumption, even if it was derived from well-defined database • wrapper – robot browsing web and extraction of data

  4. Applications • online price comparisons • automatic stock market surveillance • personalized online news • flight tickets • job search • competitors advantage • research of a new technology • …….

  5. Lixto WhitePaper • presented by Duri • table on the next slide • Comparison of wrappers, programming languages and by hand conversion • Criteria's like learning time, expressive power, user friendliness,…

  6. Comparison of wrappers, programming languages and by hand conversion

  7. Wrapper Development Tools • 3 main functions: • ability of downloading HTML pages from website • search for, recognize and extract data • save extracted data in a suitable formats, such a XML, XLS, Databases for further importing to the other applications

  8. Wrapper Development Tools • Non commercial tools: • most of them developed at universities • output data: mainly text and XML • most of them offer API • most of them is implemented in Java and is OpenSource • Most of them offer Web Crawling • some of them offer GUI • just few offer Editor – regular expressions, ontologies

  9. Wrapper Development Tools • Commercial tools: • most of them developed in commercial companies • output data: mainly XML, tables and text • most of them offer database connectivity • most of them offer Web Crawling • most of them offer API • all of them offer GUI • most of them offer Editor – regular expressions, Perl, VBScript,…

  10. Piggy Bank • extension for Firefox Web browser • turns it into a Semantic Web browser • let users: • combine information from several web sites and browse them all together • save information you have found on the Web • tag each item you save • share saved information • browse and search through an existing web site

  11. Piggy Bank – Applications • Meeting with friends and you want to locate restaurant with Chinese cuisine, which is close to your favorite coffee shop with wireless network • You are moving to the new place and you are looking for apartment close to school, subway station, away crime hotspots, nearby hospital,…

  12. Piggy Bank – How it works • semantic web • RDF model • XML information • screen scraper

  13. Piggy Bank Example

  14. Piggy Bank - Architecture • consists of 3 primarily parts: • chrome additions to browser, including menu commands, toolbars, etc • Black-end Java code that manages collected information in databases and serves it up through an HTTP interface • XPCOM components written in JavaScript that bridge the chrome part and the Java part.

  15. Piggy Bank - Technologies • Firefox, as the application platform • XUL, as the extension’s user interface language • HTML, as the client side user interface language • Javascript, as the client side and extension’s scripting language • Java, as the server side core programming language • Batik, for encoding PNG files • Informa, for parsing RSS feeds • Jetty, as the embedded web server • JTidy and JDom, for applying XSLT on HTML • Log4j, as the logging framework • Lucene, as the text indexer • Sesame, as the RDF access and storage API • Velocity, as the templating engine for generating HTML

  16. WebVCR • smart bookmarks – shortcuts to Web content that require multiple steps to be retrieved - hard-to-reach Web content • VCR style – record, replay, eventually browse steps users actions • no programming required from user, just usual browsing

  17. WebVCR - application • navigation travelocity.com: - Juliana plans to attend the WWW9 conference and she is looking for flights from Newark to Amsterdam, that leave from Newark May 14th and return from Amsterdam on May 20th. She must take the following steps: - go to http://www.travelocity.com • choose the Find/Book a Flight option • login • specify details of itinerary • produced address: http://dps1.travelocity.com:80/airgchoice.ctl?SEQ=94312

  18. WebVCR – 3 main steps • Notification – tracking users actions • browser modification to provide notifications for each action performed • using of a proxy to rewrite each page and replaces all hrefs with calls to a well-known script which provide the notification • using of a proxy to monitor all HTTP commands sent to/from the browser • attaching JavaScript event handlers to all active objects in the page • Recording - Storing user's browsing information • Playback: Replaying users' actions

  19. WebVCR – how to cope with changes • changes do not pose a problem to a user browsing the Web since the user can easily determine which link he wants to follow, but they do present a challenge to a system that performs automatic navigation • Attempt to locate a link in the last retrieved page corresponding to DOM location stored in current smart bookmark step. If the link exists, the target of the link matches the bookmark, and either the URL or text of the retrieved link match the step, then use that link. • Otherwise, if there is a unique link in the page whose target, URL, and text match those of the stored link, use that link • Otherwise, if there is a unique link in the page whose target and URL match those of the stored link, use that link • Otherwise, if there is a unique link in the page whose target and text match those of the stored link, use that link.

  20. WebVCR – how to cope with changes • Otherwise, if the link corresponds to a CGI bin script (e.g., contains ``?'' in it), then find all links that match the stored URL up to the first occurrence of a ``?'' and store them in set of candidate links, which we denote L. • Eliminate any elements of L whose parameter names do not match the stored version. For instance, if the stored URL is http://xyz.com/script?x=10&y=12 then http://xyz.com/script?x=20&y=32 matches, but http://xyz.com/script?x=10&z=12 does not, since it has a parameter named z that does not appear in the stored version. • For each parameter in the stored version whose value matches the corresponding parameter value in at least one element of L, eliminate all elements of L with a non-matching value for the same parameter.

  21. WebVCR – how to cope with changes • If L is a singleton set, use that element. • Otherwise, the playback can either be aborted, or the link present at the recorded DOM location can be used to try and proceed through the playback (our implementation uses the latter). However, the playback might fail later in the sequence, or the sequence might traverse pages different from what the user had recorded.  

  22. WebVCR – problems • HTTP authentication - some user actions cannot be recorded in the client, it is not possible to detect when HTTP authentication takes place, and since the values entered by the user are not available through the DOM API • State information – cookies, login and password just first time, after that go straight through cookies • Signed applets • Automatic refresh – they assume that auto refresh takes place • Microsoft IE limitations

More Related