230 likes | 329 Views
Automation and Customization of Rendered Web Pages. Michael Bolin , Greg Little, Marcos Ojeda, Matt Webber, Philip Rha, Tom Wilson, Rob Miller MIT CSAIL http://uid.csail.mit.edu/chickenfoot Supported by NSF IIS-0447800. Web Applications. The Web has become a major application platform.
E N D
Automation and Customization of Rendered Web Pages Michael Bolin, Greg Little, Marcos Ojeda, Matt Webber, Philip Rha, Tom Wilson, Rob MillerMIT CSAIL http://uid.csail.mit.edu/chickenfoot Supported by NSF IIS-0447800
Web Applications • The Web has become a major application platform
Automating Repetitive Operations • Bookmark my latest bank statement • Download many links at once • Fill in defaults for forms
Transforming Appearance • Change color scheme for better contrast • Concatenate multiple pages
Integrating Multiple Web Sites • Bookstore has links for New Books, Used Books, Auction… but not for my local library • Realtor has lots of data about houses for sale… but not length of my commute
Web Apps Are Wonderfully Open • Web apps have automatic hooks for scripting • Display: machine-readable HTML • Commands: generic HTTP requests • Presentation: editable HTML, stylesheets • Web “screen scraping” is already common, mainly behind the scenes (e.g., pricescan.com) • But most users don’t do it
Problem: Many Web Apps Require A Browser • Many web apps depend on the rich browser environment • Cookies, authentication, SSL, session IDs, plugins, user-agents, client-side scripting, proxies • Perl/Python scripts run outside the browser, so they can’t easily access these web apps • Solution: do customization in the browser • Greasemonkey for Firefox • User Javascript for Opera
Problem: Web Apps Are Scary Under the Hood • HTML source of most sites is complex • This complexity is a real barrier to automation & customization
Solution: Use Rendered View • Chickenfoot: user shouldn’t have to look at HTML source to customize the Web
Outline • Demo • Language • Commands • Keyword patterns • Implementation • Pattern matching algorithm • Evaluation
Chickenfoot Language • Chickenscratch = Javascript + runtime library • Javascript syntax • Standard browser objects document.links[] window.open() • Document Object Model (DOM) Node, Element, Text, Range • Chickenfoot-specific objects and commands
Commands • Page navigation go(url) openTab(url) fetch(url) • Clicking and form manipulation click(button-or-link) check(checkbox-or-radio) enter([textbox], value) pick([listbox], choice) • Pattern matching find(pattern) • Page modification insert(pattern, html) replace(pattern, html) remove(pattern) • Widgets & input handling new Link(html, action) onClick(pattern, action)
Keyword Patterns • Keywords + component type • Component type is optional for click(), enter(), check(), pick() • Nested pattern matching: find(“start address form”).find(“city textbox”) feeling lucky button depart textbox search web form
Keyword Patterns vs. Other Names Keyword “all words textbox” Javascript document.f.as_q XPATH //body/form/table[1]/tbody/tr/td/table/tbody/tr[0]/td/ table/tbody/tr/td[1]/table/tbody/tr[0]/td[1]/input …<td>with <b>all</b> of the words</font></td> <td><input value="" name="as_q" size="25" type="text">…
Pattern Matching Algorithm • Find labels matching the keywords • Find components matching each label • Rank & choose best Pattern Ranked list of components google search button Matcher Web page 1.0 0.5 0.5
1. Find Labels Matching Keywords • Label = visible chunk of text • text nodes • button labels, listbox items • ALT attributes on images • Tolerant matching • capitalization • word ordering • punctuation • typos with <b>all</b> of the words
2. Find Component Matching Label • Search in rendered view • Component must be aligned with label • Degree of match given by: • pixel distance • relative position • HTML path length
3. Rank the Matching Components • Rank score for each <label,component> pair is computed from: • Match between keywords and label • Match between label and component • Highest-ranked component is returned • If there’s a tie, find() returns the ambiguous matches, but click/enter/pick/check() throw an error
Evaluation • Web-based survey of textbox naming • 40 respondents (24 programmers, rest not) • Comprehension: which textbox on the page is identified by this pattern? • Generation: how would you identify this textbox uniquely using only words visible on the page?
40 0 0 40 0 0 38 2 0 40 0 0 37 2 1 Results of Generation Task Patterns for which algorithm found: Right match Wrong match Multiple matches 0 26 14
Disambiguation Strategies • Keywords from section heading “above person not available Mi” • Counting “second mi” same caption
Future Work • More component types for patterns • Programming by demonstration • Pointing at page to generate patterns • Clicking & form filling to generate scripts • Javascript syntax extensions box table image
Conclusion • Chickenfoot automates and customizes web applications without looking under the hood • Simple language • Keyword patterns • Developmentenvironmentin web browser http://uid.csail.mit.edu/chickenfoot