Maximizing Web Search Efficiency for Language Research

Concordancing the Webwith KWiCFinder William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001

How Big is the Web? • Now 2-4 billion webpages accessible via public links (Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.) • “Invisible web” / restricted sites several times larger • Estimated 80%-95% content in English, but… • Since mid 2000, non-Anglophones outnumber English speakers online • Anglophones < 30% of 850 million users in 2005 • Percentage of new users fluent in English decreasing • For many regions / languages, still no data available

Search Purposes • General users typically seek… • a specific site • any well-stocked site meeting their needs • Scholarly searchers must examine and evaluate a range of sites to identify the most relevant and reliable resources • Educators want to foster similar online research behavior in their students

Typical Search Behaviors • Marked preference for directories with pre-selected links organized by topic over full-text search engines • Simple queries – single word or phrase – predominate (80%-90%) • 10%-25% of attempted complex queries (Boolean operators, bracketing) are ill-formed • Users tend to work in a single window, calling up one document at a time, then returning to search engine for another link

Typical Search Outcomes • Users follow up only first few links, then settle on a page after browsing from these • Usual outcome is amatch, not best match

Ways to Use the Web for Instruction and Research • Micro level • Discover eloquent examples • Verify current / possible usage, with rough indication of prevalence • Acquire vocabulary not (yet) in dictionaries • Timeliness is essential -- “off-the-shelf corpora” often cannot help here! • Enable students to develop discovery skills (Salzman/Mills “Grammar Safari”)

Ways to Use the Web for Instruction and Research (2) • Macro level • Find authentic texts accessible to students • Locate relevant online resources for research projects • Student reports • Scholarly research

Impediments to Finding Relevant Resources Online • Reliance on commercial search engines (SEs) essential due to Web’s size • SEs’ priorities match ours only by coincidence • Link rot • Pages move or disappear • Page content changes

Challenges to Responsible Research • Online there is too much ephemeral content of unknown reliability • Preponderance of journalistic, commercial and personal texts of unknown authorship and authority • Details of sources and research methodology haphazard • Even student papers (gasp) and machine translated texts (groan choke)

Challenges to Responsible Research (2) • Representativity of Web as Corpus • Much ill-formed or fragmentary language • Domain only a rough clue to provenance • Numbers vs. Statistics • Search engines number of pages matching a query, not actual citations • One page may contain alternate usages • Narrower filters may eliminate some pages

Webidence as Evidence Our profession needs to develop “Standards of Webidence” to guide selection and documentation of online language for serious research purposes.

The Web is not a corpus in the classical sense… …but it does offer an inexhaustible body of linguistic and cultural information for research and use.

Why KWiCFinder? • Automate process of search and retrieval • Expedite evaluation of webpages • Provide specific enhancements for foreign language users and linguists • Encourage students and colleagues to take full advantage of online resources

Why AltaVista? • All words are indexed, including "stopwords" • Distinguishes case and "special characters" • Supports Boolean operators, bracketing, and wildcards • True world-wide coverage, with search by language • No limits to length or complexity of the query • Literal text search, without "second-guessing"

KWiCFinder Enhances AltaVista with… • Intuitive input for foreign characters, bracketing, operators, dates • Inclusion / exclusion criteria not included in KWiC report to focus search • Automatic search and retrieval in the background returning KWiC abstracts

KWiCFinder Enhances AltaVista with… (2) • Restricted wildcards ? % (1, 0-1 char) vs. AltaVista * (0-5 chars) • “Sic” option so “plain” or lower-case char does not match “special” or upper-case variants: • By SE default, a matches any of aáâäàãæåAÁÂÄÀÃÆÅ

KWiCFinder Enhances AltaVista with… (3) “Tamecards” -- User inputs pattern, KF generates variants: • on-line matches on-line, on line, online • s[iau]ng matches sing, sang, sung • {me,te,se,nos,os,se} desp[i,]ert{o,as,a,amos,áis,an} matches only reflexive forms me despierto, te despiertas, se despierta, nos despertamos, os despertáis, se despiertan

How Does XML Enhance KWiCFinder? • Search results become a dynamic database for end user to manipulate: • categorize, annotate, delete, merge / split searches, citations and documents • Free tools permit developer or end-user to restyle and add interactivity to reports • Layouts • Languages • Data format

Why WebKWiC? • Original hope: cross-platform, cross-browser solution • Minimal entry threshold: small download of HTML pages + JavaScript • Support for non-Western European languages

Why Google? • Link popularity ranking puts relevant sites at or near top of list • Straightforward approach to Advanced Search (“implicit Booleans”) easy to learn, thus most likely to be used by students independently • Largest number of pages analyzed • Matching pages always* available in cache with KWiC markup

How Does WebKWiC Complement Google? • Focuses and enhances interface for language learners • Provides tools to navigate among citations and documents • Simplifies management of multiple windows

Future of Web Concordancing • Agents will create specialized corpora on demand, by “search and crawl” or by monitoring specific sites • Multiplicity of encoding formats (various HTMLs, XML…) and languages will place increasing demands on developers of KWiCFinder and analogues

Pleas(e) Visit http://miniappolis.com/ • Download and try KWiCFinder and WebKWiC • View bibliography as well as this and related presentations • Use these tools with your students • Send feedback and suggestions to fletcher@miniappolis.com

Maximizing Web Search Efficiency for Language Research

Maximizing Web Search Efficiency for Language Research

Presentation Transcript

Querying the Semantic Web with RQL *

Cruising the Semantic Web with Noadster

Collaboratively Building Web-Scale with Libraries The Web-Scale Platform

Working with the Web in Python

Mobilizing the Web with DAML-Enabled Web Services

Developing Web Services with the Eclipse Web Tools Platform

Zoetrope: Interacting with the Ephemeral Web

To match or not to match? Voice, concordancing and textmatching

In Dialogue with the Web

Web Development with

Interacting with the Web of Data

Enriching P2P with the semantic web

Scraping the Web with SAS

Scraping the Web with SAS

WEB PAGE AUTHORING WITH EXPRESSION WEB

Concordancing the Web with KWiCFinder

Connecting the Common Core Standards with the Web

Developing Web Services with the Eclipse Web Tools Platform

Debugging the Web with Fiddler

Chapter 4 Working with the Web

Enriching the Web with Readability Metadata

Learn with the On the web Hypnosis Course