1 / 23

Maximizing Web Search Efficiency for Language Research

Explore the challenges and opportunities of conducting linguistic research on the vast web, emphasizing search behaviors, outcomes, and strategies for educators. Discover how KWiCFinder boosts search capabilities for non-English users and linguists, providing insights into web content and research sources.

bherbert
Download Presentation

Maximizing Web Search Efficiency for Language Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concordancing the Webwith KWiCFinder William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001

  2. How Big is the Web? • Now 2-4 billion webpages accessible via public links (Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.) • “Invisible web” / restricted sites several times larger • Estimated 80%-95% content in English, but… • Since mid 2000, non-Anglophones outnumber English speakers online • Anglophones < 30% of 850 million users in 2005 • Percentage of new users fluent in English decreasing • For many regions / languages, still no data available

  3. Search Purposes • General users typically seek… • a specific site • any well-stocked site meeting their needs • Scholarly searchers must examine and evaluate a range of sites to identify the most relevant and reliable resources • Educators want to foster similar online research behavior in their students

  4. Typical Search Behaviors • Marked preference for directories with pre-selected links organized by topic over full-text search engines • Simple queries – single word or phrase – predominate (80%-90%) • 10%-25% of attempted complex queries (Boolean operators, bracketing) are ill-formed • Users tend to work in a single window, calling up one document at a time, then returning to search engine for another link

  5. Typical Search Outcomes • Users follow up only first few links, then settle on a page after browsing from these • Usual outcome is amatch, not best match

  6. Ways to Use the Web for Instruction and Research • Micro level • Discover eloquent examples • Verify current / possible usage, with rough indication of prevalence • Acquire vocabulary not (yet) in dictionaries • Timeliness is essential -- “off-the-shelf corpora” often cannot help here! • Enable students to develop discovery skills (Salzman/Mills “Grammar Safari”)

  7. Ways to Use the Web for Instruction and Research (2) • Macro level • Find authentic texts accessible to students • Locate relevant online resources for research projects • Student reports • Scholarly research

  8. Impediments to Finding Relevant Resources Online • Reliance on commercial search engines (SEs) essential due to Web’s size • SEs’ priorities match ours only by coincidence • Link rot • Pages move or disappear • Page content changes

  9. Challenges to Responsible Research • Online there is too much ephemeral content of unknown reliability • Preponderance of journalistic, commercial and personal texts of unknown authorship and authority • Details of sources and research methodology haphazard • Even student papers (gasp) and machine translated texts (groan choke)

  10. Challenges to Responsible Research (2) • Representativity of Web as Corpus • Much ill-formed or fragmentary language • Domain only a rough clue to provenance • Numbers vs. Statistics • Search engines number of pages matching a query, not actual citations • One page may contain alternate usages • Narrower filters may eliminate some pages

  11. Webidence as Evidence Our profession needs to develop “Standards of Webidence” to guide selection and documentation of online language for serious research purposes.

  12. The Web is not a corpus in the classical sense… …but it does offer an inexhaustible body of linguistic and cultural information for research and use.

  13. Why KWiCFinder? • Automate process of search and retrieval • Expedite evaluation of webpages • Provide specific enhancements for foreign language users and linguists • Encourage students and colleagues to take full advantage of online resources

  14. Why AltaVista? • All words are indexed, including "stopwords" • Distinguishes case and "special characters" • Supports Boolean operators, bracketing, and wildcards • True world-wide coverage, with search by language • No limits to length or complexity of the query • Literal text search, without "second-guessing"

  15. KWiCFinder Enhances AltaVista with… • Intuitive input for foreign characters, bracketing, operators, dates • Inclusion / exclusion criteria not included in KWiC report to focus search • Automatic search and retrieval in the background returning KWiC abstracts

  16. KWiCFinder Enhances AltaVista with… (2) • Restricted wildcards ? % (1, 0-1 char) vs. AltaVista * (0-5 chars) • “Sic” option so “plain” or lower-case char does not match “special” or upper-case variants: • By SE default, a matches any of aáâäàãæåAÁÂÄÀÃÆÅ

  17. KWiCFinder Enhances AltaVista with… (3) “Tamecards” -- User inputs pattern, KF generates variants: • on-line matches on-line, on line, online • s[iau]ng matches sing, sang, sung • {me,te,se,nos,os,se} desp[i,]ert{o,as,a,amos,áis,an} matches only reflexive forms me despierto, te despiertas, se despierta, nos despertamos, os despertáis, se despiertan

  18. How Does XML Enhance KWiCFinder? • Search results become a dynamic database for end user to manipulate: • categorize, annotate, delete, merge / split searches, citations and documents • Free tools permit developer or end-user to restyle and add interactivity to reports • Layouts • Languages • Data format

  19. Why WebKWiC? • Original hope: cross-platform, cross-browser solution • Minimal entry threshold: small download of HTML pages + JavaScript • Support for non-Western European languages

  20. Why Google? • Link popularity ranking puts relevant sites at or near top of list • Straightforward approach to Advanced Search (“implicit Booleans”) easy to learn, thus most likely to be used by students independently • Largest number of pages analyzed • Matching pages always* available in cache with KWiC markup

  21. How Does WebKWiC Complement Google? • Focuses and enhances interface for language learners • Provides tools to navigate among citations and documents • Simplifies management of multiple windows

  22. Future of Web Concordancing • Agents will create specialized corpora on demand, by “search and crawl” or by monitoring specific sites • Multiplicity of encoding formats (various HTMLs, XML…) and languages will place increasing demands on developers of KWiCFinder and analogues

  23. Pleas(e) Visit http://miniappolis.com/ • Download and try KWiCFinder and WebKWiC • View bibliography as well as this and related presentations • Use these tools with your students • Send feedback and suggestions to fletcher@miniappolis.com

More Related