1 / 16

A Look at the Technology Under the Hood

Explore the technology behind ScienceEducation.gov's content integration capabilities, including crawling and indexing, federated search, and learning level stratification. Learn how the experience of the E-Print Network has contributed to the development of these tools.

hmedley
Download Presentation

A Look at the Technology Under the Hood

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Look at the TechnologyUnder the Hood Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009

  2. Content Integration Technologies for ScienceEducation.gov • Crawling and Indexing (Part of Science.gov, E-Print Network) • Federated Search (Science.gov, WorldWideScience.org) ScienceEducation.gov Needs to successfully integrate content from a variety of websites and databases requiring custom tools other search engines are unable to provide.

  3. Drawing on the Experience of the E-Print Network Gateway to 30,000 websites and databases worldwide, containing over 5 million e-prints in basic and applied sciences.

  4. Drawing on the Experience of the E-Print Network • Initially developed in 2001 • Crawls and indexes 30,000 websites • Uses sophisticated filters to ensure that only quality e-prints are included in the Network • Contains full-text index of over 1.5 million e-prints • Uses an Admin Tool to manage websites in the E-Print Network

  5. What is Federated Search? Federated Search is an application or service that allows a user to submit a search in parallel to multiple, distributed information sources and retrieve aggregated, ranked and de-duped results.

  6. In Other Words…One Search, Many Sources Search Other Agencies DOD NSF DOE EPA NIH FDA NASA

  7. Assembling the ScienceEducation.gov Search Engine- Part I Education Experts Assemble Starting URLs

  8. Crawl Websites Filter Bad URLs And Remove Duplicates Assign Learning Levels Build Index ScienceEducation.gov Index Assembling the ScienceEducation.gov Search Engine- Part II Starting URLs

  9. Challenges Ahead • Determining what sites to crawl • Filtering undesirable URLs • Assigning appropriate learning level to content • Categorizing content

  10. To Crawl or Not To Crawl? Don’t crawl these pages Would miss these Will crawl these

  11. All Crawled URLs Calendar Contact Feedback Housing . . . Registration Survey Filter Good URLs Filtering Undesirable URLs

  12. Removing Duplicate Web Pages URL: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/education_threats.html DUP: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/ocean_planet_book_threats.html TITLE: Ocean Planet: Threats SNIPPET: Threats to the health of the oceans Oil spills account for only about five percent of the oil entering the oceans The Coast Guard estimates that for United States waters sewage treatment plants discharge twice as much oil each year as tanker spills Each year industrial household cleaning gardening and automotive products pollute water About 65 000 chemicals are used commercially in the United States today with about 1 000 new ones added each year Only about 300 have been extensively tested for toxicity It is estimated that medical waste that washed up onto Long Island and New Jersey beaches in the summer of 1988 cost as much as 3 billion in lost revenue from tourism and recreation.

  13. Learning Level Stratification

  14. Categorizing Content • Audience: Student or Teacher • Grade Level: K-3, 4-6, 7-9, 10-12, College • Content Type: Interactive Activities, Lesson Plans, Reference Materials, Science Fair Projects, Videos • Subject Area: Chemistry, Computer Science, Energy, Life Sciences, Mathematics, Physics

  15. A Look at the TechnologyUnder the Hood Thank you! Abe Lederman abe@deepwebtech.com www.deepwebtech.com

More Related