1 / 23

Web-based Information Architectures MSEC 20-760 – Mini II –Fall 2003

Web-based Information Architectures MSEC 20-760 – Mini II –Fall 2003. Location: GSIA 152 Time: 10:30-12:20pm, Tues. & Thurs. Instructor: Prof. Jaime Carbonell Office: NSH 4519 Email: jgc@cs.cmu.edu Tel: 8-7279 [Augmented with expert guest lectures]

Download Presentation

Web-based Information Architectures MSEC 20-760 – Mini II –Fall 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web-based Information ArchitecturesMSEC 20-760 – Mini II –Fall 2003 Location: GSIA 152 Time: 10:30-12:20pm, Tues. & Thurs. Instructor: Prof. Jaime Carbonell Office: NSH 4519 Email: jgc@cs.cmu.edu Tel: 8-7279 [Augmented with expert guest lectures] Teaching assistant: Yan Liu Office: NSH 4506 Email: yanliu@cs.cmu.edu Tel: 8-492 Offices Hours: TBD Administrative assistant: Eleanor Cambridge Office: NSH 4517 Email: eleanorc@cs.cmu.edu Tel: 8-4788

  2. Administrative Issues Prerequisites •Basic programming skills (JAVA) •Familiarity with the web (HTML, browsing, etc.) •Fundamentals of Web Programming (20-753). Grading 30% homeworks (2 programming assignments) 30% miniproject (student teams will propose) 15% midterm (5 pages notes, calculator OK, no laptops) 25% final (10 pages notes, calculator OK, no laptops) Information that will be posted Schedule/syllabus Lecture notes (in powerpoint) & handouts Homework assignments Announcements & other info

  3. Textbook and Reference Materials (1) Required: Class notes (slides on web site) and handouts (to be provided) Required: "Understanding Search Engines: Mathematical Modeling and Text Retrieval" by Michael W. Berry, Murray Browne Available at http://www.siam.org (tel: 1-800-447-7426) Optional: Background reading material provided

  4. Textbook and Reference Materials (2) Optional: "Advances in Information Retrieval" Edited by Croft, Kluwer Academic Pub., 2000 [more detailed state-of-the-art IR book] Optional: "Machine Learning" by Tom M. Mitchell, WCB McGraw-Hill [Tools for text categorization and data mining.]

  5. Information Retrieval: The Challenge (1) Text DB includes: (1) Rainfall measurements in the Sahara continue to show a steady decline starting from the first measurements in 1961. In 1996 only 12mm of rain were recorded in upper Sudan, and 1mm in Southern Algiers... (2) Dan Marino states that professional football risks loosing the number one position in heart of fans across this land. Declines in TV audience ratings are cited... (3) Alarming reductions in precipitation in desert regions are blamed for desert encroachment of previously fertile farmland in Northern Africa. Scientists measured both yearly precipitation and groundwater levels...

  6. Information Retrieval: The Challenge (2) User query states: "Decline in rainfall and impact on farms near Sahara" Challenges •How to retrieve (1) and (3) and not (2)? •How to rank (3) as best? •How to cope with no shared words?

  7. Information Retrieval in eCommerce (1) Bringing in Customers How do Web-search engines work? How to maximize hits on my eCommerce pages? How to maximize preselection of customers who will transact?

  8. Information Retrieval in eCommerce (2) Analyzing the Competition •How do we find the competition? •How will customers find the competition? •Can we do preemptive information strikes? Text Mining •How to learn what customers want most? •How to find out what they missed, but wanted? •How to discover customer search/browsing patterns?

  9. Information Retrieval Assumption (1) Basic IR task •There exists a document collection {Dj } •Users enters at hoc query Q •Q correctly states user’s interest •User wants {Di } < {Dj } most relevant to Q

  10. Information Retrieval Assumption (2) "Shared Bag of Words" assumption Every query = {wi } Every document = {wk } ...where wi & wk in same Σ All syntax is irrelevant (e.g. word order) All document structure is irrelevant All meta-information is irrelevant (e.g. author, source, genre) => Words suffice for relevance assessment

  11. Information Retrieval Assumption (3) Retrieval by shared words If Q and Dj share some wi , then Relevant(Q, Dj ) If Q and Dj share all wi , then Relevant(Q, Dj ) If Q and Dj share over K% of wi , then Relevant(Q, Dj)

  12. Boolean Queries (1) Industrial use of Silver Q: silver R: "The Count’s silver anniversary..." "Even the crash of ’87 had a silver lining..." "The Lone Ranger lived on in syndication..." "Sliver dropped to a new low in London..." ... Q: silverANDphotography R: "Posters of Tonto and the Lone Ranger..." "The Queen’s Silver Anniversary photos..." ...

  13. Boolean Queries (2) Q: (silver AND (NOT anniversary) AND (NOT lining) AND emulsion) OR (AgI AND crystal AND photography)) R: "Silver Iodide Crystals in Photography..." "The emulsion was worth its weight in silver..." ...

  14. Boolean Queries (3) Boolean queries are: a) easy to implement b) confusing to compose c) seldom used (except by librarians) d) prone to low recall e) all of the above

  15. Beyond the Boolean Boondoggle (1) Desiderata (1) •Query must be natural for all users •Sentence, phrase, or word(s) •No AND’s, OR’s, NOT’s, ... •No parentheses (no structure) •System focus on important words •Q: I want laser printers now

  16. Beyond the Boolean Boondoggle (2) Desiderata (2) • Find what I mean, not just what I say Q: cheap car insurance (pAND (pOR "cheap" [1.0] "inexpensive" [0.9] "discount" [0.5)] (pOR "car" [1.0] "auto" [0.8] "automobile" [0.9] "vehicle" [0.5]) (pOR "insurance" [1.0] "policy" [0.3]))

  17. Beyond the Boolean Boondoggle (3) Desiderata (3) •Speech-recognized queries •Coming soon, to a system near you •longer queries •more fluff words to filter •acoustic recognition errors

  18. INFORMATION RETRIEVAL User The Web Spider Search Engine Inverted Index Library, etc.

  19. INFORMATION RETRIEVAL:APPLICATIONS • Searching Document Archives • Libraries (title, subject, full-text) • Data bases of patents and applications • DBs of legal cases (e.g. Lexis, Westlaw) • Searching the Web • Pure search engines (Google, Inktomi, …) • Browsing + Search (Yahoo, Terra-Lycos, …) • Meta-search (Metacrawler, Vivisimo, …) • Corporate or Government Intranets • Non-traditional (e.g. Software DBs, News)

  20. INFORMATION RETRIEVAL (IR) EVOLUTION • IR in the 1980s: • Single collection with < 106 documents (archive) • Boolean queries with unordered-set answer • IR circa 2000: • Single collection with > 109 documents (web) • Free-form queries with ranked-list answer • IR circa 2010: • Multiple collections > 1012 docs (invisible web) • “Find what I mean” queries with clustering, summarization and customization.

  21. Content for Rest of the Course (1) [See the website for the latest updates to the course schedule.] Under the Hood •The vector space model for retrieval •Building an inverted index •Term weighting and selection •Web spidering •Automated text categorization

  22. Content for Rest of the Course (2) IR Uses in eCommerce •How to make search engine work for you •How to build optimal search-attractive web sites •The business(es) of web-based information Beyond Web Search Engines •Speech processing primer •Information extraction from web pages •Data mining primer •Multi-media applications •Business models

  23. Optional Quick Review of Linear Algebra If you know n-dimensional vectors, matrices, computing inner products, etc.., Then you do not need this review. You may take a break. If you learned this material, but do not remember it, please stay and listen to refresh your knowledge. If you never learned linear algebra, stay, listen and (optionally) read either: • G. Hadley. Linear Algebra. Addison-Wesley, 1961. Ch 3. • Or, Stephen W. Goode. An Introduction to Differential Equations and Linear Algebra. Prentice Hall, 1991. Ch.3).

More Related