1 / 16

CS336

CS336. Lecture 8: Indexing Languages. File organizations or indexes are used to increase performance of system Inverted files, signature files, bitmaps Text indexing is the process of deciding what terms will be used to represent a given document

Download Presentation

CS336

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS336 Lecture 8: Indexing Languages

  2. File organizations or indexes are used to increase performance of system • Inverted files, signature files, bitmaps • Text indexing is the process of deciding what terms will be used to represent a given document • index terms are then used to build indexes for the documents • A retrieval model describes how the indexed terms are incorporated into a model • Relationship between retrieval model and indexing model

  3. Generating Document Representations • Want to use significant terms to build representations • Manual indexing: professional indexers • Manually assign terms from a controlled vocabulary • Typicallyphrases • Automatic indexing: machine selects • Terms can be single words, phrases, or other features from the text of documents • Takes ~ 1 hour to index 10 GB

  4. Index Languages • Language used to describe docs and queries • Exhaustivity: number of different topics indexed, completeness or breadth • increased exhaustivity => higher recall/ lower precision • Specificity - accuracy of indexing, detail • increased specificity => higher precision/lower recall • Pre-coordinate indexing • combinations of terms (e.g. phrases) used as an indexing label • Post-coordinate indexing • combinations generated at search time • Most common

  5. Precision Narrow terms Broad terms 0.5 Recall 0.5 The Trade Off Students want high precision: narrow terms. Lawyers want high recall: broad terms. For unknown population use terms in the middle

  6. MeSH Medical Subject Headings Faceted classification: http://www.nlm.nih.gov/mesh/2006/MeSHtree.html

  7. Disadvantages of Manual Indexing • Human effort considerable • Controlled vocabulary per collection • Subjective • intersection between indexers is only about 40% • But … • Human experts that use indexing aids describing allowable vocabulary and usage (e.g. “scope notes”) achieve good indexing uniformity

  8. Development of Automatic Methods • 60’s: search services relied on manual approaches • automatic methods were sometimes an add-on • focus remained the use of intermediaries (specialists) • strong belief that manual must be better than natural language • What caused focus to shift? • sheer volume of text: very costly to maintain vocabulary and indexing • full text of documents became more readily available … less reliance on abstracts and titles • computing power and access increased • The Web! • Encouraged direct searching by user • reduced dependence on professional searchers

  9. Which is better? • Salton - claims result of automatic comparable to manual • Based on small databases • Can depend upon task and environment • Experiments have shown that using both manual and automatic improves performance • “combination of evidence” • Typically, manual indexing not a practical option Why?

  10. Automatic Indexing with Full Text • more flexible: no decisions about doc content are made at the time of indexing • no a priori assumptions about future search needs • indexing effort not devoted to docs outside search scope • document left open to a variety of index descriptions • post-coordination indexing lets user define representation • but, no effort given to explain document content • pressures user to think more carefully about search • pressures system designer to develop tools to aid user

  11. Manual vs Automatic Indexing

  12. MeSH Medical Subject Headings Faceted classification: http://www.nlm.nih.gov/mesh/2006/MeSHtree.html

  13. Category C. Diseases C1. Bacterial Infections and Mycoses C2. Virus Diseases C3. Parasitic Diseases C4. Neoplasms C5. Musculoskeletal Diseases C6. Digestive System Diseases C7. Stomatognathic Diseases C8. Respiratory Tract Diseases C9. Otorhinolaryngologic Diseases C10. Nervous System Diseases C11. Eye Diseases C12. Urologic and Male Genital Diseases C13. Female Genital Diseases and Pregnancy Complications C14. Cardiovascular Diseases C15. Hemic and Lymphatic Diseases C16. Neonatal Diseases and Abnormalities C17. Skin and Connective Tissue Diseases C18. Nutritional and Metabolic Diseases C19. Endocrine Diseases C20. Immunologic Diseases C21. Injuries, Poisonings, and Occupational Diseases C22. Animal Diseases C23. Symptoms and General Pathology Category C2. Virus Diseases --------------------------- Arbovirus Infections African Horse Sickness Bluetongue Dengue Dengue Hemorrhagic Fever Encephalitis, Epidemic Encephalitis, California Encephalitis, Japanese Encephalitis, St. Louis Encephalitis, Tick-Borne West Nile Fever Encephalomyelitis, Equine Encephalomyelitis, Venezuelan Equine Phlebotomus Fever Rift Valley Fever Tick-Borne Diseases African Swine Fever Colorado Tick Fever Encephalitis, Tick-Borne Hemorrhagic Fever, Crimean Hemorrhagic Fever, Omsk Kyasanur Forest Disease Nairobi Sheep Disease West Nile Fever Yellow Fever

  14. Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease. Example “Ebola” document

  15. Indexing • If you were to look for documents about immunization against the Ebola virus, what might your query look like?

  16. Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease. MH - Animal MH - Antibody Formation MH - Disease Models, Animal MH - Ebola Virus/*immunology MH - Female MH - Guinea Pigs MH - Hemorrhagic Fever, Ebola/*immunology/*prevention & control MH - Human MH - Male MH - Mice MH - Mice, Inbred BALB C MH - Nucleocapsid Proteins/immunology MH - Plasmids MH - T-Lymphocytes/immunology MH - Transfection MH - *Vaccines, DNA MH - Viral Proteins/biosynthesis/immunology MH - *Viral Vaccines Example “Ebola” document

More Related