160 likes | 297 Views
CS336. Lecture 8: Indexing Languages. File organizations or indexes are used to increase performance of system Inverted files, signature files, bitmaps Text indexing is the process of deciding what terms will be used to represent a given document
E N D
CS336 Lecture 8: Indexing Languages
File organizations or indexes are used to increase performance of system • Inverted files, signature files, bitmaps • Text indexing is the process of deciding what terms will be used to represent a given document • index terms are then used to build indexes for the documents • A retrieval model describes how the indexed terms are incorporated into a model • Relationship between retrieval model and indexing model
Generating Document Representations • Want to use significant terms to build representations • Manual indexing: professional indexers • Manually assign terms from a controlled vocabulary • Typicallyphrases • Automatic indexing: machine selects • Terms can be single words, phrases, or other features from the text of documents • Takes ~ 1 hour to index 10 GB
Index Languages • Language used to describe docs and queries • Exhaustivity: number of different topics indexed, completeness or breadth • increased exhaustivity => higher recall/ lower precision • Specificity - accuracy of indexing, detail • increased specificity => higher precision/lower recall • Pre-coordinate indexing • combinations of terms (e.g. phrases) used as an indexing label • Post-coordinate indexing • combinations generated at search time • Most common
Precision Narrow terms Broad terms 0.5 Recall 0.5 The Trade Off Students want high precision: narrow terms. Lawyers want high recall: broad terms. For unknown population use terms in the middle
MeSH Medical Subject Headings Faceted classification: http://www.nlm.nih.gov/mesh/2006/MeSHtree.html
Disadvantages of Manual Indexing • Human effort considerable • Controlled vocabulary per collection • Subjective • intersection between indexers is only about 40% • But … • Human experts that use indexing aids describing allowable vocabulary and usage (e.g. “scope notes”) achieve good indexing uniformity
Development of Automatic Methods • 60’s: search services relied on manual approaches • automatic methods were sometimes an add-on • focus remained the use of intermediaries (specialists) • strong belief that manual must be better than natural language • What caused focus to shift? • sheer volume of text: very costly to maintain vocabulary and indexing • full text of documents became more readily available … less reliance on abstracts and titles • computing power and access increased • The Web! • Encouraged direct searching by user • reduced dependence on professional searchers
Which is better? • Salton - claims result of automatic comparable to manual • Based on small databases • Can depend upon task and environment • Experiments have shown that using both manual and automatic improves performance • “combination of evidence” • Typically, manual indexing not a practical option Why?
Automatic Indexing with Full Text • more flexible: no decisions about doc content are made at the time of indexing • no a priori assumptions about future search needs • indexing effort not devoted to docs outside search scope • document left open to a variety of index descriptions • post-coordination indexing lets user define representation • but, no effort given to explain document content • pressures user to think more carefully about search • pressures system designer to develop tools to aid user
MeSH Medical Subject Headings Faceted classification: http://www.nlm.nih.gov/mesh/2006/MeSHtree.html
Category C. Diseases C1. Bacterial Infections and Mycoses C2. Virus Diseases C3. Parasitic Diseases C4. Neoplasms C5. Musculoskeletal Diseases C6. Digestive System Diseases C7. Stomatognathic Diseases C8. Respiratory Tract Diseases C9. Otorhinolaryngologic Diseases C10. Nervous System Diseases C11. Eye Diseases C12. Urologic and Male Genital Diseases C13. Female Genital Diseases and Pregnancy Complications C14. Cardiovascular Diseases C15. Hemic and Lymphatic Diseases C16. Neonatal Diseases and Abnormalities C17. Skin and Connective Tissue Diseases C18. Nutritional and Metabolic Diseases C19. Endocrine Diseases C20. Immunologic Diseases C21. Injuries, Poisonings, and Occupational Diseases C22. Animal Diseases C23. Symptoms and General Pathology Category C2. Virus Diseases --------------------------- Arbovirus Infections African Horse Sickness Bluetongue Dengue Dengue Hemorrhagic Fever Encephalitis, Epidemic Encephalitis, California Encephalitis, Japanese Encephalitis, St. Louis Encephalitis, Tick-Borne West Nile Fever Encephalomyelitis, Equine Encephalomyelitis, Venezuelan Equine Phlebotomus Fever Rift Valley Fever Tick-Borne Diseases African Swine Fever Colorado Tick Fever Encephalitis, Tick-Borne Hemorrhagic Fever, Crimean Hemorrhagic Fever, Omsk Kyasanur Forest Disease Nairobi Sheep Disease West Nile Fever Yellow Fever
Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease. Example “Ebola” document
Indexing • If you were to look for documents about immunization against the Ebola virus, what might your query look like?
Nat Med 1998 Jan;4(1):37-42 Immunization for Ebola virus infection. Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA. Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease. MH - Animal MH - Antibody Formation MH - Disease Models, Animal MH - Ebola Virus/*immunology MH - Female MH - Guinea Pigs MH - Hemorrhagic Fever, Ebola/*immunology/*prevention & control MH - Human MH - Male MH - Mice MH - Mice, Inbred BALB C MH - Nucleocapsid Proteins/immunology MH - Plasmids MH - T-Lymphocytes/immunology MH - Transfection MH - *Vaccines, DNA MH - Viral Proteins/biosynthesis/immunology MH - *Viral Vaccines Example “Ebola” document