160 likes | 254 Views
Will XML and Information Retrieval Make Society Transparent?. Gregory B. Newby School of Information and Library Science University of North Carolina at Chapel Hill http://ils.unc.edu/gbnewby. Basic Premise.
E N D
Will XML and Information Retrieval Make Society Transparent? Gregory B. NewbySchool of Information and Library ScienceUniversity of North Carolina at Chapel Hill http://ils.unc.edu/gbnewby
Basic Premise • Information retrieval will be facilitated by XML because of the additional structure that XML adds. • This will result in better IR abilities compared to plain text or HTML
IR is Not Database Retrieval Bibliographic Retrieval: Controlled Vocabulary Database Query: Structured data Natural Language (Semi-) Structured Or unstructured
Information Retrieval in One Slide • IR is about matching information to info. needs • Information may be contained in documents, extracts, document surrogates, or newly-created documents • Information needs may be poorly defined, changeable, and context-specific • We evaluate IR systems by the numbers of relevant documents they identify • Recall: proportion of all relevant documents that are retrieved • Precision: proportion of documents that are retrieved that are judged as relevant
Why IR Sucks • Human language is ambiguous • Polysemy: The same word can mean different things • Synonymy: Different words can mean the same thing • The topic or aboutness of a document is hard to assess • Queries are short and ambiguous • Information needs are moving and vague targets
Things that help IR • Structure: matching based on known types of content (e.g., a list vs. discourse) • Relationships: Knowing how groups of documents are related • Metadata: terms or phrases that are of assuredly high importance • User knowledge: context, user models, history…
Transparency through Information Access (utopic view) • What if organizations (government, corporations, etc.) are less able to hide their actions? • What if individuals’ information is readily accessible to all? • What if nearly all information that is generated is available to all seekers?
Inequity through Information Access (dystopic view) • Organizations share their data only when and with whom they choose • Individuals’ information is hoarded by businesses, government and the people themselves • Information is available on a fee- and authority basis
XML can’t make societal decisions… • But XML brings about the opportunity for such decisions to be made • If information is readily available to all, XML will help make it more searchable • If information is only available to the privileged, XML will make them more powerful
XML Uncertainties • Will XML be used for markup? Or only at the back end? • Will standards such as Z39.50 or EDI make it easier for sharing XML data? Or will translation & mapping be difficult? • What sort of variety will exist in DTDs? How difficult will it be for IR and database systems to map between DTDs?
XML stakeholders: Big organizations • Organizations with lots of internal data • (The IRS; Time-Warner; others big & small) • These organizations will benefit from XML + IR by being able to match database-type items with IR-type information needs. • E.g., “for people who purchase these products, what email and chat messages have they exchanged”
XML stakeholders: Organizations who share • Organizations who broker, repackage or resell information will benefit from XML + IR • (Credit bureaus, investigative services…) • XML will make it easier to submit IR queries against multiple datasets and merge the results • E.g.,”See what this person’s public Web pages say before deciding whether to hire him or her.”
XML stakeholders: Individuals • Ultimately, lots of the most valuable information is by or about individuals • (Lifestyle, health, purchasing, travel…) • IR systems that understand us better will be able to serve us better • E.g., “recommend a book based on my past reading, movies and available time to read.”
What we know, revisited • IR sucks, but is better to the extent that language is unambiguated and structure is present • People have information needs, but have trouble expressing those needs • Documents can address some needs, but often real-world information needs are better met by assembling answers from diverse sources
What we don’t know, revisited • XML: In the background or the foreground? • How will organizations share XML data (will they?) • What external forces might make data in all forms more accessible across organizations and to individuals?
XML + IR • Despite problems, IR has continued to make good progress • Despite problems, XML appears to be making a strong contribution to storing, organizing and presenting data of all types • With IR, XML will be more searchable for a variety of purposes • With XML, IR will gain better precision and ability to serve the needs of individuals and organizations