630 likes | 775 Views
Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/is202/f02/. Lecture 12: Metadata and Markup. SIMS 202: Information Organization and Retrieval. Lecture Overview. Review
E N D
Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/is202/f02/ Lecture 12: Metadata and Markup SIMS 202: Information Organization and Retrieval
Lecture Overview • Review • Thesaurus Design And Development • Thesaurus Design • Steps In Thesaurus Development • Metadata And Markup • XML As A Metadata Lingua Franca • XML DTD Construction • XML For Protocols And Metadata Languages
Lecture Overview • Review • Thesaurus Design And Development • Thesaurus Design • Steps In Thesaurus Development • Metadata And Markup • XML As A Metadata Lingua Franca • XML DTD Construction • XML For Protocols And Metadata Languages
Structure of an IR System Storage Line Interest profiles & Queries Documents & data Search Line Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Adapted from Soergel, p. 19
Thesauri • A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms
Thesauri (cont.) • Examples • The ERIC Thesaurus of Descriptors • The Medical Subject Headings (MESH) of the National Library of Medicine • The Art and Architecture Thesaurus
Why Develop a Thesaurus? • To provide a conceptual structure or “space” for a body of information • To make it possible to adequately describe the topical contents of informational objects at an appropriate level of generality or specificity • To provide enhanced search capabilities and to improve the effectiveness of searching (i.e., to retrieve most of the relevant material without too much irrelevant material)
Development of a Thesaurus • Term selection • Merging and development of concept classes • Definition of broad subject fields and subfields • Development of classificatory structure • Review, testing, application, revision
Flow of Work in Thesaurus Construction Select Sources Define Broad Subject Fields Improve Class Structure Assign codes Sort Terms into Broad Subject Fields Print Classified Index and review Select Terms Define Subfields within one Subject Field Discuss with Experts and Users Record Selected Terms Work out detailed structure of the Subject Field Select descriptors and checklist items Revise as needed Many Modifications? Select Preferred Terms Sort Terms Yes No All Subfields of Broad Subject finished? No Merge identical Terms Assign Notation Yes Merge Terms in Same Concept class Produce Full Thesaurus and Check references All Broad Subjects finished? No Review and Test Based on Soergel, pp 327-333 Yes
The Indexing Process • Concept identification • Term selection (via thesaurus) • Term assignment
Application: The Indexing Process (Manual) Select Alternative term to represent Concept NO Does Thesaurus contain term for Concept Can Concept be expressed combining terms? Start Examine Document and Identify Significant Concepts Consider First Concept NO Is Term suitable Establish Term Denoting Concept YES YES NO Preferred Term? Select Preferred Term NO Is There Another Concept End NO YES YES YES Consider Preferred Term Consider Each of These Terms Admit New Term Into Thesaurus Assign Terms to Document NO Would Concept be better represented by one of these terms Prefer Alternative Term(s) Consider any associated terms in Thesaurus (NT,BT) YES Adapted from ISO 5963, p.5
Lecture Overview • Review • Thesaurus Design And Development • Thesaurus Design • Steps In Thesaurus Development • Metadata And Markup • XML As A Metadata Lingua Franca • XML DTD Construction • XML For Protocols And Metadata Languages
What is SGML/XML? • SGML stands for Standard Generalized Markup Language • XML stands for eXtended Markup Language • What it is NOT: • Not a visual document description • Not an application specific markup • Not proprietary
What is SGML/XML? • What it is: • An international standard (SGML- ISO 8879:1986) • A generic language for describing the structure of documents, and markup that can be used for those documents • Intended for generating markup for content rather than form elements • XML is a simplified subset of SGML (established by W3C)
Customer profiles Vendor profiles Catalogs Datasheets Price lists Purchase orders Invoices Inventory reports Bill of materials Contracts Credit reports Bank statements Proposals Directories Transportation schedules Receipts The Documents of Commerce Source Dr. Robert J. Glushko
Alternatives for Exchanging Documents Publish information for a universalclient Batch and high-volume exchange between trading partners Application Integration HTML CORBA / COM EDI Formatbased APIbased Source Dr. Robert J. Glushko
Limitations of Each Exchange Model Pre-wired Heavyweight to implement Not native to the web Must be pre-arranged High cost Rigid and inflexible Formatting markup “for eyes” “Scrape and hope” integration HTML CORBA / COM EDI Formatbased APIbased Source Dr. Robert J. Glushko
Having Our Cake And Eating It Too • We need: • The precision of APIs • The simplicity of HTML Source Dr. Robert J. Glushko
XML to the Rescue (SGML and HTML++) • Extensible Markup Language • A simplification of SGML, the Standard Generalized Markup Language • Instead of a fixed set of format-oriented tags like HTML, XML allows you to create the schema— whatever set of tags are needed—for your information type or application • This makes any XML instance “self-describing” and easily understood by computers and people • Version 1.0 ratified by W3C in 2/98 • Backed by Microsoft, Sun, Netscape, many others Source Dr. Robert J. Glushko
Why XML is Revolutionary • XML enables a business to preserve any “document type” or “database schema” when it publishes on the Web • XML enables a business to send self-describing “business messages” that can be understood by programs, not just “by eye” • This information cannot be encoded in HTML • XML-encoded information is smart enough to support new classes of Web applications Source Dr. Robert J. Glushko
XML Enables New Web Applications • Data interchange between Web clients • Use Web for application integration without information loss (example: product information in supply chain, EDI) • Moving processing from server to client • Reduce network traffic and server load (example: download airline schedule, find best flights without “back-and-forth” thrashing) Source Dr. Robert J. Glushko
XML Enables New Web Applications • Multiple client-side views of same data • Expert and novice versions • Manager and worker versions • Localization (currency or measurement conversions) • “Information push” from personalized applications • Selecting information based on user preferences (example: custom news feed by matching article keywords against user profile) Source Dr. Robert J. Glushko
The First Generation Web Computers Browsers .. making information accessible through browsers HTML scripts Eyeballs only No automation Limited integration Source Dr. Robert J. Glushko
Airline Schedule Flight Information United Airlines #200 San Francisco 9:30 AM Honolulu 12:30 PM $368.50 HTML Airline Schedule Seen “By Eye” Source Dr. Robert J. Glushko
<Title>Airline Schedule</Title> <Body> <H2>Flight Information</H2> <H3>United Airlines #200</H3> <UL><LI>San Francisco <LI>9:30 AM <LI>Honolulu <LI>12:30 PM <LI>$368.50 </UL></Body> HTML Airline Schedule Seen “By Computer” Source Dr. Robert J. Glushko
Next Generation Web Computers Computers .. making information and services accessible to computers (and people) XML Java Structured searches Agents New models Source Dr. Robert J. Glushko
Airline Schedule in XML <TransportSchedule Type=“Airline”> <Segment Id=“United Airlines #200”> <Origin>San Francisco</Origin> <DepartTime>9:30 AM</DepartTime> <Destination>Honolulu</Destination> <ArriveTime>12:30 PM</ArriveTime> <Price Currency=“USD”>368.50</Price> </Segment> </TransportSchedule> Source Dr. Robert J. Glushko
Shared Semantics for Time and Location • Shared semantics for location and time in all schemas that need them enables richer “commerce networks” of services: • <TransportSchedule Type=“Airline”> ... • <Destination>Honolulu</Destination> • <Accommodation Type=“Hotel”>... • <Destination>Honolulu</Destination> • <Event Type=“Concert”>… • <Destination>Honolulu</Destination> Source Dr. Robert J. Glushko
Automated Vacation Planning Service • Book me the cheapest flight to Honolulu the first week of January • Find a hotel room for the day I arrive • What concerts are taking place the next day? Source Dr. Robert J. Glushko
The Common Business Language • Specifies common semantics, common syntax, and message packaging for information held by and exchanged among transaction partners and market participants • These documents are the interfaces among the commerce components envisioned in the overall eCo architecture being realized in a current ATP project being carried out by CNgroup, CommerceNet, BusinessBots, and Tesserae • CBL’s focus is on the functions and information that are common to all business domains Source Dr. Robert J. Glushko
CBL and XML • CBL documents are described by XML DTDs to make them “self-descriptive” and validatable • CBL builds on existing standard or industry semantics where possible • Complex descriptions and messages can be composed from primitives • Domain-specific XML applications can be implemented in “native” form or as “hybrids” for maximal interoperability Source Dr. Robert J. Glushko
CBL Building Blocks CBL Documents Business Descriptions Business Forms Vendor Catalog core Services Purchase Order core Products Invoice Measurements Locale Classification Time Address SIC core Currency Country NAICS core Weight Language FSC core Source Dr. Robert J. Glushko
CBL Building Blocks CBL Documents Business Descriptions Business Forms Vendor Catalog core Services Purchase Order core Products Invoice Measurements Locale Classification Time Address SIC core Currency Country NAICS core Weight Language FSC core Source Dr. Robert J. Glushko
If Interested In CBL • Visit: • http://www.xcbl.org/ • And for e-commerce applications using CBL, visit: • http://www.commerceone.com/
Lecture Overview • Review • Thesaurus Design And Development • Thesaurus Design • Steps In Thesaurus Development • Metadata And Markup • XML As A Metadata Lingua Franca • XML DTD Construction • XML For Protocols And Metadata Languages
SGML/XML Structure • An SGML document consists of three parts: • The SGML Declaration • The Document Type Definition (DTD) • The Document Instance • An XML document REQUIRES only the document instance, but for effective processing a DTD is very important • XML Schema provides an alternative to DTDs for XML applications
Document Type Definitions • The DTD describes the structural elements and "shorthand" markup for a particular document type and defines: • Names of "legal" elements • How many times elements can appear • The order of elements in a document • Whether markup can be omitted (SGML only) • Contents of elements (i.e., nested structures) • Attributes associated with elements • Names of "entities" • Short-hand conventions for element tags (SGML only)
DTD Components • The major components of a DTD are: • Entity Declarations • Element Declarations • Attribute Declarations
Document Type Definitions • Entity Declarations are a "macro" definition facility for both DTD and Document instance parts • General Internal Entity Definitions<!ENTITY name "substitute string">referenced by &name; • General External Entity Definitions<!ENTITY name SYSTEM "file path">referenced by &name; • Parameter Entity Definitions (used only inside DTDs)<!ENTITY %name "substitute string">or<!ENTITY %name SYSTEM "file path">referenced by %name; or %name
Document Type Definitions • Element Declarations define the structural elements of a document and its associated markup<!ELEMENT name - - content_model or declared_content +(include_list) -(exclude_list) > • Omitted tag minimization indicates whether start-tags or end-tags can be omitted in the markup (o) or (-) are required in SGML but can NOT be used in XML
Document Type Definitions • Content model provides a nested structural description of the elements that make up this element, e.g.: <!ELEMENT memo - - ((to & from), body, close?)> <!ELEMENT body - O (p)* > <!ELEMENT p - O (#PCDATA | q)*> <!ELEMENT q - - (#PCDATA)>... • ANY (in SGML) may be used to indicate a content model of any elements in the DTD, in any order
Document Type Definitions • Same content model in XML <?xml version = “1.0”?> <!DOCTYPE memo [ <!ELEMENT memo ((to | from)+, body, close?)><!ELEMENT body (p)* ><!ELEMENT p (#PCDATA | q)* ><!ELEMENT q (#PCDATA)>… ]> • Note the XML processing instruction “Prolog” • Note that & in previous page is not legal XML
Document Type Definitions • Declared content can be:PCDATA, CDATA, RCDATA, EMPTY • Inclusion and Exclusion lists can be used to indicate elements that can occur or are forbidden to occur in any sub-elements of the content model (NOT in XML), e.g.: <!ELEMENT memo -- ((to & from), body close?) +(fn)> • Says that element fn can appear anyplace in the memo
Document Type Definitions • Attribute Declarations define attributes associated with (potentially) each element of a document and provide the acceptable values for those attributes
Attributes Example • <!ATTLIST associate_element attribute_name declared_value default_value > • <!ATTLIST memo status (PUBLIC | CONFIDENTIAL) PUBLIC> • In markup of a document: <memo status="CONFIDENTIAL">also, because of the default set:<memo>would be the same as <memo status="PUBLIC">There are a variety of special defaults and data types that can be given in attribute definitions
Sample SGML DTD <!doctype ELIB-TEXTS [ <!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. --> <!ELEMENT ELIB-TEXTS o o (ELIB-BIB*)> <!-- We allow most elements to occur any number of times in any order --> <!-- this is because there is little consistency in the actual usage. --> <!ELEMENT ELIB-BIB - - (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*, (SERIES | TYPE | REVISION | REVISION-DATE | AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACT AUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION | ULTIMATE-CLIENT | URL | KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )> <!-- We won't make any assumptions about content... all PCDATA --> <!ELEMENT ID - o (#PCDATA)> <!ELEMENT ABSTRACT - o (#PCDATA)> <!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL - o (#PCDATA)> <!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL - o (#PCDATA)> <!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING - o (#PCDATA)> … etc… ]>
XML Version <!doctype ELIB-TEXTS [ <!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. --> <!ELEMENT ELIB-TEXTS(ELIB-BIB*)> <!-- We allow most elements to occur any number of times in any order --> <!-- this is because there is little consistency in the actual usage. --> <!ELEMENT ELIB-BIB (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*, (SERIES | TYPE | REVISION | REVISION-DATE | AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACT AUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION | ULTIMATE-CLIENT | URL | KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )> <!-- We won't make any assumptions about content... all PCDATA --> <!ELEMENT ID (#PCDATA)> <!ELEMENT ABSTRACT (#PCDATA)> <!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL (#PCDATA)> <!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL (#PCDATA)> <!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING (#PCDATA)> … etc… ]>
Document Using That DTD <ELIB-BIB> <BIB-VERSION>ELIB-v1.0 </BIB-VERSION> <ID>6</ID> <ENTRY>February 13 1995</ENTRY> <DATE>March 1, 1993</DATE> <TITLE>Water Conditions in California Report 2</TITLE> <ORGANIZATION>California Department of Water Resources</ORGANIZATION> <SERIES>120-93</SERIES> <TYPE>bulletin</TYPE> <AUTHOR-INSTITUTIONAL>California Department of Water Resources </AUTHOR-INSTITUTIONAL> <PAGES>17</PAGES> <TEXT-REF>/elib/data/disk/disk5/documents/6/HYPEROCR/hyperocr.html </TEXT-REF> <PAGED-REF>/elib/data/disk/disk5/documents/6/OCR-ASCII-NOZONE </PAGED-REF> </ELIB-BIB>
A More Complex DTD <!DOCTYPE USMARC [ <!-- USMARC DTD. UCB-SLIS v.0.08 --> <!-- By Jerome P. McDonough, April 1, 1994 --> <!ELEMENT USMARC - - (Leader, Directry, VarFlds)> <!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE) "BK" id CDATA #IMPLIED> <!-- Author's Note: the id attribute for the USMARC element is intended to hold a unique record number for each MARC record in the local database. That is to say, it is intended ONLY as an aid in maintaining the local database of MARC records --> <!ELEMENT Leader - O (LRL, RecStat, RecType, BibLevel, UCP, IndCount, SFCount, BaseAddr, EncLevel, DscCatFm, LinkRec, EntryMap)> <!ELEMENT Directry - O (#PCDATA)> <!ELEMENT VarFlds - O (VarCFlds, VarDFlds)> <!-- Component parts of Leader --> <!-- Logical Record Length --> <!ELEMENT LRL - O (#PCDATA)> …etc…
More Complex DTD (cont.) <!-- Variable Data Fields --> <!ELEMENT VarDFlds - O (NumbCode, MainEnty?, Titles, EdImprnt?, PhysDesc?, Series?, Notes?, SubjAccs?, AddEnty?, LinkEnty?, SAddEnty?, HoldAltG?, Fld9XX?)> <!-- Component Parts of Variable Data Fields --> <!-- Numbers & Codes --> <!ELEMENT NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017*, Fld018?, Fld019*, Fld020*, Fld022*, Fld023*, Fld024*, Fld025*, Fld027*, Fld028*, Fld029*, Fld030*, Fld032*, Fld033*, Fld034*, Fld035*, Fld036?, Fld037*, Fld039*, Fld040?, Fld041?, Fld042?, Fld043?, Fld044?, Fld045?, Fld046?, Fld047?, Fld048*, Fld050*, Fld051*, Fld052*, Fld055*, Fld060*, Fld061*, Fld066?, Fld069*, Fld070*, Fld071*, Fld072*, Fld074*, Fld080?, Fld082*, Fld084*, Fld086*, Fld088*, Fld090*, Fld096*)> <!-- Main Entries --> <!ELEMENT MainEnty - O (Fld100?, Fld110?, Fld111?, Fld130?)> <!-- Titles --> <!ELEMENT Titles - O (Fld210?, Fld211*, Fld212*, Fld214*, Fld222*, Fld240?, Fld242*, Fld243?, Fld245, Fld246*, Fld247*)> <!-- Edition, Imprint, etc. --> <!ELEMENT EdImprnt - O (Fld250?, Fld254?, Fld255*, Fld256?, Fld257?, Fld260?, Fld261?, Fld262?, Fld263?, Fld265?)> <!-- Physical Description, etc. --> <!ELEMENT PhysDesc - O (Fld300*, Fld305*, Fld306?, Fld310?, Fld315?, Fld321*, Fld340*, Fld350?, Fld351*, Fld355*, Fld357*, Fld362*)> …etc…