200 likes | 308 Views
Leveraging XLT: (Web-Enabled) Validation of Terminology Collections. Lee Gillam, University of Surrey SALT Workshop, Antwerp 31 January 2001. Surrey-EU project history. Terminology Extraction and Management Projects: TWB , TWBII Management of Text Collections: TRANSTERM
E N D
Leveraging XLT: (Web-Enabled) Validation of Terminology Collections Lee Gillam, University of Surrey SALT Workshop, Antwerp 31 January 2001
Surrey-EU project history • Terminology Extraction and Management Projects: TWB, TWBII • Management of Text Collections: TRANSTERM • Term Resources: POINTER • Terminology Validation: INTERVAL • Convergence in SALT?
XLT ‘opportunities’ • Complete terminology collections available in XML – enhancement/reuse of other collections • Large number of (multilingual) terms – difficult for humans to appraise • Terminology relates to usage – document collections highly relevant • Quantity of terms – no guarantee of quality
(Web-Enabled) Validation • Relevant documents on the web – contextual information • Relevant documents on the ‘corporate internet’ – contextual information • Term usage in other organisations (glossaries)/as understood by Joe E.C. Taxpayer • Resource enrichment
System Description • For a given (D)XLT collection of terminology: • Partition collection by specific criteria • Collect documents relevant to criteria • Analyse documents against the partitioned collection • Report results
System Description • Partition collection by specific criteria: • Use of ‘Xpath’ • “Give me all terms in English” • //dxlt/text/body/termEntry/langSet[@lang = ‘en’]/ntig/termGrp/term/text() • Alternative example: “Give me all subjectFields” • //dxlt/text/body/termEntry/descrip[@type=‘subjectField’]/text() [check!]
System Description • Collect documents relevant to criteria • For terms, try internet/intranet searching • For subject field classifications, classification documents will be relevant • For definitions, comparisons with other glossaries may provide useful validating information • …..
System Description • Analyse documents against the partitioned collection • Are the terms contained in the documents? • Are the terms in the documents now used as parts of compounds? • What are the contexts in which the terms are used? • Are there a number of potential other definitions for a particular term? • Does this fit in with a specific classification? • ….
System Description • Report Results • Term frequency – Zero? • Potential compounds • Contexts • Definitions • Correctly classified • …..
Prototype prototype ‘Results Area’ XML attributes Indicative Actions ‘XML’
Prototype prototype Indicative XPaths
Prototype prototype Recall this term…
Prototype prototype CIRCUIT SWITCHING Found in collected texts 43 times. Valid term? PACKET SWITCHING also exists in this resource.
DHydro Sample • <termEntry id="HR-7"> • + <transacGrp> <descrip type="subjectField">200</descrip> • + <langSet lang="fr"> • <langSet lang="en"> • <descripGrp> <descrip type="definition">The apparent displacement in position of a heavenly body caused by the combination of the velocity of light and that of an observer on the surface of the earth. Aberration of light due to the rotation of the earth on its axis is termed diurnal aberration. That due to the revolution of the earth around the sun is termed annual aberration.</descrip> </descripGrp> • <ntig> <termGrp> • <term id="HR-7-en-1">aberration of light</term> • <termNote type="termType">main entry</termNote> <termNote type="partOfSpeech">n</termNote></termGrp> </ntig> </langSet> • + <langSet lang="es"> • </termEntry>
Lenoch (GMT) • <struct type="classification"> • <feat type="name">AD2</feat> • <feat type="documentation">public and private organisations</feat> • <feat type="subclass-of">AD</feat> • </struct> • <struct type="classification"> • <feat type="name">AD3</feat> • <feat type="documentation">publications and documentary search</feat> • <feat type="subclass-of">AD</feat> • </struct> • <struct type="classification"> • <feat type="name">AD31</feat> • <feat type="documentation">documentation and information systems</feat> • <feat type="subclass-of">AD3</feat> • </struct>
Lenoch (XOL) • <class> • <name>AD2</name> • <documentation>public and private organisations</documentation> • <subclass-of>AD</subclass-of> • </class> • <class> • <name>AD3</name> • <documentation>publications and documentary search </documentation> • <subclass-of>AD</subclass-of> • </class> • <class> • <name>AD31</name> • <documentation>documentation and information systems</documentation> • <subclass-of>AD3</subclass-of> • </class>
Outlook • Initial Results show promise for Validation of Terminological Resources • significant development work is still required. • XPath generation needs tailoring to specific formats (DXLT), but provides useful power • Development to merge ‘Web glossaries’ – pre-terminological validation stage • Provide a powerful prototype of the capabilities for the (Web-Enabled) Validation of Terminology Collections – with DXLT-related formats. • DXLT as the de facto standard format for Terminology Validation?