500 likes | 516 Views
資訊擷取與推薦技術期中報告 Ch9. Thesaurus Construction. 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期: 19. Nov. 2002. outline. Introduction 索引典 (Thesaurus) 定義 索引典結構 INSPEC thesaurus 索引典參照款目說明 Features of thesauri Coordination level Term relationships Number of entries for each term
E N D
資訊擷取與推薦技術期中報告Ch9. Thesaurus Construction 指導教授:黃三益教授 組 別:第三組 組 員:周桂穗、孫繡紋、莊士民 日 期:19. Nov. 2002
outline • Introduction • 索引典(Thesaurus)定義 • 索引典結構 • INSPEC thesaurus • 索引典參照款目說明 • Features of thesauri • Coordination level • Term relationships • Number of entries for each term • Specificity of vocabulary • Control on term frequency of class members • Normalization of vocabulary • Thesaurus Construction • Manual Thesaurus Construction • Automatic Thesaurus Construction 3.2.1 Thesaurus Construction from Text • Automatic Thesaurus Construction • From a Collection of Document Items • By Merging Existing Thesauri • User Generated Thesaurus • Construction of Vocabulary 3.2.2 Merging existing thesauri • Conclusion
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1.1 索引典(Thesaurus)定義 • 就資訊儲存與檢索的範疇而言,索引典乃收集足以表示知識概念的字或詞,並將之以特定的結構加以排列,這些字彙控制了同義字,區別了同形異義字,並顯現各相關詞彙間階層及語意互屬上的各種關係,以作為索引者在分析處理資料及讀者在檢索資料時能選用一致的、經過控制的詞彙。即提供資訊儲存與檢索標準化的用語。
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1.2 索引典結構 • 索引典的詞彙分為標目(heading)及參照款目(cross reference entries)兩種。 • 標目被認可為可使用的詞彙,稱之為敘述語或述語(descriptors);參照款目則為不可使用的詞彙,稱為非敘述語(non-descriptors)或被替代語(use references),亦即圖書館書目資料處理時採用的參見(see)作法。
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1.3 INSPEC thesaurus • This thesaurus is designed for the INSPEC domain, which covers physics, electrical engineering, electronics, as well as computers and control. • The thesaurus is logically organized as a set of hierarchies. • Includes an alphabetical listing of thesaural terms. • Each hierarchy is built from a root term representing a high-level concept in the domain.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion A short extract from the 1979 INSPEC thesaurus • Cesium • USE caesium • Computer-aided instruction • see also education • UF teaching machines • BT educational computing • TT computer applications • RT education teaching • CC C7810C • FC c7810cf
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 1.4 索引典參照款目說明 USE (被替代) The “see also(參見)” link leads to cross-referenced thesaural terms. NT (narrower terms較狹義字)suggest a more specific thesaural term. BT (broader terms較廣義字) provides a more general thesaural term. TT (Top term,最BT的term) RT signifies a related term(相關字) . UF (替代)is utilized to indicate the chosen form from a set of alternatives. CC Classification Codes類別代碼
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2. Features of thesauri Coordination level Term relationships Number of entries for each term Specificity of vocabulary Control on term frequency of class members Normalization of vocabulary
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.1 Coordination level The construction of phrases from individual terms. Two coordination options : pre-coordination and post-coordination. A precoordinated thesaurus can contain phrases. The advantage is that the vocabulary is very precise. The disadvantage is that the searcher has to be aware of the phrase construction rules employed. Precoordination is more common in manually constructed thesauri.
A postcoordinated thesaurus does not allow phrases. Instead, phrases are constructed while searching. The advantage is that the user need not worry about the exact ordering of the words in phrase. The disadvantage is that search precision may fall. Automatic thesaurus construction usually implies postcoordination.
前組合 將此述語概念視為一個複合名詞,例如:diesellocomotive(柴油引擎火車頭) 後組合 以現存的二個或二個以上的詞彙代替而組合,例如:diesel engines AND locomotive 1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Coordination level組合層次
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.2 Term Relationships詞彙間的關係 Three categories of term relationships: (a) Equivalence relationships(同義關係) (b) Hierarchical relationships(層屬關係) (c) Nonhierarchical relationships(非層屬關係)
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.2a) Equivalence relationships Equivalence relations include both synonymy(同義字) and quasi-synonymy(半同義字). For example:genetics(遺傳) and heredity; harshness and tenderness
同義字:同一概念可以用一種以上的詞彙表示時,索引典多選用較廣為使用或新穎的一種為述語,其他同義字:同一概念可以用一種以上的詞彙表示時,索引典多選用較廣為使用或新穎的一種為述語,其他 則作為參照款目,例如:storage batteries UF secondary batteries secondary batteries USE storage batteries UF(Used For)替代 USE被替代 半同義字:有時兩個反義字卻可代表一個概念一體之兩面,則擇其一為述語,另一為參照款目,例如:stability UF instability,相對應之參照款目為instability USE stability 1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 同義關係
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.2b) Hierarchical relationships層級關係 A typical example of a hierarchical relation is genus(屬)-species(種),such as ”dog” and “german shepherd(牧羊犬近親).”
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.2b) 層屬關係 • 索引典對於詞彙間具有層屬關係的詞彙,通常以BT及NT兩種參照符號來表示。 • BT乃指示某詞彙的上層較廣義的詞彙,例如:oak tree BT tree • NT乃指示某詞彙的下層較狹義的詞彙,例如:tree NT oak tree • BT與NT是兩個用來相互對應的參照符號
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.2c) Nonhierarchical relationships非層級關係 Nonhierarchical relationships also identify conceptually related terms. There are many examples including :thing—part such as “bus” and “seat”;thing—attribute such as “rose” and “fragrance(香味)”.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 相關關係 • 是指認可述語間的關連,一般採用RT的參照符號來連結,例如: • 表示事或物的全部與部份的關係,windows RT houses • 表示事或物與其處理作業的關係,skates RT skating • 表示事或物與其應用的關係,railway construction RT railway • 表示事或物與其特性的關係,seawater RT corrosion(侵蝕)
Wang, Vandendorpe, and Evens (1985) provide an alternative classification of term relationships consisting of: (1)parts—wholes(整部關係) (2)collocation relations(排列關係) (3)paradigmatic relations(範例關係) (4)taxonomy and synonymy(分類及同義字) (5)antonymy relations(反義字)
(1)Parts-wholes Parts and wholes include examples such as set(集合)—element(元素);count—mass.
(2)Collocation relations編排關係 Collection relates words that frequently co-occur in the same phrase or sentence.
(3)Paradigmatic relations • Paradigmatic relations relate words that have the same semantic core like “moon” and “lunar” and are somewhat similar to Aitchison and Gilchrist’s quasi-synonymy relationship.
(4)Taxonomy and synonymy • Taxonomy and synonymy are self-explanatory and refer to the classical relations between terms.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.3 Number of entries for each term • It is in general preferable to have a single entry for each thesaurus term.However ,this is seldom achieved due to the presence of homographs—words with multiple meanings. • In a manually constructed thesaurus such as INSPEC, this problem is resolved by the use of parenthetical qualifiers(括弧限定語), as in the pair of homographs, bonds化學鍵(chemical) and bonds粘合劑(adhesive膠帶). • However, this is hard to achieve automatically.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 同形異義關係(homographs) • 當許多詞彙的拼法完全相同,但所代表的意義卻不同時,則以小括號加修飾語以區別之,例如:Mercury水銀(metal金屬)、 Mercury水星(planet行星) ,小括號內的修飾語亦為述語的一部分。
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.4 Specificity of vocabulary 詞彙的明確性 • a function of the precision associated with the component terms. • A highly specific vocabulary is able to express the subject in great depth and detail.This promotes precision in retrieval. • The disadvantage is that the size of the vocabulary grows. Also, specific terms tend to change more rapidly than general terms. • There, such vocabularies tend to require more regular maintenance. • High specificity implies a high coordination level and user has to be more concerned with the rules for phrase construction.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.5 Control on term frequency of class members • Salton and McGill have stated that in order to maintain a good match between documents and queries, it is necessary to ensure that terms included in the same thesaurus class have roughly equal frequencies. • The total frequency in each class should also be roughly similar. • These constraints are imposed to ensure that the probability of a match between a query and a document is the same across classes. • Terms within the same class should be equally specific, and the specificity across classed should also be the same.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 2.6 Normalization of vocabulary詞彙的標準化 • 最好以名詞方式表示 • 名詞片語應避免採用頭字語,除非大家都知道 • 可採用形容詞 • There are other rules to direct issues such as the singularity of terms(單數), the ordering of terms within phrases(在片語中的順序), spelling(拼法), capitalization(大寫), transliteration(字譯), abbreviations(縮寫), initials(字首), acronyms(字首縮寫), and punctuation(標點符號). • The advantage is that variant forms are mapped into base expressions, thereby bringing consistency to the vocabulary. • The disadvantage is that, in order to be used effectively, the user has to be well aware of the normalization rules used.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3.1 Manual Thesaurus Construction • Define the boundaries of the subject area • Identify central subject areas and peripheral ones • Partition the domain into divisions or subareas • Identify desired characteristics • Collect terms for each subarea • Sources from index, encyclopedia, handbook, textbook, journal, abstract, catalog, existing thesaurus or vocabulary systems • Including: subject expert and potential user
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3.1 Manual Thesaurus Construction(continued) • Analyze each term for its related vocabulary • Including synonyms, broader and narrower term, definition and scope note • Organize term and relationship into hierarchical structure • Review or refine for consistency • Invert the structured thesaurus to produce an alphabetical arrangement of entries • Test the thesaurus
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3.1 Manual Thesaurus Construction(continued) • Conclusion: • Involve a group of individuals and a variety of resources • Need to be maintained to ensure viability and effectiveness • Reflect any changes in the terminology of the area =>An art and a science
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3.2 Automatic Thesaurus Construction • From document collections • Use a collection of documents as the source for thesaurus construction • Apply statistical procedures to identify important terms as well as relationships • Use computationally simpler methods to identify the more important semantic knowledge
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3.2 Automatic Thesaurus Construction (continued) • Merge existing thesaurus • Merge two or more thesauri into a single unit • Merger should not violate the integrity of any component thesaurus • e.g. augment MeSH from SNOMED
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3.2 Automatic Thesaurus Construction (continued) • User generated thesaurus • Uses of term relationship in search strategies • Capture knowledge from user’s search • e.g. TEGEN (Thesaurus Generating system) • The types of Boolean operators between terms • The type of query modification • User feedback included
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion 3.2.1 Thesaurus Construction from Texts • Process • 1 Construction of vocabulary • Normalization • Selection of terms • Phrase construction • Identify the statistical associations between terms • 2 Similarity computations • 3 Organization of vocabulary • Organize the selected vocabulary into hierarchy
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus 1 Construction of Vocabulary • Objective: Identify the most informative terms (words, phrases) • Identify an appropriate document collection which should be sizable and representative of the subject area • Determine the required specificity • Vocabulary for normalization • Eliminate trivial words and construct a stoplist • Stem the vocabulary
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Selection by Frequency of Occurrence • Selection by frequency of occurrence • Each term placed in one frequency category: high, medium, low • Medium: best for indexing and abstracting • Low: minimal impact on retrieval • High: too general and negatively impact search precision
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Selection by Discrimination Value (DV) • Selection by discrimination value (DV) • DV measures the degree to which a term is able to discriminate or distinguish between the documents • The more discriminating a term, the higher its value as an index term • Using some similarity functions to compute the average inter-document similarity in the collection • DV(k) = (Average similarity without k) - (Average similarity with k)
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Selection by Discrimination Value (DV)(continued) • Selection by discrimination value (DV) • Good discriminators are those that decrease the average similarity by their presence DV is positive • Poor discriminators have negative DV • Neutral discriminators have no effect on average similarity • Terms that are positive discriminators can be included in the vocabulary and the rest rejected
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Selection by Poisson Method • Selection by Poisson Method • Poisson distribution is a discrete random distribution that can be used to model a variety of random phenomena • Trivial words have a single Poisson distribution • Distribution of nontrivial words deviates significantly from a Poisson distribution
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Phrase Construction • Phrase constructiondecrease the frequency of high-frequency terms and increase their value for retrieval • Salton and McGill Procedure: a statistical alternative to syntactic and/or semantic methods for identifying and constructing phrases • The component words of a phrase should occur frequently in a common context • The component words should represent broad concepts, and their frequency of occurrence should be sufficiently high
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Phrase Construction(continued) • Criteria • Compute pair wise co-occurrence for high-frequency words • If this co-occurrence is lower than a threshold, then do not consider the pair any further • For pairs that qualify, compute the cohesion valuecohesion (ti, tj) =co-occurrence-frequency / sqrt ( frequency (ti) * (frequency (tj) )cohesion (ti, tj) =size-factor * (co-occurrence-frequency / (total-frequency (ti) * ( total-frequency (tj) ) ) • If cohesion is above a second threshold, retain the phrase as a valid vocabulary phrase
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Choueka Procedure (continued) • Choueka Procedure Identifying collocational expressions by the phrases whose meaning cannot be derived in a simple way from that of the component words (e.g. artificial intelligence) • Select the range of length allowed for each collocational expression • Build a list of all potential expressions from the collection with prescribed length that have a minimum frequency • Delete sequences the begin or end with a trivial word
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus Choueka Procedure (continued) 4. Delete expressions that contain high-frequency nontrivial words 5. Given an expression such a b c d evaluate any potential subexpressions for relevance. Discard any that are not sufficiently relevant 6. Try to merge small expressions into large and more meaningful ones
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus 4.2 Similarity computations between terms • To determine the statistical similarity between pairs of terms. • Dice: , if l1=0 or l2 =0 return 0 • Cosine: , if l1=0 or l2 =0 return 0 l1: # of terms associated with document 1 l2: # of terms associated with document 2 common: # of terms in common between them
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus 4.3 Organization of vocabulary • Two assumption • High-frequency words have broad meaning • If the density functions of the two terms, p and q (of varying frequencies) have the same shape, then the two words have similar meaning. • As two assumptions, if p is the term with the higher frequency, then q becomes a child of p.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion Process 1 Construction of vocabulary 2 Similarity computations 3 Organization of vocabulary • From document collections • Merge existing thesaurus • Identify a set of frequency ranges. • Group the vocabulary terms into different classes based on their frequencies. • The highest frequency class is assigned level 0, the next, level 1 and so on. • Parent-child links are determined. • Create “dummy” term.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion • From document collections • Merge existing thesaurus Merging existing thesauri • Simple-merge • Two terms in different hierarchies are merged if they are identical. • Complex-merge • Any two terms in different hierarchies are merged if they have ‘similar’ parents and children.
1. Introduction 2. Features of thesauri 3. Thesaurus Construction 4. Conclusion conclusion • This chapter began with an introduction to thesauri. • Two major automatic thesaurus construction methods have been detailed. • A few related issues to thesauri have not been considered here: • Evaluation of thesauri. • Maintenance of thesauri. • How to automate the usage of thesauri.