250 likes | 381 Views
Semantic Indexing with Typed Terms using Rapid Annotation. 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen. Chris Biemann University of Leipzig. Outline. The benefits of typed terms and relations Alleviating the ontology bottleneck Rapid annotation
E N D
Semantic Indexing with Typed Terms usingRapid Annotation 16th of August 2005 TKE-05 Workshop on Semantic Indexing, Copenhagen Chris Biemann University of Leipzig
Outline • The benefits of typed terms and relations • Alleviating the ontology bottleneck • Rapid annotation • Sources for annotation candidates • Annotation tools • Case study: Annotation of „Deutscher Wortschatz“ • Conclusion
Typed terms and relations The bag of words model treats all terms equally • Document similarity based on all terms • No views on data possible Typed terms and relations: • Multiple views on documents w.r.t. types • Document similarity restricted to types and augmented by relations • Enables some tasks of Question Answering
Motivating example: untyped Documents: • The government official A. Smithsigned a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. • „Weapon sales increased“, a government official stated, „especially tanks sell well“ • A holiday cruise on a yacht invites to take photos of seagulls. • The photosshow A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: 1 2 4 3
Motivating example: type PERSON Documents: • The government official A. Smith signed a contract over the purchase of 100 tanks from weapon manufacturer B. Miller. • „Weapon sales increased“, a government official stated, „especially tanks sell well“ • A holiday cruise on a yacht invites to take photos of seagulls. • The photos show A. Smith on a cruise with B. Miller‘s yacht. Similarity of terms: Clustering: 1 2 4 3
The ontology bottleneck • Semantic Web people believe that annotation with ontology relations will enable semantic search, ... • Annotation: Chose an ontology, label all instances in the document Problems: • New documents have to be annotated all over again • Merging of ontologies • Despite tools, users are reluctant to annotate their documents interface Merged ontology Anno 1 Anno 2 Anno 3 Anno n .... Doc 1 Doc 2 Doc 3 Doc n
Centralized annotation • Types and relations for terms are assigned globally and once-for-all. • No (logically grounded, consistent) ontology, but a free collection of types and relations suited to the problem • Annotation is done for document collections interface Annotation Doc 1 Doc 2 document collection .... Doc 3 Doc n
Generating Candidates for Annotation • Given N terms from the collection, it is not feasible to present N² pairs to an annotator. Most of the pairs will not be related • Needed: Method that produces terms with similar types and related pairs at high rate Method here: • Co-occurrence statistics: Pairs of terms that occur significantly often together in sentences/documents. • Co-occurrences of higher orders: pairs of terms that have similar co-occurrence statistics Co-occurrences reflect syntagmatic and paradigmatic relations, the former are ruled out in higher orders
The cats and dogs example cat co-occurrences: dog, her, food, pet, litter, she, burglar, animal, my, mouse, feline, Garfield, like, Cat, bag cat order 2: cats, pet, dog, animals, animal, dogs, pets, neutered, her, she, Synindex, like, tabbie, pigs, shelter cat order 4: pet, pets, cats, dog, pigs, animals, dogs, animal, owners, zoo, wild, birds, rabbits, puppies, tiger
Specifying types and relations • Click on node / edge opens context menu restricted to POS
Rule-based candidate generation • If some annotation is already present, then rules can be specified to obtain candidates at even higher rate. • It is possible to guess the type of candidates Example: Rule 1: If IS-A(A,B) and PROPERTY(B), then PROPERTY(A)yields LIVING(dog) as candidate Rule 2: If IS-A(A,B) and COHYPONYM(A,C) then IS-A(C,B)yields IS-A(cat, animal) as candidate animalLIVING IS-A dog catLIVING CO-HYPONYM
Case study: Annotating Deutscher Wortschatzwww.wortschatz.uni-leipzig.de In terms of numbers: • In 1‘000 hours, annotators could chose between • 46 semantic types and • 57 relations, and produced • 150‘000 type instances and • 150‘000 relation instances for over • 80‘000 distinct terms, that is text coverage of • 90%, with a speed of • 5 units per minute
Example: Query resolution with types and relations Query: „Find documents mentioning at least two heads of computer companies!“ 1. Translate into formal query: Qset = {B | IS-A(A, computer company), HEAD-OF(B,A)} b1 Qset, b2Qset, b1 b2 2. Access search engine with possible b1, b2
What Google found:Find documents mentioning at least two heads of computer companies! #1 hit 14.08.2005 www.google.com
Conclusion • Typed terms and relation can facilitate processing of electronic documents for a wide range of applications • Rapid annotation alleviates the acquisition bottleneck by- globally annotating- local dependencies • Intuitive tools for annotation are highly important to achieve large amounts in short time
QUESTIONS?!? THANK YOU
Bonus material • Co-occurrences • Co-occurrences of higher orders
Statistical Co-occurrences • occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors) • Significant Co-occurrences reflect relations between words • Significance Measure (log-likelihood):- k is the number of sentences containing a and b together- ab is (number of sentences with a)*(number of sentences with b)- n is total number of sentences in corpus
Iterating Co-occurrences • (sentence-based) co-ocurrences of first order:words that co-occur significantly often together in sentences • co-occurrences of second order: words that co-occur significantly often in collocation sets of first order • co-occurrences of n-th order:words that co-occur significantly often in collocation sets of (n-1)th order When calculating a higher order, the significance values of the preceding order are not relevant. A co-occurrence set consists of the N highest ranked co-occurrences of a word.