300 likes | 408 Views
Ontology learning and population from from text. Ch8 Population. Population. Population of ontology: Finding instances of relations as well as of concepts Requires full understanding of natural language More modest target: The extraction of a set of predefined relations In this chapter:
E N D
Ontology learning and population from from text Ch8 Population
Population • Population of ontology: • Finding instances of relations as well as of concepts • Requires full understanding of natural language • More modest target: • The extraction of a set of predefined relations • In this chapter: • No acquisition of instances of relations • The detection of instances of concepts
Population • Common Approaches • Corpus-based Population • A standard similarity-based approach • Learning by Googling • Semi-supervised approach • PANKOW • C-PANKOW
Common Approaches • Lexico-syntactic Patterns • Hearst patterns • Similarity-based Classification • Algorithm12 • Data sparseness problem • Supervised Approaches • Predict the category of a certain instance with a model • Requires thousands of training examples to train the model • Not feasible - considering hundreds of concepts as possible tags
Similarity-based Classification of Named Entities • Using different similarity measures • Cosine, Jaccard, L1 norm, Jensen-Shannon, Skew • Using different feature weighting measures • Conditional, PMI, Resnik
Evaluation • Goal: learn a function fs • fa and fb: specified by two annotators • Functions as sets: • Measurement • Precision, Recall, F-measure, learning accuracy
Experiments • Using Word Windows • n words to the left and right of a word of interest • Excluding stopwords without trespassing sentence boundaries • Mopti is the biggest city along the Niger with one of the most vibrant ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta. • Mopti: traditional(l), biggest(1)Niger: city(l), delta(l), view(l)Gao: San(l), ofFer(l), town(l), junction(l) San: offer(l), view(l), Gao(l), nice(l)
Experiments • Result:
Experiments • Result:
Experiments • Using Pseudo-syntactic Dependencies • Object-attribute pair • Mopti is the biggest city along the Niger with one of the most vibrant ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta. • Mopti: is-city(l), has_ambience(l) Niger: has_delta(l)Gao: junction.of(l) San: offer_subj(l) • Result:
Experiments • Dealing with Data Sparseness • Using Conjunctions • When two named entities linked by conjunctions • Result:
Experiments • Dealing with Data Sparseness • Exploiting the Taxonomy • Compute the context vector of a certain term by considering the context vectors of its subconcepts • Take only into account the context vectors of direct subconcepts • Normalizing aggregated vectors: • Standard normalization of the vector • Calculating its centroid
Experiments • Dealing with Data Sparseness • Exploiting the Taxonomy • Result:
Experiments • Dealing with Data Sparseness • Anaphora Resolution • Replace each anaphoric reference to the corresponding antecedent • The port capital of Vathy is dominated by its fortified Venetian har- bor. • The port capital of Vathy is dominated by Vathy'sfortified Venetian harbor. • Result:
Experiments • Dealing with Data Sparseness • Downloading Documents from the Web • Downloading 20 additional documents Di for each named entity i • keep d that its similarity is over an threshold of 0.2 • Result:
Experiments • Dealing with Data Sparseness • Post-processing • The k best answers of the system are checked for their statistical plausibility on the web • Result:
PANKOW • Pattern-based Annotation through Knowledge on the Web • Certain lexico-syntactic patterns as defined by Hearst can be matched in corpus AND World Wide Web
PANKOW • The Process of PANKOW • Step 1: iterates the set of entities to be classified and generates instances of patterns, one for each concept in the ontology. • For example: instance - South Africa, concepts – country and resulting in pattern instances - ' 'South Africa is a country" and ' 'South Africa is a hotel" or "countries such as South Africa" and "hotels such as South Africa". • Result 1: A set of pattern instances • Step 2: Google is queried for the pattern instances through its Web service API • Result 2: the counts for each pattern instance • Step 3: sums up the query results to a total for each concept. • Result: The statistical web fingerprint for each entity, that is, the results of aggregating for each entity the number of Google counts for all pattern instances conveying the relation of interest.
PANKOW • The Process of PANKOW
PANKOW • Evaluation • From the two annotators • Reference standards for subject A and B • Measurement: • Precision, recall, and F-measure
PANKOW • Evaluation • Measurement: • Average the results for both annotatores
PANKOW • Result:
C-PANKOW • Shortcoming of PANKOW • A lot of actual instances of the pattern schema are not found • Large number of queries sent to the Google Web API • Not scale to larger ontologies
C-PANKOW • C-PANKOW Process • the web page to be annotated is scanned for candidate instances. • for each instance idiscovered and for each clue-pattern pair in our pattern library P, an automatically generated query is issued to Google and the abstracts or snippets of the n first hits are downloaded. • Then the similarity between the document to be annotated and the downloaded abstract is calculated. If the similarity is above a given threshold t, the actual pattern found in the abstract reveals a phrase which may possibly describe the concept that the instance belongs to in the context in question. • The pattern matched in a certain Google abstract is only considered if the similarity between the original page and this abstract is above a given threshold. In this way the pattern-matching process is contextualized. • Finally, the instance iis annotated with that concept c having the largest number as well as most contextually relevant hits.
C-PANKOW • C-PANKOW Process
C-PANKOW • Evaluation • Same dataset and evaluation measures as PANKOW • BUT the C-PANKOW uses the 682 concepts of the pruned Tourism ontology as possible tags • Added learning accuracy
C-PANKOW • Result:
C-PANKOW • Result:
C-PANKOW • Result: