1 / 9

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min

Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min. Overview. Introduce a method which mines a collection of Web search queries a collection of Web documents

lavonn
Download Presentation

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min

  2. Overview • Introduce a method which mines • a collection of Web search queries • a collection of Web documents • to acquire • open-domain classes in the form of instance sets • e.g.,whales, seals, dolphins, sea lions • associated with class labels • e.g., marine animals • as well as large sets of open-domain attributes for each class • e.g., circulatory system, life cycle, evolution, food chain and scientific name for the class marine animals

  3. Acquire Labeled sets of Instances • two conditions must be met: • The class label must be a non-recursive noun phrase whose last component is a plural-form noun (e.g., zoonotic diseases). • The instance must also occur as a complete query somewhere in the query logs. Used to filter out inaccurate paris To emphasize precision or recall. However, this seems imply that labeled classes don’t overlap

  4. Mining Open-Domain Class Attributes • four stages • identification of a noisy pool of candidate attributes, as remainders of queries that also contain one of the class instances. • “cast jay and silent bob strike back” • construction of internal search-signature vector representations for each candidate attribute, based on queries that contain a candidate attribute and a class instance. • These vectors consist of counts tied to the frequency with which an attribute occurs with a given “templatized” query. e.g., “cast for kill bill” , feature “X for Y” • construction of a reference internal search-signature vector representation for a small set of seed attributes provided as input. A reference vector is the normalized sum of the individual vectors corresponding to the seed attributes • the amount of supervision is limited to seed attributes being provided for only one of the classes. High precision but low recall? • ranking of candidate attributes with respect to each class, by computing similarity scores between their individual vector representations and the reference vector of the seed attributes.

  5. Evaluation • Data set • 50 million unique queries submitted to Google in 2006 • The set of instances that can be potentially acquired by the extraction algorithm is heuristically limited to the top five million queries with the highest frequency within the input query logs. • 100 million Web documents in English, as available in a Web repository snapshot from 2006 • Extraction results • After discarding classes with fewer than 25 instances, the extracted set of classes consists of 4,583 class labels, each of them associated with 25 to 7,967 instances, with an average of 189 instances per class.

  6. Accuracy of Class Labels • A class label is: • correct, if it captures a relevant concept although it could not be found in WordNet • subjectively correct, if it is relevant not in general but only in a particular context, either from a subjective viewpoint (e.g., modern appliances), or relative to a particular temporal anchor (e.g., current players), or in connection to a particular geographical area (e.g., area hospitals); • incorrect, if it does not capture any useful concept (e.g., multiple languages). • The manual analysis of the sample of 200 class labels indicates that 154 (77%) are relevant concepts and 27 (13.5%) are subjectively relevant concepts, for a total of 181 (90.5%) relevant concepts, whereas 19 (9.5%) of the labels are incorrect.

  7. Accuracy of Class Instances • the manual inspection of the automatically-extracted instances sets indicates an average accuracy of 79.3% over the 37 gold-standard classes retained in the experiments. • They also claim 90% accuracy for class labels (37 out of 40 labels successfully matched with manual labels)

  8. Evaluation of Class Attributes

  9. Contribution • enables the simultaneous extraction of class instances, associated labels and attributes • Acquire thousands of open-domain classes covering a wide range of topics and domains • The accuracy exceeds 80% for both instance sets and class labels • the extraction of classes only a few commonly-used Is-A extraction patterns. • Extract attributes for thousands of open-domain, automatically-acquired classes • The amount of supervision is limited to five seed attributes provided for only one reference class. • The first approach to information extraction from a combination of both Web documents and search query logs, to extract open- domain knowledge that is expected to be suitable for later use

More Related