1 / 18

Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction. NIPS 15th, 2003 Identity Uncertainty and Citation Matching. Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21. Outlines. Introduction

edana
Download Presentation

Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IJCAI 2003 Workshop on Learning Statistical Models from Relational DataFirst-Order Probabilistic Models for Information Extraction NIPS 15th, 2003 Identity Uncertainty and Citation Matching Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

  2. Outlines • Introduction • Related works • Models for the bibliography domain • Experiment on model A • Desiderata for a FOPL • Conclusions 2/18

  3. Introduction –Citation Matching Problem • Citation matching: • the problem of deciding which citations correspond to the same publication • Difficulties • Different citation styles • An imperfect copy of the book’s title • Different ways to refer an object (identity) • Ambiguity • “Wauchope, K. Eucalyptus: Integrating Natural language Input with a Graphical User Interface” • Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ? • Tasks • Parsing • Disambiguation • Matching 3/18

  4. Introduction –Citation Matching Problem: Examples Journal of Artificial Intelligence Research, or Artificial Intelligence Journal ?? 4/18

  5. Introduction –First-Order Probabilistic Models 5/18

  6. Introduction –Result of Model B 6/18

  7. Related Works • IE • the Message Understanding Conferences [DARPA,1998] • Bayesian modeling • finding stochastically repeated patterns (motifs) in DNA sequences [Xing et al., 2003] • Robot localization [Anguelov et al., 2002] • FOPL/RPM (Relational Prob. Model) • A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. 7/18

  8. Models for the Bibliography Domain –Model A • [Pasula et al. 2003] 8/18

  9. Models for the Bibliography Domain –Model A (Cont.) • Suggest a declarative approach to identity uncertainty using a formal language • Algorithm • Steps • Generate objects/instances • Parse and fill attributes • Inference (Approximation, MCMC) • Cluster the identity (publication) 9/18

  10. Models for the Bibliography Domain –Model A (Cont.) • Attributes using unconditional probability • learn several bigram models • letter-based models of first names, surnames, and title words • using the following resources • the 2000 Census data on US names • a large A.I. BibTeX bibliography • a hand-parsed collection of 500 citations • Attributes using conditional probability • Using noise channels for some attributes • the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames • The parameters of the corruption models are learnt online, using stochastic EM • Citation.parse • It keeps track of the segmentation of Citation.text • An author segment, a title segment, and three filler segments (one before, one after, and one in between) • Citation.text • Be constrained by Citation.parse, Paper.pubType, … • These models were learned using our pre-segmented file. 10/18

  11. Models for the Bibliography Domain –Model B 11/18

  12. Models for the Bibliography Domain –Model B (Cont.) • Generating objects • The set of Author objects, and the set of Collection objects are generated independently. • the set of Publication objects is generated conditional on the Authors and Collections. • CitationGroup objects are generated conditional on the Authors and Collections. • Citation objects are generated from the CitationGroups. 12/18

  13. Models for the Bibliography Domain –Model B (Cont.) • Fill attributes • Author.Name • is chosen from a mixture of a letter bigram distribution with a distribution that chooses from a set of commonly occurring names • Publications.Title • is generated from an n-gram model, conditioned on Publications.area • More specific relations and conditions between attributes 13/18

  14. Experiment on model A –Experiment Setting • Dataset • Citeseer’s hand-matched datasets • Each of these datasets contains several hundred citations of machine learning papers • Citeseer’s phrase matchingalgorithm • a greedy agglomerative clustering method • based on a metric that measures the degrees to which the words and phrases of any two citations overlap • half of them in clusters ranging in size from two to twenty-one citations 14/18

  15. Experiment on model A –Experiment Result 15/18

  16. Desiderata for a FOPL • Contains • A probability distribution over possible worlds • The expression power to model the relational structure of the world • An efficient inference algorithm • A learning procedure which allows priors over the parameters • Has the ability • to answer queries • to make inferences about the existence or nonexistence of objects having particular properties • to represent common types of compound objects • to represent probabilistic dependencies • to incorporate domain knowledge into the inference algorithms 16/18

  17. Conclusions • First-order probabilistic models • a useful, probably necessary, component of any system that extracts complex relational information from unstructured text data • Some of the directions we plan to pursue in the future • defining a representation language that allows such models to be specified declaratively, • scaling up the inference procedure to handle large knowledge bases 17/18

  18. Thanks!!

More Related