1 / 41

Identity uncertainty and citation matching

This research overview explores the topics of identity uncertainty, citation matching, and information extraction. It discusses key contributions, generative model, experimental results, and implications for information extraction.

tdarr
Download Presentation

Identity uncertainty and citation matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identity uncertainty and citation matching

  2. Some relevant news… • Facebook rolled out Graph Search on Tuesday. The natural-language search engine  can, for example, search for "music that people who like Mitt Romney like," or "photos of my friends in 1989." Although it's limited to four topics at the moment - people, places, interests and photos - the queries that users can run against the millions of photos and connections stored among Facebook's billion users are powerful, powerful tools. … Bing - like Google - has begun to try and provide answers to questions, rather than lists of links. But that's the same territory Zuckerberg and Co. have staked out, too.

  3. Outline • Background of research • Key contributions • Citation matching and information extraction • Identity uncertainty • Generative model • Experimental results • Implications for information extraction more generally

  4. Background of research • Record linkage (Felegi & Sunter 1969): • Naïve Bayes model for record-pair match/mismatch vector given entity match/mismatch • Trained on matched and unmatched pairs • Sensitive to population sizes in train/test • Bayesian analysis of identity: • Data association literature (multitarget tracking) • Huang and R 97 (freeway surveillance) • Previous work on RPMs (Koller and Pfeffer) • Previous work on MCMC for RPM++ (Pasula & R 01) • CiteSeer not working too well

  5. CiteSeer02: Russell w/4 Norvig • Russell S, Norvig P (1995) Artificial Intelligence: A Modern Approach, Prentice Hall Series in Artificial Intelligence. Englewood Cliffs, New Jersey • Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 1995. • Russell S.; Norvig, P. Articial Intelligence - A Modern Approach. Prentice-Hall International Editions, 1995. • Russell S.J., Norvig P., (1995) Artificial Intelligence, A Modern Approach. Prentice Hall. • S. Russell and P. Norvig. Articial Intelligence, a Modern Approach. Prentice Hall, New Jersey, NJ, 1995.

  6. Stuart Russell and Peter Norvig. Artificial intelligence: A modern approach. Prentice-Hall Series on Artificial Intelligence. Prentice-Hall, Englewood Cliffs, New Jersey, 1995. • S. Russell and P Norvig. Artifical Intelligence: a Modern Approach. Prentice Hall, 1995. Book Details from Amazon or Barnes \& Noble • Stuart Russell and Peter Norvig. Articial Intelligence: A Modern Approach. Prentice Hall, 1995. • S. J. Russell and P. Norvig. Artificial Intelligence, a modern approach. Prentice Hall, Upper Saddle River, New Jersey 07458, 1995. • Stuart Russell and Peter Norvig. Artificial Intelligence. A modern approach. Prentice-Hall, 1995. • S. J. Russell and P. Norvig. Articial Intelligence: A Modern Approach. Prentice Hall. 1995. • S. Russell and P. Norvig, Artificial Intelligence A Modern Approach Prentice Hall 1995. • S. Russell and P. Norvig. Introduction to Artificial Intelligence. Prentice Hall, 1995.

  7. Stuart Russell and Peter Norvig. Artficial Intelligence: A Modern Approach. Prentice-Hall, Saddle River, NJ, 1995. • Stuart Russell and Peter Norvig. Articial Intelligence a modern approach. Prentice Hall series in articial intelligence. Prentice Hall, Upper Saddle River, New Jersey, 1995. • Chapter 18 Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig, Prentice-Hall, 2000. • Dynamics of computational ecosystems. Physical Review A 40:404--421. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall. • S. Russell, P. Norvig: Artificial Intelligence -- A Modern Approach, Prentice Hall, 1995. • Russell, S. \& Norvig, P. (1995) Artificial Intelligence: A Modern Appraoch (Englewood Cliffs, NJ: Prentice-Hall). Book Details from Amazon or Barnes \& Noble • Stuart Russell and Peter Norvig. AI: A Modern Approach. Prentice Hall, NJ, 1995. • S. Russell, P. Norvig. Artificial Intelligence: A Modem Approach. Prentice- Hall, Inc., 1995.

  8. 391-414. Russell SJ, Norvig P ( • Russell and Peter Norvig, "Artificial Intelligence - A Modern Approach (AIMA)", pp. 33 • Stuart Russell and Peter Norvig: Artificial Intelligence: A Modern Approach, Prentice-Hall, 1994. • Russell, S. \& Norvig, P., An Introduction to Artificial Intelligence: A Modern Approach, Prentice Hall International, 1996. • S. Russell, P. Norvig. Artician Intelligence. A modern approach. Prentice Hall, 1995. • Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995. Contributing writers: John F. Canny, Jitendra M. Malik, Douglas D. Edwards. ISBN 0-13-103805-2. • Stuart Russell and Peter Norvig. Artificial Intelligence: A Mordern Approach. Prentice Hall, Englewood Cliffs, New Jersey 07632, 1995.

  9. In Proceedings of the Third Annual Conference on Evolutionary Programming (pp. 131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P. 1995. Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ: Prentice Hall. • John Wiley. Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Inc. • Stuart Russell and Peter Norvig: Artifcial Intelligence A Modern Approach, Englewood Clioes, NJ: Prentice Hall, 1995. • In Scherer, K.R. \& Ekman, P. Approaches to Emotion, 13--38. Hillsdale, NJ: Lawrence Erlbaum. Russell, S.J. and Norvig, P. 1995. Artificial Intelligent: A Modern Approach. Englewood Cliffs, NJ: Prentice Hall. • Rosales E, Forthcoming Masters dissertation, Department of Computer Science, University of Essex, Colchester UK Russell S and Norvig P (1995) Artificial Intelligence: A Modern Approach. Prentice Hall: Englewood Cliffs, New Jersey. • S. Russell and P. Norvig (1995) Artificial Intelligence; A Modern Approach, Prentice Hall, New Jersey. • S. Russell and P. Norvig. Articial Intelligence. A Modern Approach. Prentice-Hall, 1995. ISBN 0-13-360124-2.

  10. Stuart J. Russell and Peter Norvig. Articial Intelligence: A Modern Approach, chapter 17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995. • Stuart J. Russell and Peter Norvig. Articial Intelligence A Modern Approach. Prentice Hall, Englewood Cli s, New Jersey, USA, 1995. 32 • Morgan Kaufmann Publishers. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall. • Stuart J. Russell and Peter Norvig. Articial Intelligence: AModern Approach,chapter 17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995. • W. Shavlik and T. G. Dietterich, eds., Morgan Kaufmann, San Mateo, CA. Russell, S. and Norvig, P. (1995). Artificial Intelligence - A Morden Approach. Englewood Cliffs, NJ: Prentice-Hall. • KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Advanced Digital Library Conference. to appear. Russell, S., and Norvig, P. 1995. Artificial Intelligence --A Modern Approach--. • Prentice-Hall. • Formal derivation of rule-based programs. IEEE Transactions on Software Engineering 19(3):277--296. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.

  11. Russell, Stuart and Peter Norvig, Artificial Intelligence, A Modern Approach, New Jersey, Prentice Hall, 1995. • S. Russell, P. Norvig: Articial Intelligence: A modern approach; Prentice Hall (1995). • Rechenberg, I. (89). Artificial evolution and artificial intelligence. In Forsyth, R. (Ed.), Machine Learning, pp. 83--103 London. Chapman. Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice Hall. • Russell, S and Norvig, P. 1995. Articial Intelligence: A Modern Approach Prentice-Hall, Englewood Cli s, New Jersey, 1995. • Russell, S., \& Norvig, P. (1995) . Artificial intelligence: A modern monitoring methods for information retrieval systems: From search approach. Prentice-Hall series on artificial intelligence. Upper Saddle product to search process. Journal • of the American Society for Information Science, 47, 568 -- 583. River, NJ: Prentice-Hall. • Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter 17. Number 0-13-103805-2 in Series in Artificial Intelligence. Prentice Hall, 1995. • S. Russell and P. Norvig. Articial Intelligence A Modern Approach. Prentice Hall, Englewood Cli s, 1995.

  12. Russell, Stuart and Norvig, Peter: Artificial Intelligence: A Modern Approach, Prentice Hall, Englewood Cliffs NJ, 1995 • S. Russell and P. Norvig. ????????? ????????????? ? ?????? ????????. Prentice Hall, Englewood Cli s, NJ, 1995. • S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach - The Intelligent Agent Book, Prentice Hall, NY, 1995. • S. Russell and P. Norvig. Artificial Intelligence-aModern Approach. Prentice Hall International, Englewood Cliffs, NJ,USA,1995. • S.J.Russell, P.Norvig: Arti cial intelligence. A modern approach", Prentice-Hall International, 1995. • In Proceedings of the Third Annual Conference on Evolutionary Programming (pp. 131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P. 1995. Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ: Prentice • Hall. • In Working Notes of the IJCAI-95 Workshop on Entertainment and AI/ALife, 19--24. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.

  13. Stuart J. Russell and Peter Norvig. Artiilcial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, N J, 1995. • Academic Press. 359--380. Russell, S., and Norvig, P. 1994. Artificial Intelligence: A Modern Approach. Prentice Hall. • Stuart J. Russell, Peter Norvig, Artifical Intelligence: A Modern Appraoch, Prentice-Hall, Englewood Cliffs, New Jersey. 1994. • Cambridge, MA: MIT Press. Russell, S. J., and Norvig, P. (1994). Artificial Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall. • Morgan Kauffman. Russell, S., and Norvig, P. 1994. Artificial Intelligence: A Modern Approach. Prentice Hall. • Fast Plan Generation Through Heuristic Search Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ. • Hoffmann \& Nebel Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ.

  14. Stuart Russel and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter 12.1 - 12.3, pages 367--380. Prentice Hall, 1995. • Stuart Russel and Peter Norvig. Artificial Intelligence, A Modern Approach. PrenticeHall, 1996. 2 • Stuart Russel, Peter Norvig, Articial Intelligence: A Modern Approach, Prentice Hall, New Jersey, US, 1995 • Russel, S., and Norvig, P. Articial Intelligence. A Modern Approach. Prentice Hall Series in Artificial Intelligence. 1995. • S. Russel and P. Norvig. Artificial Intelligence, A Modern Approach, Prentice Hall: 1995. Book Details from Amazon or Barnes \& Noble • S. J. Russel and P. Norvig. Articial Intelligence A Modern Approach, chapter 14, pages 426-435. Prentice Hall Series in Articial Intelligence. Prentice Hall International, Inc., London, UK, rst edition, 1995. Exercise 14.3. • Russel, S. and P. Norvig. Articial intelligence: A modern approach, Prentice Hall, 1995. Book Details from Amazon or Barnes \& Noble

  15. S. Russel and P. Norvig Artificial Intelligence: A Modern Approach, MIT Press 1995. • Russel, S. and Norvig, P., "Artificial Intelligence: A Modern Approch," p. 111-114, Prentice-Hall. • J. Russel and P. Norvig. Artificial Intelligence, A Modern Approach. Prentice Hall, Upper Saddle River, NJ, 1995. 71 • Stuart Russel and Peter Norvig. A Modern, Agent-Oriented Approach to Introductory Artificial Intelligence. 1995. • Stuart J. Russel and Peter Norvig. Artificial Intelligence---A Modern Approach, chapter 14, pages 426--435. Prentice Hall Series in Artificial Intelligence. Prentice Hall Internationall, Inc., London, UK, first edition, 1995. Excersice 14.3. • Russel S. and Norvig P. (1995). Articial Intelligence. A Modern Approach. Prentice Hall Series in Artificial Intelligence. • S. Russel, P. Norvig Articial Intelligence - A Modern Approach Prentice Hall, 1995 • Russel, S., P. Norvig. Artificial Intelligence: A Modern Approach Prentice Hall 1995.

  16. Artificial Intelligence, S Russel \& P Norvig, Prentice Hall, 1995 21 • Russel, S.J, Norvig P: Artificial Intelligence. A Modern Approach, Prentice Hall Inc. 1995 • Russel, S., Norvig, P. (1995) Artificial Intellience - A modern approach. (Englewood Cliffs: Prentice Hall International).

  17. Key contributions • A “vertically integrated,” declarative, generative model from objects, relations, and interpretations all the way down to text strings • Clear distinction between entities and strings • Probability of identity calculated as a direct consequence of the model (no “similarity heuristics”) • Semantically driven “collective disambiguation” during parsing occurs automatically • MCMC over possible worlds (object existence and relational structures), scaled linearly with #citations

  18. Citation information extraction • Given a set of citation strings scraped from online papers • Determine • What distinct papers exist (including offline papers) • What authors (venues, publishers, etc.) exist • What the titles and authors (etc.) of the papers are • Which paper cites which paper

  19. Example [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994. • Are these descriptions of the same object? Who are the authors? What is the venue? • Core task in CiteSeer, Google Scholar, over 300 record linkage companies • CiteSeer (2001) asserted that Russell and Norvig wrote over 100 distinct books

  20. Relational probability models • Taxonomic hierarchy of object classes (fixed) • Named instances in each class (fixed) • Complex attributesA denote typed relations between objects (e.g., AuthorOf(author,paper)) • Paper allows for nonfunctional relations and uncertainty over the (bounded) number #[A] of values for A for each object • Simple attributesB denote typed fixed-range functions • Probability models P(B | Parents(B)) where Parents(B) are attribute chains A1.….An.B’ • E.g., height ~ mother.height, father.height

  21. RPM for citations

  22. Priors for object attributes Names: 0.9 x (~ US census) + 0.1 x letter bigram Titles: Word+letter bigrams from AI BibTeX database #authors, #fnames, pubType: Counts in hand-parsed training set of 500 citations

  23. Observation models Corruption of names: Letter deletion/insertion/change First name(s) replaced by initials Omission/reordering of authors Corruption of titles: Letter and word deletion/insertion/replacement Parameters learned online by EM without requiring ground truth

  24. Models for Citation.parse and Citation.text • Range of parse attribute has two parts: • Style specifies order between title and authors plus how to write the authors (J. Smith vs Smith, J.). • Prior estimated from hand-segmented training set. • Segmentation (tiny grammar) specifies cut points: • <filler1>|<title>|<filler2>|<authors>|<filler3> [A50] “|I, Robot|.” By |I. Asimov|. Gnome Press, 1950. • Prior for cut points is uniform • Filler: letter + word bigrams estimated from training set • Filler 3 depends on PubType • Style, segmentation, obsTitle, names of obsAuthors, and fillers jointly determine text

  25. Comment • The model contains no “features” or other heuristics related to determining if two citations match each other

  26. Identity uncertainty • We name papers P1, P2, … cited by C1, C2, … • Uncertainty as to whether P1 = P2 etc. • If P1 = P2, just one title, author list; otherwise two • Possible world includes an assignmentι that specifies equivalence classes of papers: • {{P1},{P2}} vs. {{P1,P2}} • Probability model factors as P(ι)P(world | ι) • P(ι) = ΠCP(ιC); “known” classes (e.g., citations) are fixed sets of singletons (e.g., {{C1},…,{Ck}})

  27. Identity uncertainty contd. • Specify prior on number of objects |ιC| • E.g., number of authors ~ lognormal[6.9,2.3] • Make appropriate a priori uniformity assumption • Each paper equally likely to be cited • Each author equally likely to write a paper • Suppose ιC,k,m maps k named instances to m distinct objects in C out of n total; then • P(ιC,k,m) = n! / (n-m)!nk

  28. Identity uncertainty contd. • Why does the number of objects matter? • Suppose, for any object, we observe just one attribute: an integer in the range [1,…,1 000 000] • Prior distribution for values is uniform • Observed objects A and B both have value 526881 • What is the probability that A=B?

  29. Identity uncertainty contd. • Why does the number of objects matter? • Suppose, for any object, we observe just one attribute: an integer in the range [1,…,1 000 000] • Prior distribution for values is uniform • Observed objects A and B both have value 526881 • What is the probability that A=B? • If there are a thousand objects, 0.999

  30. Identity uncertainty contd. • Why does the number of objects matter? • Suppose, for any object, we observe just one attribute: an integer in the range [1,…,1 000 000] • Prior distribution for values is uniform • Observed objects A and B both have value 526881 • What is the probability that A=B? • If there are a thousand objects, 0.999 • If there a billion objects, 0.001

  31. Identity uncertainty contd. • Why does the number of objects matter? • Suppose, for any object, we observe just one attribute: an integer in the range [1,…,1 000 000] • Prior distribution for values is uniform • Observed objects A and B both have value 526881 • What is the probability that A=B? • If there are a thousand objects, 0.999 • If there a billion objects, 0.001 • Probability of identity is affected by • Number of objects, size of measurement space • Measurement accuracy, similarity between objects

  32. Inference • Given • a fixed assignment ι • a fixed “relational skeleton” (complex attributes) • RPM can be grounded as a Bayes net

  33. Inference contd. • Metropolis-Hastings MCMC • First part of proposal is a split/merge on ιC : • Drop two clusters of citations (i.e., papers) Pa, Pb • Add two empty clusters P1, P2 • Put citations from Pa, Pb u.a.r. into P1, P2 • Second part fills in attribute values: • Names, title: apply perturbation model in reverse (and occasionally pick a nearby census name) • Parse: sample directly from precomputed list

  34. MCMC inference

  35. Scaling • Naïve M-H merge proposals fail more frequently for larger worlds • Canopies (McCallum et al KDD-00; see also blocking methods in data deduplication): sets of entities that have >ε chance of matching according to a simple distance metric (i.e., don’t try merging “Smith” and “Baeza-Yates”). • Pick a canopy first, then a pair of papers in it • Runtime scales

  36. Citation Matching Results Four data sets of ~300-500 citations, referring to ~150-300 papers

  37. Cross-Citation Disambiguation Wauchope, K. Eucalyptus: Integrating Natural Language Input with a Graphical User Interface. NRL Report NRL/FR/5510-94-9711 (1994). Is "Eucalyptus" part of the title, or is the author named K. Eucalyptus Wauchope?

  38. Cross-Citation Disambiguation Kenneth Wauchope (1994). Eucalyptus: Integrating natural language input with a graphical user interface. NRL Report NRL/FR/5510-94-9711, Naval Research Laboratory, Washington, DC, 39pp. Second citation makes it clear how to parse the first one Wauchope, K. Eucalyptus: Integrating Natural Language Input with a Graphical User Interface. NRL Report NRL/FR/5510-94-9711 (1994). Is "Eucalyptus" part of the title, or is the author named K. Eucalyptus Wauchope?

  39. Cross-citation disambiguation • Later experiments (Milch) showed • 33% of singleton citations are parsed correctly • 64% of non-singleton citations are parsed correctly

  40. What about Russell and Norvig? • Still generates several clusters: • Correct version • Introduction to Artificial Intelligence • S. Russel and P. Norvig, Artificial Intelligence • Etc. • Copying of erroneous citations produces data whose best explanation has multiple clusters • Could add a copying model, plus times of citing and cited papers

  41. Summary • A somewhat nontrivial application of a declarative, relational, open-universe language • Relatively simple model for existence – did not allow for dependence on other objects • Inference results suggest scalability is not the huge barrier for model-based methods • Later work by Pasula raised accuracy to ~99% (comparable to humans); errors were mainly on partial or concatenated citation strings • Building the model was initially difficult

More Related