320 likes | 615 Views
BibPro: A Citation Parser System. Introduction. Integrating bibliographical information Metadata author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc. Citation string Thousands of variations More than 2,000 formats in Endnote
E N D
Introduction • Integrating bibliographical information • Metadata • author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc. • Citation string • Thousands of variations • More than 2,000 formats in Endnote • Citation Parsing Problem • automatically recognize individual fields from a given citation string • A template citation parser
Machine Learning • Condition Random Field • F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. • Support Vector Machine • Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. • Hidden Markov Model • K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. • Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. • Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.
Knowledge Base • A tree-like knowledge representation scheme that organizes the knowledge of reference concepts in a hierarchical fashion • Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006. • A knowledge base automatically constructed from an existing set of sample metadata records of a given area • E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.
Template Base • Keep citation style as a template • ParaCite http://paracite.eprints.org/ • I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548. • Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.
Key Idea • Encode a citation string into a template for BLAST • Keeping citation style information into a protein sequence • Utilizing bioinformatics sequence tools • BLAST • Using Domain Knowledge • Reserved word • Knowledge database (optional) • Blocking rule (common sense knowledge)
Question • How many symbols can be used in a protein sequence? 23 symbols used in BLAST • How many fields should be extracted from a citation? choose the most common used field • Which punctuation marks are treated as partition marks base on domain knowledge • How do we transform a citation into a protein sequence and retain its structure feature? define a encode table
Encoding Knowledge • A [AUTHOR] • Name abbreviation • T [TITLE] • Length of Blocking • L [VENUE] • booktitle(conference): Proceedings Proc Workshop Conf Conference Symposium Sympos Symp International Intern Annual Annu • journaltitle: Transactions Trans Journal • techtitle(thesis): Tech rep Rpt TR Master Masters Ph PhD Thesis thesis Dissertation dissertation • V [VOLUME] • volume: Volume volume Vol vol Vo vo • issue: Number number Nr nr No no NO Nos • P [PAGE] • page: pp page pages PP Page Pages pg PG
Encoding Knowledge • Y [DATE] • month: January February March April May June July August September October November December Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept • year: 1900-2010 • F [EDITOR] • editor: eds Eds editors Editors editor Eds ED Ed ed edited • S [INSTITUTION] • institution: University Univ Department Dept Corporation • M [PUBLISHER] • publisher: Press Pub Publishers Inc Publications
Tokenizing and Encoding Citation M . Bianchini , P . Frasconi , and M . Gori , " Learning in multilayered networks used as autoassociators , " IEEE Transactions on Neural Networks , vol . 6 , pp . 512 -515 , March 1995 .
Blocking Mechanism • After encoding the citation, we can utilize semi-structured characteristic of citation by some special pattern of sequence • Using blocking rule to merge special pattern into a single unit • e.g ADXRA A
Blocking(2/2) Index Form ARGBRGLRVRPRYD (keep the blocking area information: start position and end position e.g. “A” start:0 end:11 )
Template Database • A record in the Template Database • A citation item with both citation string and metadata • Style Form • Index Form • Once the template database has been constructed, BibPro can provide the citation parsing service on-the-fly
Citation Style Template • Index Form (Unknown Answer) • Style Form (Known Answer)
Finding Citation Style Templates • Using Score mechanism • Finding similar citation style templates • Blast Score Matrix • which fields exist in query citation • the order of partition mark represent a citation style • Choose by IndexForm • Align query citation with citation style template • Score Matrix (dynamic programming) • Content Symbol map to Content Symbol • Partition Mark map to Partition Mark • Choose the most suitable citation style template according to alignment between IndexForm and StyleForm
Parsing (Alignment Extraction) (Query) Index Form ARGBRGLRVRPRYD |||:|||||||||| (Template) Style Form ARGTRGLRVRPRYD ARGTRGLRVRPRYD
Experiment • Dataset • INFOMAP Dataset • A total of 160,000 citation records were collected from digital libraries on the Web • Citation string data was generated for each of the six citation styles (APA, IEEE, ACM, MISQ, JMIS, and ISR) • Cora Dataset • 500 records with diversity citation style • Flux-CIM Dataset • 2000 HS-domain records • 300 CS-domain records
Experiment • Evaluation • Token-Level • A is the number of true positive tokens • B is the number of false negative tokens • C is the number of false positive tokens • D is the number of true negative tokens • Field-Level
Conclusion • Parsing citation is a challenging problem • Diversity in citation formats • We present a template-based parser • Parser System http://csclws.iis.sinica.edu.tw:8080/input.jsp • Template Generator System http://csclws.iis.sinica.edu.tw:8080/tpin.jsp
Reference • F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. • Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. • K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. • Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. • Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006. • E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM. • I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548. • Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008. • Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008. • S. F. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman. A basic local alignment search tool. J. Mol. Biol., 215, 1990, 403-410. • Needleman, S. B. and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 1970, 443-453.