1 / 9

Chem X Seer: A Data Repository and Digital Library for Chemical Kinetics

Chem X Seer: A Data Repository and Digital Library for Chemical Kinetics. Prasenjit Mitra The Pennsylvania State University, University Park, PA 16803 In collaboration with Y. Liu, B. Sun, C.L. Giles. Challenges. Formula and Name Search Structure Search Table Data Extraction & Search

maja
Download Presentation

Chem X Seer: A Data Repository and Digital Library for Chemical Kinetics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ChemXSeer: A Data Repository and Digital Library for Chemical Kinetics Prasenjit Mitra The Pennsylvania State University, University Park, PA 16803 In collaboration with Y. Liu, B. Sun, C.L. Giles

  2. Challenges • Formula and Name Search • Structure Search • Table Data Extraction & Search • Figure Data Extraction & Search

  3. Chemical Formula and Name Search • Extraction • Disambiguation from acronyms/abbreviations, etc., e.g., OH (hydroxyl versus Ohio) • Functional group identification by automatic segmentation • Indexing • Which sub-formula should we index? • Index Pruning • Search • Query Semantics for partial/fuzzy search of chemical formula

  4. Formula/Name Search

  5. TableSeer:Table Data Extraction and Search • Automatically identify tables in digital documents • Varying formats, irregular formats • Table Boundary detection [CIKM’08] • Separate column header from data [JCDL’08] • Some tables have multiple rows of column headers • Units and other metadata extraction • Index data and metadata extracted from tables • Metadata: captions, references to the table in the text • Automatically identify synopsis – part of the paper discussing the table [in submission] • TableRank • Dependent both upon table data, metadata, and document level metadata [AAAI’08]

  6. TableSeer Beta online working design of a table search engine

  7. Other Challenges • Crawl using limited resources • Clustering-based algorithm [CIKM’07] • Extract data from figures [AAAI’08] • Chemical Structure Search [in submission] • Graph Database • What substructures to index? • What is a good ranking function?

  8. Future Work • Information Extraction • From other types of figures: • Bar graphs, pie charts, etc. • Data Quality • How do we assign scores representing the fidelity of the data? • Depends upon data source • Depends upon accuracy of our online extractors • Make Extracted Table Data Queryable • Schema Mapping • Columns in extracted tables to other similar columns in tables • Automatic unit detection and normalization

  9. pmitra@psu.edu • chemxseer.ist.psu.edu • archseer.ist.psu.edu [ Map Search] • => SeerSuite [ SourceForge ]

More Related