1 / 16

Automatic Indexing from Machine Readable Text

KeyGraph algorithm extracts foundations, columns, and roofs of documents, enhancing automatic indexing for efficient information retrieval.

lmanning
Download Presentation

Automatic Indexing from Machine Readable Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KeyGraph 2006년 9월 19일 김경중

  2. Introduction • Explosion in the amount of machine readable text • Dealing with complete documents : Limited space and time considerations • Representing a document using a small set of terms • Keywords : The terms assigned • Indexing : Assigning representative terms to a document • Terms representing author’s point • “Omni-directional robot vision for looking at vertical edges all around the robot with a single lens” • Camera? (frequent but…) • Use of a conical mirror for making a robot look at vertical edges all around itself

  3. Previous Works • Human experts for indexing? • Authors for indexing? • Automatic indexing • Statistic indexing • Frequencies above a certain threshold • Average frequencies of terms in a large-scale database • Not appropriate for extracting main point • Indexing by sections and titles • Indexing by natural language analysis • A significant amount of background knowledge (syntactic, semantic, pragmatic knowledge, fonts for emphasis, and so on)

  4. Our Aim • Automatic Indexing with … • Expresses the main point of the author, not frequent terms he/she uses • Uses only information in the text of documents, (i.e., not external knowledge like a corpus, sections, or natural language processing) • achieves a simple and fast algorithm

  5. KeyGraph Algorithm (1) • Indexing by a co-occurrence graph • 1) A building construction metaphor • Assumption • Technical or academic documents • A document is written to carry a few original points, and the terms in that document are related for expressing these points • Metaphor • Written document = Constructed building • Building • Foundations (statements for preparing basic concepts), walls, doors, and windows (ornamentation), roofs (main ideas in the document) • Roofs : the most important things • Roofs are supported by columns

  6. KeyGraph Algorithm (2) • KeyGraph • Extracting foundations : Basic and preparatory concepts • Extracting columns : relationships between terms in the document and the basic concepts extracted 1) • Extracting roofs: terms at the cross of strong columns

  7. Document Preparation • A document : D is composed of sentences • Stop words list: A list of non-significant words that have little meaning (“a”, “and”, “here”, etc.) • Word stem (“run”, “running”, and “runs”) • Phrase candidates: Sequences of words bound by non-significant words and stems • a, b, c, d : {abcd}, {abc}, {ab}, {bcd}, {bc} and {cd} • Select longer phrase with high frequency • A term : A word or a phrase (unique)

  8. Extracting Foundations • Graph G for document D • Nodes: Representing terms (highly frequent terms, 30) • Terms representing fundamental concepts : Frequently shown • Links: Representing the co-occurrence (term pairs which frequently occur in the same sentences) • Underlying concepts: Association among terms • Co-occurrence for indexing < TFIDF • Least number of edges : (number of nodes in G) -1 • Cluster : Maximal connected subgraphs |x|s: # of x in s (sentence)

  9. Extracted Foundations

  10. Extracting Columns • 12 top keys

  11. Extracting Roofs (Keywords Extraction) • 12 top keywords HF (high frequent term) key

  12. A Weak Link

  13. Results After Pruning

  14. Performance • 5900 documents (AI) • 9 users (AI) • 130 queries • Average of 5 queries = 26 points

  15. Earthquake Prediction • Earthquake sequence 데이터를 이용하여 지진이 일어날 위치를 예측 • 잠재적인 (알려지지 않은) 진원을 탐색하여 대규모의 광범위한 지진이 미리 일어날 것을 예측

  16. Earthquake Prediction (2) Kansai 지방의 지도 및 생성된 KeyGraph, 빨간색 지역이 잠재적 진앙 (chance) 39번 지역은 고베 대지진 (1995, 7000명 사망)의 진앙 본 KeyGraph는 대지진이 일어나기 전인 1992년까지의 데이터로 만듦

More Related