Automatic Indexing from Machine Readable Text

KeyGraph 2006년 9월 19일 김경중

Introduction • Explosion in the amount of machine readable text • Dealing with complete documents : Limited space and time considerations • Representing a document using a small set of terms • Keywords : The terms assigned • Indexing : Assigning representative terms to a document • Terms representing author’s point • “Omni-directional robot vision for looking at vertical edges all around the robot with a single lens” • Camera? (frequent but…) • Use of a conical mirror for making a robot look at vertical edges all around itself

Previous Works • Human experts for indexing? • Authors for indexing? • Automatic indexing • Statistic indexing • Frequencies above a certain threshold • Average frequencies of terms in a large-scale database • Not appropriate for extracting main point • Indexing by sections and titles • Indexing by natural language analysis • A significant amount of background knowledge (syntactic, semantic, pragmatic knowledge, fonts for emphasis, and so on)

Our Aim • Automatic Indexing with … • Expresses the main point of the author, not frequent terms he/she uses • Uses only information in the text of documents, (i.e., not external knowledge like a corpus, sections, or natural language processing) • achieves a simple and fast algorithm

KeyGraph Algorithm (1) • Indexing by a co-occurrence graph • 1) A building construction metaphor • Assumption • Technical or academic documents • A document is written to carry a few original points, and the terms in that document are related for expressing these points • Metaphor • Written document = Constructed building • Building • Foundations (statements for preparing basic concepts), walls, doors, and windows (ornamentation), roofs (main ideas in the document) • Roofs : the most important things • Roofs are supported by columns

KeyGraph Algorithm (2) • KeyGraph • Extracting foundations : Basic and preparatory concepts • Extracting columns : relationships between terms in the document and the basic concepts extracted 1) • Extracting roofs: terms at the cross of strong columns

Document Preparation • A document : D is composed of sentences • Stop words list: A list of non-significant words that have little meaning (“a”, “and”, “here”, etc.) • Word stem (“run”, “running”, and “runs”) • Phrase candidates: Sequences of words bound by non-significant words and stems • a, b, c, d : {abcd}, {abc}, {ab}, {bcd}, {bc} and {cd} • Select longer phrase with high frequency • A term : A word or a phrase (unique)

Extracting Foundations • Graph G for document D • Nodes: Representing terms (highly frequent terms, 30) • Terms representing fundamental concepts : Frequently shown • Links: Representing the co-occurrence (term pairs which frequently occur in the same sentences) • Underlying concepts: Association among terms • Co-occurrence for indexing < TFIDF • Least number of edges : (number of nodes in G) -1 • Cluster : Maximal connected subgraphs |x|s: # of x in s (sentence)

Extracted Foundations

Extracting Columns • 12 top keys

Extracting Roofs (Keywords Extraction) • 12 top keywords HF (high frequent term) key

A Weak Link

Results After Pruning

Performance • 5900 documents (AI) • 9 users (AI) • 130 queries • Average of 5 queries = 26 points

Earthquake Prediction • Earthquake sequence 데이터를 이용하여 지진이 일어날 위치를 예측 • 잠재적인 (알려지지 않은) 진원을 탐색하여 대규모의 광범위한 지진이 미리 일어날 것을 예측

Earthquake Prediction (2) Kansai 지방의 지도 및 생성된 KeyGraph, 빨간색 지역이 잠재적 진앙 (chance) 39번 지역은 고베 대지진 (1995, 7000명 사망)의 진앙 본 KeyGraph는 대지진이 일어나기 전인 1992년까지의 데이터로 만듦

Automatic Indexing from Machine Readable Text

Automatic Indexing from Machine Readable Text

Presentation Transcript