120 likes | 131 Views
Develop tools for users to access and understand large multimodal information, extract implicit data, use KDT for knowledge discovery. Architecture, sources of info, GUI, association tasks, query language, constraints, BNF grammar, query execution, presentation, applying FACT to news data. Improves discovery process over databases.
E N D
Finding Associations in Collections of Text 99419-511 김유환
Introduction • The need to develop tools to help users access and understand large quantities of multimodal information • Nontrivial extraction of implicit, previously unknown, and potentially useful information from data • KDT(Knowledge discovery from Text)
The FACT System Architecture • Three sources of information • Knowledge Sources • Background Knowledge • unary and binary predicates over the keyword labeling the documents • 유의어 사전 • GUI • Text Collections • Must either already be labeled with a set of keywords • Or must be fed through a text categorization system that augments documents with such keywords
Associations • FACT focuses on the task of finding association in collections of text. • r={t1,…,tn} : Collection of documents • R={I1,…,Im} : Set of Keywords • t(A) = 1 : A is one of the keywords labeling t • (X) : The set of all documents ti that are labeled (at least) with all the keywords in X. • X is called a -covering if |(X)|>= • W=>B : association over over r • all documents that are labeled with the keywords in W, at lest a proportion r of them are also labeled with keywords in B
The Query Language • Association-discovery query • What type of keywords are desired in the left-hand and right-hand side of any found associations • Any found association to satisfy • unary predicates • binary predicates : define relationships between keywords • Constraints on the size of the various components of the association • BNF grammar
The Query Language (2) Find : (5/0.5) c1:country, c2:country=>t:topic Where : c1G7, c2 {Arab League}, tExportCommodities(c1) • at least half of the time, whenever a G7 country and an Arab League country label a document, the document is labeled by some topic that is not an export commodity of the G7 country, and this occurs at least 5 times in the collection
Query Execution • 사전 지식 • -cover인 집합의 부분집합은 모두 -cover이다. • The set of candidate -covers is built incrementally, starting from singleton -covers and adding elements to a set so long as the set stays a -cover • Finding associations in the presence of constraints
Presentation of Associations • Provide a browsing tool that helps the user easily focus on the subset of results that are potentially relevant
Applying FACT to Newswire Data • Reuters data • Background Knowledge : CIA World FactBook • Run a series of queries using FACT and compared the CPU time and the number of associations found for each query • 결과 • the specification of background-knowledge constraints actually provides information that is exploited by our discovery algorithm, speeding up the association-discovery process
Final Remarks • Better than Database Query • Presents the user with an easy-to-use graphical interface in which discovery tasks can be specified