250 likes | 429 Views
Concept Space Construction. Todd Littell 27SEP06. Roadmap. Last Year: High quality concept space construction performed offline. First version of inference engine via graph operations. This Year: High quality concept space construction performed online.
E N D
Concept Space Construction Todd Littell 27SEP06
Roadmap • Last Year: • High quality concept space construction performed offline. • First version of inference engine via graph operations. • This Year: • High quality concept space construction performed online. • Second implementation of reasoning engine or graph mining application? • Future: • Online semantic net construction. • Sophisticated reasoning engine.
Semantic Net • Goal: Efficient, high-quality learning of a semantic network from text. • Semantic Net is a kind of Knowledge Representation (KR) structure that is typically represented by a graph model. • What does a Semantic Net comprise of? • Concepts • Entities • Types • Relationships: associative, categorical, functional, structural, mechanical, temporal, spatial… • Concept Space or Association Network is a simplified SN that captures concepts and concept-associations. Ref: www.jfsowa.com: “A cat is on the mat”.
Applications • Uses of Semantic Nets: • Knowledge Base for Reasoning and Inference: Fuzzy ER Model, Markov Net, Bayesian Net, Causal Net, etc. • Information Browsing: Navigation thru space, drill-up, drill-down, drill-across. • Domain Modeling: Communication, Information System construction, etc. • Query Formulation for Retrieval: Feedback, User Modeling, etc.
Related Knowledge Models • Other kinds of knowledge models: • Concept Graphs: see John Sowa’s site: http://www.jfsowa.com/cg/index.htm, defines rules for assertions, composition and limited inference. Also OWL/RDF. • Graphical Models: Belief Nets, Causal Diagrams, Signed Directed Graphs, Lattices. • Concept Lattices: FCA – mathematical theory for objects, attributes & mappings. • UML: formal modeling notation & semantics defined specifically for software industry. • Express-G: formal modeling notation & semantics defined for engineering. • Moore/Mealy/Petri Nets: actionable semantics for modeling & simulation. • Aspects: domain, expressability, ad hoc vs. well-defined semantics, discrete vs. continuous, typing, purpose/application, underlying theory, etc.
Associative Net (3) Ref: Mark Steyvers, Joshua B. Tenebaum, “The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth”, Cognitive Science 29, 2005.
Basic Algorithm • Calculate measure of association between two terms using a similarity measure and output best associations. Let T = set of terms, D = set of documents, F(t,d) = freq of t in d. Let adj(t) := {d | f(t,d) > 0 } be term adjacency list. Let adj(d) := {t | f(t,d) > 0 } be document adjacency list. For each t1 in T For each d in adj(t1) For each t2 in adj(d) s1 += g(f(t1,d), f(t2,d)) Calculate s := h(s1, params) if (s >= thresh) output (t1, t2, s). • Note: • Term vectors are sparse –don’t need to iterate through all dimensions. • Many similarity/distance measures exist, as well as other kinds of measures. • Characteristic of “ideal similarity metric” is tied to application. • All calculations are independent, hence easily parallelizable.
BeeSpace Variations • Only interested in calculating for representative terms with thresh1 < freq < thresh2. • In some cases, only interested in calculating for user-specified documents. Implies: • where D_R,D_C are selection matrices and F is term-by-doc freq matrix. • Only need to output top K similar terms. • Only need to output terms with sim > thresh3.
Co-occurrence Metrics • Many extensions possible for incorporating weighting functions, features such as POS tags, context, word distance, window size, etc. • Ref: Frequency Estimates for Statistical Word Measures by Terra & Clarke. • Similarity Reqs: s(x,y) >= 0; s(x,y) > s(x,z) => y is more similar to x than z; optionally s(x,y)=s(y,x).
MI & PWMI assuming MLE f(x,y) := |adj(x)^adj(y)| = sum_d(1 : f(x,d)>0 & f(y,d)>0)
MI & PWMI Generalizations • Obvious generalizations: utilize parameters p(d): • Non-obvious generalizations: Ref: Barry Robson, “Clinical and Pharmacogenomic Data Mining…”, Journal of Proteome Research, 2003. Ref: Jonathan Wren, “Extending the mutual information measure to rank inferred literature relationships”, BMC Bioinformatics, 2004.
Results • Look at results in spreadsheet…
“Make a Faster Wheel” • Optimized I/O • Parallelize • Better Algorithm • Better Code • Smarter Data Structures
Optimize I/O • Only use formatted I/O for human consumption; use binary I/O for all other cases. • Use buffered I/O if reading/writing small chunks at a time. • See handout.
Parallelize • Do the problems split naturally? • Divide-n-conquer apply? • Level of parallelization: • Very coarse grained: distributed agents. • Coarse grained: parallel jobs. • Medium grained: forked processes. • Fine grained: multi-threaded.
Better Algorithm • How to compare algorithms? • Time complexity. • Space complexity.. • Parallelizable. • Time-to-develop.
Better Code • Know your language. • Factor invariant expressions outside of loops. • Pre-compute whenever possible: cache results. • Sacrifice OO-ness. • Customized data structures. • Optimized I/O. • Avoid “long” calls (e.g. network, disk, etc.). • Tune to memory hierarchy.
Smarter Data Structures • Understand your language’s built-in collections library. • Roll-your-own data structures can often out perform generic libraries. Why? • Hybrid techniques.
To-Do/Unresolved • Decide what the complete set of applications will be for this component: browsing, inference, retrieval, etc. • Evaluate metrics using SME. • Decide what set of mined relations are significant for the applications in (a). • Investigate more advanced methods and compare trade-offs.