Graphical Representations of Knowledge and Its Distribution

Graphical Representations of Knowledge and Its Distribution Cliff Behrens Information Analysis Applied Research Telcordia Technologies, Inc 973.829.5198 cliff@research.telcordia.com Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University, August 1 - 2, 2003

Knowledge, Consensus and Information Sharing Cultural Knowledge Derived from Consensus Consensus  Knowledge Individual Knowledge Information Sharing Among Individuals in a Single COI

Schemer Knowledge Validation Services • Issues with CSCW technology • Focus of CSCW research on new tools, less on motivating their use • Collaborative modeling building often lacks scientific rigor and quality control • Schemer Web-based technology that derives knowledge from consensus among Subject Matter Experts • Knowledge-based collaboration reveals distribution of domain expertise among panelists • Metrics for qualifying panelists and validating the models they produce • validates saliency of domain to SMEs • estimates competency of SMEs • yields best answers based on responses of SMEs weighted by their respective competencies • Generic service, but first tried on SIAM® influence networks

SIAM® Influence Net Example

Mathematics of Consensus Analysis (Romney et al. 1986) • Formal model consists of a data matrix X containing the responses Xik of SMEs 1..i..N on items 1..k..M • from this matrix a symmetrical matrix M* is estimated and holds the empirical point estimates M*ij, the proportion of matching responses on all items between SMEs i and j, corrected for guessing (if appropriate), on off-diagonal elements. • Obtain approximate solution yielding estimates of the individual SME competencies (the D*i) by applying Maximum Likelihood Factor Analysis to fit equation below and solve for the main diagonal values • M* = D*D*' • relative magnitude of eigenvalues (λ1 > 3 λ2) implies single factor solution • D*i, are the loadings for SMEs on the first factor • D*i = v1i{λ1} • Estimated competency values (D*i ) and the profile of responses for item k (Xik,l) used to compute Bayesian a posteriori probabilities for each possible answer. The formula for the probability that an answer is best or “correct” one follows: N • Pr(<Xik> i=1 | Zk=l) =  [D*i + (1-D*i)/L]Xik,l [(1-D*i)(L-1)/L]1-Xik,li = 1

Schemer Knowledge Validation Services

SME Contact Data • Email services • Meeting services • Other plug-ins • Structured Collaboration and Advice Network • User’s relation to other SMEs • Most similar point-of-view • Most different point-of-view • Someone a bit more knowledgeable • Gurus • Novel thinkers • Information Routing • Supports/challenges one’s point-of-view • Supports/challenges the consensus point-of-view Knowledge-Based Communications Interface

Standard Vector Space Model (ndims = nterms) Reduced LSI Vector Space Model (ndims << nterms) Doc 1 chip memory Doc 3 chip Doc 3 LSI Dimension 2 computer Doc 1 Doc 2 Doc 2 memory LSI Dimension 1 computer Latent Semantic Indexing (LSI): What is it?

LSI: How Does It Work? • Analyze training collection of documents • throw-out stop words and mark-up • count frequencies of words in each document • Compute term  document matrix • store word counts as entries in a matrix • apply appropriate weighting, e.g., log-entropy, to entries • Compute LSI vector space • reduce term  document matrix with Singular Value Decomposition • Fold new documents into LSI vector space • document vector computed from weighted sum of its term vectors • Compute vector for query (“pseudo-document”) • query vector computed from weighted sum of its term vectors • Search vector space for semantically-close term/document vectors • compute cosine of angle between query and other vectors

potato Many Undifferentiated Conceptual Domains/COIs corn chip silicon sugar wafer valley copper "chip" "wafer" valley silicon copper Dimension 2 wafer chip sugar corn potato Dimension 1 "chip" "wafer" Scalability: Large Document Collections and Polysemy

LSI: Ongoing Work • Distributed LSI • Needed for LSI to scale to massive document collections • Adopts “divide and conquer” approach • Sort documents by conceptual domain • recognizes documents created for different COIs • create more semantically homogeneous subcollections • apply cluster analysis, e.g., bisecting K-means • Compute independent LSI vector spaces for each subcollection • more parsimonious representations of concept domains or contexts • Compute similarity measures between spaces • construct graphs from terms shared by two vector spaces • compute similarity between these two graphs • Discover appropriate search vector spaces for a query • cosine calculations (as before) • relevance feedback (as before) • query expansion • Visualizations to explore semantic context for a query in different LSI vector spaces

Vector Spaces Dimensions Non-stop Terms Documents NSF-Geology 298 25,963 3,255 NSF-Engineering 229 30,247 3,057 NSF-Biology 224 38,176 3,645 Movie Reviews 239 70,411 3,557 All Documents 282 122,685 13,514 DLSI: Experiments with NSF-Movie Review Corpus

university center/center’s cooperative earth center reports travel research earth science-fiction/ sci-fi travel alien earth DLSI: The Context of Term Meaning Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing only NSF geology abstracts. Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing only Ebert movie reviews. Graph of semantic relationships between top five terms retrieved for the query {travel, center, earth} from the vector space containing all documents.

Graphical Representations of Knowledge and Its Distribution