190 likes | 339 Views
Probabilistic Latent Semantic Analysis as a Potential Method for Integrating Spatial Data Concepts. R.A. Wadsworth 1 , A.J. Comber 2 , P.F. Fisher 2. Centre for Ecology and Hydrology, Lancaster, UK Dept of Geography, Leicester University, UK.
E N D
Probabilistic Latent Semantic Analysis as a Potential Method for Integrating Spatial Data Concepts R.A. Wadsworth1, A.J. Comber2, P.F. Fisher2 • Centre for Ecology and Hydrology, Lancaster, UK • Dept of Geography, Leicester University, UK
We want to understand how the environment is changing. But, natural resource inventories constantly develop new base-lines. Therefore we want some way to know how similar two categories are so we can decide whether inconsistencies are change or error. Motivation
First we just asked people (domain experts) “are ‘a’ and ‘b’, similar or dis-similar or you’re not sure?” But, the domain expert has to make lots of choices, sometimes domain experts aren’t available, you don’t know why they think concepts are similar (or not), etc. so ... (Very) simple text mining – the more words used in common in two categories the more similar they are. Earlier approaches
In the proceedings we use land-cover categories, but, We’re all here because of Andrew ... So, what does his writing tell us about the underlying concepts behind his work? Case Study
Used the English language abstracts from the papers provided on his web site. This is a biased sample, do the other papers contain concepts not covered by the English language work? Do they contain collaborations I’ve missed? However, just want to illustrate the process ... Case Study – the data
Case Study – the data Red dots – collaborators Blue squares – papers in this analysis
Text Mining Andrew’s Abstracts “Object orientated modelling in GIS” “Processes in cadstre” “A formal model of correctness in cadstre” “Surveying mapping and LIS education in the USA” “Surveying education for the future” “Expert systems for GIS”
If we knew what the underlying (hidden, latent) concepts are, we might be able to understand why two categories are considered to be similar. Why latent analysis?
It is a “generative model” Assumes: documents describe themes and words are associated with themes We observe the frequency of words in documents P(d,w) = P(d)∑zєZP(w|z)P(z|d) Therefore, we try and model what latent variables (z’s) exist. Probabilistic Latent Semantic Analysis
In practice similar to clustering but ... “Documents are not assigned to clusters, they are characterized by a specific mixture of factors with weights P(z|d). These mixing weights offer more modelling power and are conceptually very different from posterior probabilities in clustering models and (unsupervised) naive Bayes models.” Thomas Hofmann 1999 Probabilistic Latent Semantic Analysis
Nine Latent Themes in Andrew’s Work Cadastral systems, metadata and cartography? “B” “C” “A”
Latent Themes in Andrew’s work Education and Technology? “D” “E”
Latent Themes in Andrew’s work Decisions and Directions? “G” “F”
Latent Themes in Andrew’s work Data? “I” “H”
Simple text mining allows you to relate categories to each other, but, not always easy to say why. PLSA gives some indication of the underlying (fundamental?) themes, but, how stable or useful are the results ...? Conclusions