250 likes | 494 Views
A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering. Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group Oak Ridge National Laboratory. Outline.
E N D
A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group Oak Ridge National Laboratory
Outline • Introduction of Dynamic Information Stream and the issues • Bio-inspired Clustering • MSF Clustering Model Based on Bird Flock Collective Behavior • TFIDF not practical for dynamic data • MSF Document Clustering Algorithm • Multi-Agent Document Clustering Implementation • Future works and Conclusion
Text Challenge • Problem • How to effectively reduce the size of a large, streaming set of documents • “Give me the 10 documents that I need to read, out of the 1000 I received today?” • Characteristics • A steady flow of simple documents • Need to rapidly organize the documents into subsets • Select representative documents from the subsets
Approach • Use standard IR techniques to convert text to vectors • Use unsupervised learning/text clustering to organize the documents • Look for improvements in term weighting approaches
Document 1 Terms The Army needs senor technology to help find improvised explosive devices Army Sensor Technology Help Find Improvise Explosive device Document 2 ORNL has developed sensor technology for homeland defense ORNL develop sensor technology homeland defense Document 3 Mitre won contract develop homeland defense sensor explosive devices Mitre has won a contract to develop homeland defense sensors for explosive devices Standard Information Retrieval Vector Space Model Term List Army Sensor Technology Help Find Improvise Explosive Device ORNL develop homeland Defense Mitre won contract
Standard Textual Clustering Vector Space Model Cluster Analysis Dissimilarity Matrix D1 D2 D3 Documents to Documents Most similar documents TFIDF Euclidean distance Time Complexity O(n2Log n)
Issues (1) • Analysts are currently overwhelmed with the amount of information streams generated everyday. • Researches in clustering analysis mainly focus on how to quickly and accurately cluster static data collection. • Research on clustering the dynamic information stream is limited.
Solution: Bio-inspired Clustering • New computational algorithms inspired from biological models, such as ant colonies, bird flocks, and swarm of bees etc., can solve problems in dynamical environment. • These algorithms are characterized by the interaction of a large number of agents that follow the same rules. • The bio-inspired clustering algorithms apply the self-organizing and collective behaviors of social insects for organizing of dynamical changed data.
Data Clustering by Ant Clustering Algorithm Deneubourg proposed the first clustering solutions inspired by ant colonies in 1991. Agent (ant) action rule: agent move randomly in the grid. Agents only recognize objects immediately in front of them. Picking up or dropping item based on pickup probability and drop probability. • The movement of data objects has to be implemented through the movements of a small number of ant agents, which will slow down the clustering speed.
A New Clustering Algorithm Based on Bird Flock Collective Behavior
Flocking Model Flocking model, one of the first bio-inspired computational collective behavior models, was first proposed by Craig Reynolds in 1987. Alignment : steer towards the average heading of the local flock mates Separation : steer to avoid crowding flock mates Cohesion : steer towards the average position of local flock mates Alignment Separation Cohesion
Multiple Species Flocking (MSF) Model Feature similarity rule: Steer away from other birds that have dissimilar features and stay close to these birds that have similar features.
Every added or removed document from the set requires recalculation of the entire VSM TFIDF not practical for dynamic data Requires sequential processing Not good for a distributed agent approach Issues (2) Document Set must be known before VSM can be calculated
Inverse Corpus Frequency • Look at the forest, not the trees • We analyzed near 1 million documents from 6 major research corpora • We found 229,023 unique terms (A large dictionary contains around 70,000 terms) • We use this term frequency distribution as our “global” term frequency Reed, Jiao, et al., “TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams,” The Fifth International Conference on Machine Learning and Applications (2006) to appear Reed et al., “Multi-Agent System for Distributed Cluster Analysis,” Third International Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS'04), May 24-25, 2004, Edinburgh, Scotland
Why this matters • We can now generate an accurate vector directly from a text document • That vector can be generated where ever the document resides • We can now use agents to create vectors from documents over a broad range of computers
Multiple Species Flocking (MSF) Document Clustering • Each document is projected as a bird in a 2D virtual space. • The birds that have similar document vector feature (same as the bird’s species and colony in nature) will automatically group together and became a bird flock. • Other birds that have different document vector features will stay away from this flock.
MSF Document Clustering Demo The Document collection Dataset
Performance Results of MSF, K-means and Ant Clustering Algorithm The clustering results of K-means, Ant clustering and MSF clustering Algorithm on synthetic* and document** datasets after 300 iterations * Four data types and each includes 200 two dimensional (x, y) data objects. x and y are distributed according to Normal distribution. ** 112 news article dataset, 11 categories *** The k-means algorithm has pre-knowledge of the cluster number. Ref: X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architecture, Volume 52, Issues 8-9 , pp. 505-515, August 2006, ISSN: 1318-7621
MSF Clustering Algorithm for Information Stream • The MSF clustering algorithm can achieve better performance in document clustering than the K-means and the Ant clustering algorithm. • This algorithm can continually refine the clustering result and quickly react to the change of individual data. This character enables the algorithm suitable for clustering dynamic changed document information, such as the text information stream.
Multi-Agent Document Clustering Implementation • JADE platform. (http://jade.tilab.com/) • Linux Cluster Machine. • One main node and three client nodes, which are connected with a Gigabit Ethernet switch. Each node contains a single 2.4G Intel Pentium IV processor and 512M memory. • Document datasets are derived from TREC collections. TREC: Text REtrieval Conference (http://trec.nist.gov/)
Current and Future Works • Switched agent platform from JADE to our light agent platform (ORMAC). • Built a control agent for automatically generating and deploying flock agents on all available cluster nodes of 135 node cluster. • Built agents to monitor the news update on several popular Internet news websites and collect news and feed into the system in real-time. • Building a better GUI interface
Conclusion • The heuristic searching mechanism of flocking model helps document agents to quickly form flocks and react to the change of any individual documents. • TFIDF enhancement, the TFICF vector space model, allows for parallel or distributed algorithms for information stream clustering • Agent architecture provides analysis approach that can run on cluster computers.
Location proxy agents Boid agents Node2 Node3 Node1 JADE system agents Head Node JADE Container JADE main Container The architectures the central model and distributed model JADE system agents JADE main Container Head Node Location proxy agent Node1 … Boid agents JADE Container the Single Processor model the distributed model