160 likes | 268 Views
Text Classification of USENET messages for a Conversation Visualisation System Final Year Project Final Presentation. Jolyon Hunter cs91jh@surrey.ac.uk www.jrth.co.uk Tuesday 6 th May 2003. Introduction. Aim
E N D
Text Classification of USENET messages for a Conversation Visualisation SystemFinal Year ProjectFinal Presentation Jolyon Hunter cs91jh@surrey.ac.uk www.jrth.co.uk Tuesday 6th May 2003
Introduction • Aim • “To investigate how messages and conversations on USENET newsgroups can be classified automatically as part of a system to visually represent online discussions.” • Objectives • To review systems which visualise online discussions -enabling the identification of phenomena to be visualised • To analyse 250,000+ word corpus of text – try to identify potential cues for classification • To specify and design a system for automatic classification of messages/conversations • To implement, test and evaluate this system TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Conversation Visualisation Systems?For example… “PeopleGarden” Others include:“Loom” (Donath et al), “Netscan” (Smith) and “Conversation Map” (Sack), and “CodeZebra” (Diamond et al) Xiong, Rebecca & Donath, Judith 1999 “PeopleGarden: Creating Data Portraits for Users” MIT Media Laboratory http://smg.media.mit.edu/~becca/ TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Phenomena to Visualise… …and how to do it! • Emotion (“Happy”, “Sad”) • Agreement/Disagreement (“Argument”) • Involvement – Sense of Community • Character traits of users and many more… How to Classify? Automated Text Analysis • “Smokey” (Spertus) • “WebSOM” (Kohonen) • “CLUTO” (Karypis) TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Analysis Overview • HOW?Initial Observations – phenomena +featuresIn-depth corpus analysis • WHAT?6000+ messages from various newsgroups (4 million+ words) • UniS/CodeZebraWorkshop – features (words) • Using System Quirk to extract words; frequency counting (Kontext) >> Relative Frequencies • Using gCLUTO to visualise data for interpretation • WHY?Formulate programmablerules to code into a system TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
gCLUTO Visualisations • Visualise clusters and the relationships between clusters • Possible to see patterns or heuristics to help derive rules • CLUTO has potential for future use within a system to automatically classify text - e.g. real-time clustering TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Analysis: Creating Rules • Possible to derive example rules from analysis • More analysis – random sample using 6 classes: • Similar patterns emerge • Example rules also >>> SYSTEM! TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
System Development • Process Model of Software Engineering:Requirements, Design, Implementation, Testing and Evaluation • “System”:System Quirk > Rules > Program > CLASSIFICATION • Rule-Based Processor: IF..THEN.. Rules coded into Perl program to produce classifications TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Generic Conversation Visualisation System TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
“Message Text Analysis” Module TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Perl Code: Key points IF…THEN… RULES (as seen earlier) CLASS COUNTER: if(($word eq "agree") && ($relative{$word} > 0.003)) { $AGREEMENT++; } CLASSIFICATIONS… if ($AGREEMENT >= 2){ $classification = "AGREEMENT"; } if ($ARGUMENT >= 2) { $classification = "ARGUMENT"; } TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Testing & Evaluation • Ten sample messages either “Agreement” or “Disagreement” • Small sample • Key excerpts given to human testers (ten people) – asked to rate • System vs. Humans! • System correct 3 times, most inconclusive • Human responses correlate with system, but ambiguities also exist • Conclusions?Results not conclusive but show promise > Larger sample; more research; TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Recap: Mission Accomplished? • Aim • “To investigate how messages and conversations on USENET newsgroups can be classified automatically as part of a system to visually represent online discussions.” • Objectives • To review systems which visualise online discussions -enabling the identification of phenomena to be visualised • To analyse 250,000+ word corpus of text – try to identify potential cues for classification • To specify and design a system for automatic classification of messages/conversations • To implement, test and evaluate this system TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Text Classification of USENET messages for a Conversation Visualisation System Thanks for listening… Any Questions? TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
Final Report The Final Report for this project is also available online at: www.jrth.co.uk TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM
REFERENCES • “Loom" Judith DonathDonath, Judith 2002 “A Semantic Approach to Visualising Online Conversation” Communications of the ACM 45(4): 45-49http://web.media.mit.edu/~kkarahal/loom/index.html • “Conversation Map” Warren SackSack, Warren 2000 “Design for Very Large-Scale Conversations” Ph.D. Thesis, February 2000, MIT Media Laboratory http://www.sims.berkeley.edu/~sack/cm/ • “Netscan” Marc SmithSmith, Marc. 2001. “Netscan: A tool for measuring and mapping social cyberspaces.” http://netscan.research.microsoft.com • “PeopleGarden” Rebecca Xiong & Judith DonathXiong, Rebecca & Donath, Judith 1999 “PeopleGarden: Creating Data Portraits for Users” MIT Media Laboratory http://smg.media.mit.edu/~becca/ • “CodeZebra”Sara DiamondDiamond, Sara (Project Leader) - Banff New Media Institute, Canada plus many others (inc. Dr. A. Salway, University of Surrey)http://www.codezebra.net • “Smokey” Ellen SpertusSpertus, Ellen 1997 "Smokey: Automatic Recognition of Hostile Messages,“ Innovative Applications of Artificial Intelligence ‘97http://www.spertus.com/ellen/ • “WebSOM” Teuvo KohonenKohonen, T. 1996 onwards: more details at http://websom.hut.fi/websom/ • “CLUTO” George KarypisKarypis, George - 2002 - “CLUTO”, “gCLUTO” and “wCLUTO” University of Minnesota, MN USA Software available from http://www-users.cs.umn.edu/~karypis/cluto/ TEXT CLASSIFICATION OF USENET MESSAGES FOR A CONVERSATION VISUALISATION SYSTEM