1 / 38

Data Mining Email Discussion Lists for Scholarly Discourse in the Digital Age

Explore a new approach to data mining email discussion lists, such as H-Net, to analyze scholarly discourse in the digital age. Discover the opportunity and challenge of searching large text archives and the application of semantic-augmented consensus clustering.

hadfield
Download Presentation

Data Mining Email Discussion Lists for Scholarly Discourse in the Digital Age

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. H-Net and Scholarly Discoursein the Digital Age:A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen Michigan State University November 5, 2006

  2. H-Net and Scholarly Discourse in the Digital Age • Opportunity & ChallengeSearching Large, Text Archives • New ApproachSemantic-Augmented Consensus Clustering • ApplicationH-Net Discussion Lists

  3. IT Communication Revolution • People: Few  Many Experts  Everyone • Speed: Slow  Instant • Quantity: Small  Vast • Style: Long  Short • Location: Limited Everywhere • Lifetime: Short  Forever

  4. New… Forms of Interactivity Trans-Disciplinary Communities Participants Producers Consumers Levels of Democratization of Information Impact on Scholarly Communication

  5. Electronic Archives Mostly Text Based Exponential Growth Not Catalogued or Catalog-able Little or No Metadata Untapped Value Current Users Future Scholars

  6. Opportunity & Challenge Knowledge Information Large Text Archive

  7. Typical Document Search Data • Words and Phrases • Boolean Combinations • Automatic(“Unsupervised”) • Not Sufficient • Too Little • Too Much Metadata • Keywords & Annotations • Classifications • By Hand(“Supervised”) • Not Scalable • 1M Messages • 3GB Text

  8. Our Research • On Large Text Archives • Organization • Exploration • Develop and Test • New Techniques • New Tools • Interdisciplinary Knowledge Information Large Text Archive

  9. H-Net and Scholarly Discourse in the Digital Age • Opportunity & Challenge Searching Large, Text Archives • New ApproachSemantic-Augmented Consensus Clustering • ApplicationH-Net Discussion Lists

  10. Two Approaches Very broadly there are two approaches we could use to aid a user in finding documents in a large set: • Classification • Clustering

  11. The Two Spiral Problem Our little example. How to discriminate the two intertwined spirals. = +

  12. Classification Given k classes, find the best class in which to place a particular example Typically two stages: • Train the algorithm on examples from the k classes • See how well the algorithm does on placing an unknown into the correct class

  13. Classification Example Class 1 Train Test Algorithm Trained Algorithm ? Class 2

  14. Supervised Classification is a supervised process. We know the k classes (or we have a good idea) so we make the algorithm work properly on examples, then test how well it learned by testing it with unknowns.

  15. Clustering Slightly different. Given a set of examples, find the “best” partitioning into k sets of those examples. Also two stages: • Cluster the examples, we provide k • Measure somehow how well separated the examples are.

  16. Example Algorithm

  17. Unsupervised There is typically no training in clustering. We choose where to put a point based on some criteria of “closeness”. As you can see, that can be hard to measure.

  18. Document Clustering Our approach is to cluster documents (instead of points in a spiral) based on documents that are “close” to each other in meaning. The result should be sets of documents that have something in common, especially if the process is user influenced.

  19. Three general problems we will address • Consensus clustering • Semantic distance measure • Semi-supervised user influence on the clustering process

  20. One: Consensus Clustering Two basic problems: • No one measure of “closeness” is often sufficient to get good clusters. Should be a combination of many such measures • On large document sets, any algorithm is likely expensive. However, if done on smaller subsets of the overall set, much cheaper

  21. Example Simplest clustering algorithm ever invented! Draw a random line through the cluster space. One side is cluster 1, the other side cluster 2. And the results ….

  22. Um, so why? • The algorithm is cheap, very cheap! Draw a line through the “space”. Cheap is good when you are worried about large numbers. • It turns out that multiple applications, each poor, when taken together in consensus give very good results! • Multiple “measures” can be accounted for this way.

  23. Two: Semantic Distance One distance measure we would like to add to the consensus is semantic distance. How close semantically are two documents? How to do this cheaply?

  24. Wordnet • Started by George Miller Princeton (“The magical number 7 plus or minus 2”) in 1985. Funded to study machine translation. • Is much more than just a dictionary. It is an ontology (in CS, that means a data model) of English. • It includes relationships such as: hypernym, hyponym, meronym, holonym, synonym, antonym, etc.

  25. Use Wordnet to find semantic distance How close are “dog” and “cat”? dog: sense 1: domestic dog sense 2: unattractive girl sense 3: lucky man sense 4: a cad sense 5: hot dog sense 6: hinged catch sense 7: andiron hypernym canine: sense 1: tooth sense 2: family Canidae hypernym carnivore: sense 1: meat eater cat: sense 1: true cat sense 2: guy sense 3: spiteful woman sense 4: tea sense 5: whip sense 6: truck sense 7: lions sense 8: tomography hyponym hyponym feline: sense 1: felid

  26. Semantic Relationship Graphs • Ultimately will find graphs of “close word senses” and use them to represent a document

  27. The Text Another problem was to make governments strong enough to prevent internal disorder. In pursuit of this goal, however, rulers were frustrated by one of the strongest movements of the eleventh and twelfth centuries: the drive to reform the Church. No government could operate without the participation of the clergy; members of the clergy were better educated, more competent as administrators, and usually more reliable than laymen. Understandably, the kings and the greater feudal lords wanted to control the appointment of bishops and abbots in order to create a corps of capable and loyal public servants. But the reformers wanted a church that was completely independent of secular power, a church that would instruct and admonish rulers rather than serve them. The resulting struggle lasted half a century, from 1075 to 1122. [6a]

  28. Three: User Interaction We want the use to be able to interact with the clustering process in a natural way (that is, not modify the algorithm). We do this by allowing the use to establish relationships between documents: • must-link (these docs go together) • must-not-link (separate these docs)

  29. Changing the algorithm As a result of changing the way documents cluster together, the user changes the algorithm (because the constraints he/she establishes must be respected across all the documents) but in a way they can understand.

  30. H-Net and Scholarly Discourse in the Digital Age • Opportunity & Challenge Searching Large, Text Archives • New ApproachSemantic-Augmented Consensus Clustering • ApplicationH-Net Discussion Lists

  31. H-Net • Humanities and Social Sciences OnLine • Pioneer, Peer-Edited Discussion Lists • 160 Networks • 600+ Editors • 150,000 Participants • Global

  32. H-Net Archives • Scholarly Value • Current Users • Future Scholars • Scale • 1,000,000+ Messages • 3GB of Text

  33. Current Search Capabilities By • Date • Author • Subject • Words in Text What’s missing? • Multi-Thread • Multi-List • Cross-Temporal • Etc…

  34. Movie Amistad was discussed across H-Net networks History, Literature, Film, Teaching, Economics Different perspectives Over time Example in H-Net

  35. Locate related content Across time Across scholarly communities Facilitate interdisciplinary scholarship and teaching Synthesize new knowledge in new forms Value to H-Net

  36. Email and Forums Popularity Limitations Adding depth and breadth while maintaining immediacy Unlocking the Potential of Scholarly Communication

  37. Value of Humanities Technology Research Fundamental challenge in computer science Humanities research --- new insights/new connections H-Net provides testbed/testers Truly interdisciplinary research

  38. H-Net and Scholarly Discourse in the Digital Age Contact Information: MATRIX: Center for the Humane Arts, Letters, and Social Sciences On-Line www.matrix.msu.edu

More Related