1 / 92

Mining Academic Community

Mining Academic Community. Jan-Ming Ho hoho iis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica. What is Community?. In Graph Theory densely connected groups of vertices, with sparser connection between groups

padma
Download Presentation

Mining Academic Community

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw Computer System and Communication Lab Institute of Information Science Academia Sinica

  2. What is Community? • In Graph Theory • densely connected groups of vertices, with sparser connection between groups • In Social Network Analysis • groups of entities that share similar properties or connect to each other via certain relations • A social network is a structure made up of nodes, representing entities from different conceptual groups, that are linked with different types of relations

  3. Why is Community Important? • Interesting data with community structure • researcher collaboration, friendship network, WWW, Massive Multi-player on-line gaming, electronic communications. • Groups of web pages that link to more web pages in the community than pages outside correspond to web pages on related topics • Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.

  4. Motivation • Understand the research network between authors, conferences and topics (rank entities by relevance for given entities) • Find and justifiably recommend research collaborators for given authors • Explore the academic social network • Find out most important papers, researchers and venues for a given topic

  5. Related Systems • Many digital library systems exist • ACM Digital Library • IEEExplorer • DBLP • Citeseer • Libra • DBConnect • Problems • The coverage of dataset is not large enough • Name ambiguous problem exists in • Web pages • Citation records

  6. Libra Academic Search • http://libra.msra.cn • Free computer science bibliography search engine • A test-bed for object-level vertical search research • Currently the following types of paper-related objects can be searched: • Papers, Authors, Conferences, Journals, Research Communities

  7. DBconnect: Conference

  8. DBconnect: Topic

  9. DBconnect: Author

  10. ZoomInfo (1) People Directory (2) Developer Tools (3) Social Network, Profile Statistics, Employment History (4) Ability to identify ambiguous?! Ex. Can get 21 different people called “Bing Liu”

  11. ArnetMiner

  12. Our goal • Developing an automatic system to • Explore the academic social network • Find out most important papers, researchers and venues for a given topic • Provide solutions for existent problems • Collecting larger citation datasets • Retrieving data from web pages • Publication list finder • Extracting citation strings from web pages • Citation parser • Multilingual data sources • Chinese and English corpuses • Name dissemination mechanism in • Web pages • Citation records

  13. Our contributions • Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, and Jan-Ming Ho, "Web Appearance Disambiguation of Personal Names Based on Network Motif," in the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), Hong Kong, Dec. 18-22, 2006 • Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho, "PLF: A Publication List Web Page Finder for Researchers," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007 • Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and Jan-Ming Ho, "Mining Translations of Chinese Name from Web Corpora by Using Query Expansion Technique and Support Vector Machine," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007 • Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming Lee, "AEFS: Authoritative Expert Finding System Based on a Language Model and Social Network Analysis," in Proceedings of the 12th Conference on Artificial Intelligence and Applications (TAAI2007), Nov 16-17, 2007 • Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser Based on Sequence Alignment Techniques," will appear in Proceedings of the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA-08)

  14. PLF: A Publication List Web Page Finder for Researchers

  15. Agenda Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

  16. Overview of aPublication List Web Page Keep abreast of state-of-the-art research Contains citations not found elsewhere. May provide some reference materials, such as slides and talks. Challenges How to find the publication list web pages Only with the given name . Various versions or Multiple copies An author may have many affiliations. Name ambiguity problem E.g., Dr. Bing Liu, we found that 26 people share the same name by inquiring to ZoomInfo (people search engine).

  17. Problem “Publication List Web Page?”

  18. Definition of Publication List Affiliated Personal Publication List Web Page (APPL) a web page belongs to the affiliated web site of a specific person with the given name. [Affiliation] Institute of Information Science, Academia Sinica citation string

  19. Agenda Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

  20. Process Flow

  21. Basic Concept A publication list web page may contain many citation strings

  22. Agenda Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

  23. Dataset Scenario Seminar members have usually published major research works We randomly collected 200 names from theWWW ’06Conference Committee website

  24. ExperimentEvaluation Evaluation metrics We considerthe top-5 results derived by each link and focus on thetop-5 recall metric, which is calculated by:

  25. Parameter Analysis for Single-Group (m, n) (m, n) (a) Fixed n mixed with different scale m (b) Fixed m mixed with different scale n • Figure (a) • When m increases, the recall rate also increases. • Figure (b) • System performance may be constrained by m.

  26. Parameter Analysis for Multi-Group (a) Fixed n mixed with different scale m (b) Fixed m mixed with different scale n • Figure (a) • It is clear that the performance when m = 40 is always better than the other settings. • Figure (b) • The best performance (top-5 recall is 70%) occurs when n = 75.

  27. Performance Evaluations (given name + keyword) (a)Performance of approaches in single-group (b)Performance of different ways in multi-group • The parameter m has a strong influence on the system’s performance; for example, an oversized m may degrade the performance. • The parameter n has little influence on the system’s performance. • The PLF system outperforms the other two approaches on both the single-group and the multi-group datasets.

  28. Conclusion We havedefined the problem of finding the publication list web pages of a researcher, and proposed “PLF” system Ongoingwork Name ambiguity problem How to merge the multiple publication list web pages for a specific person into a single page.

  29. Discussion – Name Ambiguity Problem Scenario We takethename “BingLiu” asanexample Analyze manually Observation Citation Count Nametranslationproblem Partialmatchingproblem

  30. Extracting Citation Strings from Web Pages

  31. Extract Citation Records Extract Web Page Structured Data

  32. Challenges • The formats of publication list web pages vary • There are no fixed syntactic rules for parsing citation records • Hence, We can not apply simple rules to extract citation records automatically

  33. Challenges: Complex Layouts of Publication List Pages

  34. Ideas • The semantic structure of web pages is organized by visual arrangement. • We can utilize semi-structure information(visual ) of web pages to help extraction task. • With hierarchical structure and geometric information, DOM tree is not only a great structure to present Web pages, but also very helpful for visual pattern analysis.

  35. DOM Tree Presentation of Web page

  36. Architecture of Citation Extraction System

  37. Modules of Citation Extraction System • Common Style Finder • find out all common style patterns for each level of granularity in web pages • Citation Extractor • explore data regions with common style patterns • distill extraction rules from those data regions • rank extraction patterns based on a normal word count distribution probability

  38. BibPro: A Citation Parser based on Sequence Alignment Techniques

  39. System Goal

  40. Author Title Journal … year page … protein sequence A D T D L D Y R P H S Basic Idea(1/2) • Encode citation to protein sequence • Only keep the citation style information • order of fields • field separators

  41. Basic Idea(2/2) • To determine citation style by the order of punctuation marks and reserved words

  42. How to encode citation to protein sequence? • Keep the citation style information • Which field should be included? (only can use 23 symbol) • Which punctuation are used to separate fields? • By observing different citation styles, we define an encode table to translate each token of citation to an amino acid symbol

  43. Encode Table

  44. How to using protein sequence to extract metadata? • Transform extraction problem to sequence alignment problem • Form translation • Unknown Answer • BASE FORM • ALIGN FORM • INDEX FORM • Known Answer • RESULT FORM • STYLE FORM • INDEX FORM

  45. RESULT FORM (Known Answer)

  46. BASE FORM (Unknow Answer)

  47. System Structure • System PreProcess (Template Generating System) • Citation Crawler • Template Builder • Online Parsing (Parsing System) • Template Matching • Metadata Extraction

  48. Citation Crawler

More Related