250 likes | 394 Views
CoBib. a collaborative bibliographic database Beth Trushkowsky. Objectives. Create a web repository of research papers Allow users to obtain literature reviews of particular subjects Allow users to group and download citations Suggest citations that are
E N D
CoBib a collaborative bibliographic database Beth Trushkowsky
Objectives • Create a web repository of research papers • Allow users to obtain literature reviews of particular subjects • Allow users to group and download citations • Suggest citations that are • currently recommended explicitly • currently “hot” implicit • Limit scope of community to research groups • Use open standards for interoperability
Presentation Outline • Motivation • Methodology • Related work • Architecture • Design • Tagging • Recommending • Reference Reconciliation • String-distance • K-means clustering • DPMM • Reflection and Other Ideas • User Studies
Motivation • Example: someone new to database systems wants to learn about prominent ideas and people in that area • Many resources online to locate research papers • Missing the interaction between users • Academic peers have the potential to help one another find relevant information • Want to promote easy exchange of ideas
Methodology • Database functions • Searching: a direct query • Browsing: an indirect survey • Information integration • Merge user’s personal repositories • Improve utility of citations for community • Web 2.0 • User involvement and interaction • Immediate resultsasynchronous processing
Related Work Also: CiteULike, Bibsonomy, etc…
Related Work • Rexa • Similar project at UMass Amherst • Introduces tagging of citations for personal bookmarking • Tackles object coreference problems • How CoBib is different • Users share tags • Users contributes to system’s integrity via corrections • Citation recommendation • Small academic community
Design: Tagging • Folksonomy: user generated taxonomy for classifying web content with tags • Self-bookmarking is an obvious application • Social tagging systems connect users’ tagging activities to rest of the community • Increases success in citation browsing • Vocabulary problem • Want vocabulary to be consistent • Ex: databases and database systems • Improve consistency with suggestions
Design: Recommendation • Explicit • Recommend citation or citation groups to particular individuals or research areas • Share bibliography for particular project • Implicit: questions to answer • Track user interaction: how do citations viewed indicate interests? • How can the community’s actions be used to recommend new citations to particular users? • Collaborative filtering • Try to find a neighborhood metric • Have advantage of direct input from users about interests
Reference Reconciliation • Multiple citations, single entity • Aha, D. W. & Kibler D. Noise-Tolerant Instance-Based Leanring Algorithms. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence. • D. Aha and D. Kibler. Noise-tolerant instance-based learning algorithms. In Proceedings of IJCAI-89. • Need single thread of discussion
String Distance Metrics • String matching algorithms: Cohenet al. • Edit-distance methods F O R B E SF O _ V E S • Token-based methods hands on labs without computers use computers with your hands • Hybrid methods
K-Means Clustering • Represent set as mixture model • X is set of observed citations • Z is latent, true citation entity • Group into k clusters using string distance • Want to discern number k • Allow number to change while processing • Use McCallum’s canopies to avoid too many pair-wise operations • Complete pass to generate canopies • Compute distance only to points in same canopy
Dirichlet Process Mixture Model • Chinese restaurant process • Represent each citation as vector of counts on vocabulary • ~ Dirichlet prior over vocabulary size • Hyperparameters (1, 2… v) and 0 • ~ “Stick-breaking” Dirichlet prior over number of clusters • Gibbs Sampling • Choose a citation • Compute probability belongs to each cluster • Nonzero probability of creating new cluster
Reflection and Other Ideas • Both methods performed poorly • Information loss • String-distance: between groups of citations • Bag-of-words: between citation fields • Other ideas • Explore different feature spaces of citations • Active learning • User feedback for reconciling citations • Order of reconciliation tasks • Determine users most capable of reconciling
User Studies • Currently in data-gathering stage • Want to measure effectiveness of CoBib • Did users find new citations in their research areas of interest? • Were users more likely to interact with citations in their areas? • How often did searching turn into browsing? • Which type of recommendations were more relevant?
Thank you! Suggestions?
Design Principles • Adhere to information standards • z39.50 protocol from Library of Congress • Future to query other university databases • Established profiles for searching bibliographic data • Metadata Object Description Schema (MODS) • BibTeX and Endnote • Portable Document Format for Archives (PDF/A) • CoBib’s Documents will be self-contained
User Interface Issues • User-Centered Design • Give attention to needs and wants of end user during the design process • Decrease processing time • Increase productivity • Use AJAX to give users immediate feedback on possible word completions • Java paper submission • Making tasks easier increases likelihood that users will perform them… thus more data • Give a lot, take a little: users may be more inclined to help with citation modifying if the rest of the process is very easy
Object Coreference • What we’ve done • Advantage of structured XML data, treat different types of data differently • Author fields: Soft Jaro Winkler • Title fields: Soft Monge Elkan • What needs to be improved • Most effective weight for respective fields • Appropriate thresholds for the above algorithms
Storing Citations with Zebra • Zebra • Structured text indexing and retrieval engine • Supports various portable file formats • Scalable to large dataset • Automates indexing • Zebra uses z39.50 protocol • Client-server protocol for retrieving information from remote databases • Syntax of the client’s search query is independent of the server’s database structure • Future possibility to incorporate other libraries data into CoBib
Searching Citations with Zebra • Attribute sets • “Characteristics of a query” • Maps constraints to numeric values • Attribute types in the Bib-1 attribute set • Use: access point in search query (i.e. author) • Relation: relationship of search terms to those in database (less than, greater than, etc) • Position: search terms’ position in field • Structure: format of the query (word, numeric,etc) • Truncation: where in query search terms can match • Completeness: how “full” is the query in its field • Example • @attr 1=1003 @attr 3=3 @attr 5=1 forbes
Storing Other Information • Relational database captures the connections between users and citations • MySQL database • Most tables describe what user x did with citation y • Ex. Tags, UserActions, Recommends
Object Coreference • Reference Reconciliation: Dong et al. • Partition into sets of classes with atomic and association attributes • Goal: each instance of a class will represent a single real-world entity and vice versa • Objectives • Use associations between references to assist reconciliation decisions • Propagate reconciliations for different pairs of references • Enrich references after reconciliation • Construct dependency graph • Node represents similarity between two references • Edge represents dependency between similarities
Object Coreference • Identity uncertainty: Pasula et al. • Proposes a declarative approach using a formal language • Relational Probability Models (RPMs) • Sets of classes, named instances, attributes, conditional probability models, and instance statements • For each citation, construct Bayesian network • Use Markov chain Monte Carlo to make approximate inferences • Results: doesn’t scale well • Use cheap distance metric to make canopies