220 likes | 363 Views
CS511: Entity Search and Visualization. Nikolay Kojuharov Lauren Massa-Lochridge Hoa Nguyen Quoc Le. Roadmap. Motivation Problem Definition Architecture Implementation Demo Conclusion. Background. DBLife:
E N D
CS511: Entity Search and Visualization Nikolay Kojuharov Lauren Massa-Lochridge Hoa Nguyen Quoc Le
Roadmap • Motivation • Problem Definition • Architecture • Implementation • Demo • Conclusion
Background • DBLife: • Back-end: Given a number of documents (researchers home page, conference website, mailing list), identify entities and relationships and monitor them over time. • Front-end: Search for entities; show relationships; display useful findings • DBLife needs improvements: • Automatic source discovery process • Better mention extraction • Inferring more metadata and relationships • Front-end improvement (search functionality, display).
Benefits • Front-end improvement: • Testing, evaluation. • Immediate use. • User feedback. • Goal: • Search: Given an ER graph as logical view, search for entities using keyword search. • Making navigation “easier” and clearer by novel method of displaying entities and relationships.
Search Problems • Cannot find partial matches. E.g. no results for “AnHai”. • Cannot search for entities in context. E.g. no results for “data integration” • Cannot use more advanced queries. E.g. boolean, regular expressions, proximity, approximate match, etc. • Returns multiple results for the same mention • …
Search Solution • Integrated entity and text search • Contextual entity search over ER model • Advanced query syntax • Example: search for “name:Fu data integration” Suggested authors: 100% LiMin Fu 59% Jack Fu 59% Guangrui Fu Suggested web pages: 85% http://anhai.cs.uiuc.edu\home\projects\aida.html 22% www.cs.wisc.edu\~chenl\Stream_Data_Processing.html 21% http://anhai.cs.uiuc.edu\home\thesis.html
Problems cont’ • User Interface • Relationships are unclear. • Relationship types and weights are not shown. • Browsing is inconvenient • Solution: ER-Visualization • Display entities and relationships in 2-D. • Easy to navigate, keep focus.
Problem Definition • Entity focused Search: Integrated ranked keyword search for entity and web pages. • ER Visualization: 2-D Graph Style Interface with Entity as Node and Relationship as Edge.
Architecture Indexing Document Index Entities, Relationship, External Sources Search GUI Visualization
Implementation • XmlDBLPReader, XmlEntityReader: DBLP, DBlife xml ER Model • EntityGraph: ER Model Graph • AuthorIndexer, TextIndexer: ER Model index files • AuthorSearch, TextSearch: index files search results • SearchGUI: UI, graph layout, etc.
Indexing • Two inverted file indexes – for documents and for entities • Document index – title*, URL, contents, etc. • Author index – name*, publications, co-authors * Highest importance
Query Parsing & Analysis • Users already familiar with keyword search are not required to learn special syntax or commands. • Queries analyzed using StandardAnalyzer() • Lucene offers a variety of different query parsers AuthorSearch demo uses query across all fields in an index. • Query query = MultiFieldQueryParser.parse(queryString, AuthorIndexer.FIELDS, analyzer);
Query Syntax • Boolean operators over terms or phrases: • “data integration” OR “schemas” • Specify field data • title:homepage AND “semantic web” • Wild card search: dat? • Fuzzy search: roam~0.8 • Proximity search: “object relational”~10 • Range search: date:[anhai TO john] • Term Boosting: professor^4 AND mining • Grouping: (anhai OR “an hai”) AND Doan.
Visualization • Built in Java using Java2D graphics library. • Graph data structure: Node, Edge, TreeNode, Graph, Tree. • Loading & Saving graph data: GraphReader, GraphWriter interfaces, XMLGraphReader etc. • Filtering: Select parts of the graph to display. • ItemRegistry: mapping between VisualItems and original graph data. • Display: render VisualItem to screen + graphic transforms + animation & activity listener.
Tools • The system is implemented in Java with Java Swing (GUI), SAXBuilder (XML), Tomcat (Web Server) and: • Lucene: is a high-performance, full-featured text search engine library written entirely in Java suitable for nearly any application especially cross-platform. • Prefuse: is a user interface toolkit for building highly interactive visualizations of structured and unstructured entity-relationship data.
Project Demo • Evaluation: not formally done at this point. Let users judge its coolness themselves. • Examples: • “anhai” • “a* schemas” • “name:schemas” • “name:a* co-author:a*”
Limitations • Explanation not quite intuitive • Lack a specific language to ask query. • Multiple relationships between two entities. • Formal evaluation of the system. • Integrate system in Web Services.
Future Work • Experiment and compare hits accuracy and quality for alternate indexing, query analysis and query parsing methods: • Span-Term Query for 'mentions‘. • Query filtering : sequences of one or more QueryParser-based filters.