1.16k likes | 1.29k Views
Indexing and searching heterogeneous information LLNL – Nov. 3, 2006 Edward A. Fox Virginia Tech fox@vt.edu http://fox.cs.vt.edu. Outline. Acknowledgements, Publications Introduction: Problem, Digital Libraries New Efforts: Personalization, Superimposed Info 5S, ETANA, Structure
E N D
Indexing and searching heterogeneous informationLLNL – Nov. 3, 2006Edward A. FoxVirginia Techfox@vt.eduhttp://fox.cs.vt.edu
Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary
Acknowledgements: Students • Pavel Calado, William Cameron, Yuxin Chen, Fernando Das Neves, Robert France, Marcos Gonçalves, S.H. Kim, Aaron Krowne, Ming Luo, Paul Mather, Fernando Das Neves, Sanghee Oh, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ohm Sornil, Hussein Suleman, Ricardo Torres, Manas Tungare, Wensi Xi, Seungwon Yang, Xiaoyan Yu, Baoping Zhang, Qinwei Zhu, …
Acknowledgements: Faculty, Staff • Lillian Cassel, Lois Delcambre, Debra Dudley, Roger Ehrich, Joanne Eustis, Weiguo Fan, James Flanagan, C. Lee Giles, Rohit Kelapure, Neill Kipp, Douglas Knight, Deborah Knox, Aaron Krowne, Alberto Laender, David Maier, Gail McMillan, Claudia Medeiros, Manuel Perez-Quinones, Jeffrey Pomerantz, Naren Ramakrishnan, Layne Watson, Barbara Wildemuth, …
Other Collaborators (Selected) • Brazil: FUA, UFMG, UNICAMP • Case Western Reserve University • Emory, Notre Dame, Oregon State • Germany: Univ. Oldenburg • Mexico: UDLA (Puebla), Monterrey • College of NJ, Hofstra, Penn State, Villanova • University of Arizona • University of Florida, Univ. of Illinois • University of Virginia
Acknowledgements: Support • ACM, Adobe, AOL, CAPES, CNI, CONACyT, DFG, IBM, Microsoft, NASA, NDLTD, NLM, NSF (IIS-9986089, 0086227, 0080748, 0325579, 0535057; ITR-0325579; DUE-0121679, 0136690, 0121741, 0333601, 0435059, 0532825), OCLC, SOLINET, SUN, SURA, UNESCO, US Dept. Ed. (FIPSE), VTLS
Publications – 1 of 2 • N. J. Belkin, P. Kantor, E. A. Fox and J. A. Shaw. Combining the Evidence of Multiple Query Representations for Information Retrieval. Information Processing & Management, 31(3), 431-448, May-June 1995. • Fan, W., Luo, M., Wang, L., Xi, W., and Fox, E. A. Tuning before feedback: Combining ranking discovery and blind feedback for robust retrieval. SIGIR 2004, 27th Annual Int’l ACM SIGIR Conf. on R&D in Information Retrieval, Sheffield, England, 25-29 July • Weiguo Fan; Gordon, M.D.; Pathak, P.; Wensi Xi; Fox, E.A.; Ranking function optimization for effective web search by genetic programming: an empirical study, in the Proceedings of 37th Hawaii International Conf. on System Sciences (HICSS), 5-8 Jan. 2004, 105 - 112 • Edward A. Fox, Fernando Das Neves, Xiaoyan Yu, Rao Shen, Seonho Kim, and Weiguo Fan. Exploring the computing literature with visualization and stepping stones & pathways. CACM 49(4): 52-58, April 2006 • Edward A. Fox and Paul Mather. Scalable Storage for Digital Libraries. Chapter 12 in Multimedia Information Retrieval and Management: Technological Fundamentals and Applications, eds. D. Feng, W.C. Siu and H.J. Zhang, Berlin: Springer, 2003, pp. 265-288 • E. Fox and J. Shaw. Combination of Multiple Searches. In Proc. of The Second Text REtrieval Conference (TREC-2) (Aug. 30 - Sept. 1, 1993, NIST, Gaithersburg, MD), NIST Special Pub. 500-215, 1994, ed. D. K. Harman, 243-252 • Marcos Andre Goncalves, Robert K. France, and Edward A. Fox, MARIAN: Flexible Interoperability for Federated Digital Libraries. In Proc. 5th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'2001, September 4-8, 2001, Darmstadt, Germany, Springer, LNCS 2163 / 2001, pp. 173-186 • Ananth Raghavan, Naga Srinivas Vemuri, Rao Shen, Marcos Andre Goncalves, Weiguo Fan, and Edward A. Fox. Incremental, Semi-automatic, Mapping-Based Integration of Heterogeneous Collections into Archaeological Digital Libraries: Megiddo Case Study. In Proc. ECDL2005, Vienna, Sept. 18-23, 2005, 139-150
Publications – 2 of 2 • Rao Shen, Naga Srinivas Vemuri, Weiguo Fan, Ricardo da S. Torres, and Edward A. Fox. Exploring Digital Libraries: Integrating Browsing, Searching, and Visualization. In Proc. JCDL 2006, June 11-15, 2006, Chapel Hill, NC, 1-10 • Ricardo da Silva Torres, Alexandre X. Falcao, Baoping Zhang, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Pavel Calado. A new framework to combine descriptors for content-based image retrieval. In Proc. 14th Conf. Information and Knowledge Management, CIKM 2005, 31 Oct. - 5 Nov. 2005 Bremen, Germany, 335-336 • Li Wang, Weiguo Fan, Rui Yang, Wensi Xi, Ming Luo, Ye Zhou, Edward A. Fox, Ranking Function Discovery by Genetic Programming for Robust Retrieval, Text Retrieval Evaluation Conference-2003, Nov 17-23, NIST, Washington DC, 9 pages • Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, Jun Yan, Dong Zhuang. SimFusion: Measuring Similarity using Unified Relationship Matrix. In Proc. SIGIR 2005, 28th Annual International ACM SIGIR Conf., Salvador, Brazil, August 15-19, 2005, 130-137, http://doi.acm.org/10.1145/1076034.1076059 • W. Xi, B. Zhang, Z. Chen, Y. Lu, S. Yan, W.Y. Ma, E.A. Fox. Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects. In Proc. Thirteenth International World Wide Web Conf., WWW2004, NY, U.S.A. 19-22 May 2004, 10 pages • Wensi Xi, Ohm Sornil, Ming Luo, and Edward A. Fox. Hybrid Partition Inverted Files: Experimental Validation. In "Research and Advanced Technology for Digital Libraries, 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002, Proceedings", eds. Maristella Agosti and Constantino Thanos, LNCS 2458, Springer, pp. 422-431. • Wensi Xi, Ohm Sornil, and Edward A. Fox. Hybrid Partition Inverted Files for Large-Scale Digital Libraries. Proc. Digital Library: IT Opportunities and Challenges in the New Millennium, July 9-11, 2002, Beijing Library Press, Beijing, China, 404-418 • Baoping Zhang, Yuxin Chen, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Marco Cristo, Pavel Calado. Intelligent GP Fusion from Multiple Sources for Text Classification. In Proc. 14th Conf. on Information and Knowledge Management, CIKM 2005, 31st October - 5 Nov 2005 Bremen, Germany, 477-484
Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary
Problem Characterization • Distributed (space) • Content (streams) • Indexing (space, structure) • Features • Type/sub-type: Image, texture; link, citation • Descriptors: words or phrases or concepts • High dimensionality • Searching (scenario)
Efficiency / Effectiveness • Effectiveness • Very common measures: Precision, Recall, F1, 10-precision, R-Precision • Usefulness, usability, task support, … • Efficiency • Time • Space • Performance, Resource use, …
Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary
Personalizing A Course Website Using the NSDL William Cameron2, Boots Cassel2, Edward Fox1, Manuel Perez-Quinones1, Manas Tungare1, Xiaoyan Yu1 Virginia Tech1, Villanova2
Syllabus Collection …Towards an intelligent educational system Publisher Recommender Searcher Editor Services Potential Syllabus Text Other NSDL Resources Unstructured Syllabus Text Structured Syllabus Text Syllabus Ontology Syllabus Classifier Classification Scheme Extractor Resource Classifier Crawler
Search • With collection, we have a full text search • Results point to local copy in our collection as well as to original document • Try it out http://doc.cs.vt.edu/search/
Syllabus Ontology • Standard, machine understandable • Ontology Editor: Protégé • Syllabus Schema: SylVia • http://doc.cs.vt.edu/ontologies/
Creating new syllabus • Web-based application to support entry of syllabi into collection • Moodle Plug-in in the works • Uses CC 2001 to select topics for a course
Information Extraction • Plans to automatically extract information from syllabi documents collected • Rule-based Approach • Statistics-based Approach • Apply the best extractor on the unstructured syllabi
Superimposed Tools for VT Uma Murthy and Edward A. Fox Department of Computer Science, Virginia Tech 18 October 2006
Origin of SI • This basic need had been addressed in diverse ways, with varying degrees of success, for many years: • concordances, annotations, comments • bookmarks, concept maps, digital annotations, … • The term “SI” was coined in 1999 by researchers, currently collaborating with us, now at Portland State University • Lois Delcambre • David Maier
Layers in an SI system * Source: ICDE04 presentation by Murthy, et. al
Summary * Source: ICDE04 presentation by Murthy, et. al
Outline • Acknowledgements, Publications • Introduction: Problem, Digital Libraries • New Efforts: Personalization, Superimposed Info • 5S, ETANA, Structure • Hybrid Partitioned Inverted Indices • Discovering Ranking Functions • Text + CBIR + Metadata + GIS • Meta-search, Union DLs • LinkFusion, SimFusion • Summary
Informal 5S & DL DefinitionsDLs are complex systems that • help satisfy info needs of users (societies) • provide info services (scenarios) • organize info in usable ways (structures) • present info in usable ways (spaces) • communicate info with users (streams)
5S and DL formal definitions and compositions (April 2004 TOIS)
Structures Societies Scenarios hypertext Streams indexing Spaces searching services Collection Repository browsing A Minimal DL in the 5S Framework Structured Stream Structural Metadata Specification Descriptive Metadata Specification Metadata Catalog Digital Object Minimal DL
ETANA-DL • Archaeological DL • Integrated DL • Heterogeneous data handling • Applies and extends the OAI-PMH • Open Archives Initiative Protocol for Metadata Handling • Design considerations • Componentized • Extensible • Portable
ETANA Spaces • Geographic distribution of found artifacts • Temporal dimension (as inferred by archaeologists) • Metric or vector spaces • used to support retrieval operations, and to calculate distance (and similarity) • used to browse / constrain searches spatially • 3D models of the past, used to reconstruct and visualize archaeological ruins • 2D interfaces for human-computer interaction
ETANA Structures • Site Organization • Region, site, partition, sub-partition, locus, … • Temporal orderings (ages, periods) • Taxonomies • for bones, seeds, building materials, … • Stratigraphic relationships • above, beneath, coexistent
ETANA Streams • successive photos and drawings of excavation sites, loci, unearthed artifacts • audio and video recordings of excavation activities and discussions • textual reports • 3D models used to reconstruct and visualize archaeological ruins.
Degree of Structure Web DLs DBs Chaotic Organized Structured
Digital Objects (DOs) • Born digital • Digitized version of “real” object • Is the DO version the same, better, or worse? • Decision for ETDs: structured + rendered • Surrogate for “real” object • Not covered explicitly in metamodel for a minimal DL • Crucial in metamodel for archaeology DL
Metadata Objects (MDOs) • MARC • Dublin Core • RDF • IMS • OAI (Open Archives Initiative) • Crosswalks, mappings • Ontologies • Topics maps, concept maps
Also Important: Epub, SGML, XML • 5S perspective: streams, structures, scenarios • Authoring • Rendering, presenting • Tagging, Markup, DOM • Semi-structured information • Dual-publishing, eBooks • Styles (XSL, XSLT) • Structured queries
XML-based DL Log Standard • Log analysis • is a source of information on: • How patrons really use DL services • How systems behave while supporting user information seeking activities • Used to: • Evaluate and enhance services • Guide allocation of resources • Common practice in the web setting • Supported by web servers, proxy caches • DL Logging can be more detailed
DL Logging Features • Captures high level user and system behaviors • Organized according to the 5S framework • Hierarchical organization (XML-based) • Centered on the notions of events • Record only events related to initial user inputs and final system outputs • Help to understand user interactions and the perceived value of responses