290 likes | 418 Views
Improving the ETD Landscape ETD 2014: 17 th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD, www.ndltd.org fox@vt.edu http:// fox.cs.vt.edu /talks/ 2014 Virginia Tech, Blacksburg, VA 24061 USA. Outline. Acknowledgments Why, what, who, how
E N D
Improving the ETD LandscapeETD 2014: 17th Int’l Symposium on ETDsLeicester, EnglandEdward A. FoxExecutive Director, NDLTD, www.ndltd.orgfox@vt.edu http://fox.cs.vt.edu/talks/2014 Virginia Tech, Blacksburg, VA 24061 USA
Outline • Acknowledgments • Why, what, who, how • Improving, quality • Related technical contributions • DLs and DL curriculum
Acknowledgments • Family, mentors, teachers, students • Dissertations: Sung Hee Park, VenkatSrinivasan, Seungwon Yang • NSF: IIS-0535057, 0916733, 1319578 • All those working with ETDs • NDLTD, including its Members, Board, Committees, and Working Groups
Why, What, Who? • Why? • enhance graduate education • expand global research collaboration • What? • help students communicate more effectively • get ETDs for all TDs: next goal 5 million • help make ETDs open, accessible, preserved • Who? • levels: students, faculty, staff, (grad) administrators • professions: CS, IT, LIS, librarians, archivists
How? • Authoring systems, tools, methods • Data and auxiliary information management aids • Metadata creation software and techniques • Submission, approval, refinement workflows • Local access and information management • Sharing, disseminating, discovering • OAI, data providers, harvesting • Regional/national, global institutions • Services: access, preservation, adding value • Add back files
Improving – 1 of 2 • Context: Quality frameworks, references on quality • Guidelines and documentation for all of this • Works • XML + PDF + raw/original representations • Multimedia, software, simulations, websites, dynamic content • Data, auxiliary information, references/bibliographies • Reproducibility • Metadata • Completeness: subject classification, faculty by role • Authority info
Improving – 2 of 2 • Local services • Training, assistance • IR, archives, archival consortia • Global services • Browse, faceted search, full-text search • Recommend, CLIR, CBIR, summaries, topics • Linked data, hyperlinks, citation linking • Alerts, notifications, RSS feeds, filtering
Information Life Cycle (adapted) Creation Active Authoring Modifying Classifying Tagging Recommending Indexing Social Context Using Citing Retention / Mining Downloading Storing Retrieving Semi- Active Discovering Utilization Filtering Distributing Networking Inactive Searching Borgman et al. 1996 http://is.gseis.ucla.edu/research/dig_libraries/
Improve related movements • Make related efforts work for graduate researchers, ETDs, and university ETD activities: • Open access, institutional repositories • Sharing references and citations: Zotero, … • Sharing data, datasets, workflows; reproducible science: reproducibleresearch.net, … • Building author profiles: ORCID, ISNI, … • Digital libraries and DL education (DL2014)
Related technical contributions • Broadly: new/better systems, user/usage studies, added services, improved practices • Automatically assign topics or categories to ETDs or to portions (e.g., chapters) to aid browsing and (faceted) searching • Build a union reference collection: by aiding authors (e.g., Hiberlink) and/or by automatic ETD text mining • Enhanced information retrieval: cross language IR, content based IR (image/video/music) …
Topic determination • Given a document, extract or generate generalized description of its topics • Statistical approaches, e.g., LDA • Knowledge based approaches, e.g., Xpantrac • Take a webpage or document • Use portions of it to build queries to a knowledge source (Web, Wikipedia, and ETD collection) • Combine, analyze, and summarize the results • Seungwon Yang, "Automatic Identification of Topic Tags from Texts Based on Expansion-Extraction Approach", Jan. 2014, Ph.D. dissertation, http://hdl.handle.net/10919/25111
ETD Classification: VenkatSrinivasan • Enhance metadata by adding subject categories • Hierarchical classification of ETDs (and chapters thereof) using Library of Congress categories • Training data • OCLC’s WorldCat: records from 1M books have good labels but little metadata; labels on ETDs not usable • Results coming from queries each designed to describe a category • Need to balance negative and positive examples throughout the LoC taxonomy
ETD Classification: Algorithm Pipeline ETDs categorized into a node of the category tree (after classification) Category Tree ETD Collection Category label for each node used as query ETD metadata used for categorization Categorized ETDs Google Naïve Bayes Classifiers Level-wise categorization Top 50 webpages (for each node in the tree) Browsing Training Web Interface Document Sets Training Sets Cleanup (stemming, stopword removal, etc.)
Reference Extraction and Databasing • How can we implement metadata schema for bibliographic information? • What machine learning methods are effective to extract reference sectionsincluding footnotes and chapter references? Sung Hee Park, "Discipline-Independent Text Information Extraction from Heterogeneous Styled References Using Knowledge from the Web", June 2013, VT CS Ph.D. dissertation
Dataflow of Reference Section Extraction Training data Feature Extraction Learning Pdf2 txt Feature Extraction Reference Section Extraction Tagged data ETD in PDF
ETD References: System Architecture ETD Repository Extracting Reference Sections Searching, Browsing, Manipulating Metadata with References Users Web App (e.g., ETD-db) https://github.com/VTUL/etddb2 Union ETD References ?
Discovery, Search Engines, Info. Retrieval(to be extended for images, etc.) Query Q Search Ranking D Results Documents Best matches (Q with D) selected Quality of many systems is low, with recall and precision at only around .5, as opposed to 1 at 1.
Search Module Detail(features can be about text, images, …) Similarity Function Feature vector Q Query Q S = Sim(Q,D1) Feature vectors D1 Document D1 • In CBIR (Content Based Image Retrieval), • search is based on visual content of images • Color • Shape • Texture …
DL Definitions: Informal 5S DLs are complex systems that • help satisfy info needs of users (societies) • provide info services (scenarios) • organize info in usable ways (structures) • present info in usable ways (spaces) • communicate info with users (streams) • Use this as: checklist, design guidelines, basis for formal description, specification for software implementation; e.g., Spaces help re GIS, VR
Digital Library Books • Edward A. Fox and Jonathan P. Leidig, eds. Digital Library Applications: CBIR, Education, Social Networks, eScience/Simulation, and GIS.Morgan & Claypool Publishers, 2014, 175 p., http://dx.doi.org/10.2200/S00565ED1V01Y201401ICR032 • Edward A. Fox and Ricardo da Silva Torres, eds. Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction, and Security. Morgan & Claypool, 2014, 205 p., http://dx.doi.org/10.2200/S00566ED1V01Y201401ICR033 • RaoShen, Marcos Andre Goncalves, and Edward A. Fox. Key Issues Regarding Digital Libraries: Evaluation and Integration. Morgan & Claypool, 2013, 110 p., http://dx.doi.org/10.2200/S00474ED1V01Y201301ICR026 • Edward A. Fox, Marcos Andre Goncalves, and RaoShen. Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. Morgan & Claypool, 2012, 180 p., http://dx.doi.org/10.2200/S00434ED1V01Y201207ICR022, supplementary website https://sites.google.com/a/morganclaypool.com/dlibrary/
DL Curriculum Project • NSF awards to VT and UNC-CH: CS and LIS • Project server: http://curric.dlib.vt.edu/ • Wikiversity: http://en.wikiversity.org/wiki/Curriculum_on_Digital_Libraries • Table 1: Core DL Curriculum • Table 2: Information Retrieval Packages • Table 3: LucidWorks Big Data Software • Table 4: Multimedia Software
DL Curriculum Module Template 1. Module name 2. Scope 3. Learning objectives 4. 5S characteristics of the module (streams, structures, spaces, scenarios, society) 5. Level of effort required (in-class and out-of-class time required for students) 6. Relationships with other modules (flow between modules) 7. Prerequisite knowledge/skills required (what the students need to know prior to beginning the module; completion optional; complete only if prerequisite knowledge/skills are not included in other modules) 8. Introductory remedial instruction (the body of knowledge to be taught for the prerequisite knowledge/skills required; completion optional) 9. Body of knowledge (theory + practice; an outline that could be used as the basis for class lectures) 10. Resources (required readings for students; additional suggested readings for instructor and students) 11. Exercises / Learning activities 12. Evaluation of learning objective achievement (graded exercises or assignments) 13. Glossary 14. Additional useful links 15. Contributors (authors of module, reviewers of module)
DL Curriculum Modules - examples • Module 1-b: History of digital libraries and library automation • Module 2-c: File Formats, Transformation, and Migration • Module 3-b: Digitization • Module 4-b: Metadata • Module 5-a: Architecture overviews • …
Conclusion: Improving together • Who will help? • What can we do? • What knowledge and education is needed? • What connections, integrations, collaborations can help with ETDs? • Please comment and share! – Ed Fox (fox@vt.edu, http://fox.cs.vt.edu/talks/2014)