220 likes | 320 Views
Research ExperT. Paul Varcholik Joshua Thompson EEL 6883 – Software Engineering II Spring 2009. Background. Academic Research Literature Reviews Conferences Journals Material collected from the Internet Google Scholar How do researchers organize the papers they find? Hard copies
E N D
Research ExperT Paul Varcholik Joshua Thompson EEL 6883 – Software Engineering II Spring 2009
Background • Academic Research • Literature Reviews • Conferences • Journals • Material collected from the Internet • Google Scholar • How do researchers organize the papers they find? • Hard copies • On-Disk Directory Structures
Background (cont.) • Needs • Storage and quick retrieval of research papers • Collaboration with colleagues • User-provided reviews • Annotated references • Existing Tools • 2collab.com • Mendeley • Zotero • Papers (Mac-only) • Wikipedia comparison
High-Level Architecture • 5 Assemblies • 1 Common • 1 Data Layer • 1 Unit Test • 2 UI • 1 Web • 1 Windows Forms (WinForms)
First Iteration • Requirements gathering, initial design, and implementation • Web-based system • Foundation set, key features available • Large scope required feature pull-back • UI lacking polish
Second Iteration • Windows Forms (WinForms) UI • Same base code – database and data layer with some extensions • Attempts at auto-extraction of meta-data
Iteration Metrics Comparison • 180 files • ~4,500 ELOC • 57 classes and enumerations • 15 database tables • 88 stored procedures • 87 unit tests • Files • ~9,650 ELOC • 92 classes and enumerations • 16 database tables • 100 stored procedures • 96 unit tests First Iteration Second Iteration
UI Comparison Web Windows
Discussion (cont.) • Low complexity
Discussion (cont.) • High maintainability You can think of the score as a percentage grade, numbers closer to 100 are better. * The formula for average complexity is logarithmic (the numbers don’t add up like sums)
PDF Parsing • Metadata • Issue Heading • Title • Authors • Abstract • Keywords
PDF Parsing (cont.) • Using PDFBox libraries for PDF reading and manipulation • Three methods for parsing PDFs • Automatic • XML based • User-driven image based
PDF Parsing (cont.) • Automatic parsing • Uses heuristics to determine metadata • Font sizes • Relative positioning • Specific tokens • Pros • No user input required • Can provide reasonable guesses • Cons • Makes assumptions • Does not always work 100% • Difficulties with text grabbing
PDF Parsing (cont.) • XML Parsing • Paper formats are specified • Order of metadata • Relative font sizes • Token delimiters • Pros • More effective than automatic parsing • No direct user input required • Cons • Requires manual input for each publication source
PDF Parsing (cont.) • User-Driven Image Based Parsing • Display Page 1 • User draws rectangles around metadata • Uses automatic parsing as an initial guess • User can review/modify the results • Pros • Uses automatic and user-driven methods • Cons • Requires user input
Discussion • Interesting uses of .NET Reflection • Object Registry • Difficulties of PDF Parsing • Approaches to resolving these difficulties • Publication source templates • User input • Cut-and-paste
Future Work • Integrated meta-data parsing • Group-User-Repository access roles • Author ranking • Advanced searching • Annotated references • Additional document types (e.g. MS Word) • More UI polish • Server selection • Review attachment improvements • Administration features
Questions? Research Expert Paul Varcholik Joshua Thompson EEL 6883 – Software Engineering II Spring 2009