280 likes | 443 Views
Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position. Ahmed AlSum PhD Candidate Old Dominion University. Outline. Engineer What I did Web Archive What I know What I did What I can do for SUL. CCSP Project.
E N D
Web Archiving Challenges and OpportunitiesPresentation for Web archiving Engineering position Ahmed AlSum PhD Candidate Old Dominion University
Outline • Engineer • What I did • Web Archive • What I know • What I did • What I can do for SUL
CCSP Project • It is an internal IBM support portal that provides client-facing audiences a by-client, holistic view of client situations. • Technologies: The project depends on IBM technologies, WebSphere Portal, DB2, and deployed on zLinux machines
Responsibilities: • Software Engineer. • Administrator on production and staging. • Customer support team lead. • Software engineer team leader.
Developing Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and the support for backend tasks based on EJB. • Lotus Sametime developer for both Plugins and Bot development. • Development front-end components based on Web 2.0 technologies (AJAX based on dojo 1.0, and Java Script). • Developing and deploying Portal solutions on WebSphere Portal. • WebSphere Portal Administration on for standalone and clustered environment. • Administration on Linux and Windows OS. • DB2 server’s administration for single instance and multiple instances with HADR support. • Leading the customer support activities. • Support in some project quality activities. • Code review and static analysis activities.
Certifications: • IBM Certified System Administrator, IBM WebSphere Portal V6.0. (May. 2008) • IBM Certified Solution Developer, XML and Related Technologies. (Since Mar. 2008) • IBM Certified Solution Developer, IBM WebSphere Portal V6.0. (Since Feb. 2008) • Sun Certified Web Component Developer for the Java 2 Platform, Enterprise Edition 1.4 (Since Jan. 2008). • Sun Certified Programmer for the Java 2 Platform, SE 5.0, (Since March 2007). • IBM Rational Software Certified, RAD 6.0 Associate Developer (Since Apr. 2006) • Microsoft Certified Professional in Designing and Implementing Desktop Applications with Microsoft® Visual C++® 6.0. (Since Sep. 2002)
Memento • Memento is an extension for the (HTTP) to allow the user to browse the past web as the current web. Now T1 T2 T3 I. Jacobs and N. Walsh. Architecture of the world wide web. Technical report, W3C, 2004. http://www.w3.org/TR/webarch/.
Memento • Memento Aggregator • Developer and Adminstartor Aggregator
Memento • Memento Client • MementoFox: Firefox addon • mcurl: command line in Perl • Both of them have been implemented based on Memento internet draft 8.0.
WAT Extraction • Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls. • Technologies: Hadoop, PigLatin, JAVA.
Challenges and Opportunities Web Archiving
Web Archive Life Cycle Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.
Selection • Decide what to capture • We studied what is already captured.
How Much of the Web is archived? • Tell me what is your URI source!! • S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada. 2011.
Selection • Curator • TwitterCrowdsource: • UK Web archive: Twittervana. • Internet Memory: Collect URIs from twitter APIs. • VA Tech: CTRNET project.
Web Archive Life Cycle Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.
Harvesting • Services • Archive-It • WAS @ CDLib • Dedicated server • Heritrix
Harvesting • Challenges • Ajax and Web 2.0/3.0 • Streaming Media • URI challenges (i.e. twitter hash-bang) • Mobile
Harvesting • SiteStory - Transaction Archive Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013.
Web Archive Life Cycle Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.
Storage • Flat files: • WARC files (ISO standard) • No-SQL db: • Internet memory
Storage • Wrong solution could be a disater