270 likes | 408 Views
An XML Log Standard and Tool for Digital Library Logging Analysis. Marcos Andr é Gonçalves, Ming Luo, Rao Shen, Mir Farooq Ali, and Edward A. Fox Virginia Tech. Outline. Motivation Related Work Problems with existing DL logs The Digital Library Standardized Log Format
E N D
An XML Log Standard and Tool for Digital Library Logging Analysis Marcos André Gonçalves, Ming Luo, Rao Shen, Mir Farooq Ali, and Edward A. Fox Virginia Tech
Outline • Motivation • Related Work • Problems with existing DL logs • The Digital Library Standardized Log Format • DL log standard design • DL Log format structure • DL log tool and its implementation • Conclusions and future work
Motivation • Log analysis • Source of information about: • How patrons really use DL services • How systems behave while supporting user information seeking activities • Examples: patterns • Used to: • Evaluate • Enhance services • Help and design user interfaces • Better allocation of resources • Common practice in the web setting • Supported by web servers, proxy caching
Motivation (cont.) • DLs differ from the web • DL collections are explicitly organized, described, managed, and preserved • Users with more specific tasks and needs • Digital objects and collections more structured • DL Logging should offer much richer information and opportunities • Tradeoff : user privacy • Current DL logs • Differences in formats and recorded information • Problems: • Lack of interoperability • No reuse of analysis tools • Comparability of log analysis results
Related Work • Web Servers (Common Log Format) • Focused in browsing, stateless bbn-cache-3.cisco.com - - [22/Oct/1998:00:20:21 -0400] "GET /~harley/courses.html HTTP/1.0" 200 1734 bbn-cache-3.cisco.com - - [22/Oct/1998:00:20:22 -0400] "GET /~harley/clip_art/word_icon.gif HTTP/1.0" 200 1050 www4.e-softinc.com - - [22/Oct/1998:00:20:27 -0400] "HEAD / HTTP/1.0" 200 0 user-38ldbam.dialup.mindspring.com - - [22/Oct/1998:00:20:48 -0400] "GET /~lhuang/junior/capehatteras.html HTTP/1.0" 200 328 user-38ldbam.dialup.mindspring.com - - [22/Oct/1998:00:20:48 -0400] "GET /~lhuang/junior/PB2panforringed.mirror.gif HTTP/1.0" 200 20222 eger-dl01.agria.hu - - [22/Oct/1998:00:20:51 -0400] "GET /~tjohnson/pinouts/ HTTP/1.0" 200 26994
DL- Greenstone ADMINISTRATION 37 /fast-cgi-bin/niupepalibrary (a) its-www1.massey.ac.nz (b) [Thu Dec 07 23:47:00 NZDT 2000] (c) (a=p, b=0, bcp=, beu=, c=niupepa, cc=, ccp=0, ccs=0, cl=, cm=, cq2=, d=, e=, er=, f=0, fc=1, gc=0, gg=text, gt=0, h=, h2=, hl=1, hp=, il=l, j=, j2=, k=1, ky=, l=en, m=50, n=, n2=, o=20, p=home, pw=, q=, q2=, r=1, s=0, sp=frameset, t=1, ua=, uan=, ug=, uma=listusers, umc=, umnpw1=, umnpw2=, umpw=, umug=, umun=, umus=, un=, us=invalid, v=0, w=w, x=0, z=130.123.128.4-950647871) (d) "Mozilla/4.08 [en] (Win95; I ;Nav)" Related Work (cont.)
Relate Work (cont.) • Search Engine - OpenText Mon Sep 28 17:48:42 1998 ----- Starting Search ----- Mon Sep 28 17:48:42 1998 {Transaction Begin} Mon Sep 28 17:48:42 1998 {RankMode Relevance1} Mon Sep 28 17:48:42 1998 "Bacillus thuringiensis " Mon Sep 28 17:48:42 1998 P0 = "Bacillus thuringiensis " Mon Sep 28 17:48:42 1998 R = (*D including (*P0)) Mon Sep 28 17:48:42 1998 R = (((*R rankedby *P0))) Mon Sep 28 17:48:42 1998 S = (subset.1.10 (*R)) Mon Sep 28 17:48:42 1998 SL0 = (region "OTSummary" within.1 (*S)) Mon Sep 28 17:48:42 1998 (*SL0 within.1 ( subset.1.1 *S )) Mon Sep 28 17:48:42 1998 (*SL0 within.1 ( subset.2.1 *S )) Mon Sep 28 17:48:42 1998 {Transaction End}
Related Work (cont.) • Problems with existing DL logs • Incompatibility • Incompleteness • Complexity of analysis • Lack of organization • Ambiguity • Inflexibility • Verboseness
The Digital Library Standardized Log Format • Comprehensive • Reflective of the actual DL system behavior • Easily readable • Precise • Flexible to accommodate in varying systems • Succinct enough to be implemented • Concern: user privacy
The Digital Library Standardized Log Format- Design (cont.) • Capture high level user and system behaviors • Hierarchical organization • Encapsulated in transactions • Interactions between the users and the system or among the system components • Log format designed to record a number of different kinds of transactions • Examples: • Login to the system • Submission of search query • Browsing a result list • Recording of a user failure
Design Reflective of DL behavior Based on the 5S formal theory Unifying, mathematical theory to formally describe the semantics of DL components Guidance for how to organize the log structure The Digital Library Standardized Log Format- Design (cont.)
The Digital Library Standardized Log Format (cont.) • Specification • Collection of extensive, flat set of attributes update catalog session event help query collection Machine information Result cutoff transaction search search timestamp action response registering Sorting rule error browse
The Digital Library Standardized Log Format - Specification • Organization in structured logical way • XML- XML Schema • Standard syntax • Guarantee quality, correctness • Rich set of basic types help standardization • Abundance of XML parsers helps construction of analysis tools
The Digital Library Standardized Log Format - Structure • Top Level Hierarchy Log . . . . . . Log Entry Transaction Statement SessionId TimeStamp MachineInfo
The Digital Library Standardized Log Format – Structure (cont.) • Decomposition of statement into different types Statement ErrorInfo SessionInfo HelpInfo RegisterInfo Event AdmInfo
The Digital Library Standardized Log Format – Structure (cont.) • Decomposition of event Statement ErrorInfo SessionInfo Event HelpInfo AdmInfo RegisterInfo Action StatusInfo Update Search Browse StoreSysInfo
The Digital Library Standardized Log Format – Structure (cont.) • Search Attributes Search TimeFrame Collection PresentationInfo Catalog SearchBy QueryString Format SortBy NumberOfResults CutOff
DL Log Tool and Implementation • Java classes • XMLLogData: store data • XMLLogManager: methods to read and write log information according to the format • Synchronized read and writes: avoid conflicts and inconsistencies • Middleware for plug-in DL tool to target system • Events based on target system architecture and implementation • Implemented in the MARIAN DL system
DL Log Tool and Implementation (cont.): the MARIAN DL system Distributed client communication Customization and personalization Webgate Query history User Interaction Layer Structured logging Searcher community Search Layer Fusion modules Semantic network Management API Multilingual support Database Layer Semantic networks persistent storage Generalized inverted index interfaces Database management API Data Analysis, Collection Builders & Loading Tools Tailored DL Infrastructure generation DL Information networks characterization, indexing and loading
DL Log Tool and Implementation (cont.) DL patron MARIAN User Layer User event c1 System event XMLLogManager c2 writeLogEntry (parameters) Log middleware storelogData (parameters) DL analyst getLogData (parameters) XMLLogData Analysis request Analysis tool logData result
DL Log Tool and Implementation (cont.) • Example 1: Login to the system <Transaction ID = "3452"> <SessionId > 987654usr3 </SessionId> <SessionInfo> <SessionStart> Start </SessionStart> <LoginInfo> <UserId> mhabib <UserId> </LoginInfo> </SessionInfo> <TimeStamp> 2002-05-31T20:10:55.000-05:00 </TimeStamp> <MachineInfo> <IPAddress> 128.173.244.56 <IPAddress> <Port> 8000 </Port> </MachineInfo> </TransId>
DL Log Tool and Implementation • Example 2: query all Dirline records about “low back pain” ... <Event> <Action> <Search> <Collection>Dirline</Collection> <ObjectType>CommunityRecord</ObjectType> <SearchBy>SearchByAnyParts</SearchBy> <SearchType>NonPersistant</SearchType> <QueryString>low back pain</QueryString> <TimeFrame> <StartTime>2002-05-31T20:11:07.000-05:00</StartTime> <EndTime>2002-05-31T20:11:09.000-05:00</EndTime> </TimeFrame> <PresentationInfo> <Format>List</Format> <SortBy>ByRank</SortBy> <NumberOfResults>217</NumberOfResults> <Cutoff>20</Cutoff> </PresentationInfo> ...
DL Log Tool and Implementation • Example 3: Browse an item of the ranked list returned as an answer for the previous search <Transaction ID = "3456"> <SessionId > 987654usr3 </SessionId> ... <Statement> <Event> <Action> <Browse> <DocID> 5114 </DocID> <DocName>University of Washington School of Medicine Multidisciplinary Pain Center (UWPC) </DocName> ...
In conclusion • Analysis of current DL log formats • Need for standardization, common practices, interoperable tools • Designed an XML-based log format standard for DL logging analysis • Captures a rich, detailed set of system and user behaviors. • Implemented format in a log component tool • Connected to the MARIAN DL system
Future Work • Build suite of Components for Evaluation • Use log format and tools to evaluate several projects • Networked Digital Library of Theses and Dissertations (NDLTD) • CITIDEL • Broadening the scope of use to other NSDL projects • Extend and use log tool with other DL systems and architectures • Consider user privacy issues • Explore info for personalization
Future work • Crosswalks to other standards (e.g. CLF) • “Not yet other standard” • More challenges • Distributed Logs • Large settings • Investigate compression issues to deal with XML verboseness • Promote discussions: • Listserv: dl-log-l@listserv.vt.edu