590 likes | 620 Views
Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis. Jean-Pierre Norguet. Web Communication. Web transaction = requ est + r es ponse M e ta-d ata in Web logs : Request d ate et time Page reference (URI) Referral URI Client machine information.
E N D
Improving Web Sites with Web Usage Mining,Web Content Mining, and Semantic Analysis Jean-Pierre Norguet
Web Communication • Web transaction = request + response • Meta-datain Web logs: • Request date et time • Page reference (URI) • Referral URI • Client machine information
Web Analytics Tools • Results • Page views • Number of visitors • Debit • Traffic • Exploitation • Self-promotion • Sales planning • Technical resizing • Structure Optimization Low semantics Low-level decisions
Organization Structure Web analytics tools
Web Analytics Results • Low semantics low intuitivity • Too numerous results
Page Ref. Ambiguity (1) Adress: http://www.ulb.ac.be/cgi/search
Page Ref. Ambiguity (2) Adress: http://www.ulb.ac.be/cgi/search
Page Volatility Adress: http://www.ulb.ac.be/cgi/search
Problems Summary • Low semantics low intuitivity • Too numerous results • Page reference ambiguity • Page synonymy • Page polysemy • Page temporality • Page volatility
Our solution • Summarized and conceptual results for: • Chief editors • Organization managers • Generic solution, independent from: • Web site content • Web site language • Web site technology analyze output text content
Output Page Collection • Mining points in Web environment: • Web logs (+ content journal) • Web server • Network wire • On-screen Web page
Lexical Analysis • Output page mining Web pages • Unformatting text • Tokenization terms • Stopwords removal • Stemming • Term selection index terms • Occurrence counting audience metrics
Consultation Term-Based Metrics • Term occurrence counting in pages: Presence Output pages Online pages Interest
Term-Based Metrics • Term-based metrics: • Consultation • Presence • Interest • Limitations: • Too many terms • Term synonymy • Term polysemy Ontology-based term grouping
Hierarchical Aggregation • Consultation • Presence
Hierarchical Aggregation • Consultation • Presence • Interest (x2)
Hierarchical Aggregation • Consultation • Presence • Interest (x2)
Data model • Ontology term hierarchy • Number of occurrences: by day, by term • List of days (possibly aggregated)
OLAP Model • Parent-child ontology dimension • Time dimension • Measures
Case Study • Web site: cs.ulb.ac.be • 1.500 pages • 100 page views/day • Knowledge domain: computer science • Ontology: ACM classification • Knowledge domain: computer science • 11 top domains • 3 levels • 1230 terms
Experimental setting • WASA prototype • SQL Server OLAP Analysis Service
Concept-Based Metrics • Y: top ontology domains • X: consultation, presence, interest
Summary • Web analytics • Output page mining • Lexical analysis • Concept-based metrics with OLAP • Experiments • Conclusion & future work
Conclusion • Most Web sites supported • Approach validated by experiments • Topic-based metrics are intuitive • Exploitation at higher decision levels • Limitation: ontology availability • Future work: ontology enrichment Integration into Web analytics tools
Content Journaling • Web logs + content journal • (+) Easy to setup • (+) Minimal storage and computation • (-) Dynamic pages
Server Monitoring • Web server plugin • (+) Dynamic pages • (+) Fast • (-) Risky
Network Monitoring • TCP/IP packet sniffing • (+) Independent from Web server • (-) Ethernet only • (-) Encrypted content • (-) CPU-intensive
Client-Side Collection • Page-embedded program • Parses page • Sends content to mining server • (+) Distributed workload • (+) Supports client-side XML/XSL • (-) Visibility and vulnerability
Output Page Collection • Collection methods alone or in combination • any Web site output is collectable • Implemented: WASA-CJ • Implemented: Sourceforge mod_trace_output
Experiments • Experimental settings • Visualization • Ontology coverage • Validation • Scalability
Experimental setting • WASA prototype • SQL Server OLAP Analysis Service
EUROVOC Thesaurus • European Commission thesaurus • Knowledge domain: EC-related domains • 21 top domains • 8 levels • 6650 terms
04 Politics 08 International Relations 10 European Communities 12 Law 16 Economics 20 Trade 24 Finance 28 Social Questions 32 Education and Competition 36 Science 40 Business and Competition 44 Employment and Working Conditions 48 Transport 52 Environment 56 Agriculture, Forestry and Fisheries 60 Agri-Foodstuffs 64 Production, Technology and Research 66 Energy 68 Industry 72 Geography 76 International Organisations 28 SOCIAL QUESTIONS 2806 family 2811 migration 2816 demography and population 2821 social framework 2826 social affairs 2831 culture and religion arts cultural policy culture acculturation civilization cultural difference cultural identity RT: protection of minorities (1236) RT: socio-cultural group (2821) cultural pluralism popular culture regional culture religion 2836 social protection 2841 health 2846 construction and town planning Eurovoc Example
Ontology Coverage • Definition: the percentage of ontology terms that appear in the Web site • ACM classification: 15% • Eurovoc: 0,75% • Characterizes the meaning of the metrics ontology enrichment with terms of the Web site
Methodology Steps • Editor browses his pages • Select new terms • Find enrichment point in the ontology • Insert terms into ontology • Editor sends ontology to chief editor • Chief editor commits the inserts
Validation • Comparison with WebTrends • Personal Web site • Optimized custom ontology of 1250 terms • Top concepts match the page directories results should be comparable
Urchin WASA Results
Scalability: Case Study • Web site: www.ulb.ac.be • 800,000 pages • 100,000 page views • Knowledge domain: broad • Ontology: Eurovoc • Knowledge domain: broad (EC’s interests) • 21 top domains • 8 levels • 6650 terms • Run=15 hours, linear dependency reasonableand applicable to any Web site