1 / 58

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis. Jean-Pierre Norguet. Web Communication. Web transaction = requ est + r es ponse M e ta-d ata in Web logs : Request d ate et time Page reference (URI) Referral URI Client machine information.

sonyad
Download Presentation

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Web Sites with Web Usage Mining,Web Content Mining, and Semantic Analysis Jean-Pierre Norguet

  2. Web Communication • Web transaction = request + response • Meta-datain Web logs: • Request date et time • Page reference (URI) • Referral URI • Client machine information

  3. Web Analytics Process

  4. Web Analytics Tools • Results • Page views • Number of visitors • Debit • Traffic • Exploitation • Self-promotion • Sales planning • Technical resizing • Structure Optimization  Low semantics  Low-level decisions

  5. Organization Structure Web analytics tools

  6. Web Analytics Results • Low semantics  low intuitivity • Too numerous results

  7. Page Ref. Ambiguity (1) Adress: http://www.ulb.ac.be/cgi/search

  8. Page Ref. Ambiguity (2) Adress: http://www.ulb.ac.be/cgi/search

  9. Page Volatility Adress: http://www.ulb.ac.be/cgi/search

  10. Page Synonymy (1)

  11. Page Synonymy (2)

  12. Page Polysemy

  13. Page Temporality (1)

  14. Page Temporality (2)

  15. Problems Summary • Low semantics  low intuitivity • Too numerous results • Page reference ambiguity • Page synonymy • Page polysemy • Page temporality • Page volatility

  16. Our solution • Summarized and conceptual results for: • Chief editors • Organization managers • Generic solution, independent from: • Web site content • Web site language • Web site technology  analyze output text content

  17. Output Page Collection • Mining points in Web environment: • Web logs (+ content journal) • Web server • Network wire • On-screen Web page

  18. Lexical Analysis • Output page mining  Web pages • Unformatting  text • Tokenization  terms • Stopwords removal • Stemming • Term selection  index terms • Occurrence counting  audience metrics

  19. Consultation Term-Based Metrics • Term occurrence counting in pages: Presence Output pages Online pages Interest

  20. Term-Based Metrics • Term-based metrics: • Consultation • Presence • Interest • Limitations: • Too many terms • Term synonymy • Term polysemy  Ontology-based term grouping

  21. Hierarchical Aggregation • Consultation • Presence

  22. Hierarchical Aggregation • Consultation • Presence • Interest (x2)

  23. Hierarchical Aggregation • Consultation • Presence • Interest (x2)

  24. Data model • Ontology term hierarchy • Number of occurrences: by day, by term • List of days (possibly aggregated)

  25. OLAP Model • Parent-child ontology dimension • Time dimension • Measures

  26. Case Study • Web site: cs.ulb.ac.be • 1.500 pages • 100 page views/day • Knowledge domain: computer science • Ontology: ACM classification • Knowledge domain: computer science • 11 top domains • 3 levels • 1230 terms

  27. Experimental setting • WASA prototype • SQL Server OLAP Analysis Service

  28. Concept-Based Metrics • Y: top ontology domains • X: consultation, presence, interest

  29. Results

  30. Exploitation Process

  31. Summary • Web analytics • Output page mining • Lexical analysis • Concept-based metrics with OLAP • Experiments • Conclusion & future work

  32. Conclusion • Most Web sites supported • Approach validated by experiments • Topic-based metrics are intuitive • Exploitation at higher decision levels • Limitation: ontology availability • Future work: ontology enrichment  Integration into Web analytics tools

  33. Thank you for your attention

  34. Q & A

  35. Content Journaling • Web logs + content journal • (+) Easy to setup • (+) Minimal storage and computation • (-) Dynamic pages

  36. Server Monitoring • Web server plugin • (+) Dynamic pages • (+) Fast • (-) Risky

  37. Network Monitoring • TCP/IP packet sniffing • (+) Independent from Web server • (-) Ethernet only • (-) Encrypted content • (-) CPU-intensive

  38. Client-Side Collection • Page-embedded program • Parses page • Sends content to mining server • (+) Distributed workload • (+) Supports client-side XML/XSL • (-) Visibility and vulnerability

  39. Output Page Collection • Collection methods alone or in combination • any Web site output is collectable • Implemented: WASA-CJ • Implemented: Sourceforge mod_trace_output

  40. Experiments • Experimental settings • Visualization • Ontology coverage • Validation • Scalability

  41. Experimental setting • WASA prototype • SQL Server OLAP Analysis Service

  42. EUROVOC Thesaurus • European Commission thesaurus • Knowledge domain: EC-related domains • 21 top domains • 8 levels • 6650 terms

  43. 04 Politics 08 International Relations 10 European Communities 12 Law 16 Economics 20 Trade 24 Finance 28 Social Questions 32 Education and Competition 36 Science 40 Business and Competition 44 Employment and Working Conditions 48 Transport 52 Environment 56 Agriculture, Forestry and Fisheries 60 Agri-Foodstuffs 64 Production, Technology and Research 66 Energy 68 Industry 72 Geography 76 International Organisations 28 SOCIAL QUESTIONS 2806 family 2811 migration 2816 demography and population 2821 social framework 2826 social affairs 2831 culture and religion arts cultural policy culture acculturation civilization cultural difference cultural identity RT: protection of minorities (1236) RT: socio-cultural group (2821) cultural pluralism popular culture regional culture religion 2836 social protection 2841 health 2846 construction and town planning Eurovoc Example

  44. Ontology Coverage • Definition: the percentage of ontology terms that appear in the Web site • ACM classification: 15% • Eurovoc: 0,75% • Characterizes the meaning of the metrics  ontology enrichment with terms of the Web site

  45. Collaborative Enrichment

  46. Methodology Steps • Editor browses his pages • Select new terms • Find enrichment point in the ontology • Insert terms into ontology • Editor sends ontology to chief editor • Chief editor commits the inserts

  47. Results

  48. Validation • Comparison with WebTrends • Personal Web site • Optimized custom ontology of 1250 terms • Top concepts match the page directories  results should be comparable

  49. Urchin WASA Results

  50. Scalability: Case Study • Web site: www.ulb.ac.be • 800,000 pages • 100,000 page views • Knowledge domain: broad • Ontology: Eurovoc • Knowledge domain: broad (EC’s interests) • 21 top domains • 8 levels • 6650 terms • Run=15 hours, linear dependency  reasonableand applicable to any Web site

More Related