280 likes | 343 Views
Explore how data mining can optimize content creation for a tech journal, focusing on reader engagement, content personalization, and taxonomy improvement. Discover potential issues and current results in targeting IT professionals.
E N D
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results
A US company (“TechJournal”) publishes an on-line journal (“TechPub”) with content specifically aimed at IT professionals TechJournal is 15 years old; TechPub is 5 years old Content for TechPub comes from three sources: Aggregated content from public sources TechJournal created content Peer contributed content TechJournal core business is to produce a high-end list product for the marketing departments of IT manufacturers The Company
The content on the publication website is available to both anonymous and registered users Registered users get access to some premium services as well Most content is free. Some whitepapers for sale. Three very unique features of the site Peer contributed content Auction system -> readers to get paid to contribute content New: personalized content for each reader The Journal
Target: IT Professional involved in their organization’s technology purchasing decision Different levels of “readership”: The company continuously tries to stimulate new readership through e-mail campaigns The Readers Number of Individuals E Mail Recipients Anonymous Visits E Mail Recipients Visited Site E Mail Recipients Repeat Visitor Registered Light Reader Registered Heavy Reader
The Business Model “Active Readers Produce Better Lists” Loop “Success Breeds Success” Loop “Known Readers Make For Better Journal” Loop “Buzz Marketing” Loop
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results
“Active Readers Produce Better Lists” Loop “Success Breeds Success” Loop “Known Readers Make For Better Journal” Loop Focal Areas For Data Mining • Given email recipient attributes, what is the likelihood of a visit to website? • Which content headlines would maximize that visit likelihood? • Given registered readers’ attributes, which stories will they be interested in? • Given past stories read, what is a registered reader most likely to also read? • Given registered readers’ attributes, which will be most active? • Is TechJournal’s current content taxonomy effective or • would some content taxonomy be more useful?
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results
The Data My “Chunk of Data” to Mine: An Issues Table 713,110 records Issues - Content Linker Table 2,185,664 records Content Items Table 590 records Page Visit Table 43,580 records Recipients Table 195,455 records Taxonomy Click Table 9,385 records
Attributes to Work With = Features that can be utilized directly or derived from for Classification
Level Classes 1 1 2 5 3 46 . . . . 4 798 1909 5 . . 21 5000 + Creating Content Classes • TechJournal’s current taxonomy for classifying content: • Manually derived • Aggregation of other credible taxonomy fragments • From a content provider point of view • Goes out to 21 levels in some cases, others as shallow as three 9,750 Visits spread over 31 Classes
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status Preliminary Results
A Variety of Approaches PREDICTIVE MODELING • Given email recipient attributes, what is the likelihood of a visit to website? • Which content headline would maximize that visit likelihood? • Given registered readers attributes, which readers will be most active? • Given registered reader attributes, which types of content will they read? CLUSTER ANALYSIS • Is TechJournal’s current content taxonomy effective or would some other taxonomy be more useful? ASSOCIATION ANALYSIS • Given past stories read, what is a registered reader most likely to also read?
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results
Potential Issues • Database evolution produces noisy, dirty, unevenly populated data • Data comes from multiple sources, producing consistent data has been a challenge • Still not clear if we will end up with enough data to see anything meaningful • Content taxonomy is relatively new; most likely has real problems with how its structured • Taxomony measures article subject matter, but behavior stimulating content may be in headlines • Features are somewhat related: • Features have high number of discrete values – need to be put into meaningful groupings • Under-representation of several feature and class values
Feature Grouping - Location 10 7 1 5 2 6 3 Other 11 4 9 8
Owner Chairman/CEO Manager of Managers Result: 24 Categories 1 Manager of Doer Functional Area 1 Functional Area N 2,20 - 29 Functional Area 1 Functional Area 10 3,30 - 39 Doer 4 Assistant Assistant Feature Grouping - Title • Start with ~ 1000 distinct self-reported Titles in the Database • Most interested in Title as it correlates with impact, influence on IT buying decisions • Reclassify them based on three concepts: Senority, Function, Employees in Company
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results
Where I Am In The Process Problem Definition Data Gathering Data Prep Data Mining Results Analysis Visualiz. Sum Up Insights
Agenda Background Questions of Interest Data Overview Selected Approach Potential Issues Current Status First Results
0.1429 n = 7 0.7037 n = 27 First Results Q: Given registered readers attributes, which readers will be most active? Method: Decision Tree Induction – Training Set 599 Records, Test Set 187 Records MSE on Training Set = .1313 MSE on Test Set = .1451
First Results Q: Given the attributes of a registered reader, which content types they will read? Method: Decision Tree Induction n= 786 node), split, n, deviance, yval * denotes terminal node 1) root 786 223508.000 29.44402 2) LocGrpID< 1.5 96 23784.990 24.01042 4) RIC>=70.5 53 10433.890 19.66038 * 5) RIC< 70.5 43 11112.050 29.37209 10) RIC< 66 33 8432.545 25.27273 * 11) RIC>=66 10 294.900 42.90000 * 3) LocGrpID>=1.5 690 196494.400 30.20000 6) RIC< 71.5 438 127844.900 28.34475 12) RIC>=14.5 411 120569.000 27.69586 * 13) RIC< 14.5 27 4468.667 38.22222 * 7) RIC>=71.5 252 64521.570 33.42460 14) Title_Code>=38 20 4712.950 20.45000 * 15) Title_Code< 38 232 56151.570 34.54310 * 20.45 n = 20 35.54 n = 232
First Results Q: Given registered reader attributes, which types of content will they read? Method: Kernel SVM with Gaussian Kernel Overall Training Error = .569975
Defining Project Success • Success for this project could come in different forms: • Insights gained on any of the six questions within • the project’s scope; • - and/or – • Insight into how TechJournal should modify its • data capture policies to facilitate data mining for the • answers to these questions in the future