200 likes | 306 Views
Giga-Mining. Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999. Case Study. Statistical modeling Processing of multi-GB databases Data warehousing Prediction and classification User interfaces. Three Goals.
E N D
Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999
Case Study • Statistical modeling • Processing of multi-GB databases • Data warehousing • Prediction and classification • User interfaces
Three Goals • Daily perform meaningful mining on multi-GB of data • Classify telephone numbers as business or residential (pattern deviation, etc.) • Maintain operational data for each phone number.
Quantity of data • 1997: 275 million phone calls per week day -- total of 76 billion for whole year • 65M unique TNs per weekday • 350M unique TNs over a 40-day period • “Universe list”: Set of all TNs observed on network, each with a 7-byte profile
Contents of each profile • Inactivity -- number of days since TN used • Minutes of use -- average daily minutes TN is observed on network • Frequency -- estimated number of days between observing a TN • “Bizocity” -- Business-like behavior of TN • Stored for inbound/outbound, toll/toll-free
Calculation of each variable • Inactivity: Set to 0 if observed, and (Inactivity++) if not observed. • Other variables are calculated via an exponential weighted average: • X(TN)new = λX(TN)today + (1-λ)X(TN)old,0 < λ < 1
Provides for estimate as a weighted sum of all previous daily values, where weights decrease smoothly over time. Most recent day’s activity is weighted higher than 2 weeks ago. Weight of a call k days ago is wk= (1-λ)k λ Old data is “aged out” as new data is “blended in” Aging factor λ
“Bizocity” • Concerns over whether a TN is residential or business. • Different operations for residences and businesses for customer care, billing, collections, fraud detection, etc.
“Bizocity” continued • AT&T has confirmed residential/business status for 30% of 350M TNs. • Incomplete data is due to lack of communication with local companies, additional lines, out of date information. • Behavioral estimate is generated by observing behavior of all 350M TNs, generating a bizocity score, and combining it with previous days’ totals.
Generating “Bizocity” • When a call completes, data such as originating TN, dialed TN, connect time, and call duration (note that callers are not identified, just phone numbers). • Those with known biz/res status are flagged, and training sets are generated. • Noise and outliers are usually eliminated by the volume of data.
Generating “Bizocity” -- examples • Example: Long calls originating at night are usually residential, not business. • Example: Residential calls peak in eve., business calls peak between 9am-5pm • Example: Business calls are generally shorter, call other businesses, or call 800 services.
Processed every 24 hours • Provides better aggregate data for each TN • Reduces I/O by 75% • Have to store all call details and sort them. • Each call is reduced to a 32-byte binary record, resulting in 8GB daily. • Sorting takes 30 min. (3GB RAM, 1 processor)
Processing -- continued • 4d data cube is generated • Dimensions are day-of-week, time-of-day, duration, and biz/res/800 status (7x6x5x3) • Have previously developed logistic regression models for scoring TNs based on each profile (to estimate “Bizocity”) • Biz(TN)new = λBiz(TN)today + (1-λ)Biz(TN)old 0 < λ < 1
Processing -- continued • Training set is used to classify TNs with unknown status based on probabilities • Inactive TNs are not updated • “Bizocity” scores for unknown TNs are generated using probabilities
Accuracy • Accuracy of prediction of status is 75% • Failures due to incorrectly provided status of shifting status (ex. home businesses, cell phones, etc.)
Data Structures • Exploit the “exchange” concept (1st 6 digits form an exchange) • Only about 150,000 of 1M exchanges are in use • All 10,000 TNs for each exchange are stored sequentially, whether used or not • Each data structure is 2GB for each variable (lower bound is 1.5GB)
Interface • Variety of visualization tools (start at top, drill-down) • Web interface with password protection • Images are computed on the fly • C-code directly computes images in gif format
Toll Fraud Detection • Same methodology, but event-driven • Only have to track about 15M TNs. • Profiles are about 512 bytes each (7.5GB)