1 / 22

Techniques for Collecting and Harmonizing External Data

Techniques for Collecting and Harmonizing External Data. By Neil Hepburn. Speaker Bio. Data Architect for Call Genie Inc. Has worked in telecom for past 7 years, and IS/IT for past 15 years Is GM of marketing for TUN3R.com, an Internet radio aggregator

bee
Download Presentation

Techniques for Collecting and Harmonizing External Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Techniques for Collecting and Harmonizing External Data By Neil Hepburn

  2. Speaker Bio • Data Architect for Call Genie Inc. • Has worked in telecom for past 7 years, and IS/IT for past 15 years • Is GM of marketing for TUN3R.com, an Internet radio aggregator • Does part time consulting, including workshops on external data collection for consultancy Hepburn Data Inc.

  3. Overview • Fact-based decision making has begun to capture the popular imagination – beyond the traditional data warehouse. The demand for fact-based approaches is real • The first goal of this presentation is to trace the lineage of the fact-based management approach, and point out the pitfalls when used the wrong way • Members of IRMAC and DAMA live in a fact-based world. However, the data is mainly confined to data produced internally. • External data is collected and managed inconsistently. • Internal BI, and external market data are the respective "peanut butter" and "chocolate" of market analytics, but the greatest value comes when they are blended together • My second goal is to explain why the DW/BI group should play a central role, and form closer bonds with marketing • The techniques external marketing firms employ can be more easily adopted internally than vice versa • My third goal is to provide pragmatic and concrete guidance on which can be actioned by marketing and a BI/DW working in collaboration • The bulk of this presentation will be about this

  4. Presentation Roadmap • Modern history of analytics-driven enterprises • Rationale for internalizing external data collection • What consumer marketing departments usually look for: • Competitive Intelligence explained • Location Intelligence explained • Householding explained • The ethics of external data collection • First steps: Integrating the Canadian Census and Elections Canada data • Locating and evaluating data sets for purchase • Primary Field Research • Web Harvesting • Integrating external data sets to derive new insights • First and Last Name analysis • Lessons learned from The Whiz Kids

  5. Background & History Part 1 • Current wave of “Cultures of Analytics” has begun to capture the popular the popular imagination. In the last three years we have seen the following books released: • The Numerati (by Stephen Baker) • Competing on Analytics (by Thomas Davenport & Jeanne Harris) • Supercrunchers (by Ian Ayres) • Data Driven: Profiting from your most Important Asset (by Thomas C. Redman) • Much of the inspiration behind these books originates from “Moneyball: The Art of Winning an Unfair Game” (by Michael M. Lewis), which documents the success of the Oakland ‘A’s through “Sabermetrics” – taking an analytical approach to team picks and real time game strategy • It’s all good stuff, but really nothing new…

  6. Part 2 The Whiz Kids • In the late forties ‘Tex’ Thornton recruited Jack Reith, Robert MacNamara and 7 other “Whiz Kids” from the Harvard Business School to form a new department within the US Air Force. • This department was called Statistical Control and its mandate was to base all management decisions on numbers eschewing emotion and intuition • The Whiz Kids quickly saved the Air Force billions of dollars. • They moved on as a group to work for The Ford Motor Co. with mixed results • We will return to this last point at the end of this presentation…

  7. Why, What, and Who? External Data for Marketing • External data is collected primarily to support Sales & Marketing decisions in the following subject areas: • Advertising • What communities should I advertise in? • What media should I use (e.g. television, Internet, print, radio, direct mail, out-of-home, etc.) and how should I be targeting within each media? • Properly managed external data sets can significantly cut down on “spray and pray” approaches to advertising • Pricing • How much are my competitors charging? • How have their prices changed over time? • How do I assign value to individual product attributes? • Competitive Positions • How do I stack up against my competitor across… • Product Pricing (as discussed already) • Retail/Branch/Kiosk locations • Customer Service (e.g. average pick-up time, IVR complexity, problem resolution times, etc.) • Market Insights. For example: • Where are my early adopters? • How am I performing across gender/age/ethnic segments • Store openings and closings (i.e. retail network optimization) • This presentation is primarily focused on data to Sales & Marketing initiatives

  8. Why, What, and Who? External Data for Risk Awareness • External data is heavily used in finance and insurance for risk awareness. The most common examples include: • Mortgage Limits, Approvals, Property Assessments • Loan Limits, Approvals, and Rates • Insurance Premiums • Other common forms of risk awareness include: • Background checks (e.g. for new hires) • Store/branch location risk • Risk of natural disaster. Typically flooding, but also hurricanes, avalanches, and earthquakes • Risk of crime and theft, including warehouse or retail inventory shrinkage • Since these external data and usages are mature within the finance and insurance industry, this presentation does not spend as much time on risk awareness

  9. Why, What, and Who? Practically Anybody • Many companies rely on upstream data sources as part of their business. Most organizations (with the exception of the government) are not transparent about their data collection methods and operations • Evaluating upstream data quality is not difficult and will often provide results that may surprise (and even explain quite a lot) • Decision makers are apt to make decisions with the information they have, and make “reasonable assumptions” where there are gaps. • The more information you can access to close off these assumptions, the better. The old saw about “assuming” is as true as ever

  10. Rationale for Internalizing External Data Collection Part 1 • There exists a chasm between internal domain knowledge and external data set knowledge • Marketers may realize this, but are not equipped to deal with it. • They need to support decisions with market facts, and will not wait for IT to validate those facts • Warning warning! Danger danger! • External vendors selling raw or packaged data sets are geared towards selling to marketing departments • The average marketer does not know how to evaluate data quality • Many data products are not transparent, and come with many “gotchas”, especially when analyzing micro-markets. • It is easier to comprehend both internal data sets going from the internal to the external than the other way around. • You probably have a better understanding of StatCan data than StatCan has of your data. • The juiciest stories come from when we derive new facts through the “JOINing” of existing facts • Derived facts often provide the most valuable insights. • E.g. Identify customer characteristics with respect to regional income levels and property values. Very interesting stories will quickly surface.

  11. Rationale for Internalizing External Data Collection Part 2 • By creating internal programmes you can be more agile at collecting data, especially during crucial times of the year and crucial events. • You can react instantaneously to business information needs • You are in a better position to quantify and measure Competitive Intelligence (as opposed to ad hoc reports) • It is often cheaper to collect the data on your own • It is a nice break from routine for many employees • Going outside to collect information can be surprisingly enjoyable • Many people perceive external data collection as being risky, when in fact it is not. Naturally: • Some people will resist external data collection, often due to perceived ethically concerns • Other people will appreciate the learning opportunity • Recommended to check with HR first to understand insurance and WSIB risks

  12. Competitive Intelligence explained • Competitive Intelligence (CI) is the legal and ethical practice of obtaining public domain information about one’s competitor • Competitive Intelligence can either be qualitative OR; • Is the most common form of CI, but its value degrades quickly over time • E.g. press releases, Google Alerts • Quantitative • methodically collected and managed, time seriesed data increases in value, but is less common • E.g. Competitor pricing across product lines

  13. Location Intelligence explained • Location Intelligence (LI) is the discipline of managing regional attributes • Similar to geomatics, but is different in that LI is not solely focused on physical attributes, but rather all attributes. • The most granular unit of LI tends to be a single parcel of land (i.e. an address), but can be as high as province or country • Most LI tends to be focussed on either the address level (householding), but these other units are commonly used: • postal code (recommended) • Forward Sort Area [FSA] (the first three characters of the postal code) • Dissemination Area [DA] (the most granular level of the Census where qualitative attributes are revealed) • Block (the most granular level of the Census, where only population figures are revealed)

  14. Householding • The most granular level of LI data, and the most valuable • Most householding initiatives are focussed on data quality. • E.g. cleaning up addresses, removing duplicates, and confirming occupants identity (e.g. Trillium Software, IBM InfoSphere QualityStage, etc.) • Specific household details can be had, but be careful…

  15. The ethics of data collection • General rules of thumb: • Only utilize public domain information • Ask yourself if you would accept your competitor doing the same • For web sites, read the terms of service. Most web sites do not prohibit web harvesting • Most call centres record your call already • State to the person or computer you are recording their call too. • StatCan policies (be aware of these rules if you ever publish or externally exchange information): • Rule of three • Round by fives • Some useful resources • Personal Information Protection and Electronic Documents Act (PIPEDA) • Society for Competitive Intelligence Professionals (SCIP)

  16. First steps: Integrating the Canadian Census and Elections Canada data • The Canadian Census is likely the most valuable external data set you are likely to find for Customer Segmentation. It provides • Population figures • Income levels, and earning population • Age, Gender, and Ethnic population counts • Elections Canada provides party voting counts by riding, as well as party contributions • Both the Census and Elections Canada each have Postal Code Conversion File (PCCF), which can be purchased from StatCan • Once you have integrated the Census, other StatCan data sets (e.g. Uniform Crime Reporting Survey or General Social Survey) can be integrated using the same Census <-> Postal Code mapping table • StatCan data is transparent, as they document: sourcing methodology, the original questionnaire, gaps in data, sources of error, and other data quality indicators • First Nations data (i.e. data pertaining to reserves, is of poor quality, and is currently being addressed through the First Nations Statistical Institute) • Many vendors sell derived data sets derived largely from the Census. Beware, as these vendors tend not to be as transparent as StatCan, and often you would be better off with the original Census

  17. Locating and Evaluating Data Sets for Purchase • When attempting to locate a new data set, ask yourself the question: “Who is in a position to acquire these data?” • Many data sets cannot be purchased, but rather obtained through sharing. Some loyalty programs have been known to do this. • Some data sets can only be obtained by certain types of businesses (e.g. Ontario Land Registry data) • Questions to ask about new data: • How and when was the data sourced? • What are the data definitions? • What are the sources of error? • What are the gaps in data? • What distribution model does the data align to? E.g. gamma-poisson distribution. • Evaluating new data: • Always obtain sample data • If possible, validate data against your internal data • If possible, obtain a sample that large enough so as to be representative of the whole within +/- 5% nineteen times out of twenty • You may want to tap a statistician or actuary to help you here

  18. Primary Field Research • Can be qualitative (i.e. focus groups). This can be tricky as it’s often difficult and expensive to build the right focus group, and requires some experience to run and interpret results. • Often best to outsource Focus Group research, since finding the right mix of candidates requires maintaining a network of contacts with the right mix of demographic attributes (e.g. a middle income teenager, an upper income woman, etc.) • Quantitative approaches are easier to run, and require less experience. Plus the data can be time seriesed for trend analysis • Areas of focus often include: • Survey Polling. Either in person, over the phone, or through the web (e.g. Facebook polls) • Retail trade area traffic counts • Call Centre responses • Very little practice is required, but once trained up, you will have a new found agility when obtaining concrete answers to tough questions

  19. Web Harvesting • Fastest growing area of external data collection. A virtual cottage industry already exists… • You can easily outsource this to a company like fetch.com, but there are drawbacks to this • Doing it in-house gives you greater control over data quality, and the data model, and costs go down over time • Many tools already exist to greatly assist in harvesting • Like any software development platform, the “quick and dirty” use of the tool is unsustainable, so you’re best to use something that can be controlled by a programming language (e.g. a COM component) • Some newer sites can be challenging due reliance on the Document Object Model [DOM] (as opposed to explicit HTML), and may require more sophisticated tools to interrogate the DOM. • However, the real challenge is figuring out what to harvest. Here are some suggestions to consider: • Competitor pricing • Competitor branch/store locations • Competitor hiring • Competitor press releases • Data grows in value if it is well structured (normalized), and time seriesed

  20. Integrating External Data Sets • For Location Intelligence, your best bet is to align data to a postal code or FSA level • Highly recommend to purchase postal code <-> census conversion file and postal code <-> riding conversion file. Be aware that there are imperfections in these files, so if you are working at a municipal level, these files should be scrutinized • For householding, data can be aligned by either: • Civic address • E-mail address • Telephone number

  21. First and Last Name proxy analysis • If you have a large enough customer base, it is possible to achieve sensitive insights into your business through name analysis. • Certain first name, last name, and first name last name combinations are highly correlated with the following: • Age group • Gender • Ethnicity, including the distinction of • Immigrant • Born in Canada • Religion • Language • It is possible to approach householding using a combination of name analysis and other attributes (e.g. census profile of belonging region). • For householding if you are going to take this approach, be sure that a hit has a positive impact, and a miss has little or no impact. • E.g. attaching the appropriate language(s) to a bill or statement insert is an excellent opportunity to connect with non-native English speakers • Another way to extract value from name analysis, is by taking the most highly correlated names and using them as proxy for customer segments. With a large enough customer base this can yield some very powerful insights

  22. Lessons Learned from The Whiz Kids • The Whiz Kids’ career ended on many sour notes • At Ford, they were fixated on driving costs down, and neglected to invest in production and innovation. • While not a direct attack, Theodore Levitt’s famous “Marketing Myopia”, is as true now as it was back in 1960 when it was first published in the Harvard Business Review. • Robert McNamara went on to “architect” the Vietnam war. While he knew (from his own numbers) that it was militarily unwinnable as early as 1965, McNamara put too much focus on kill ratios and neglected to learn about the social dynamics in Asia resulted in the war dragging on for nearly a decade longer. • He also reported to Lyndon B. Johnson who made it clear that he was not going to be the first US president to surrender in a war. • Tex Thornton had a falling out with Henry Ford Jr. • Again, politics and relationships can trump results • Jack Reith became a huge advocate of Mergers and Acquisitions and coined the term “synergies” • Beyond the consolidation of HR departments, he failed to achieve the “synergies” he had hoped for • He saw in the numbers what he wanted to see, and took huge risks which failed • He was also behind a couple of failed Ford cars (i.e. Mercury Comet and Ford Edsel) • The morals of the story, from Neil’s perspective: • Numbers are better than no-numbers, and we should try to get as many as we can to support key decisions • Facts live in history. How facts are interpreted to predict the future is highly subjective • Knowing what questions to ask should always be central to decision making. • There is no substitute for the judgement skills that come with experience. • Do not underestimate the power of politics

More Related