140 likes | 339 Views
Data Mining Status and Risks. Dr. Gregory Newby UNC-Chapel Hill http://ils.unc.edu/gbnewby. Overview. What is data mining and related concepts? Fundamentals of the science and practice of data mining What data sources are available? Causality and correlation Risks of data mining
E N D
Data Mining Status and Risks Dr. Gregory Newby UNC-Chapel Hill http://ils.unc.edu/gbnewby
Overview • What is data mining and related concepts? • Fundamentals of the science and practice of data mining • What data sources are available? • Causality and correlation • Risks of data mining • Future moves
Data Mining • “An information extraction activity whose goal is to discover hidden facts contained in databases. …[D]ata mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.”(Via http://www.twocrows.com/glossary.htm)
Data Mining • Is: Seeking new information from relations among data, possibly from different sources • Is: An important area of academic, corporate and government research • Is: Important from a security standpoint, because data mining might yield emergent information that would otherwise remain unknown
The Bigger Picture Information retrieval Data mining Data fusion The Data Universe (all data, all sources)
The Data Universe • All data • All topics • All sources • Numeric, textual • Discrete, longitudinal • Lots and lots of data! • The data universe is growing constantly, and many new data sources are being created as a result of security concerns & technological progress
Challenges of the Data Universe • Scale: too much data to deal with • Format: many different formats which are difficult to merge or query • Access: most data (over 90%?) are not Web-accessible • Databases • Proprietary or internal data • Formatting problems or issues
Solutions • Figure out how to get data from one format to another. Standards such as XML and EDI help • Develop cooperative relationships among data holders for data exchange. This is happening much more in government • Develop tools to identify relationships among data. This is the focus of data mining
Data Mining != Web Searching • On the Web, we’re doing high precision information retrieval • We want the first ranked documents to be relevant • We don’t want to see irrelevant documents • The data universe for Web search engines is vast, making this a relatively straightforward problem (though a big engineering challenge!)
Data Mining != Web Searching • Data mining is all about recall, not precision • Recall means we find all the relevant documents, regardless of how many irrelevant documents • This is a tougher problem, since the set of responses to a given inquiry can be huge • It’s tougher : data formats, data merging, access, etc. • The data miner’s goal is to set a threshold over which relationships are “interesting” • Data miners can also search for particular patterns, i.e. related to an individual or group
Today • Law enforcement, industry and government are making their data sources more open to each other (these data sources are not generally publicly available) • Data integrity issues are a major concern • Data mining is still tough. “False positive” relationships are easy to spot • Correlation vs. causality • Seek and ye shall find • Lots of data yields lots of matches
Today’s Data Sources • Credit and other financials • Law enforcement records • Travel history • Health data • Whatever you put on the InternetIf you are targeted: • Wiretap data (‘net, phone, etc.) • Surveillance data • HUMINT, etc., etc.
Tomorrow • Decreased barriers among different data sources (this is a main impact of PATRIOT, but more is coming) • Increased data collection (via PATRIOT plus technological trends) • Better tools for data mining, and new technologies making data sharing and integration easier
Contact Info • Greg Newby is moving from UNC to UAF • New position: • Research Faculty at the Arctic Region Supercomputing CenterUniversity of Alaska, Fairbanks • newby@arsc.edu