290 likes | 441 Views
Making Good Use of Data at Hand: Open Source Tools. Mark C. Cooke, Ph.D. Tax Management Associates, Inc. Overview. Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)”
Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.
Overview • Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)” • Open Source tools exist for making good use of existing data sets • ETL (“Extract, Transform, Load”) + Analytics • Knime and the R language are two of the most powerful resources for leveraging data Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Open Data • Open Data concept – governments collect, through existing management systems, enormous quantities of data that can be leveraged in alternative and novel ways to find solutions. • The goal is often to leverage the broader community to develop solutions that governments may not have previously conceived. • Open Data and Business Intelligence should be used by internal consumers as well. Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Open Data Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
“Data Scientist” Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Doing Data the Old Way • Data is locked inside systems :-( • Software systems are designed to wrap a Graphical User Interface (GUI) around data. • The GUI functionality, historically, has to be programmed to produce reports, views, and analysis. • The GUI is driven by the sole purpose of the software. But the data has many purposes… Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Open Data – Way Forward • Making data talk across platforms: AS400, SQL, XML, Excel, PDF’s, Text Files, Image Files (.png, .jpeg, etc.), Shape Files (ESRI), email archives, web-scraping, API’s from social media, etc. • Connecting data across multiple platforms • Using data for novel insight • Tools now exist for importing, cleaning, standardizing, and analyzing data using complex algorithms built into accessible packages Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Open Data • These systems are known as “Data Agnostic:” Database Agnostic - Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system. • http://searchdatamanagement.techtarget.com/definition/database-agnostic Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Data Science • What is the breadth of the tool base? • Reading in data from various resources • Transforming data to merge various resources, translate data into a usable format or to add new data elements • Analyzing data from basic logical and statistical functions to higher level machine learning tools and algorithms “Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.” http://en.wikipedia.org/wiki/Machine_learning Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Data Science • What is the output? • “Business Intelligence” or actionable information that drives business decisions through insight • Creating new insights from existing data • Visualizations - representation of that BI in ways to make it consumable to a non-specialist audience “According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means.” http://en.wikipedia.org/wiki/Data_visualization Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Knime is a GUI-based data agnostic tool for ETL, analytics, and visualization. • Knime is an open source platform for the desktop with commercial enterprise server layers including collaboration tools and web-services (web-portal). • Knime supports other analytics languages, including the R language for statistical computing www.Knime.org Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
The advantages of Knime: • Rapid development environment • Very powerful processing handling large datasets on commodity hardware • Allows for 100% data samples up to millions of elements row-wise • Workflows can be saved, shared, and duplicated • nodes are stepwise allowing for quick revisions • nodes provide access to complex algorithms Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
What is Knime? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
The Knime Workbench Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes • Nodes are the workers inside a workflow • Every node serves at least one function • Nodes can also be built as Meta-Nodes, which are a collection of nodes performing common functions • A collection of nodes is called a “workflow” • You can develop nodes with Java and the node development support Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes • For example, the file reader node is an intelligent file reader that can determine the type of file • However, it also allows for the end user to adjust parameters Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes • The Column Filter node allows users to filter columns from a table (conveniently named…) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes (sample) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Knime Integrates with R • R integration is key to expanding the data analysis and visualization capabilities of Knime • R supports data ingestion of complex files (including ESRI) • R supports complex data manipulation and statistical analysis • R supports a wide variety of highly customizable visualizations • So, what is R, exactly? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
R Project for Statistical Computing www.r-project.org • R is an open source scripting language which can be run inside Knime, but also within a command line environment independently • Several GUI interfaces for R exist such as R Studio, a group that provides software for using R as well as training and extension packages (www.rstudio.com) • Community contributions make up the bulk of R packages, which now total more than 4,700 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
R Project for Statistical Computing www.r-project.org • The R base package (standard software) provides methods for reading data, ETL, analysis and visualizations • The community provided packages take this base and build on it depending on the interest of the producer • Packages stretch across all imaginable data uses, including advanced statistical analyses, machine learning and data mining, and advanced graphical visualizations (including sophisticated mapping) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Popular R Packages • A (very) brief overview of popular packages: • Plyr – for advanced data manipulation • Maps – for mapping datasets onto georeferenced outputs • GGPlot2 – for advanced data visualizations • Rcurl – for reading data from webpages and repositories • TextMining – for text mining applications • SNA – for social network analysis Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
R Inside Knime Basic Data Manipulation: Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
R Inside Knime Basic Visual using Maps: Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Knime + R + TPP • Case examples for working with TPP: • Look at distribution of TPP accounts across a county, state, or region • Map entities or create a heatmap (choropleth) of the distribution of personal property values • Compare personal property reporting across schedules across industry sectors (m&e across manufacturing types) • Compare like-kind entity reporting (franchises, big-box) for consistency in values • Compare personal property accounts with other data resources (real property accounts, permits, etc.) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Brief Demonstration • Data: • Florida • 67 Counties • More than 1.24 million personal property accounts • Goals: • Group all data by industry to illustrate the taxable value and exempted value by type • Subset the data to include only a particular industry • Map the state-wide exempt value in a choropleth Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.
Questions? Thank you for your time and attention. I am always happy to discuss data, so please feel free to contact me at any of the information below. Mark C Cooke Mark.Cooke@tma1.com 704.847.1234 (office) 704.953.6349 (cell) www.linkedin.com/in/markccooke Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.