480 likes | 665 Views
Analyzing data with python. Sarah Guido @ sarah_guido Reonomy OSCON 2014. About me. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer. About this talk. Bird’s-eye overview: not comprehensive explanation of these tools!
E N D
Analyzing data with python Sarah Guido @sarah_guido Reonomy OSCON 2014
About me • Data scientist at Reonomy • University of Michigan graduate • NYC Python organizer • PyGotham organizer
About this talk • Bird’s-eye overview: not comprehensive explanation of these tools! • Take data from start-to-finish • Preprocessing: Pandas • Analysis: scikit-learn • Analysis: nltk • Data pipeline: MRjob • Visualization: matplotlib • What next?
Why python? • So many tools • Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability • Community support • “Easy” language to learn • Both a scripting and production-ready language
From point A to point…x? • How to find the best tool(s)? • The 90/10 rule • Simple is better than complex
Why I chose these tools • Available resources • Documentation, tutorials, books, videos • Ease of use(with a grain of salt) • Community support and continuous development • Widely used
Preprocessing • The importance of data preprocessing • AKA wrangling, munging, manipulating, and so on • Preprocessing is also getting to know your data • Missing values? Categorical/continuous? Distribution?
Pandas • Data analysis and modeling • Similar to R and Excel • Easy-to-use data structures • DataFrame • Data wrangling tools • Merging, pivoting, etc
Pandas • Keep everything in Python • Community support/resources • Use for preprocessing • File I/0, cleaning, manipulation, etc • Combinable with other modules • NumPy, SciPy, statsmodel, matplotlib
Pandas • File I/O
Pandas • Finding missing values
Pandas • Removing missing values
Pandas • Pivoting
Pandas • Other things • Statistical methods • Merge/join like SQL • Time series • Has some visualization functionality
Machine Learning • Application of algorithms that learn from examples • Representation and generalization • Useful in everyday life • Especially useful in data analysis
Machine learning • Supervised learning • Classification and regression • Unsupervised learning • Clustering and dimensionality reduction
Scikit-learn • Machine learning module • Open-source • Built-in datasets • Good resources for learning
Scikit-learn • Scikit-learn: your data has to be continuous • Here’s what one observation/label looks like:
Scikit-learn • Transform categorical values/labels
Scikit-learn • Classification
Scikit-learn • Classification
Scikit-learn • Other things • Very comprehensive of machine learning algorithms • Preprocessing tools • Methods for testing the accuracy of your model
Natural Language Processing • Concerned with interactions between computers and human languages • Derive meaning from text • Many NLP algorithms are based on machine learning
nltk • Natural Language ToolKit • Access to over 50 corpora • Corpus: body of text • NLP tools • Stemming, tokenizing, etc • Resources for learning
NLTK • Stopword removal
NLTK • Stopword removal
NLTK • Stemming
NLTK • Other things • Lemmatizing, tokenization, tagging, parse trees • Classification • Chunking • Sentence structure
Processing Large Data • Data that takes too long to process on your machine • Not “big data” but larger data • Solution: MapReduce! • Processing large datasets with a parallel, distributed algorithm • Map step • Reduce step
Processing Large Data • Map step • Takes series of key/value pairs • Ex. Word counts: break line into words, return word and count within line • Reduce step • Once for each unique key: iterates through values associated with that key • Ex. Word counts: returns word and sum of all counts
MRJOB • Write MapReduce jobs in Python • Test code locally without installing Hadoop • Lots of thorough documentation • A few things to know • Keep everything in one class • MRJob program in a separate file • Output to new file if doing something like word counts
mrjob • Stemmed file • Line 1: (‘miss’, 2), (‘taylor’, 1) • Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) • And so on…
MRJob Map Reduce (‘miss’, 2) (‘taylor’, 2) (‘first’, 2) (‘wed’, 2) (‘father’, 2) • Line 1: (‘miss’, 2), (‘taylor’, 1) • Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) • Line 3: (‘first’, 1), (‘wed’, 1) • Line 4: (‘father’, 1) • Line 5: (‘father’, 1)
MRJob • Let’s count all words in the Gutenberg file • Map step
MRJob • Reduce (and run) step
MRJob • Results • Mapped counts reduced • Key/val pairs
MRJob • Other things • Run on Hadoop clusters • Can write highly complex jobs • Works with Elasticsearch
Data Visualization • The “final step” • Conveying your results in a meaningful way • Literally see what’s going on
Matplotlib • 2D visualization library • Very VERY widely used • Wide variety of plots • Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc)
Matplotlib • Remember this?
Matplotlib • Bar chart of distribution
Matplotlib • Let’s graph our word count frequencies • (Hint: It’s a power law distribution!)
matplotlib • High frequency of low numbers, low frequency of high numbers
Matplotlib • Other things • Many different kinds of graphs • Customizable • Time series
What next? • Phew! • Which tool to choose depends on your needs • Workflow: • Preprocess • Analyze • Visualize
Resources • Pandas • http://pandas.pydata.org/ • scikit-learn • http://scikit-learn.org/ • NLTK • http://www.nltk.org/ • MRJob • http://mrjob.readthedocs.org/ • matplotlib • http://matplotlib.org/
Contact Me! • Twitter • @sarah_guido • LinkedIn • https://www.linkedin.com/in/sarahguido • NYC Python • http://www.meetup.com/nycpython/
The End! Questions?