1 / 51

Python Pandas

Python Pandas. K. Anvesh. Introduction.

ashlynd
Download Presentation

Python Pandas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Python Pandas K. Anvesh

  2. Introduction • Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. • It has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language

  3. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. • Python library provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

  4. Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. • Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data  • load, • prepare, • manipulate, • model, and • analyze.

  5. Pandas Features • Fast and efficient DataFrame object with default and customized indexing. • Tools for loading data into in-memory data objects from different file formats. • Data alignment and integrated handling of missing data. • Reshaping and pivoting of date sets. • Label-based slicing, indexing and subsetting of large data sets. • Columns from a data structure can be deleted or inserted. • Group by data for aggregation and transformations. • High performance merging and joining of data. • Time Series functionality.

  6. Installation of Pandas • Python  Anaconda is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. • It is also available for Linux and Mac. • Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install Pandas using popular Python package installer, pip. C:\Users\Sony>pip install pandas

  7. Highlights of Pandas • A fast and efficient DataFrame object for data manipulation with integrated indexing; • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form; • Flexible reshaping and pivoting of data sets; • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; • Columns can be inserted and deleted from data structures for size mutability; • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

  8. High performance merging and joining of data sets; • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure. • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data; • Highly optimized for performance, with critical code paths written in  C. • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

  9. Dataset in Pandas • Pandas deals with the following three data structures − • Series • DataFrame • Panel • These data structures are built on top of Numpy array • All Pandas data structures are value mutable (can be changed). Except Series all are size mutable. Series is size immutable. • DataFrame is widely used and one of the most important data structures. Panel is used much less.

  10. Different Dimensions of Datasets

  11. Series • Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, … Panel • Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.

  12. DataFrame • DataFrame is a two-dimensional array with heterogeneous data. For example, • The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.

  13. Series • Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. • A series can be created using various inputs like − • Array • Dict • Scalar value or constant

  14. Create a Series from ndarray • If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1]. • Ex: series_1.py • import pandas as pd • import numpy as np • data = np.array(['a','b','c','d']) • s = pd.Series(data) • print (s)

  15. Giving manual Indexing to the data • Use Index list and give the values. • Ex: series_index.py • series_index_2.py • series_3.py

  16. Create a Series from dict • A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. • Ex: 1. series_dict.py • 2. dic_panda.py

  17. Data Frames • A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Features of DataFrame • Potentially columns are of different types • Size – Mutable • Labeled axes (rows and columns) • Can Perform Arithmetic operations on rows and columns • A pandas DataFrame can be created using the following constructor − pandas.DataFrame( data, index, columns, dtype, copy)

  18. Create DataFrame Pandas DataFrame can be created using various inputs like : • Lists • dict • Series • Numpyndarrays • Another DataFrame

  19. Syntax: import pandas as pd df = pd.DataFrame() print (df) The above syntax will generate an empty dataframe with no columns and no index

  20. Dataframe using Lists • The DataFrame can be created using a single list or a list of lists. Ex: df_2.py • Giving column names and list of values Ex: df_3.py

  21. DataFrameusgingDict of ndarrays & Lists • All the ndarrays must be of same length. If index is passed, then the length of the index should be equal to the length of the arrays. • If no index is passed, then by default, index will be range(n), where n is the array length. • Ex: df_4.py • List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names. • Ex: df_5.py

  22. We can also create a DataFrame with a list of dictionaries, row indices, and column indices. • Ex: df_6.py • Note: Here df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.

  23. DataFrame from Dict of Series • Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed. • Ex: df_7.py

  24. Dataset Manipulations • Column wise manipulations in a dataframe • We can perform Dataframe manipulations like: • Selecting required columns for display • Adding new columns • Deleting the columns Example: column wise manipulations.py

  25. Row wise manipulations in a dataframe • We can do the following like:- • Row Selection, • Selecting using label • Selecting using integer location • Selecting using slicing • Addition of row, and • Deletion of row • Example: row wise manipulations.py

  26. Dataset concatinating Ex: concatinating_df.py • Dataset Merging Ex: merge_df.py • Dataset Joining Ex: join_df.py

  27. Data Preprocessing • In the real world, we usually come across lots of raw data which is not fit to be readily processed by machine learning algorithms. We need to preprocess the raw data before it is fed into various machine learning algorithms. • In other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.

  28. Why preprocessing ? • Real world data are generally • Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • Noisy: containing errors or outliers • Inconsistent: containing discrepancies in codes or names

  29. Tasks in data preprocessing • Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. • Data integration: using multiple databases, data cubes, or files. • Data transformation: normalization and aggregation. • Data reduction: reducing the volume but producing the same or similar analytical results. • Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

  30. Data cleaning • Fill in missing values (attribute or class value): • Ignore the tuple: usually done when class label is missing. • Use the attribute mean (or majority nominal value) to fill in the missing value. • Use the attribute mean (or majority nominal value) for all samples belonging to the same class. • Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value. • Identify outliers and smooth out noisy data: • Binning • Sort the attribute values and partition them into bins (see "Unsupervised discretization" below); • Then smooth by bin means,  bin median, or  bin boundaries. • Clustering: group values in clusters and then detect and remove outliers (automatic or manual) • Regression: smooth by fitting the data into regression functions. • Correct inconsistent data: use domain knowledge or expert decision.

  31. Data transformation • Normalization: • Scaling attribute values to fall within a specified range. • Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min) • Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev • Aggregation: moving up in the concept hierarchy on numeric attributes. • Generalization: moving up in the concept hierarchy on nominal attributes. • Attribute construction: replacing or adding new attributes inferred by existing attributes.

  32. Data reduction • Reducing the number of attributes • Data cube aggregation: applying roll-up, slice or dice operations. • Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space • Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data.. • Reducing the number of attribute values • Binning (histograms): reducing the number of attributes by grouping them into intervals (bins). • Clustering: grouping values in clusters. • Aggregation or generalization • Reducing the number of tuples • Sampling

  33. Discretization and generating concept hierarchies • Unsupervised discretization -  class variable is not used. • Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size. • Equal-frequency (equidepth) binning: use intervals containing equal number of values. • Supervised discretization - uses the values of the class variable. • Using class boundaries. Three steps: • Sort values. • Place breakpoints between values belonging to different classes. • If too many intervals, merge intervals with equal or similar class distributions. • Entropy (information)-based discretization. Generating concept hierarchies: recursively applying partitioning or discretization methods.

  34. Missing Values in the array set or the Dataset • Identifying the no. of missing values in a dataset. • Function: - data.isna() or data.isnull() • The above function returns true if the dataframe or dataset is having null values. • We can also count the number of null values in a column. • Function:- data.isnull().sum() or data.isna().sum() data.isnull().sum(axis=0) [column level] / data.isnull().sum(axis=1) [row level]

  35. Null values or Missing values can also be filled Function: data.isnull().fillna(60) • We can drop the missing valued row in a dataframe or dataset using: Function: data.dropna() Mean We can fill the missing values by using mean function Syntax: data.fillna(data.mean())

  36. Lets discusses various techniques for preprocessing data in Python machine learning. • Data preprocessing steps Step 1 − Importing the useful packages (numpy, pandas….. sklearn.preprocessing- This package provides many common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for machine learning algorithms.) Step 2 − Defining sample data Step3 − Applying preprocessing technique

  37. Preprocessing Techniques • Data can be preprocessed using several techniques  • Mean removal : It involves removing the mean from each feature so that it is centered on zero. Mean removal helps in removing any bias from the features. • Example: meanremoval.py • Note: Here we can observe that mean is almost 0 and the standard deviation is 1

  38. Scaling • The values of every feature in a data point can vary between random values. So, it is important to scale them so that this matches specified rules. • It is another data preprocessing technique that is used to scale the feature vectors. Scaling of feature vectors is needed because the values of every feature can vary between many random values. In other words we can say that scaling is important because we do not want any feature to be synthetically large or small. • Example: scaling.py • Note: Here we can see that all the values have been scaled to given range

  39. Normalization • Normalization involves adjusting the values in the feature vector so as to measure them on a common scale. • Here, the values of a feature vector are adjusted so that they sum up to 1. • Example: normalization.py • Note: Here Normalization is used to ensure that data points do not get boosted due to the nature of their features.

  40. L1 Normalization It is also referred to as Least Absolute Deviations. This kind of normalization modifies the values so that the sum of the absolute values is always up to 1 in each row. L2 Normalization It is also referred to as least squares. This kind of normalization modifies the values so that the sum of the squares is always up to 1 in each row

  41. Data Analysis • This concept deals with data, dataframes or dataset by applying all functions/methods for data analyzing Loading the Dataset • Here we are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.  • Methods: read_csvor read_excel • Example: csv file load.py

  42. Summarizing the Dataset Summarizing the data can be done in many ways as follows − • Check dimensions of the dataset • List the entire data • View the statistical summary of all attributes • Breakdown of the data by the class variable

  43. Dimensions of Dataset You can use the following command to check how many instances (rows) and attributes (columns) the data contains with the shape property. Method : x.shape • List out the entire dataset (head/tail also used) • View the statistical Summary You can view the statistical summary of each attribute, which includes the count, unique, top and freq, by using the following command. Method: x.describe()

  44. Breakdown the Data by Class Variable • You can also look at the number of instances (rows) that belong to each outcome as an absolute count, using the command shown here print(x.groupby(‘dept').size())

  45. Data Visualization • Basic Plotting: plot • This functionality on Series and DataFrame is just a simple wrapper around the matplotliblibraries plot() method. • If the index consists of dates, it calls gct().autofmt_xdate() to format the x-axis as shown in the above illustration. • We can plot one column versus another using the x and y keywords. • Example: basicplotting.py

  46. Plotting methods allow a handful of plot styles other than the default line plot. These methods can be provided as the kind keyword argument to plot(). These include − • bar or barh for bar plots • hist for histogram • box for boxplot • 'area' for area plots • 'scatter' for scatter plots

  47. Bar Plot • Let us now see what a Bar Plot is by creating one. A bar plot can be created using plot.bar() • Example: vertical_barplot.py • To get horizontal bar plots, use the barh method • Example: horizontal_barplot.py

  48. Histograms • Histograms can be plotted using the plot.hist() method. We can specify number of bins. • Example: histogram_plot.py • To plot different histograms for each column • Example: histogram_diff.py

  49. Box Plots • Boxplotcan be drawn by callingSeries.box.plot() andDataFrame.box.plot(), or DataFrame.boxplot() to visualize the distribution of values within each column. • For instance, here is a boxplot representing five trials of 10 observations of a uniform random variable on [0,1).

  50. Scatter Plot • Scatter plot can be created using the DataFrame.plot.scatter() methods.

More Related