Pandas is an open-source Python library used for data manipulation and analysis. It allows users to extract data from files like CSVs into DataFrames and perform statistical analysis on the data. DataFrames are the primary data structure and allow storage of heterogeneous data in tabular form with labeled rows and columns. Pandas can clean data by removing missing values, filter rows/columns, and visualize data using Matplotlib. It supports Series, DataFrames, and Panels for 1D, 2D, and 3D labeled data structures.
Pandas is an open-source Python library for data manipulation and analysis, introduced by Wes McKinney in 2008. It supports DataFrames for statistical operations and data visualization.
Pandas can be installed via command prompt using 'pip install pandas' or in Jupyter Notebooks with '!pip install pandas'. Import with 'import pandas as pd'.
Pandas features three main data structures: Series (1D), DataFrame (2D), and Panel (3D), each with unique properties for handling data.
A Series is a 1D labeled array, while a DataFrame is a 2D structure with rows/columns. Examples illustrate data organization and types in DataFrames.
DataFrame properties: heterogeneous and mutable data. Panel is a 3D data structure useful as a DataFrame container.
A Series is a one-dimensional array that can hold various data types. Key constructor parameters are introduced for creating Series.
Series can be created from arrays, dictionaries, or constants. The importance of matching index length to data length is highlighted.
DataFrame is a 2D table-like structure with different data types per column. Key features include mutability and arithmetic operations.
A DataFrame can be constructed using various inputs such as lists, dictionaries, Series, Numpy ndarrays, and other DataFrames.
Pandas
•Open-source Python Library
•Datamanipulation and analysis tool using its powerful data structures.
•Wes McKinney -2008
•Pandas will extract the data from that CSV into :
•DataFrame
•Table
•Perform following statistical operations on it like :
•What's the average, median, max, or min of each column?
•Does column A correlate with column B?
•What does the distribution of data in column C look like?
•Clean the data by doing things like removing missing values and
filtering rows or columns by some criteria
•Visualize the data with help from Matplotlib. Plot bars, lines,
histograms, bubbles, and more.
•Store the cleaned, transformed data back into a CSV, other file or
database
2.
Pandas
• From commandprompt
•pip install pandas
• From Jupyter notebook
• !pip install pandas’
•Then in notebook to use pandas..need to create instance of it
• import pandas as pd
3.
Pandas
•Pandas deals withthe following three data structures −
•Series
•DataFrame
•Panel
Data Structure Description
Series 1D labeled homogeneous array, size immutable.
Data Frames 2D - labeled, size-mutable tabular structure with potentially
heterogeneously typed columns.
Panel 3D labeled, size-mutable array.
4.
Pandas
Series
Homogeneous data
Size Immutable
Valuesof Data Mutable
DataFrame is a two-dimensional array with heterogeneous data.
For example,
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 2.40
The data is represented in rows and columns. Each column represents an attribute
and each row represents a person.
Data Type of Columns
The data types of the four columns are as follows −
Column Type
Name String
Age Integer
Gender String
Rating Float
5.
Pandas
Data Frame :
•KeyPoints
•Heterogeneous data
•Size Mutable
•Data Mutable
•Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to
represent the panel in graphical representation. But a panel can be illustrated as a
container of DataFrame.
Key Points
Heterogeneous data
Size Mutable
Data Mutable
6.
Pandas
•Series is aone-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.).
•The axis labels are collectively called index.
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −
Sr.No Parameter & Description
1 Data : data takes various forms like ndarray, list, constants
2 Index : Index values must be unique and hashable, same length as
data. Default np.arrange(n) if no index is passed.
3 Dtype : dtype is for data type. If None, data type will be inferred
4 Copy : Copy data. Default False
7.
Pandas
A series canbe created using various inputs like −
Array
Dict
Scalar value or constant
Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is
passed, then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
8.
Pandas
•A Data frameis a two-dimensional data structure, i.e., data is aligned in a tabular
fashion in rows and columns.
•Features of DataFrame
•Potentially columns are of different types
•Size – Mutable
•Labeled axes (rows and columns)
•Can Perform Arithmetic operations on rows and columns
9.
Pandas
•pandas.DataFrame
•A pandas DataFramecan be created using the following constructor −
•pandas.DataFrame( data, index, columns, dtype, copy)
•Create DataFrame
•A pandas DataFrame can be created using various inputs like −
•Lists
•dict
•Series
•Numpy ndarrays
•Another DataFrame