Data manipulation:
Data Files, and
Data Cleaning & Preparation
AAA-Python Edition
Plan
●
1- Data Files: Reading and Writing
●
2- Missing data
●
3- Data transformation
●
4- String Manipulation
3
1-DataFiles:
ReadingandWriting
[By Amina Delali]
pandas
●
Using pandas, we can easily read (and write) diferent types of
data from:
On disk files Web Interaction Database interaction
Like
●
csv
●
txt
●
json
●
html
●
xml
●
Excel
files
Like
●
GitHub
website
Like
●
Sqlite
database
4
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
You have just to choose the right function to use with the right
arguments:
The file has a header
It is a csv file
It is delimited with ‘,’
No need to specify
a header or a separator
Specifying that the data
file has no header, a default
Header was added
The real header is
considered as a row value
In this case,no need
to specify a separator
In the case where
the delimiter is not
a ‘,’, you can
specify the used one
(you can also use
a regular expression
like: ‘s+’==one ore
more spaces )
5
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
Specifying a header:
list of 5 values
Specifying an index
column: the fifth column
is no longer a value column
but an index column
Some files may contain rows values +
other text, so you can skip this text by:
skiprows argument: skiprows=[0, 2]: will
not include the first and third rows
6
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
A missing
value By default, the missing
and Na values are
considered to be NULL
The content of the
file : A3P-w2-ex5.csv
We can specify the Null values as
a dictionary, to specify the
corresponding columns as keys
By default, the missing
and Na values are
considered to be NULL
We can also use a list, to select
from all the values of the file
7
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
Chunksize==
2 rows
We read only
10 rows (from 10000)
Total of 5
chunks (2 * 5== 10 rows )
Combine the arguments
values to create tuples
8
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
With read_csv or read_table, you can read other text fles format
as (.txt fles) containing columns separated by delimiters.
●
You can use read_json to read json fles
●
You can use read_html to read tabular data in a html fle.
●
You can use read_excel to read excel fles.
The content of the json file
The content of the excel file
You will have to install xlrd and
openpyxl libraries
9
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
Only the displayed
number of rows will
be limited to 5 (the
DataFrame still contain
all the rows)
Will read all the tables
Preview of the html tableThe required libraires (in addition to pandas)
are: lxml, beautifulsoup4, and html5lib.
10
1-DataFiles:
ReadingandWriting
[By Amina Delali]
●
To write the data to a fle, you can use this corresponding
methods: to_csv, to_json, and to excel.
Creating a DataFrame
Saving the files to
different files format
Content of file1.txt
and file1.csv
Content of file1.json
Downloading file1.xlsx using
this commands: Content of file1.xlsx
11
1-DataFiles:
ReadingandWriting
●
It is possible to interact with websites APIs to retrieve
data via a predefned format.
Web Interaction
We selected from the json
data this 2 columns
By default, it will get
only the last 30 issues
We selected only
closed issues, the second
page, and each page will
contain 100 issues
The two columns
we selected
12
1-DataFiles:
ReadingandWriting
[By Amina Delali]
●
In the following example, we will use sqlachemy and pandas to
interact with an sqlite database.
●
There is various ways to connect, create and extract data
from a DataBase using sqlalchemy. We selected one of them.
DataBase Interaction
The name of
the database
Link “meta” with the
created database (engine)
The table will have
2 columns: id and value
The table will have
2 columns: id and value
The table is linked
with DB
The table is linked
with DB
The name of the tableThe name of the table
13
1-DataFiles:
ReadingandWriting
[By Amina Delali]
DataBase Interaction
The created DataBase
Value to insert with
the corresponding column
14
2-Missingdata
[By Amina Delali]
●
Filtering out
●
Sometimes, data may have missing or “Na” values. So, with pandas
we can filter out those values using the dropna method.
The new Series will
contain only those
two values
df1
dropna by default will drop
all rows containing at least one
Nan value. To drop all columns
contating at least one nan value
you should specify axis=1
The Na value
Column 3 is kept, because it has
two values different from Nan . how=”all” means that
dropna will drop rows
if all the values are “Na”
15
2-Missingdata
[By Amina Delali]
●
Filling in
●
Instead of dropping missing data, we can produce new ones using
pandas with fillna method.
By default, fillna will fill rows (axis=0) with :
- a given value: in this case limit=2 signify
the maximum number of nan values to be
replaced in each column (this is our case)
- a given method:in this case limit=2 signify
the maximum number of consecutive Nan
values to be replaced in a column
By default, fillna will fill rows (axis=0) with :
- a given value: in this case limit=2 signify
the maximum number of nan values to be
replaced in each column (this is our case)
- a given method:in this case limit=2 signify
the maximum number of consecutive Nan
values to be replaced in a column
If axis=1 was specifiedIf axis=1 was specified
axis=0
df1 will be modified
16
3-DataTransformation
●
Some other types of transformations are necessary as: dropping
duplicated data, transforming and creating new data using
mapping, replacing values, renaming indexes,
discretization, permutation and random sampling
df1
row1 and row2
have same values
considering
the columns 0, 1
and 2. So only the
first (top) one is
kept
keep=’last’ is specified so
the last (bottom) line is kept
Dropping duplicates
17
3-DataTransformation
[By Amina Delali]
Transforming data
Added the new column “order”, by mapping
the values from “chars” using the dictionary myMap
18
3-DataTransformation
[By Amina Delali]
Replacing values and
Renaming indexes
df2
- To modify columns
labels use: column=
- if indexes or columns
were strings we could
use for example:
index= str.lower()
- to modify only one value:
df1.replace(0.99,1)
- using inplace=True, will
modify the original DataFrame
df2
3 doesn’t exist
19
3-DataTransformation
ser2
The values are grouped in 3
categories: 0→4 ,5 → 7, 8 → 9
==
(0,4],(4,7],(7,9]
- “(“ means the value is out. The
“]” means the value is in.
0 doesn’t
Belong
to Any
category
The values are grouped in 4 categories with the
same length using the minimum and maximum
values. All the values are included.
Discretization
20
3-DataTransformation
Permutation
Random sampling
The length of the array
must be == to the
number of rows
If you select n greater than ser2 length
you will have to specify : replace=True
argument (to fill the reaming needed values)
21
4-Stringmanipulation
[By Amina Delali]
String methods
●
String object have useful methods that can be used:
If “e” doesn't exist it
will raise an exception
If “e” doesn't exist it
will return -1
References
●
SQLAlchemy authors and contributors. Sqlalchemy 1.2
documentation. On-line at
https://docs.sqlalchemy.org/en/latest/core/dml.html. Ac-cessed on
19-10-2018.
●
[2] GitHub. Rest api v3. On-line at https://developer.github.com/v3/.
Accessed on 15-10-2018.
●
[3] Wes McKinney. Python for data analysis: Data wrangling with
Pandas, NumPy, and IPython. O’Reilly Media, Inc, 2018.
●
[4] pydata.org. Pandas documentation. On-line at
https://pandas.pydata.org/. Accessed on 19-10-2018.
●
[5] pysheeet. Python sqlalchemy cheatsheet. On-line at
https://pysheeet.readthedocs.io/en/latest/notes/python-
sqlalchemy.html. Accessedon 19-10-2018.
●
[6] McKinney Wes. pydata-book. On-line at
https://github.com/wesm/pydata-book.git. Accessed on 14-10-2018.
Thank
you!
FOR ALL YOUR TIME

Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation

  • 1.
    Data manipulation: Data Files,and Data Cleaning & Preparation AAA-Python Edition
  • 2.
    Plan ● 1- Data Files:Reading and Writing ● 2- Missing data ● 3- Data transformation ● 4- String Manipulation
  • 3.
    3 1-DataFiles: ReadingandWriting [By Amina Delali] pandas ● Usingpandas, we can easily read (and write) diferent types of data from: On disk files Web Interaction Database interaction Like ● csv ● txt ● json ● html ● xml ● Excel files Like ● GitHub website Like ● Sqlite database
  • 4.
    4 1-DataFiles: ReadingandWriting [By Amina Delali] Ondisk Files ● You have just to choose the right function to use with the right arguments: The file has a header It is a csv file It is delimited with ‘,’ No need to specify a header or a separator Specifying that the data file has no header, a default Header was added The real header is considered as a row value In this case,no need to specify a separator In the case where the delimiter is not a ‘,’, you can specify the used one (you can also use a regular expression like: ‘s+’==one ore more spaces )
  • 5.
    5 1-DataFiles: ReadingandWriting [By Amina Delali] Ondisk Files Specifying a header: list of 5 values Specifying an index column: the fifth column is no longer a value column but an index column Some files may contain rows values + other text, so you can skip this text by: skiprows argument: skiprows=[0, 2]: will not include the first and third rows
  • 6.
    6 1-DataFiles: ReadingandWriting [By Amina Delali] Ondisk Files ● A missing value By default, the missing and Na values are considered to be NULL The content of the file : A3P-w2-ex5.csv We can specify the Null values as a dictionary, to specify the corresponding columns as keys By default, the missing and Na values are considered to be NULL We can also use a list, to select from all the values of the file
  • 7.
    7 1-DataFiles: ReadingandWriting [By Amina Delali] Ondisk Files ● Chunksize== 2 rows We read only 10 rows (from 10000) Total of 5 chunks (2 * 5== 10 rows ) Combine the arguments values to create tuples
  • 8.
    8 1-DataFiles: ReadingandWriting [By Amina Delali] Ondisk Files With read_csv or read_table, you can read other text fles format as (.txt fles) containing columns separated by delimiters. ● You can use read_json to read json fles ● You can use read_html to read tabular data in a html fle. ● You can use read_excel to read excel fles. The content of the json file The content of the excel file You will have to install xlrd and openpyxl libraries
  • 9.
    9 1-DataFiles: ReadingandWriting [By Amina Delali] Ondisk Files ● Only the displayed number of rows will be limited to 5 (the DataFrame still contain all the rows) Will read all the tables Preview of the html tableThe required libraires (in addition to pandas) are: lxml, beautifulsoup4, and html5lib.
  • 10.
    10 1-DataFiles: ReadingandWriting [By Amina Delali] ● Towrite the data to a fle, you can use this corresponding methods: to_csv, to_json, and to excel. Creating a DataFrame Saving the files to different files format Content of file1.txt and file1.csv Content of file1.json Downloading file1.xlsx using this commands: Content of file1.xlsx
  • 11.
    11 1-DataFiles: ReadingandWriting ● It is possibleto interact with websites APIs to retrieve data via a predefned format. Web Interaction We selected from the json data this 2 columns By default, it will get only the last 30 issues We selected only closed issues, the second page, and each page will contain 100 issues The two columns we selected
  • 12.
    12 1-DataFiles: ReadingandWriting [By Amina Delali] ● Inthe following example, we will use sqlachemy and pandas to interact with an sqlite database. ● There is various ways to connect, create and extract data from a DataBase using sqlalchemy. We selected one of them. DataBase Interaction The name of the database Link “meta” with the created database (engine) The table will have 2 columns: id and value The table will have 2 columns: id and value The table is linked with DB The table is linked with DB The name of the tableThe name of the table
  • 13.
    13 1-DataFiles: ReadingandWriting [By Amina Delali] DataBaseInteraction The created DataBase Value to insert with the corresponding column
  • 14.
    14 2-Missingdata [By Amina Delali] ● Filteringout ● Sometimes, data may have missing or “Na” values. So, with pandas we can filter out those values using the dropna method. The new Series will contain only those two values df1 dropna by default will drop all rows containing at least one Nan value. To drop all columns contating at least one nan value you should specify axis=1 The Na value Column 3 is kept, because it has two values different from Nan . how=”all” means that dropna will drop rows if all the values are “Na”
  • 15.
    15 2-Missingdata [By Amina Delali] ● Fillingin ● Instead of dropping missing data, we can produce new ones using pandas with fillna method. By default, fillna will fill rows (axis=0) with : - a given value: in this case limit=2 signify the maximum number of nan values to be replaced in each column (this is our case) - a given method:in this case limit=2 signify the maximum number of consecutive Nan values to be replaced in a column By default, fillna will fill rows (axis=0) with : - a given value: in this case limit=2 signify the maximum number of nan values to be replaced in each column (this is our case) - a given method:in this case limit=2 signify the maximum number of consecutive Nan values to be replaced in a column If axis=1 was specifiedIf axis=1 was specified axis=0 df1 will be modified
  • 16.
    16 3-DataTransformation ● Some other typesof transformations are necessary as: dropping duplicated data, transforming and creating new data using mapping, replacing values, renaming indexes, discretization, permutation and random sampling df1 row1 and row2 have same values considering the columns 0, 1 and 2. So only the first (top) one is kept keep=’last’ is specified so the last (bottom) line is kept Dropping duplicates
  • 17.
    17 3-DataTransformation [By Amina Delali] Transformingdata Added the new column “order”, by mapping the values from “chars” using the dictionary myMap
  • 18.
    18 3-DataTransformation [By Amina Delali] Replacingvalues and Renaming indexes df2 - To modify columns labels use: column= - if indexes or columns were strings we could use for example: index= str.lower() - to modify only one value: df1.replace(0.99,1) - using inplace=True, will modify the original DataFrame df2 3 doesn’t exist
  • 19.
    19 3-DataTransformation ser2 The values aregrouped in 3 categories: 0→4 ,5 → 7, 8 → 9 == (0,4],(4,7],(7,9] - “(“ means the value is out. The “]” means the value is in. 0 doesn’t Belong to Any category The values are grouped in 4 categories with the same length using the minimum and maximum values. All the values are included. Discretization
  • 20.
    20 3-DataTransformation Permutation Random sampling The lengthof the array must be == to the number of rows If you select n greater than ser2 length you will have to specify : replace=True argument (to fill the reaming needed values)
  • 21.
    21 4-Stringmanipulation [By Amina Delali] Stringmethods ● String object have useful methods that can be used: If “e” doesn't exist it will raise an exception If “e” doesn't exist it will return -1
  • 22.
    References ● SQLAlchemy authors andcontributors. Sqlalchemy 1.2 documentation. On-line at https://docs.sqlalchemy.org/en/latest/core/dml.html. Ac-cessed on 19-10-2018. ● [2] GitHub. Rest api v3. On-line at https://developer.github.com/v3/. Accessed on 15-10-2018. ● [3] Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc, 2018. ● [4] pydata.org. Pandas documentation. On-line at https://pandas.pydata.org/. Accessed on 19-10-2018. ● [5] pysheeet. Python sqlalchemy cheatsheet. On-line at https://pysheeet.readthedocs.io/en/latest/notes/python- sqlalchemy.html. Accessedon 19-10-2018. ● [6] McKinney Wes. pydata-book. On-line at https://github.com/wesm/pydata-book.git. Accessed on 14-10-2018.
  • 23.