Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation

Data manipulation:
Data Files, and
Data Cleaning & Preparation
AAA-Python Edition

Plan
●
1- Data Files: Reading and Writing
●
2- Missing data
●
3- Data transformation
●
4- String Manipulation

3
1-DataFiles:
ReadingandWriting
[By Amina Delali]
pandas
●
Using pandas, we can easily read (and write) diferent types of
data from:
On disk files Web Interaction Database interaction
Like
●
csv
●
txt
●
json
●
html
●
xml
●
Excel
files
Like
●
GitHub
website
Like
●
Sqlite
database

4
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
You have just to choose the right function to use with the right
arguments:
The file has a header
It is a csv file
It is delimited with ‘,’
No need to specify
a header or a separator
Specifying that the data
file has no header, a default
Header was added
The real header is
considered as a row value
In this case,no need
to specify a separator
In the case where
the delimiter is not
a ‘,’, you can
specify the used one
(you can also use
a regular expression
like: ‘s+’==one ore
more spaces )

5
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
Specifying a header:
list of 5 values
Specifying an index
column: the fifth column
is no longer a value column
but an index column
Some files may contain rows values +
other text, so you can skip this text by:
skiprows argument: skiprows=[0, 2]: will
not include the first and third rows

6
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
A missing
value By default, the missing
and Na values are
considered to be NULL
The content of the
file : A3P-w2-ex5.csv
We can specify the Null values as
a dictionary, to specify the
corresponding columns as keys
By default, the missing
and Na values are
considered to be NULL
We can also use a list, to select
from all the values of the file

7
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
Chunksize==
2 rows
We read only
10 rows (from 10000)
Total of 5
chunks (2 * 5== 10 rows )
Combine the arguments
values to create tuples

8
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
With read_csv or read_table, you can read other text fles format
as (.txt fles) containing columns separated by delimiters.
●
You can use read_json to read json fles
●
You can use read_html to read tabular data in a html fle.
●
You can use read_excel to read excel fles.
The content of the json file
The content of the excel file
You will have to install xlrd and
openpyxl libraries

9
1-DataFiles:
ReadingandWriting
[By Amina Delali]
On disk Files
●
Only the displayed
number of rows will
be limited to 5 (the
DataFrame still contain
all the rows)
Will read all the tables
Preview of the html tableThe required libraires (in addition to pandas)
are: lxml, beautifulsoup4, and html5lib.

10
1-DataFiles:
ReadingandWriting
[By Amina Delali]
●
To write the data to a fle, you can use this corresponding
methods: to_csv, to_json, and to excel.
Creating a DataFrame
Saving the files to
different files format
Content of file1.txt
and file1.csv
Content of file1.json
Downloading file1.xlsx using
this commands: Content of file1.xlsx

11
1-DataFiles:
ReadingandWriting
●
It is possible to interact with websites APIs to retrieve
data via a predefned format.
Web Interaction
We selected from the json
data this 2 columns
By default, it will get
only the last 30 issues
We selected only
closed issues, the second
page, and each page will
contain 100 issues
The two columns
we selected

12
1-DataFiles:
ReadingandWriting
[By Amina Delali]
●
In the following example, we will use sqlachemy and pandas to
interact with an sqlite database.
●
There is various ways to connect, create and extract data
from a DataBase using sqlalchemy. We selected one of them.
DataBase Interaction
The name of
the database
Link “meta” with the
created database (engine)
The table will have
2 columns: id and value
The table will have
2 columns: id and value
The table is linked
with DB
The table is linked
with DB
The name of the tableThe name of the table

13
1-DataFiles:
ReadingandWriting
[By Amina Delali]
DataBase Interaction
The created DataBase
Value to insert with
the corresponding column

14
2-Missingdata
[By Amina Delali]
●
Filtering out
●
Sometimes, data may have missing or “Na” values. So, with pandas
we can filter out those values using the dropna method.
The new Series will
contain only those
two values
df1
dropna by default will drop
all rows containing at least one
Nan value. To drop all columns
contating at least one nan value
you should specify axis=1
The Na value
Column 3 is kept, because it has
two values different from Nan . how=”all” means that
dropna will drop rows
if all the values are “Na”

15
2-Missingdata
[By Amina Delali]
●
Filling in
●
Instead of dropping missing data, we can produce new ones using
pandas with fillna method.
By default, fillna will fill rows (axis=0) with :
- a given value: in this case limit=2 signify
the maximum number of nan values to be
replaced in each column (this is our case)
- a given method:in this case limit=2 signify
the maximum number of consecutive Nan
values to be replaced in a column
By default, fillna will fill rows (axis=0) with :
- a given value: in this case limit=2 signify
the maximum number of nan values to be
replaced in each column (this is our case)
- a given method:in this case limit=2 signify
the maximum number of consecutive Nan
values to be replaced in a column
If axis=1 was specifiedIf axis=1 was specified
axis=0
df1 will be modified

16
3-DataTransformation
●
Some other types of transformations are necessary as: dropping
duplicated data, transforming and creating new data using
mapping, replacing values, renaming indexes,
discretization, permutation and random sampling
df1
row1 and row2
have same values
considering
the columns 0, 1
and 2. So only the
first (top) one is
kept
keep=’last’ is specified so
the last (bottom) line is kept
Dropping duplicates

17
[By Amina Delali]
Transforming data
Added the new column “order”, by mapping
the values from “chars” using the dictionary myMap

18
[By Amina Delali]
Replacing values and
Renaming indexes
df2
- To modify columns
labels use: column=
- if indexes or columns
were strings we could
use for example:
index= str.lower()
- to modify only one value:
df1.replace(0.99,1)
- using inplace=True, will
modify the original DataFrame
df2
3 doesn’t exist

19
ser2
The values are grouped in 3
categories: 0→4 ,5 → 7, 8 → 9
==
(0,4],(4,7],(7,9]
- “(“ means the value is out. The
“]” means the value is in.
0 doesn’t
Belong
to Any
category
The values are grouped in 4 categories with the
same length using the minimum and maximum
values. All the values are included.
Discretization

20
Permutation
Random sampling
The length of the array
must be == to the
number of rows
If you select n greater than ser2 length
you will have to specify : replace=True
argument (to fill the reaming needed values)

21
4-Stringmanipulation
[By Amina Delali]
String methods
●
String object have useful methods that can be used:
If “e” doesn't exist it
will raise an exception
If “e” doesn't exist it
will return -1

References
●
SQLAlchemy authors and contributors. Sqlalchemy 1.2
documentation. On-line at
https://docs.sqlalchemy.org/en/latest/core/dml.html. Ac-cessed on
19-10-2018.
●
[2] GitHub. Rest api v3. On-line at https://developer.github.com/v3/.
Accessed on 15-10-2018.
●
[3] Wes McKinney. Python for data analysis: Data wrangling with
Pandas, NumPy, and IPython. O’Reilly Media, Inc, 2018.
●
[4] pydata.org. Pandas documentation. On-line at
https://pandas.pydata.org/. Accessed on 19-10-2018.
●
[5] pysheeet. Python sqlalchemy cheatsheet. On-line at
https://pysheeet.readthedocs.io/en/latest/notes/python-
sqlalchemy.html. Accessedon 19-10-2018.
●
[6] McKinney Wes. pydata-book. On-line at
https://github.com/wesm/pydata-book.git. Accessed on 14-10-2018.

Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation

More Related Content

What's hot

Similar to Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation

More from AminaRepo

Recently uploaded

Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation