0

Python novice here and I am looking for a concise way to write my program. I want to read multiple csv files and clean them for outliers and after that I want to normalize the columns and then create a combined dataset from the normalized columns. The input csv file has many columns and I want to normalize all the columns. Here in the code I have written an example for 2 columns.

the code i wrote works fine, but its tedious and cumbersome. I have it written for 3 datasets. In reality I could be looking at a lot more. Any help on how to loop this and make it look concise? Thanks

    import numpy as np
    import pandas as pd
    gr_P10 = 40
    gr_P50 = 65
    gr_P90 = 90
    rt_P10 = 10
    rt_P50 = 25
    rt_P90 = 50

    def get_quantiles(input_log):
      p10_log = np.percentile(input_log, 10)
      p50_log = np.percentile(input_log, 50)
      p90_log = np.percentile(input_log, 90)
    return p10_log, p50_log, p90_log

    def normalize(input_log, x_90, x_50, x_10, p90_log, p50_log, p10_log):
      mmin = (x_50-x_10)/(p50_log-p10_log)
      mmax = (x_90-x_50)/(p90_log-p50_log)
      if (input_log < p50_log ):
        output_log = x_50 +(mmin*(input_log-p50_log))
      else:
        output_log = x_50 +(mmax*(input_log-p50_log))
    return output_log

# Read data and removing outliers
#Data1
a = pd.read_csv('Data1.csv')
zscore = np.abs(stats.zscore(a))
a = a[(zscore < 3).all(axis=1)]
#Data2
b = pd.read_csv('Data2.csv')
zscore = np.abs(stats.zscore(b))
b = b[(zscore < 3).all(axis=1)]
#Data3
c = pd.read_csv('Data3.csv')
zscore = np.abs(stats.zscore(c))
c = c[(zscore < 3).all(axis=1)]
# Normalizing Data
# Normalizing Data1
p10_log, p50_log, p90_log = get_quantiles(a['GR'])
a['GR_NORM'] = a.apply(lambda x: normalize(x['GR'],gr_P90, gr_P50, gr_P10, p90_log, p50_log, p10_log ), axis =1)

p10_log, p50_log, p90_log = get_quantiles(a['RT'])
a['RT_NORM'] = a.apply(lambda x: normalize(x['RT'],rt_P90, rt_P50, rt_P10, p90_log, p50_log, p10_log ), axis =1)
# Normalizing Data2
p10_log, p50_log, p90_log = get_quantiles(b['GR'])
b['GR_NORM'] = b.apply(lambda x: normalize(x['GR'],gr_P90, gr_P50, gr_P10, p90_log, p50_log, p10_log ), axis =1)

p10_log, p50_log, p90_log = get_quantiles(b['RT'])
b['RT_NORM'] = b.apply(lambda x: normalize(x['RT'],rt_P90, rt_P50, rt_P10, p90_log, p50_log, p10_log ), axis =1)
# Normalizing Data3
p10_log, p50_log, p90_log = get_quantiles(c['GR'])
c['GR_NORM'] = c.apply(lambda x: normalize(x['GR'],gr_P90, gr_P50, gr_P10, p90_log, p50_log, p10_log ), axis =1)

p10_log, p50_log, p90_log = get_quantiles(c['RT'])
c['RT_NORM'] = c.apply(lambda x: normalize(x['RT'],rt_P90, rt_P50, rt_P10, p90_log, p50_log, p10_log ), axis =1)
# Forming new combined dataset with normalized values
new_a = a['GR_NORM','RT_NORM'].copy()
new_b = b['GR_NORM','RT_NORM'].copy()
new_c = c['GR_NORM','RT_NORM'].copy()
new_dataset = pd.concat([new_a,new_b, new_c], ignore_index= True)

3 Answers 3

1

You just need to use functions more to get rid of duplicate code. Try replacing the second half with something like this:

# Read data and removing outliers
#Data1

def read_data(data):
    a = pd.read_csv(data)
    zscore = np.abs(stats.zscore(a))
    a = a[(zscore < 3).all(axis=1)]

    #Normalizing Data
    p10_log, p50_log, p90_log = get_quantiles(a['GR'])
    a['GR_NORM'] = a.apply(lambda x: normalize(x['GR'],gr_P90, gr_P50, gr_P10, p90_log, p50_log, p10_log ), axis =1)

    p10_log, p50_log, p90_log = get_quantiles(a['RT'])
    a['RT_NORM'] = a.apply(lambda x: normalize(x['RT'],rt_P90, rt_P50, rt_P10, p90_log, p50_log, p10_log ), axis =1)

    return a['GR_NORM','RT_NORM'].copy()


data = ['Data1.csv','Data2.csv','Data3.csv']

new_dataset = pd.DataFrame()

for x in data:
    new_dataset = new_dataset.append(read_data(x))
Sign up to request clarification or add additional context in comments.

Comments

0

Unless I'm overlooking something, you can just write a function for that.

Comments

0
N_files = 3
for i in range(1, N_files):
    a = pd.read_csv(f"Data{i}.csv")  #this will loop through open all your files

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.