I have a file formatted as below:
S1A23
0.01,0.01
0.02,0.02
0.03,0.03
S25A123
0.05,0.06
0.07,0.08
S3034A1
1000,0.04
2000,0.08
3000,0.1
I'd like to break it up by each "S_A_", and compute the correlation coefficient of the data below. So far, I have:
import re
import pandas as pd
test = pd.read_csv("predict.csv",sep=('S\d+A\d+'))
print test
but that only gives me:
Unnamed: 0 ,
0 0.01,0.01 None
1 0.02,0.02 None
2 0.03,0.03 None
3 NaN ,
4 0.05,0.06 None
5 0.07,0.08 None
6 NaN ,
7 1000,0.04 None
8 2000,0.08 None
9 3000,0.1 None
[10 rows x 2 columns]
I'd, ideally, like to keep the regex delimiter, and have something like:
S1A23: 1.0
S2A123: 0.86
S303A1: 0.75
Is this possible?
EDIT
When running large files (~250k lines), I receive the following error. It is not a problem with the data, as when I break the ~250k lines into smaller chunks, all pieces run fine.
Traceback (most recent call last):
File "/Users/adamg/PycharmProjects/Subj_AnswerCorrCoef/GetCorrCoef.py", line 15, in <module>
print(result)
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 35, in __str__
return self.__bytes__()
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __bytes__
return self.__unicode__().encode(encoding, 'replace')
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 857, in __unicode__
result = self._tidy_repr(min(30, max_rows - 4))
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
My exact code is:
import numpy as np
import pandas as pd
import csv
pd.options.display.max_rows = None
fileName = 'keyStrokeFourgram/TESTING1'
df = pd.read_csv(fileName, names=['pause', 'probability'])
mask = df['pause'].str.match('^S\d+_A\d+')
df['S/A'] = (df['pause']
.where(mask, np.nan)
.fillna(method='ffill'))
df = df.loc[~mask]
result = df.groupby(['S/A']).apply(lambda grp: grp['pause'].corr(grp['probability']))
print(result)