2

I have a file formatted as below:

S1A23
0.01,0.01
0.02,0.02
0.03,0.03
S25A123
0.05,0.06
0.07,0.08
S3034A1
1000,0.04
2000,0.08
3000,0.1

I'd like to break it up by each "S_A_", and compute the correlation coefficient of the data below. So far, I have:

import re
import pandas as pd

test = pd.read_csv("predict.csv",sep=('S\d+A\d+'))

print test

but that only gives me:

  Unnamed: 0     ,
0  0.01,0.01  None
1  0.02,0.02  None
2  0.03,0.03  None
3        NaN     ,
4  0.05,0.06  None
5  0.07,0.08  None
6        NaN     ,
7  1000,0.04  None
8  2000,0.08  None
9   3000,0.1  None

[10 rows x 2 columns]

I'd, ideally, like to keep the regex delimiter, and have something like:

S1A23: 1.0
S2A123: 0.86
S303A1: 0.75

Is this possible?

EDIT
When running large files (~250k lines), I receive the following error. It is not a problem with the data, as when I break the ~250k lines into smaller chunks, all pieces run fine.

Traceback (most recent call last):
  File "/Users/adamg/PycharmProjects/Subj_AnswerCorrCoef/GetCorrCoef.py", line 15, in <module>
    print(result)
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 35, in __str__
    return self.__bytes__()
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 857, in __unicode__
    result = self._tidy_repr(min(30, max_rows - 4))
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

My exact code is:

import numpy as np
import pandas as pd
import csv
pd.options.display.max_rows = None
fileName = 'keyStrokeFourgram/TESTING1'

df = pd.read_csv(fileName, names=['pause', 'probability'])
mask = df['pause'].str.match('^S\d+_A\d+')
df['S/A'] = (df['pause']
              .where(mask, np.nan)
              .fillna(method='ffill'))
df = df.loc[~mask]

result = df.groupby(['S/A']).apply(lambda grp: grp['pause'].corr(grp['probability']))
print(result)

1 Answer 1

2

The sep parameter is used for specifying the pattern which separates values on the same line. It can not be used for separating rows of the csv into separate dataframes.

Edit: There is a way to read the csv into a DataFrame using read_csv. This is preferable to using a Python loop (as done in my original answer) since read_csv should be faster. This could be important -- particularly for large csv files.

import numpy as np
import pandas as pd
df = pd.read_csv("data", names=['x', 'y'])
mask = df['x'].str.match('^S\d+A\d+')         # 1
df['type'] = (df['x']
              .where(mask, np.nan)            # 2
              .fillna(method='ffill'))        # 3
df = df.loc[~mask]                            # 4

result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)

yields

type
S1A23      1.000000
S25A123    1.000000
S3034A1    0.981981
dtype: float64

  1. The mask is True on the rows that have a "type" in the 'x' column.

    In [139]: mask
    Out[139]: 
    0      True
    1     False
    2     False
    3     False
    4      True
    5     False
    6     False
    7      True
    8     False
    9     False
    10    False
    Name: x, dtype: bool
    
  2. df['x'].where(mask, np.nan) returns a Series, equal to df['x'] where the mask is True, and np.nan otherwise.
  3. Forward-fill in the nans with the currency values

    In [141]: df['x'].where(mask, np.nan).fillna(method='ffill')
    Out[141]: 
    0       S1A23
    1       S1A23
    2       S1A23
    3       S1A23
    4     S25A123
    5     S25A123
    6     S25A123
    7     S3034A1
    8     S3034A1
    9     S3034A1
    10    S3034A1
    Name: x, dtype: object
    
  4. Select only those rows where the mask is False

Original answer:

Unfortunately, I don't see a way to read your data file directly into an appropriate DataFrame. You'll need to do some massaging of the rows to get it into the right form, using a Python loop.

import pandas as pd
import csv

def to_columns(f):
    val = None
    for row in csv.reader(f):
        if len(row) == 1:
            val = row[0]
        else:
            yield [val] + row

with open('data') as f:
    df = pd.DataFrame.from_records(to_columns(f), columns=['type', 'x', 'y'])

print(df)
result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the update. I do have a large amount of data, though, and it's not printing out all of the data. I've tried it in PyCharm and the Terminal, but in both cases it only prints a couple dozen datapoints, with a "..." in the middle. Any idea how to get all of the data?
Put pd.options.display.max_rows = None in your script to see all the rows. Type help(pd.set_option) to see the available options. Another alternative to print all the rows is to use print(df.to_string()).
Two followup questions about this, if you don't mind: 1) How can I change it to use Spearman instead of Pearson correlation? 2) I'm getting an error when I try to feed in a lot of chunks (~5k). I can copy the error if you'd like.
1) Change corr(grp['y']) to corr(grp['y'], method='spearman'). 2) Please post the full traceback error message, and the code you are running too.
I added a suggestion for how to fix the error here.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.