Breaking Up CSV By Regex in Python

Question

I have a file formatted as below:

S1A23
0.01,0.01
0.02,0.02
0.03,0.03
S25A123
0.05,0.06
0.07,0.08
S3034A1
1000,0.04
2000,0.08
3000,0.1

I'd like to break it up by each "S_A_", and compute the correlation coefficient of the data below. So far, I have:

import re
import pandas as pd

test = pd.read_csv("predict.csv",sep=('S\d+A\d+'))

print test

but that only gives me:

  Unnamed: 0     ,
0  0.01,0.01  None
1  0.02,0.02  None
2  0.03,0.03  None
3        NaN     ,
4  0.05,0.06  None
5  0.07,0.08  None
6        NaN     ,
7  1000,0.04  None
8  2000,0.08  None
9   3000,0.1  None

[10 rows x 2 columns]

I'd, ideally, like to keep the regex delimiter, and have something like:

S1A23: 1.0
S2A123: 0.86
S303A1: 0.75

Is this possible?

EDIT
When running large files (~250k lines), I receive the following error. It is not a problem with the data, as when I break the ~250k lines into smaller chunks, all pieces run fine.

Traceback (most recent call last):
  File "/Users/adamg/PycharmProjects/Subj_AnswerCorrCoef/GetCorrCoef.py", line 15, in <module>
    print(result)
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 35, in __str__
    return self.__bytes__()
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 857, in __unicode__
    result = self._tidy_repr(min(30, max_rows - 4))
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

My exact code is:

import numpy as np
import pandas as pd
import csv
pd.options.display.max_rows = None
fileName = 'keyStrokeFourgram/TESTING1'

df = pd.read_csv(fileName, names=['pause', 'probability'])
mask = df['pause'].str.match('^S\d+_A\d+')
df['S/A'] = (df['pause']
              .where(mask, np.nan)
              .fillna(method='ffill'))
df = df.loc[~mask]

result = df.groupby(['S/A']).apply(lambda grp: grp['pause'].corr(grp['probability']))
print(result)

unutbu · Accepted Answer · 2014-04-08 00:16:04Z

2

The sep parameter is used for specifying the pattern which separates values on the same line. It can not be used for separating rows of the csv into separate dataframes.

Edit: There is a way to read the csv into a DataFrame using read_csv. This is preferable to using a Python loop (as done in my original answer) since read_csv should be faster. This could be important -- particularly for large csv files.

import numpy as np
import pandas as pd
df = pd.read_csv("data", names=['x', 'y'])
mask = df['x'].str.match('^S\d+A\d+')         # 1
df['type'] = (df['x']
              .where(mask, np.nan)            # 2
              .fillna(method='ffill'))        # 3
df = df.loc[~mask]                            # 4

result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)

yields

type
S1A23      1.000000
S25A123    1.000000
S3034A1    0.981981
dtype: float64

The mask is True on the rows that have a "type" in the 'x' column.

In [139]: mask
Out[139]: 
0      True
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9     False
10    False
Name: x, dtype: bool

df['x'].where(mask, np.nan) returns a Series, equal to df['x'] where the mask is True, and np.nan otherwise.

Forward-fill in the nans with the currency values

In [141]: df['x'].where(mask, np.nan).fillna(method='ffill')
Out[141]: 
0       S1A23
1       S1A23
2       S1A23
3       S1A23
4     S25A123
5     S25A123
6     S25A123
7     S3034A1
8     S3034A1
9     S3034A1
10    S3034A1
Name: x, dtype: object

Select only those rows where the mask is False

Original answer:

Unfortunately, I don't see a way to read your data file directly into an appropriate DataFrame. You'll need to do some massaging of the rows to get it into the right form, using a Python loop.

import pandas as pd
import csv

def to_columns(f):
    val = None
    for row in csv.reader(f):
        if len(row) == 1:
            val = row[0]
        else:
            yield [val] + row

with open('data') as f:
    df = pd.DataFrame.from_records(to_columns(f), columns=['type', 'x', 'y'])

print(df)
result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)

edited Apr 8, 2014 at 0:16

answered Feb 13, 2014 at 13:45

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Adam_G Over a year ago

Thanks for the update. I do have a large amount of data, though, and it's not printing out all of the data. I've tried it in PyCharm and the Terminal, but in both cases it only prints a couple dozen datapoints, with a "..." in the middle. Any idea how to get all of the data?

unutbu Over a year ago

Put pd.options.display.max_rows = None in your script to see all the rows. Type help(pd.set_option) to see the available options. Another alternative to print all the rows is to use print(df.to_string()).

Adam_G Over a year ago

Two followup questions about this, if you don't mind: 1) How can I change it to use Spearman instead of Pearson correlation? 2) I'm getting an error when I try to feed in a lot of chunks (~5k). I can copy the error if you'd like.

unutbu Over a year ago

1) Change corr(grp['y']) to corr(grp['y'], method='spearman'). 2) Please post the full traceback error message, and the code you are running too.

unutbu Over a year ago

I added a suggestion for how to fix the error here.

|

Collectives™ on Stack Overflow

Breaking Up CSV By Regex in Python

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related