Pandas Dataframe to Numpy Array with '\t' delimiter

Question

i am trying to load some data that is several thousand rows and 4 columns, where each column is separated by a tab space, and turn every item of every row into an int datatype.

when i create the dataframe like this:

my_data = pd.read_csv('filename', sep='\t')

i get a output where each row looks like this:

col1\tcol2\tcol3\tcol4

i then need to transform this into a numpy array so i do this:

arr_data = np.array(my_data)

this is my output now:

array([['col1\tcol2\tcol3\tcol4'],
       ['col1\tcol2\tcol3\tcol4'],
       ['col1\tcol2\tcol3\tcol4'], 
       .....
       .....

so basically now each row is a string. what i'd like to do is turn everything into an int, instead of string but when i try to to do this:

arr_data = np.array(my_data, dtype=int)

i get a ValueError

do i need to write a nested for loop to go through every row, and then every column in every row to make each item into an int??

edit: i've also just noticed that when i create the dataframe the data is of shape (rows, 1) instead of (rows, 4), which i guess means the delimiter didn't work? here's the first few rows:

1   1   5   874965758
1   2   3   876893171
1   3   4   878542960
1   4   3   876893119
1   5   3   889751712
1   7   4   875071561

thanks

What you say is somewhat hard to believe. Could you include the first couple of rows of the file? — DYZ
– DYZ, Commented Aug 2, 2018 at 21:19
Please do not paste screenshots, but the actual data. Screenshots cannot be copied, if necessary. I suspect that your columns are space-separated. Check is setting the separator to '\s+' helps. — DYZ
– DYZ, Commented Aug 2, 2018 at 21:24

rafaelc · Accepted Answer · 2018-08-02 21:27:47Z

2

Use the flag delim_whitespace

my_data = pd.read_csv('filename', delim_whitespace=True)

answered Aug 2, 2018 at 21:27

rafaelc

59.4k15 gold badges64 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JAbrams Over a year ago

thank you. this worked as well. is there any difference between delim_whitespace=True and using the sep = \s+ argument?

rafaelc Over a year ago

@JessiAbrams it is the same thing, but delim_whitespace=True is easily readable/maintainable while sep='\s+' is not.

DYZ · Accepted Answer · 2018-08-02 21:26:27Z

1

Your columns are not TAB-separated. They are space-separated. Use sep='\s+' to parse them. This separator incidentally also covers tabs.

answered Aug 2, 2018 at 21:26

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

3 Comments

JAbrams Over a year ago

THANK YOU. will accept as correct answer as soon as SO allows. thank you

JAbrams Over a year ago

may i ask how you figured this out by just looking at the data?

DYZ Over a year ago

That was the only plausible explanation.

Collectives™ on Stack Overflow

Pandas Dataframe to Numpy Array with '\t' delimiter

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related