0

i am trying to load some data that is several thousand rows and 4 columns, where each column is separated by a tab space, and turn every item of every row into an int datatype.

when i create the dataframe like this:

my_data = pd.read_csv('filename', sep='\t')

i get a output where each row looks like this:

col1\tcol2\tcol3\tcol4

i then need to transform this into a numpy array so i do this:

arr_data = np.array(my_data)

this is my output now:

array([['col1\tcol2\tcol3\tcol4'],
       ['col1\tcol2\tcol3\tcol4'],
       ['col1\tcol2\tcol3\tcol4'], 
       .....
       .....

so basically now each row is a string. what i'd like to do is turn everything into an int, instead of string but when i try to to do this:

arr_data = np.array(my_data, dtype=int) 

i get a ValueError

do i need to write a nested for loop to go through every row, and then every column in every row to make each item into an int??

edit: i've also just noticed that when i create the dataframe the data is of shape (rows, 1) instead of (rows, 4), which i guess means the delimiter didn't work? here's the first few rows:

1   1   5   874965758
1   2   3   876893171
1   3   4   878542960
1   4   3   876893119
1   5   3   889751712
1   7   4   875071561

thanks

2
  • 1
    What you say is somewhat hard to believe. Could you include the first couple of rows of the file? Commented Aug 2, 2018 at 21:19
  • Please do not paste screenshots, but the actual data. Screenshots cannot be copied, if necessary. I suspect that your columns are space-separated. Check is setting the separator to '\s+' helps. Commented Aug 2, 2018 at 21:24

2 Answers 2

2

Use the flag delim_whitespace

my_data = pd.read_csv('filename', delim_whitespace=True)
Sign up to request clarification or add additional context in comments.

2 Comments

thank you. this worked as well. is there any difference between delim_whitespace=True and using the sep = \s+ argument?
@JessiAbrams it is the same thing, but delim_whitespace=True is easily readable/maintainable while sep='\s+' is not.
1

Your columns are not TAB-separated. They are space-separated. Use sep='\s+' to parse them. This separator incidentally also covers tabs.

3 Comments

THANK YOU. will accept as correct answer as soon as SO allows. thank you
may i ask how you figured this out by just looking at the data?
That was the only plausible explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.