Sorting in pandas for large datasets

Question

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.

data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)

Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

where is your data stored? how big? what's memory constraint? — Jeff
– Jeff, Commented Jan 22, 2014 at 1:09
It's couple of TB file and the maximum available memory is about 250 GBs on the cluster. — user1867185
– user1867185, Commented Jan 22, 2014 at 2:51
see pandas.pydata.org/pandas-docs/dev/io.html, and pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore; hdf5 is an extremely efficient format for fast and space efficient storage snd retrieval. sorting is non-trivial in a chunked manner but certainly possible — Jeff
– Jeff, Commented Jan 22, 2014 at 4:30
Me and some co-workers came across this same problem. What we ended up doing was doing a parallel process split on the file into smaller 1 million row file chunks. Then depending on how you are sorting, ou can come up with some sort of directory scheme to "sort" the files into. If it's transaction data, you could use AWK or pandas to parse out each 1 million row chunk into a relative year_quarter directory/file, and then you can sort on these aggregated files. If you need the data in one file, then at the end you can just stack them back together in order. Good luck! — Ryan G
– Ryan G, Commented Jan 22, 2014 at 20:06

Ami Tavory · Accepted Answer · 2016-02-02 07:15:27Z

21

+25

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.

I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.

Say your file is called stuff.csv, and looks like this:

4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2

Then the following command will sort it by the 3rd column:

sort --parallel=8 -t . -nrk3 stuff.csv

Note that the number of threads here is set to 8.

The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So

split -l 100000 stuff.csv stuff

would split the file into files of length at most 100000 lines.

Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:

sort -m sorted_stuff_* > final_sorted_stuff.csv

Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.

edited Feb 2, 2016 at 7:15

answered Feb 1, 2016 at 20:36

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Michael Ohlrogge Over a year ago

isn't it necessary to also specify the order of sorting for the mergesort at the end, i.e. sort -nrk3 -m sorted_stuff_* > final_sorted_stuff.csv? Without this, won't sort default to, I believe, just sorting based on the first column and then proceeding rightwards?

Mariano Desanze · Accepted Answer · 2024-03-30 16:34:14Z

As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.

About the sorting problem, there are at least three possible ways to solve it with that approach.

First, you can try to use pandas directly, querying the HDF-stored-DataFrame.

Second, you can use PyTables, which pandas uses under the hood.

Francesc Alted gives a hint in the PyTables mailing list:

The simplest way is by setting the sortby parameter to true in the Table.copy() method. This triggers an on-disk sorting operation, so you don't have to be afraid of your available memory. You will need the Pro version for getting this capability.

In the docs, it says:

sortby : If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used

Third, still with PyTables, you can use the method Table.itersorted().

From the docs:

Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)

Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.

Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.

This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.

Introduction

This notebook explores a 3.9Gb CSV file.

This notebook is a primer on out-of-memory data analysis with

pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.

IPython notebook: An interface for writing and sharing python code, text, and plots.
SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.

Requirements

import pandas as pd
from sqlalchemy import create_engine # database connection

Import the CSV data into SQLite

Load the CSV, chunk-by-chunk, into a DataFrame 2. Process the data a bit, strip out uninteresting columns 3. Append it to the SQLite database

disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory

chunksize = 20000
index_start = 1

for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
    
    # do stuff   

    df.index += index_start

    df.to_sql('data', disk_engine, if_exists='append')
    index_start = df.index[-1] + 1

Query value counts and order the results

Housing and Development Dept receives the most complaints

df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
                       'FROM data '
                       'GROUP BY Agency '
                       'ORDER BY -num_complaints', disk_engine)

Limiting the number of sorted entries

What's the most 10 common complaint in each city?

df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
                            'FROM data '
                            'GROUP BY `City` '
                   'ORDER BY -num_complaints '
                   'LIMIT 10 ', disk_engine)

Possibly related and useful links

Back2Basics · Accepted Answer · 2016-02-03 22:57:20Z

3

Blaze might be the tool for you with the ability to work with pandas and csv files out of core. http://blaze.readthedocs.org/en/latest/ooc.html

import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort()  # Uses Chunked Pandas

For faster processing, load it into a database first which blaze can control. But if this is a one off and you have some time then the posted code should do it.

edited Feb 3, 2016 at 22:57

answered Feb 3, 2016 at 22:51

Back2Basics

7,8462 gold badges35 silver badges50 bronze badges

Comments

ZFY · Accepted Answer · 2016-02-05 16:01:37Z

1

If your csv file contains only structured data, I would suggest approach using only linux commands.

Assume csv file contains two columns, COL_1 and P_VALUE:

map.py:

import sys
for line in sys.stdin:
    col_1, p_value = line.split(',')
    print "%f,%s" % (p_value, col_1)

then the following linux command will generate the csv file with p_value sorted:

cat input.csv | ./map.py | sort > output.csv

If you're familiar with hadoop, using the above map.py also adding a simple reduce.py will generate the sorted csv file via hadoop streaming system.

answered Feb 5, 2016 at 16:01

ZFY

1351 silver badge12 bronze badges

Comments

sam · Accepted Answer · 2016-02-05 11:44:05Z

0

Here is my Honest sugg./ Three options you can do.

I like Pandas for its rich doc and features but I been suggested to use NUMPY as it feel faster comparatively for larger datasets. You can think of using other tools as well for easier job.
In case you are using Python3, you can break your big data chunk into sets and do Congruent Threading. I am too lazy for this and it does nt look cool, you see Panda, Numpy, Scipy are build with Hardware design perspectives to enable multi threading I believe.
I prefer this, this is easy and lazy technique acc. to me. Check the document at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html

You can also use 'kind' parameter in your pandas-sort function you are using.

Godspeed my friend.

edited Feb 5, 2016 at 11:44

answered Feb 4, 2016 at 9:47

sam

1,9521 gold badge18 silver badges34 bronze badges

3 Comments

iled Over a year ago

I would like to ask you to show some references or examples on how "it feels faster". pandas is built on top of numpy. It is like a numpy data-analysis-flavoured version. Just do df.values to get a numpy.array. Also, DataFrame.sort_values() (the one linked is deprecated) uses numpy.sort(). See the code here. Of course, it may add some overhead, and it might be slightly faster to use numpy(cpu time, perhaps not in terms of programming time), in which cases you can easily access numpy objects.

Back2Basics Over a year ago

While Numpy is a great tool for working within problems that fit in ram Numpy handle the size of problems that he's talking about with the authors current machine. (this memory constraint wasn't in the original question. It showed up in the comments later)

sam Over a year ago

Thank you for clarity iled & Back2Basics. thanks guys.

Collectives™ on Stack Overflow

Sorting in pandas for large datasets

5 Answers 5

1 Comment

Introduction

Requirements

Import the CSV data into SQLite

Query value counts and order the results

Limiting the number of sorted entries

Possibly related and useful links

Comments

Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Introduction

Requirements

Import the CSV data into SQLite

Query value counts and order the results

Limiting the number of sorted entries

Possibly related and useful links

Comments

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related