How to save & load large pandas dataframes

How to save & load large pandas dataframes

I have recently started using Pandas for many projects, but one feature which I felt was missing was a native file format the data. This is especially important as the data grows.

Numpy has a simple data format which is just a header plus a memory dump, which is great as it allows you to memory-map the data into memory. Pandas does not have the same thing.

After looking at the code a little bit, I realized it’s pretty easy to fake it though:

  1. The data in a Pandas DataFrame is held in a numpy array.
  2. You can save that array using the numpy format.
  3. The numpy code does not care about the file beyond the header: it just maps the rest of the data into memory.
  4. In particular, it does not care if there is something in the file after the data. Thus, you can save the Pandas extra-data after the numpy array on disk.

I wrote this up. Here is the writing code:

np.save(open(fname, 'w'), data)
meta = data.index,data.columns
s = pickle.dumps(meta)
s = s.encode('string_escape')
with open(fname, 'a') as f:
    f.seek(0, 2)
    f.write(s)

We save the array to disk with the numpy machinery, then seek to the end and write out the metadata.

Here is the corresponding loading code:

values = np.load(fname, mmap_mode='r')
with open(fname) as f:
    numpy.lib.format.read_magic(f)
    numpy.lib.format.read_array_header_1_0(f)
    f.seek(values.dtype.alignment*values.size, 1)
    meta = pickle.loads(f.readline().decode('string_escape'))
frame = pd.DataFrame(values, index=meta[0], columns=meta[1])

Check out this gist for a better version of these, which also supports pandas.Series.

As an added bonus, you can load the saved data as a numpy array directly if you do not care for the metadata:

save_pandas('data.pdy', data)
raw = np.load('data.pdy')

Note: I’m not sure this code covers all Pandas cases. It fits in my use case and I hope it can be useful for you, but feel free to point out shortcomings (or improvements) in the comments.

Advertisements

8 thoughts on “How to save & load large pandas dataframes

  1. I don’t know. Unfortunately, I was working on a machine where it’s hard to install software and HDF5 was not installed.

    This is almost as fast as it can be in terms of the data as the on-disk representation is simply mmap()ed into memory. Unpickling the metadata does take a bit if your matrix is very large (I was working with millions of rows & columns), but still pretty usable.

  2. Great! np.save(filename, df.values) worked on my 26k x 26k dataframe where to_pickle and to_msgpack were crashing ipython.

    • Unfortunately, the code you posted at gist doesn’t seem to work if the dataframe has any strings:
      df = pd.DataFrame({
      ‘s’:[‘a’, ‘b’, ‘a’, ‘c’],
      ‘v’:[0.8, 0.7, 0.75, 0.2]})
      save_pandas(“test.df”, df)
      load_pandas(“test.df”)
      >> ValueError: Array can’t be memory-mapped: Python objects in dtype.
      load_pandas(“test.df”, None)
      >> UnpicklingError: bad pickle data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s