A Deep Dive into mmap with NumPy

In the realm of data processing and numerical computing, NumPy has established itself as a cornerstone library in Python. It provides a powerful ndarray object for efficient multi - dimensional array operations. However, when dealing with extremely large datasets, memory can become a bottleneck. This is where memory - mapped files (mmap) in combination with NumPy come into play. Memory - mapped files allow us to access files on disk as if they were an array in memory, enabling us to work with datasets that are much larger than the available RAM. In this blog, we’ll explore the fundamental concepts, usage methods, common practices, and best practices of using mmap with NumPy.

Table of Contents

  1. Fundamental Concepts
    • What is mmap?
    • How mmap integrates with NumPy
  2. Usage Methods
    • Creating a memory - mapped NumPy array
    • Reading and writing to a memory - mapped array
  3. Common Practices
    • Working with large datasets
    • Sharing data between processes
  4. Best Practices
    • Error handling
    • Performance optimization
  5. Conclusion
  6. References

Fundamental Concepts

What is mmap?

Memory - mapped files (mmap) are a technique that maps a file on disk to a region of memory. This mapping allows the program to access the file’s contents using normal memory operations, such as reading and writing, without the need for explicit I/O operations like read() and write(). The operating system takes care of loading the necessary parts of the file into physical memory as needed, which is known as demand paging.

How mmap integrates with NumPy

NumPy provides the memmap function that creates a memory - mapped ndarray. This ndarray behaves just like a regular NumPy array, but its data is stored on disk rather than in memory. Any changes made to the memmap array are automatically reflected in the underlying file, and vice versa.

Usage Methods

Creating a memory - mapped NumPy array

The following code shows how to create a new memory - mapped NumPy array:

import numpy as np

# Create a new memory-mapped array
filename = 'large_array.dat'
shape = (1000, 1000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)

# Fill the array with some values
fp[:] = np.random.rand(*shape)

# Flush the changes to disk
del fp

In this code, we first define the filename, shape, and data type of the array. We then use np.memmap to create a new memory - mapped array in write - plus (w+) mode. After filling the array with random values, we delete the reference to the array, which flushes the changes to disk.

Reading and writing to a memory - mapped array

The following code demonstrates how to read and write to an existing memory - mapped array:

import numpy as np

# Open an existing memory-mapped array
filename = 'large_array.dat'
shape = (1000, 1000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='r+', shape=shape)

# Read a slice of the array
slice_data = fp[100:200, 100:200]
print(slice_data)

# Modify a slice of the array
fp[100:200, 100:200] = 0

# Flush the changes to disk
del fp

Here, we open an existing memory - mapped array in read - plus (r+) mode. We read a slice of the array and then modify another slice. Finally, we delete the reference to the array to flush the changes to disk.

Common Practices

Working with large datasets

When working with large datasets, memory - mapped arrays can be a lifesaver. Instead of loading the entire dataset into memory, we can access it in chunks. For example:

import numpy as np

filename = 'huge_array.dat'
shape = (10000, 10000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='r', shape=shape)

# Process the array in chunks
chunk_size = 1000
for i in range(0, shape[0], chunk_size):
    chunk = fp[i:i + chunk_size, :]
    # Do some processing on the chunk
    result = np.mean(chunk, axis=1)
    print(result)

del fp

In this code, we process the large array in chunks of size 1000 rows at a time, which reduces the memory footprint.

Sharing data between processes

Memory - mapped arrays can also be used to share data between multiple processes. Here is a simple example using the multiprocessing module:

import numpy as np
import multiprocessing as mp

def worker(offset, chunk_size, filename):
    shape = (1000, 1000)
    dtype = np.float32
    fp = np.memmap(filename, dtype=dtype, mode='r+', shape=shape)
    chunk = fp[offset:offset + chunk_size, :]
    chunk[:] = np.random.rand(*chunk.shape)
    del fp

if __name__ == '__main__':
    filename = 'shared_array.dat'
    shape = (1000, 1000)
    dtype = np.float32
    fp = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)
    del fp

    processes = []
    chunk_size = 200
    for i in range(0, shape[0], chunk_size):
        p = mp.Process(target=worker, args=(i, chunk_size, filename))
        processes.append(p)
        p.start()

    for p in processes:
        p.join()

In this code, we create a memory - mapped array and then spawn multiple processes to modify different chunks of the array.

Best Practices

Error handling

When working with memory - mapped arrays, it’s important to handle errors properly. For example, if the file cannot be opened or if there is a disk I/O error, the program should handle these exceptions gracefully.

import numpy as np

try:
    filename = 'nonexistent_file.dat'
    shape = (1000, 1000)
    dtype = np.float32
    fp = np.memmap(filename, dtype=dtype, mode='r', shape=shape)
except FileNotFoundError:
    print(f"File {filename} not found.")
except ValueError as e:
    print(f"Value error: {e}")

Performance optimization

To optimize performance, it’s recommended to use the appropriate data type and to access the array in a contiguous manner. For example, accessing the array row - by - row is generally faster than accessing it column - by - column for a C - contiguous array.

Conclusion

Memory - mapped files in combination with NumPy provide a powerful way to work with large datasets efficiently. By allowing us to access files on disk as if they were in memory, we can overcome the limitations of available RAM. We’ve explored the fundamental concepts, usage methods, common practices, and best practices of using mmap with NumPy. With this knowledge, you can now handle large - scale numerical data processing tasks more effectively.

References