ndarray
object for efficient multi - dimensional array operations. However, when dealing with extremely large datasets, memory can become a bottleneck. This is where memory - mapped files (mmap) in combination with NumPy come into play. Memory - mapped files allow us to access files on disk as if they were an array in memory, enabling us to work with datasets that are much larger than the available RAM. In this blog, we’ll explore the fundamental concepts, usage methods, common practices, and best practices of using mmap with NumPy.Memory - mapped files (mmap) are a technique that maps a file on disk to a region of memory. This mapping allows the program to access the file’s contents using normal memory operations, such as reading and writing, without the need for explicit I/O operations like read()
and write()
. The operating system takes care of loading the necessary parts of the file into physical memory as needed, which is known as demand paging.
NumPy provides the memmap
function that creates a memory - mapped ndarray
. This ndarray
behaves just like a regular NumPy array, but its data is stored on disk rather than in memory. Any changes made to the memmap
array are automatically reflected in the underlying file, and vice versa.
The following code shows how to create a new memory - mapped NumPy array:
import numpy as np
# Create a new memory-mapped array
filename = 'large_array.dat'
shape = (1000, 1000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)
# Fill the array with some values
fp[:] = np.random.rand(*shape)
# Flush the changes to disk
del fp
In this code, we first define the filename, shape, and data type of the array. We then use np.memmap
to create a new memory - mapped array in write - plus (w+
) mode. After filling the array with random values, we delete the reference to the array, which flushes the changes to disk.
The following code demonstrates how to read and write to an existing memory - mapped array:
import numpy as np
# Open an existing memory-mapped array
filename = 'large_array.dat'
shape = (1000, 1000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='r+', shape=shape)
# Read a slice of the array
slice_data = fp[100:200, 100:200]
print(slice_data)
# Modify a slice of the array
fp[100:200, 100:200] = 0
# Flush the changes to disk
del fp
Here, we open an existing memory - mapped array in read - plus (r+
) mode. We read a slice of the array and then modify another slice. Finally, we delete the reference to the array to flush the changes to disk.
When working with large datasets, memory - mapped arrays can be a lifesaver. Instead of loading the entire dataset into memory, we can access it in chunks. For example:
import numpy as np
filename = 'huge_array.dat'
shape = (10000, 10000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='r', shape=shape)
# Process the array in chunks
chunk_size = 1000
for i in range(0, shape[0], chunk_size):
chunk = fp[i:i + chunk_size, :]
# Do some processing on the chunk
result = np.mean(chunk, axis=1)
print(result)
del fp
In this code, we process the large array in chunks of size 1000 rows at a time, which reduces the memory footprint.
Memory - mapped arrays can also be used to share data between multiple processes. Here is a simple example using the multiprocessing
module:
import numpy as np
import multiprocessing as mp
def worker(offset, chunk_size, filename):
shape = (1000, 1000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='r+', shape=shape)
chunk = fp[offset:offset + chunk_size, :]
chunk[:] = np.random.rand(*chunk.shape)
del fp
if __name__ == '__main__':
filename = 'shared_array.dat'
shape = (1000, 1000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)
del fp
processes = []
chunk_size = 200
for i in range(0, shape[0], chunk_size):
p = mp.Process(target=worker, args=(i, chunk_size, filename))
processes.append(p)
p.start()
for p in processes:
p.join()
In this code, we create a memory - mapped array and then spawn multiple processes to modify different chunks of the array.
When working with memory - mapped arrays, it’s important to handle errors properly. For example, if the file cannot be opened or if there is a disk I/O error, the program should handle these exceptions gracefully.
import numpy as np
try:
filename = 'nonexistent_file.dat'
shape = (1000, 1000)
dtype = np.float32
fp = np.memmap(filename, dtype=dtype, mode='r', shape=shape)
except FileNotFoundError:
print(f"File {filename} not found.")
except ValueError as e:
print(f"Value error: {e}")
To optimize performance, it’s recommended to use the appropriate data type and to access the array in a contiguous manner. For example, accessing the array row - by - row is generally faster than accessing it column - by - column for a C - contiguous array.
Memory - mapped files in combination with NumPy provide a powerful way to work with large datasets efficiently. By allowing us to access files on disk as if they were in memory, we can overcome the limitations of available RAM. We’ve explored the fundamental concepts, usage methods, common practices, and best practices of using mmap with NumPy. With this knowledge, you can now handle large - scale numerical data processing tasks more effectively.
mmap
module documentation:
https://docs.python.org/3/library/mmap.html