How to Use Python Pandas to_sql for Large DataFrames: Writing in Chunks with Official Methods to Avoid MySQL Timeouts

When working with large datasets in Python, pandas is the go-to library for data manipulation. A common workflow involves extracting data (e.g., from CSV, APIs, or databases), transforming it with pandas, and then loading it back into a database like MySQL for storage or further analysis. However, inserting large DataFrames directly into MySQL using pandas.to_sql() can lead to frustrating issues: timeouts, memory overload, or even failed transactions.

The root cause? MySQL connections have a wait_timeout (default often 28800 seconds, or 8 hours), but inserting millions of rows in one go can exceed this limit if the operation takes too long. Additionally, loading an entire large DataFrame into memory and inserting all rows at once strains system resources.

The solution? Use pandas’ official chunking functionality with to_sql(). By splitting the DataFrame into smaller "chunks" and inserting them sequentially, you avoid overwhelming the database connection, reduce memory usage, and ensure the operation completes without timeouts.

In this blog, we’ll dive deep into how to implement chunked writes with pandas.to_sql(), why it works, and advanced tips to optimize performance. Whether you’re a data engineer, analyst, or scientist, this guide will help you reliably load large datasets into MySQL.

Table of Contents#

  1. Understanding the Problem: Why Large DataFrames Cause Timeouts
  2. Pandas to_sql() Basics: A Quick Recap
  3. Chunking with to_sql(): The Official Solution
  4. Step-by-Step Implementation: Writing Large DataFrames to MySQL
  5. Advanced Tips for Optimization
  6. Troubleshooting Common Issues
  7. Conclusion
  8. References

Understanding the Problem: Why Large DataFrames Cause Timeouts#

Before we fix the issue, let’s understand why inserting large DataFrames with to_sql() fails:

1. MySQL Connection Timeouts#

MySQL servers enforce a wait_timeout (and interactive_timeout) setting, which closes idle connections after a specified period. If your to_sql() operation takes longer than wait_timeout to insert all rows, the server will terminate the connection mid-operation, causing errors like:

OperationalError: (MySQLdb.OperationalError) (2006, 'MySQL server has gone away')

2. Memory Overload#

Pandas loads the entire DataFrame into memory. For datasets with millions of rows, this can consume gigabytes of RAM. Even if the DataFrame fits, inserting all rows in one batch requires constructing a massive SQL query, which may exceed MySQL’s max_allowed_packet (the maximum size of a single SQL statement), leading to:

OperationalError: (MySQLdb.OperationalError) (1153, "Got a packet bigger than 'max_allowed_packet' bytes")

3. Slow Single-Batch Inserts#

Inserting millions of rows in a single batch is inefficient. Databases optimize for smaller, frequent writes, and single large transactions can lock tables, degrade performance, or fail due to resource limits.

The fix? Chunking: Split the DataFrame into smaller subsets and insert them one at a time. Pandas provides a built-in chunksize parameter in to_sql() to do this—no custom loops required!

Pandas to_sql() Basics: A Quick Recap#

If you’re new to pandas.to_sql(), let’s recap its core functionality. The method writes a DataFrame to a SQL table, using SQLAlchemy for database connections (supporting MySQL, PostgreSQL, SQL Server, etc.).

Basic Syntax:#

import pandas as pd
from sqlalchemy import create_engine
 
# Sample small DataFrame
df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})
 
# MySQL connection (replace with your credentials)
engine = create_engine("mysql+pymysql://user:password@host:port/database")
 
# Write to SQL (no chunking for small DataFrames)
df.to_sql(
    name="users",  # Table name
    con=engine,    # SQLAlchemy engine
    if_exists="replace",  # What to do if table exists: 'replace', 'append', or 'fail'
    index=False    # Don't write DataFrame index as a column
)

This works for small DataFrames but fails for large ones. Let’s fix that with chunking.

Chunking with to_sql(): The Official Solution#

Pandas’ to_sql() includes an official chunksize parameter to split the DataFrame into chunks. Here’s how it works:

What is chunksize?#

The chunksize parameter specifies the number of rows to insert per batch. For example, chunksize=1000 splits a 1,000,000-row DataFrame into 1,000 chunks of 1,000 rows each, inserting one chunk at a time.

Why It Works:#

  • Avoids Timeouts: Smaller chunks insert faster, so each batch completes before wait_timeout expires.
  • Reduces Memory Usage: Only one chunk is processed in memory at a time.
  • Faster Writes: Databases optimize for bulk inserts of moderate size (e.g., 1,000–10,000 rows per batch).

Key Notes:#

  • chunksize is an official pandas parameter (introduced in pandas 0.14.0), so it’s maintained and compatible with future pandas updates.
  • No need for custom loops—to_sql() handles chunking internally.

Step-by-Step Implementation: Writing Large DataFrames to MySQL#

Let’s walk through a complete example of writing a large DataFrame to MySQL using chunksize.

Step 1: Install Required Libraries#

First, ensure you have the necessary packages:

pip install pandas sqlalchemy pymysql  # pymysql is the MySQL driver for SQLAlchemy

Step 2: Import Libraries#

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import logging  # For monitoring progress

Step 3: Create a Large DataFrame#

Let’s generate a sample large DataFrame (1,000,000 rows) using numpy:

# Generate 1M rows of dummy data
np.random.seed(42)  # For reproducibility
num_rows = 1_000_000
df = pd.DataFrame({
    "user_id": np.arange(1, num_rows + 1),
    "username": ["user_" + str(i) for i in range(num_rows)],
    "age": np.random.randint(18, 99, size=num_rows),
    "signup_date": pd.date_range(start="2020-01-01", periods=num_rows, freq="T")  # 1M minutes of dates
})
 
print(f"DataFrame shape: {df.shape}")  # Output: (1000000, 4)

Step 4: Set Up MySQL Connection with SQLAlchemy#

Use SQLAlchemy’s create_engine() to connect to MySQL. Never hardcode credentials—use environment variables or a config file. For simplicity, we’ll use a placeholder connection string:

# Replace with your MySQL credentials (use environment variables in production!)
db_username = "your_username"
db_password = "your_password"
db_host = "localhost"
db_port = "3306"
db_name = "your_database"
 
# Create SQLAlchemy engine
connection_string = f"mysql+pymysql://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
engine = create_engine(connection_string)

Security Note: Use python-dotenv to load credentials from a .env file:

# .env file (add to .gitignore!)
DB_USER=your_username
DB_PASS=your_password
DB_HOST=localhost
DB_PORT=3306
DB_NAME=your_database
 
# Load with python-dotenv
from dotenv import load_dotenv
import os
load_dotenv()  # Loads variables from .env
connection_string = f"mysql+pymysql://{os.getenv('DB_USER')}:{os.getenv('DB_PASS')}@{os.getenv('DB_HOST')}:{os.getenv('DB_PORT')}/{os.getenv('DB_NAME')}"

Step 5: Write to MySQL with Chunking#

Use to_sql() with chunksize to insert in batches. We’ll also add logging to track progress:

# Configure logging to monitor chunk inserts
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
# Define chunk size (adjust based on your needs)
chunksize = 10_000  # Insert 10,000 rows per batch
 
# Write DataFrame to MySQL
logger.info(f"Starting to write {num_rows} rows to MySQL with chunksize={chunksize}...")
df.to_sql(
    name="large_users",  # Table name in MySQL
    con=engine,
    if_exists="replace",  # Replace table if it exists (use 'append' to add data)
    index=False,
    chunksize=chunksize,  # Official chunking parameter
    method="multi"  # Use executemany for faster bulk inserts (optional but recommended)
)
logger.info("Successfully wrote DataFrame to MySQL!")

Explanation of Key Parameters:#

  • chunksize=10_000: Splits the 1M-row DataFrame into 100 chunks (1M / 10k = 100 batches).
  • method="multi": Uses executemany under the hood, which sends multiple rows in a single SQL statement (e.g., INSERT INTO table VALUES (1, 'a'), (2, 'b'), ...), drastically speeding up inserts compared to the default method=None (single-row inserts).
  • if_exists="replace": Overwrites the table if it exists. Use "append" to add data to an existing table, or "fail" to abort if the table exists.

Advanced Tips for Optimization#

To maximize performance and avoid edge cases, use these advanced strategies:

1. Optimize Chunk Size#

The ideal chunksize depends on your database and hardware. Test with values between 1,000–100,000 rows:

  • Too small (e.g., 100 rows): Too many round-trips to the database; slow.
  • Too large (e.g., 500,000 rows): May still cause timeouts or exceed max_allowed_packet.

Start with 10,000–50,000 rows and adjust based on performance.

2. Use method="multi" for Faster Bulk Inserts#

The method="multi" parameter is critical for speed. Without it, to_sql() defaults to inserting one row per SQL statement (slow for large data). With method="multi", pandas uses executemany to bulk-insert rows, reducing round-trips.

Note: Some databases (e.g., older MySQL versions) have limits on the number of parameters per query. If you hit OperationalError: (1064, ...), reduce chunksize or avoid method="multi".

3. Manage Transactions#

By default, SQLAlchemy wraps all chunks in a single transaction, committing only after all chunks are inserted. If one chunk fails, the entire operation rolls back. To commit after each chunk (reducing rollback risk for large datasets), use engine.begin() with a context manager:

with engine.begin() as connection:  # Commits after the block completes
    df.to_sql(..., con=connection, ...)
 
# To commit per chunk (not recommended for atomicity, but useful for partial inserts):
for chunk in np.array_split(df, num_chunks):  # Manual chunking (if needed)
    with engine.begin() as connection:
        chunk.to_sql(..., con=connection, ...)

4. Monitor Progress with Logging#

Add logging to track how many chunks have been inserted:

total_chunks = (len(df) + chunksize - 1) // chunksize  # Ceiling division
for i, chunk in enumerate(df.to_sql(..., chunksize=chunksize, ...)):
    logger.info(f"Inserted chunk {i+1}/{total_chunks}")

5. Clean Up Connections#

Explicitly dispose of the engine to free resources after insertion:

engine.dispose()

Troubleshooting Common Issues#

Even with chunking, you may encounter problems. Here’s how to fix them:

Issue 1: "MySQL Server Has Gone Away" (Timeout)#

Cause: wait_timeout is too low for your chunk size.
Fix:

  • Temporarily increase MySQL’s wait_timeout (requires admin access):
    SET GLOBAL wait_timeout = 3600;  # 1 hour (default is 28800 seconds = 8 hours)
  • Reduce chunksize to make each batch faster.

Issue 2: Memory Errors#

Cause: The DataFrame is too large to fit in memory, even with chunking.
Fix:

  • Process data in chunks during extraction (e.g., pd.read_csv(chunksize=...) for CSV files) and write each chunk immediately:
    # For CSV files too large to load into memory
    for chunk in pd.read_csv("large_file.csv", chunksize=10_000):
        chunk.to_sql(..., chunksize=10_000, ...)

Issue 3: Data Type Mismatches#

Cause: Pandas dtypes (e.g., int64, datetime64) may not match MySQL’s expected types (e.g., INT, DATETIME).
Fix: Use the dtype parameter in to_sql() to explicitly map pandas columns to MySQL types:

from sqlalchemy.types import Integer, String, DateTime
 
df.to_sql(
    ...,
    dtype={
        "user_id": Integer(),
        "username": String(length=50),  # MySQL VARCHAR(50)
        "signup_date": DateTime()
    }
)

Issue 4: Duplicate Rows#

Cause: if_exists="append" adds rows without checking for duplicates.
Fix:

  • Use if_exists="replace" to overwrite (destructive).
  • Add a unique constraint in MySQL (e.g., UNIQUE(user_id)) and handle errors with try/except.

Conclusion#

Writing large DataFrames to MySQL with pandas.to_sql() doesn’t have to result in timeouts or memory errors. By leveraging pandas’ official chunksize parameter, you can split data into manageable batches, reduce connection strain, and ensure reliable inserts.

Key takeaways:

  • Use chunksize to avoid timeouts and memory overload.
  • Pair chunksize with method="multi" for faster bulk inserts.
  • Optimize chunk size (10,000–50,000 rows) and monitor progress with logging.
  • Handle credentials securely and troubleshoot timeouts by adjusting MySQL’s wait_timeout or chunk size.

With these techniques, you can scale pandas workflows to handle even the largest datasets efficiently.

References#