How to Avoid Duplicate Records and Reuse Existing Entries with factory_boy Factories (SQLAlchemy)

When writing tests for applications using SQLAlchemy, generating test data efficiently is critical. Tools like factory_boy simplify this by creating reusable factories for your models. However, a common pain point is duplicate records: inadvertently creating the same entry (e.g., a user, category, or product) multiple times in tests. This leads to slower tests, unique constraint errors, flaky test suites, and bloated test databases.

In this guide, we’ll explore how to solve this problem by integrating "get or create" logic into factory_boy factories, ensuring existing records are reused instead of duplicated. We’ll focus on SQLAlchemy, the popular ORM, and provide step-by-step examples to implement this pattern effectively.

Table of Contents#

  1. Prerequisites
  2. Understanding the Problem: Why Duplicates Happen
  3. Why Duplicates Matter
  4. Solutions: Reusing Existing Entries with factory_boy
  5. Testing the Implementation
  6. Common Pitfalls and How to Avoid Them
  7. Conclusion
  8. References

Prerequisites#

To follow along, you should have:

  • Basic knowledge of Python (3.8+ recommended)
  • Familiarity with SQLAlchemy models and sessions
  • Experience with factory_boy (basic factory creation)
  • A testing framework (we’ll use pytest for examples)

Install required packages:

pip install factory_boy sqlalchemy pytest

Understanding the Problem: Why Duplicates Happen#

Consider a simple SQLAlchemy model for a Category with a unique name constraint:

# models.py
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
 
Base = declarative_base()
 
class Category(Base):
    __tablename__ = "categories"
    
    id = Column(Integer, primary_key=True)
    name = Column(String(50), unique=True, nullable=False)  # Unique constraint
    description = Column(String(200))

A basic factory_boy factory for Category might look like this:

# factories.py
import factory
from factory.alchemy import SQLAlchemyModelFactory
from models import Category
from sqlalchemy.orm import Session
 
class CategoryFactory(SQLAlchemyModelFactory):
    class Meta:
        model = Category
        sqlalchemy_session = Session()  # Assume a test session is configured
 
    name = factory.Sequence(lambda n: f"Category {n}")
    description = factory.Faker("sentence")

Now, if we create two categories with the same name in tests:

# test_categories.py
def test_duplicate_category():
    # First creation: works
    cat1 = CategoryFactory(name="Books")
    # Second creation: raises UniqueViolation error!
    cat2 = CategoryFactory(name="Books")

This fails because the database enforces the unique name constraint. The factory naively creates a new record every time, ignoring existing entries.

Why Duplicates Matter#

  • Unique Constraint Errors: The most obvious issue—tests fail due to database integrity errors.
  • Slow Tests: Creating duplicate records increases database writes, slowing down test suites (critical for large projects).
  • Flaky Tests: Duplicates can lead to unexpected behavior (e.g., foreign key conflicts) in downstream tests.
  • Bloated Test Data: Unnecessary records make debugging harder and consume more storage.

Solutions: Reusing Existing Entries with factory_boy#

The core solution is to implement "get or create" logic in your factories: check if a record with the desired attributes exists, and return it if found; otherwise, create a new one.

4.1 The "Get or Create" Pattern#

The "get or create" pattern is a common database pattern:

  1. Query the database for a record matching specific criteria (e.g., name="Books").
  2. If found, return the existing record.
  3. If not found, create and return a new record.

In SQLAlchemy, this might look like:

def get_or_create_category(session: Session, **kwargs):
    existing = session.query(Category).filter_by(** kwargs).first()
    if existing:
        return existing
    new_category = Category(**kwargs)
    session.add(new_category)
    session.commit()  # or flush() for transactional tests
    return new_category

We’ll integrate this directly into factory_boy factories.

4.2 Implementing Get-or-Create in factory_boy#

factory_boy’s SQLAlchemyModelFactory uses a _create method to persist instances to the database. By overriding _create, we can inject get-or-create logic.

Step 1: Define Unique Fields#

First, specify which fields define uniqueness for your model (e.g., name for Category). Add a class attribute to the factory:

class CategoryFactory(SQLAlchemyModelFactory):
    class Meta:
        model = Category
        sqlalchemy_session = Session()  # Your test session
 
    # Unique fields for "get or create"
    class Params:
        unique_fields = ("name",)  # Fields that define uniqueness
 
    name = factory.Sequence(lambda n: f"Category {n}")
    description = factory.Faker("sentence")

Step 2: Override _create#

Override _create to check for existing records using unique_fields:

class CategoryFactory(SQLAlchemyModelFactory):
    # ... (Meta and Params as above)
 
    @classmethod
    def _create(cls, model_class, *args, **kwargs):
        # Extract unique fields from kwargs
        unique_kwargs = {k: kwargs[k] for k in cls.Params.unique_fields if k in kwargs}
        
        # If no unique fields provided, create normally
        if not unique_kwargs:
            return super()._create(model_class, *args, **kwargs)
        
        # Query for existing instance
        session = cls._meta.sqlalchemy_session
        existing = session.query(model_class).filter_by(**unique_kwargs).first()
        
        if existing:
            return existing
        
        # Create new instance if not found
        return super()._create(model_class, *args, **kwargs)

How It Works:#

  • unique_kwargs filters the factory’s kwargs to only include fields in unique_fields (e.g., {"name": "Books"}).
  • The factory queries the database for an existing record with unique_kwargs.
  • If found, it returns the existing record; otherwise, it delegates to the default _create method to create a new one.

4.3 Handling Relationships with SubFactories#

Most models have relationships (e.g., a Product belonging to a Category). factory_boy uses SubFactory to handle this, and we can reuse get-or-create logic here too.

Example: Product Model with Category Relationship#

# models.py
class Product(Base):
    __tablename__ = "products"
    
    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False)
    category_id = Column(Integer, ForeignKey("categories.id"))
    category = relationship("Category", backref="products")

Product Factory with SubFactory#

# factories.py
class ProductFactory(SQLAlchemyModelFactory):
    class Meta:
        model = Product
        sqlalchemy_session = Session()
 
    name = factory.Faker("product_name")
    category = factory.SubFactory(CategoryFactory)  # Reuses CategoryFactory's get-or-create

Now, creating two Products with the same Category name will reuse the existing Category:

def test_reuse_category_in_products():
    # Create first product with "Books" category
    product1 = ProductFactory(category__name="Books")
    # Create second product with the same category name
    product2 = ProductFactory(category__name="Books")
    
    # Assert both products share the same category instance
    assert product1.category.id == product2.category.id

The category__name syntax passes name="Books" to the CategoryFactory SubFactory, which triggers get-or-create logic.

4.4 Composite Unique Constraints#

For models with composite unique constraints (e.g., User with unique (username, email)), update unique_fields to include all constrained fields:

# models.py
class User(Base):
    __tablename__ = "users"
    
    id = Column(Integer, primary_key=True)
    username = Column(String(50), nullable=False)
    email = Column(String(100), nullable=False)
    
    __table_args__ = (
        UniqueConstraint("username", "email", name="unique_username_email"),
    )
# factories.py
class UserFactory(SQLAlchemyModelFactory):
    class Meta:
        model = User
        sqlalchemy_session = Session()
 
    class Params:
        unique_fields = ("username", "email")  # Composite unique constraint
 
    username = factory.Faker("user_name")
    email = factory.Faker("email")

Now, UserFactory(username="johndoe", email="[email protected]") will reuse an existing user with both username="johndoe" and email="[email protected]".

4.5 Caching for Performance (Advanced)#

For test suites with many repeated calls to the same factory, add an in-memory cache to avoid redundant database queries:

class CategoryFactory(SQLAlchemyModelFactory):
    # ... (Meta, Params, and fields as above)
    
    _cache = {}  # Class-level cache: {unique_kwargs_tuple: instance}
 
    @classmethod
    def _create(cls, model_class, *args, **kwargs):
        unique_kwargs = {k: kwargs[k] for k in cls.Params.unique_fields if k in kwargs}
        if not unique_kwargs:
            return super()._create(model_class, *args, **kwargs)
        
        # Use a tuple of sorted items as cache key (dictionaries are not hashable)
        cache_key = tuple(sorted(unique_kwargs.items()))
        
        # Check cache first
        if cache_key in cls._cache:
            return cls._cache[cache_key]
        
        # If not in cache, check database
        session = cls._meta.sqlalchemy_session
        existing = session.query(model_class).filter_by(**unique_kwargs).first()
        
        if existing:
            cls._cache[cache_key] = existing
            return existing
        
        # Create new instance and cache it
        new_instance = super()._create(model_class, *args, **kwargs)
        cls._cache[cache_key] = new_instance
        return new_instance

Caveats:

  • Session Rollbacks: If your tests roll back the session after each test, cached instances may become stale (detached from the session). Clear the cache in test teardown:
    @pytest.fixture(autouse=True)
    def clear_factory_caches():
        CategoryFactory._cache.clear()
        UserFactory._cache.clear()
  • Memory Usage: Caching adds memory overhead. Use only for frequently reused factories.

Testing the Implementation#

Verify your factory works with a test case:

# test_categories.py
def test_get_or_create_category():
    # Create first category
    cat1 = CategoryFactory(name="Books")
    # Create second category with the same name
    cat2 = CategoryFactory(name="Books")
    
    # Assert both are the same instance (same ID)
    assert cat1.id == cat2.id
    
    # Verify only one record exists in the database
    session = CategoryFactory._meta.sqlalchemy_session
    count = session.query(Category).filter_by(name="Books").count()
    assert count == 1

Common Pitfalls and How to Avoid Them#

  1. Missing Unique Fields: Forgetting to include all unique constraint fields in unique_fields will lead to duplicates. Always align unique_fields with your model’s UniqueConstraint.

  2. Stale Session Cache: If using SQLAlchemy’s session-level caching (e.g., expire_on_commit=False), the factory may return outdated instances. Use session.refresh(existing) after fetching from the database.

  3. Overusing unique_fields: Avoid including non-unique fields in unique_fields (e.g., description). This will prevent reuse even when unique fields match.

  4. build() vs. create(): The _create method (and thus get-or-create) only runs when using CategoryFactory.create(). CategoryFactory.build() creates in-memory objects without checking the database—use create() for persisted records.

  5. Race Conditions: In concurrent test environments (e.g., pytest-xdist), two processes may check for a record simultaneously, both creating duplicates. Use database-level locks or avoid concurrent writes to shared test data.

Conclusion#

By implementing "get or create" logic in factory_boy factories, you can eliminate duplicate records, speed up tests, and ensure test data consistency. Key takeaways:

  • Override _create to check for existing records using unique_fields.
  • Use SubFactory to reuse related records (e.g., categories for products).
  • Add caching for performance, but handle session rollbacks carefully.
  • Always test factory behavior to verify uniqueness.

This pattern ensures your test suite remains fast, reliable, and easy to maintain.

References#