LEADR

What 2,000 AI-Generated Tests Taught Me About pytest

Feb 3, 2026

Now that development has calmed down a bit I want to start a dev log. I know, it sounds a bit backwards. But the challenges faced when your nice little app gets slapped in the face by the cold winds of production are far more interesting in my opinion than "today I built a feature". Plus I have more time to actually write!

So today, we're looking at how over 2,000 unit tests and defensive CI checks ate our free GitHub Actions minutes allowance, and how better unit test design and pytest usage saved the day for LEADR and could help you too.

Context

I have been leaning enormously on unit tests when building LEADR as a safety (and sanity) mechanism for coding with agentic gen-AI tools like Claude Code. This isn't "vibe coding", this is 20 years of software engineering experience carefully designed and delegated to a team of AI teammates who can code and debug faster than I ever could. Having a thick layer of tests is one of the methods that keeps rapid development in the realm of confidence rather than chaos even without AI assistance. So, the test suite grew fast: Over 2,000 tests across eight domains, covering API routes, service logic, and repository queries. And that's just the leadr-oss repo!

That's great for correctness. Less great for your CI usage.

Our full test suite runtime gradually snuck up to over 16 minutes on a standard GitHub Actions runner as the project grew. That's 16 minutes per push. In a dev-heavy week with multiple PRs and a few fixup commits, you can burn through a surprising chunk of your monthly allowance before Wednesday.

"Free GitHub Actions allowance? I thought open-source products had unlimited minutes?" I hear you say. Well yes, they do. Unless they're owned by an organisation. Gotcha.

So we had a problem. The test suite was doing its job, but the economics of running it on every push were unsustainable on the free tier. Here's what we tried to remedy that, roughly in order from quick wins to proper fixes.

For starters

Before trying anything, find out where the time is actually going. You can't fix what you haven't measured, and gut instinct is a poor substitute for data when you're staring at a 16-minute test run.

pytest has this built in:

pytest --durations=0

The --durations flag reports the slowest tests and their setup/teardown phases at the end of the run. Passing 0 means "show all of them, not just the top N". You can also pass a specific number like --durations=20 if you just want the worst offenders.

If you have a large suite, --durations=0 can produce a wall of output that's hard to parse. The --durations-min flag helps here by filtering out anything below a threshold in seconds:

pytest --durations=0 --durations-min=1

That gives you only the tests and phases that took at least one second, which is usually enough to surface the real problems without the noise.

The output looks something like this:

Pay attention to both the call and setup/teardown lines. A test with a fast call but a 5-second setup is telling you something about your fixtures, not your code under test.

What you're looking for at this stage is patterns. Is there one obviously broken test taking 30 seconds that you can rewrite or skip? That's a quick win. Are hundreds of tests each taking a fraction of a second longer than they should because of shared fixture overhead? That's a systemic problem and points you toward the structural fixes later in this article. Either way, you now have an informed baseline before trying any of the following.

The quick fixes

Parallelise with xdist

Some people are probably going to yell at me for jumping straight into suggesting xdist off the bat... But honestly, if you're using modern hardware you should probably be using it from day one.

pytest-xdist distributes your pytest tests across multiple parallel workers. It even integrates out of the box with pytest-cov to collect and report test coverage cleanly.

pytest --cov=./src ./tests -n

Great locally on a laptop with 14 cores - cut runtime to less than 2 minutes. But on GitHub-hosted Actions runners with only 2 VCPUs available the effect was negligible, maybe even worse. Parallelism only helps if you have cores to parallelise across.

Still we kept xdist in the local dev tools and it means running the test suite is less disruptive to the development flow and therefore more likely to happen more often.

Run the tests less often

In theory we could run just a subset of tests on pushes to a PR based on changed (or impacted) files and only run the full suite when PRs are merged into main. This solution didn't sit well with me though. The whole point of the test suite is catching problems before they hit the main branch, not after.

What I did opt for is doubling down on xdist-enabled test runs as mentioned above by adding pytest to our pre-commit hooks, to ensure they're run before code even gets to CI. This means we can use our local hardware to avoid wasted and failed runs that eat into the allowance. Pre-push would also be an option for even less developer flow friction while still running the checks efficiently.

The trade-off is obvious: you're relying on developers to actually have the hooks installed and not skip them with --no-verify. For a solo project or small team where you trust the workflow, it works. For larger teams, probably not.

Cancel superseded runs

If you push three commits in quick succession, there's no reason to let all three CI runs complete. Only the latest one matters. GitHub Actions supports this natively with concurrency groups:

# Your workflow file
name: Tests

on: ...

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

...

Simple, effective, and something you should probably have configured regardless on most CI jobs. Any in-progress run for the same branch gets cancelled when a new one starts. This alone saved us a decent number of wasted minutes, especially on days with rapid iteration.

Aggressive caching

We already used (and love) uv for dependency management, which already gives good performance vs alternatives on its own. But reinstalling dependencies from scratch on every CI run is still time you're paying for. Side note for fellow Pythonistas: if you're using pip, virtualenv, other older Python package & environment management tools and haven't tried out uv yet, you really should. I'm not being paid, it's just a great, free tool. It's built in Rust so it's wicked fast, and makes the usual faff of handling Python virtual environments painless and reliable.

Astral's official GitHub Action supports caching out of the box. If you're not already using it, the setup is minimal:

- name: Set up uv
  uses: astral-sh/setup-uv@v5
  with:
    enable-cache: true

This caches the uv cache directory between runs, so subsequent runs skip downloading and building packages that haven't changed. For a project with a non-trivial dependency tree like LEADR the saving is noticeable. Not dramatic on its own, but it compounds with everything else.

The real fixes

The quick fixes above trimmed the edges and introduced some good housekeeping that was probably overdue, but the core problem was still there: the tests themselves were slower than they needed to be. And the reason for that came down to test design.

True unit tests

Thanks to me being focused on getting the app code right the Claude agent had less oversight during test creation, and it turns out our "unit tests" were actually integration tests.

The LEADR codebase follows Domain-Driven Design principles and separates the API, business logic, and storage layers within any given domain. When I re-examined our DDD patterns I discovered that the tests for each layer were importing and running all the code in the layers beneath too. The application code itself was functionally correct and thoroughly tested end-to-end. What was missing was structural testability; services were coupled to their repositories rather than accepting them as arguments. A test for the ScoresService (which handles score-related business logic) for example was instantiating and calling a real ScoreRepository. That ScoreRepository was talking to the real test database.

Luckily, that's a code quality and maintainability issue, not a correctness issue. The app served users; it just wasn't architected for proper unit testing.

The result: almost every test in the suite required a running PostgreSQL instance, and almost every test was doing actual I/O. That's not a unit test. That's an integration test in disguise.

What we should have been doing is mocking the layer boundaries. A service test should mock its repository. A route test should mock its service. The only tests that should touch the database are the repository tests themselves, and even those should ideally run against an isolated, lightweight fixture.

That was a big refactor and required updating some of the app code to facilitate more dependency injection...

Dependency injection

The fix was straightforward in principle: more dependency injection. We needed to adjust many services to accept a repository as an additional constructor argument, which is standard DDD and wasn't enforced in the initial build.

Refactoring meant replacing real dependencies with mocks or fakes at the right boundary:

# Before: integration test disguised as a unit test
async def test_submit_score(db_session):
    repo = ScoreRepository(db_session)
    service = ScoresService(repo)
    result = await service.submit(score_data)
    assert result.value == 100

# After: actual unit test
async def test_submit_score():
    repo = AsyncMock(spec=ScoreRepository)
    repo.create.return_value = Score(value=100, ...)
    service = ScoresService(repo)
    result = await service.submit(score_data)
    repo.create.assert_called_once()
    assert result.value == 100

The second version runs in microseconds. No database, no I/O, no connection pool, no teardown. Multiply that saving across hundreds of service and route tests and the numbers start to shift meaningfully.

We still have integration tests, of course. They live in a separate test directory and run on a different schedule. The distinction matters.

This refactoring gave us roughly a 25% reduction in total CI runtime. Not earth-shattering, but real, and the single largest change achieved in this test optimisation sprint. It also came with the added benefit of tests that are actually easier to read and reason about.

Some people might prefer patching (eg unittest patch or pytest monkeypatch), but having tried my hand at the notoriously specific import frustrations plus the inevitable reams of extra code required in tests over the years, I know those people are wrong dependency injection is a worthwhile investment for clear, reliable code.

Fixture scopes

This is actually one I'd made sure to get right the first time round having learned the lesson the usual (hard) way before. It's worth covering here because it's easy to overlook and it's another place the cost adds up quietly.

pytest fixtures accept a scope parameter that controls how often the fixture is created and torn down. The default is function, meaning the fixture is set up and destroyed for every single test that requests it. Every. Single. Test.

The other options are class, module, package, and session.

For cheap fixtures this doesn't matter. For expensive ones it matters. For expensive ones used by thousands of tests the impact is enormous. Think about a fixture that creates a PostgreSQL database connection, or spins up a test database schema, or starts a subprocess. If that's scoped to function and you have 500 tests requesting it, you're creating and destroying that resource 500 times. That's thrashing, and it's one of the quietest ways to make a test suite slow.

# Thrashing: new connection pool for every single test
@pytest.fixture
async def db_pool():
    pool = await create_pool(dsn=TEST_DSN)
    yield pool
    await pool.close()

# Better: one pool for the entire test session
@pytest.fixture(scope="session")
async def db_pool():
    pool = await create_pool(dsn=TEST_DSN)
    yield pool
    await pool.close()

The rule of thumb is: scope your fixture to the widest level where it's still safe to share. A database connection pool can be session-scoped because it's stateless from the perspective of individual tests. A database transaction probably needs to be function-scoped so each test gets a clean slate. Anything that mutates shared state needs a narrower scope; anything that's read-only or merely provides access to a resource can usually go wider.

The trap I fell into years ago (and the reason I was careful this time) was having everything at function scope by default and never questioning it. The test suite worked, the tests were isolated, and I had no idea I was paying a massive overhead in setup and teardown on every run. This is exactly the kind of thing --durations=0 surfaces as long setup times.

More power

Truth be told, 2,000 tests just take time to run and no simple hack (aside from €€€) is going to change that. Even better software design will only get you so far. This leaves three real options depending on your situation.

I like infra

Some people like setting up cloud infrastructure. Many cloud providers offer a free tier that can be perfect for the occasional bursts of CI tasks and nightly builds. Exactly how to deploy a self-hosted GitHub Actions runner on AWS without incurring any costs is beyond the scope of this article, but it's perfectly possible. Ask your favourite search engine or AI.

The key thing to look for is a provider with a generous free tier that covers the compute hours you actually need. Spot instances or equivalent can bring costs down further if you're comfortable with the occasional interruption.

I've got time to tinker

If you happen to have another device lying around doing nothing, hosting your own runner can be a game changer. An old laptop, a Raspberry Pi (for lighter workloads), or a NUC sitting under your desk can serve as a dedicated CI runner with zero ongoing cost beyond electricity.

GitHub makes it straightforward to register a self-hosted runner. The main downsides are maintenance (it's your hardware, your problem) and security considerations if you're running workflows from forks or public PRs. For a personal or small-team open-source project, these are manageable.

F*** it, I'll just pay

By far the easiest option is to pay for a subscription that gives you a greater allowance of minutes and, hopefully, other features that justify the cost.

Despite implementing all of the above methods to some degree for LEADR, we still opted to upgrade. The CI jobs need to always run. We trigger boatloads of tests, linting checks, image builds, docs updates, CLI releases, and more. The reliability of knowing your pipeline will never stall because you've hit a usage limit is worth the money, at least once your project reaches a certain level of activity.

Conclusion

There is no single trick here. Any improvements came from stacking up small wins: parallel execution locally, smarter CI configuration, caching, and then the bigger structural fix of actually writing proper unit tests with mocked dependencies.

Our total CI runtime for the test suite still hovers around 12 minutes on a GitHub-hosted runner. Still far from instant, but at least now I'm confident it's 12 minutes well spent.

The biggest lesson, honestly, was about the AI-assisted development workflow. When you're moving fast with agentic tools, it's easy to rubber-stamp generated test code even if you're manually scrutinising app code more closely. Taking the time to verify that your tests are actually testing what you think they're testing, at the granularity you think they're testing it, is worth doing before you end up with 2,000 integration tests and a CI bill to match.

‹ Why Rolling Your Own Game Leaderboard Backend is Harder Than It Looks

Why We Built a Game Leaderboard Service (And Why Existing Solutions Fall Short) ›