How to Fix Flaky iOS Tests: A Practical Guide

Flaky tests are the ones that pass sometimes and fail sometimes with no code change in between. Every iOS team hits this eventually. You push a commit, CI goes red, you re-run the same tests, they pass. You ship. Two hours later someone else’s commit fails on the same tests.

It feels random. It is not random.

Flakiness has causes. They are diagnosable. Most of them are fixable with a specific change to how your tests are written or how your test environment is set up.

This guide covers the four main causes of flaky iOS tests and what to actually do about each one.

Why Flakiness Is Worth Taking Seriously

Before getting into fixes, it helps to understand the cost.

A flaky test suite trains engineers to ignore failures. When a test fails 20% of the time for no good reason, people learn to re-run CI instead of investigating. That habit stays even when there is a real bug. Real failures get dismissed as “probably flaky.”

At scale, this gets expensive fast. At Uber, the engineering team calculated that flaky tests were contributing to $25M per year in delayed releases and wasted engineer time. That’s a number that got people’s attention. They built an internal AI system to address it, which cut flakiness by 91% over four months. That experience is a big part of why Revyl exists.

Most teams don’t have Uber’s resources to build something like that internally. But the underlying causes of flakiness are the same at every scale.

Cause 1: Timing and Async Issues

This is the most common cause of iOS test flakiness by a wide margin.

Your test taps a button. The next screen hasn’t finished loading yet. The next assertion fails because the element it’s looking for isn’t there yet. Run it again half a second slower and it passes.

The wrong fix: sleep(2). Hard-coded waits make your tests slow and still flaky. The timing that works on your fast CI machine might not work on a slower device or a loaded CI runner.

The right fix: Wait for specific conditions instead of waiting for time.

With XCTest:

// Bad
sleep(2)
XCTAssertTrue(app.buttons["Submit"].exists)

// Good
let submitButton = app.buttons["Submit"]
let exists = submitButton.waitForExistence(timeout: 5)
XCTAssertTrue(exists)

waitForExistence polls until the element appears or the timeout expires. It’s faster when things load quickly and reliable when they don’t.

For more complex async conditions, use XCT expectations:

let predicate = NSPredicate(format: "exists == true")
let expectation = XCTNSPredicateExpectation(predicate: predicate, object: app.staticTexts["Welcome"])
wait(for: [expectation], timeout: 10)

For network-dependent views: If your UI waits on an API call, your test needs to wait on the same thing. Either mock the network layer and control the timing yourself, or wait for a specific UI element that only appears after the data loads.

Cause 2: Simulator State Leaking Between Tests

Each test should start from a clean state. If a previous test left the app in a specific state (logged in, mid-flow, with certain data in the database), the next test might behave differently than expected.

This shows up as tests that pass when run alone but fail when run as part of the full suite.

Find it: Run the failing test in isolation. If it passes alone but fails in the suite, state leakage is the likely cause.

Fix it in setUp and tearDown:

override func setUp() {
    super.setUp()
    continueAfterFailure = false

    let app = XCUIApplication()
    // Reset app state before each test
    app.launchArguments = ["--uitesting", "--reset-state"]
    app.launch()
}

override func tearDown() {
    // Clean up after each test
    // Log out, clear local storage, reset to initial state
    super.tearDown()
}

Use launch arguments to control app state: Add a --uitesting flag your app reads to skip onboarding, use a test account, or reset local storage. This is cleaner than trying to navigate back to a clean state through the UI at the end of every test.

// In your AppDelegate or App struct
if CommandLine.arguments.contains("--reset-state") {
    UserDefaults.standard.removePersistentDomain(forName: Bundle.main.bundleIdentifier!)
    // Clear keychain, local database, etc.
}

Cause 3: Brittle Element Locators

Your test finds elements by querying the UI. If those queries are fragile, any UI change breaks the test even if the underlying functionality hasn’t changed.

Common fragile patterns:

Position-based queries: “The third cell in the table view.” If the order changes, the test breaks.

Label-based queries that change with copy updates: If your QA test finds the submit button by its exact label text and a designer changes “Submit” to “Confirm,” the test breaks even though the button still works.

Nested query chains that are too specific: app.scrollViews.otherElements.tables.cells.staticTexts["Username"] breaks whenever the view hierarchy changes.

Better approach: accessibility identifiers.

Set them in your app code:

// In your SwiftUI view
Button("Submit") {
    submitForm()
}
.accessibilityIdentifier("submit-button")

// In your UIKit code
submitButton.accessibilityIdentifier = "submit-button"

Query them in your tests:

app.buttons["submit-button"].tap()

Accessibility identifiers don’t change with copy updates or minor UI reorganization. They’re also better for accessibility tooling as a side benefit.

For dynamic content: When you have lists where items change order or content, query by the identifier of the specific item you care about, not its position.

// Bad: breaks if order changes
app.cells.element(boundBy: 2).tap()

// Good: finds the right item regardless of position
app.cells["transaction-\(transactionId)"].tap()

Cause 4: Network and External Dependencies

Tests that make real network calls are at the mercy of network latency, API reliability, and test data state. A test that hits a staging API can fail because the staging environment was slow, had stale data, or was briefly down for a deploy.

Recommended approach: mock network calls in UI tests.

Libraries like OHHTTPStubs or your own URLProtocol subclass let you intercept network requests in your test target and return controlled responses.

// Stub a successful login response
stub(condition: isPath("/api/login")) { _ in
    let stubData = """
    {"token": "test-token-123", "userId": "test-user"}
    """.data(using: .utf8)!
    return HTTPStubsResponse(data: stubData, statusCode: 200, headers: nil)
}

With mocked network calls:

Tests run faster (no real network round trips)
Tests are deterministic (same response every time)
You can test error states and edge cases reliably

If you need to test against real data (end-to-end integration tests), keep those in a separate test target that runs less frequently and is understood to be slower and less reliable.

A Framework for Diagnosing Flakiness

When you have a flaky test and don’t know which category it falls into, work through this in order:

Run it 10 times in a row. What’s the failure rate? Consistent failures point to a logic bug, not flakiness. Occasional failures are flakiness.
Run it alone. Does it pass 10/10 times when isolated? If yes, it’s state leakage from another test. If it still fails occasionally, it’s timing or environment.
Add logging around the failure point. What element is missing? What state is the app in? This narrows it to the specific cause.
Check the timing. Is the failure always on the same assertion? Does adding a longer wait make it more reliable? If yes, it’s an async issue.
Check recent UI changes. Did a recent commit change the element your test is looking for? If yes, it’s a brittle locator.

When Manual Fixes Stop Being Enough

The four causes above are fixable. Most teams can work through them and get to a much more stable test suite.

The harder problem is staying there. As your app grows, new tests get written by people who don’t know all the anti-patterns. UI changes break locators faster than they get fixed. The suite degrades gradually.

This is the problem AI-native testing is designed to handle. Instead of writing “tap the button with accessibility ID submit-button,” you write “tap the submit button.” The system looks at the current UI, finds what you mean, and executes it. When the button moves or changes, the test doesn’t break because the instruction was in plain language, not a brittle locator.

It doesn’t eliminate all flakiness. Timing issues and state leakage still need to be handled at the architecture level. But it removes the locator brittleness category almost entirely, which is a significant part of the maintenance burden for most teams.

At Uber, we built an AI-powered mobile testing system that cut flaky tests by 91% in four months and saved $25M in delayed releases. That’s the core of what we put into Revyl.

If your iOS test suite is the kind of thing your team dreads dealing with, we’d be glad to show you how it works.

Try Revyl free or learn more at revyl.com.