Turning Recorded Production Traffic into Replayable Feature Tests

Patrick Ogbuitepu
May 5, 2025
Engineering/Quality

Writing end to end feature tests by hand is slow, and the tests you end up with only cover the cases someone actually thought to write down. At the same time, human users and our own testers use the application constantly, hitting edge cases nobody would have thought to script, and almost all of that valuable behaviour disappears the moment the browser tab closes. There is a better source of test cases sitting right there already, the requests that already happened.

The idea is simple. While a special recording mode is switched on, every request the application handles gets written to disk as a small file that contains the action being called, every value sent in with it, the relevant session details at that moment, and basic timing information. Each file is named after the action and the time, and they all collect into one folder for that testing session.

{

"name": "patient\_queue-execute-1718900012.json",

"start": 1718900012.41,

"data": {

"get": { ... },

"post": { "patient\_id": "...", "systolic": "118", "diastolic": "76" },

"session": { "user\_role": "nurse", "store": "..." }

}

None of this asks the tester to think about test automation at all. They are simply using the application normally, working through an actual triage flow, registering an actual patient, processing an actual discharge, and the recording happens quietly as a side effect of that normal use.

The first, cheaper way is a headless replay. We feed a recorded file's values straight back into the same request entry point, completely outside a browser, through a small script or test runner that rebuilds the request and checks the shape of the response, the status, key fields, no error in the result. This runs fast enough that we can replay hundreds of recorded scenarios inside our build pipeline in seconds, and it tests the exact same code path an actual request would hit.

The second way is a browser driven replay, for cases where what the user actually sees matters, not only what the server returns. A sequence of recorded actions becomes a script for a browser tool that opens the page, fills in the recorded values, submits, and checks what shows up on screen. This is slower and more easily broken than the headless version, but it catches a whole kind of bug the headless version cannot.

We use both. A large, fast set of headless tests covering breadth, and a smaller, slower set of browser tests covering the handful of flows where what the screen actually shows really matters.

A test written by hand encodes what its author thought mattered. A recorded session encodes what actually happened, the field a tester left blank because the flow allowed it, a sequence of clicks that arrived in an order nobody who wrote the spec expected. Building our test set from actual recorded sessions means the blind spots in our tests are our actual blind spots, not a second, separate set of blind spots that came from whoever happened to write the tests.

It also makes growing our coverage far cheaper. Instead of an engineer sitting down to list out scenarios, simply testing a new feature becomes test case generation by itself. Turn on recording, test normally, and every session becomes a candidate test with no extra writing needed.

We currently rely on people to enter dummy data during testing rather than automating the sanitisation process. Because recordings can become a permanent part of our test suite, we review them before storing them and replace any realistic-looking information with clearly fake equivalents that preserve the same structure. This ensures replayed scenarios still exercise the same validation and formatting rules without retaining actual data.

We also had to deal with the parts that are supposed to change every time, such as timestamps, generated IDs, and session tokens. A simple comparison between a replay and the original recording would incorrectly report all of these as differences. To prevent that, our replay tooling automatically knows which fields are expected to vary and either skips them or handles them specially before comparing the replayed result with the original one.

We learned that the test process itself has to be planned carefully. Rather than depending on the exact database state, we design scenarios around predictable data and validate behaviour instead of exact output. We check whether a save succeeded or whether the expected error was returned, rather than expecting a specific patient record or stock level to exist. A recording tied to one particular piece of data will inevitably fail once the actual database has moved on.

Finally, we curate the collection instead of letting it grow forever. Not every recorded session is worth keeping as a lasting test. Many are simply repeats of the same basic flow. We review the collection from time to time and keep the sessions that actually exercise a different path, the same care we would give to tests written by hand, instead of letting a pile of near duplicate recordings slow the whole suite down for no actual benefit.

Once the cleaning and comparison tooling exists, it keeps paying off. Every future testing session becomes a source of new test coverage at almost no extra cost. For a team trying to keep its test coverage growing as fast as its feature set, that is a genuinely different kind of economics than writing every single test case by hand.

Maybeach Tech builds testing tools that turn actual usage into test coverage automatically. Get in touch and let us talk about the gaps in your own test suite.

Automated Database Backup Strategies That You'll Actually Be Able to Restore

Designing RBAC for Multi-Location Organizations

Related Post

Case Study

Useful Links

Contact Us