My experience with Claude and Codex on a system architecture bug

Recently, I encountered a subtle bug in an event-driven system. Looking at the symptoms, the immediate defect looked clear to me but of late, for most bugs, I tend to rubber-duck it with an AI model before I make the fix. In the last few months, I have mostly used Claude but my employer recently rolled out initial support for Codex which also seems to be well regarded in the industry. I decided to do the back-and-forth with both models this time. I did not expect much difference in the results but I was a little curious how Codex would fare against Claude but also wanted to be exhaustive and make sure I don’t miss anything.

Here is the exact technical scenario, the code, and how the two models handled it. For the impatient, the exact model for Claude was Opus 4.6 (high-effort) and for Codex I tried out GPT-5.4 (xhigh effort).

The Architecture and The Bug

The system uses a message broker (AWS SQS) to process asynchronous commands.

Each command represents a unit of work that needs to be done.
Commands are generally idempotent unless the state of the system has changed between successive commands of the same kind.
Commands are “handled” by a number of “command executors”.
An SQS listener pulls a command from the queue and effectively loops through each command executor and executes the incoming command.
If a certain handler failed its execution, the loop continues with the next handler gracefully, while keeping track of the failed executors.
Each executor may also return another command representing additional unit of work.
Such additional commands are re-queued back to the same SQS queue.
If any of the handlers fail, the main listener method throws an exception, leaving the message unacknowledged so the queue will retry it.
The queue also has a DLQ configured, so after a few unsuccessful attempts it will move the message to the DLQ for the oncall to look at.

To prevent redundant work, the listener contained an optimization. It filtered out any follow-up commands that were semantically identical to the incoming command.

Here is a simplified version of the code I gave the LLMs:

// Method is called from a parent poller class
// after deserializing the SQS message
void processMessage(Command command) {
    var result = new ResultContainer();
    for (var executor : commandExecutors) {
      try {
        result.addSuccess(executor.run(command));
      } catch (Exception e) {
        result.addFailure(executor, e);
      }
    }
    
    // THE OPTIMIZATION:
    // Filter out commands identical to the incoming one.
    // Rationale: If we fail below, 
    // SQS will retry the incoming command anyway.
    var commandsToPublish 
        = result.allCommandsToRequeue()
            .stream()
            .filter(c -> !c.isSameCommand(command)) 
            .toList();
            
    queueClient.enqueueCommands(commandsToPublish);

    // If any of the handlers failed, 
    // throw to trigger an SQS retry of the original message
    if (!result.allCommandsSucceeded()) {
        throw new RuntimeException(
          "Handlers failed for " + command);
    }
}

Can you spot the flaw?

Round 1: Claude

Claude found the localized logic bug immediately.

It correctly pointed out that the filtering was unconditional, happening even on successful execution of the entire loop. If all handlers succeeded (meaning no exception is thrown and the poller acknowledges the message to SQS), but at least one handler intentionally emitted a follow-up command semantically equivalent to the original, the system would filter it out and silently drop the work.

It proposed this fix, using a ternary operator to only apply the filter if the handlers failed:

// Claude's proposed fix:
var commandsToPublish 
  = result.allCommandsSucceeded() 
    ? result.allCommandsToRequeue()
    : result.allCommandsToRequeue().stream()
          .filter(c -> !c.isSameCommand(command))
          .toList();

This was a good catch, but by this time I had thought about the code enough to know this has a flaw. I prompted it:

Me: But if the current command is not retried even after throwing the exception, because it is about to be sent to the Dead Letter Queue (DLQ), will this fix still work?

Claude backtracked.

Claude: Good catch, if the SQS message is on its final retry attempt, throwing an exception sends it to the DLQ, not back to the active queue. Because the code had already filtered out the follow-up command, the work would be lost entirely.

It concluded that the safest fix was to remove the filter entirely and this is what it proposed:

void processMessage(Command command) {
    var result = new ResultContainer();
    for (var executor : commandExecutors) {
      try {
        result.addSuccess(executor.run(command));
      } catch (Exception e) {
        result.addFailure(executor, e);
      }
    }
            
    queueClient.enqueueCommands(
      result.allCommandsToRequeue());

    // If any of the handlers failed, 
    // throw to trigger an SQS retry
    // of the original message
    if (!result.allCommandsSucceeded()) {
        throw new RuntimeException("Handlers failed for " + command);
    }
}

While Claude was technically accurate at a line-by-line level, it relied on me to provide the architectural context regarding the DLQ. At this point, seeing that I had to prompt the model with the additional context I decided to also try the same with Codex. My experience with Claude is that generally it can find such additional context on its own but of late I’ve been seeing more and more examples like this where it didn’t (or it may just be confirmation bias on my part).

Round 2: Codex

I gave the exact same starting prompt to Codex. Its approach was different enough from Claude that I would consider its response to have a higher level of understanding of the system compared to Claude. This is what I value more than the immediate code - help me watch out for things that I might be overlooking.

Instead of just looking at that specific code block, it mapped out the execution path across multiple files. Without any prompting from me, it searched for the message’s ApproximateReceiveCount in the base listener class. Because the local processMessage method couldn’t see how many times the message had been delivered, Codex deduced it had no way of knowing if this was the last delivery before the DLQ.

It took some additional time and was a little verbose but finally concluded that the optimization was inherently unsafe. Effectively its first solution was what Claude gave me after I prompted Claude about the DLQ retry situation.

Round 3: The Dangerous Optimization

I wanted to explore if there was still a place for saving some redundant work by not enqueuing some commands under certain conditions. I didn’t take the time to think this through but asked anyway if I could make the following change safely:

Me: Now that we always enqueue the follow-up commands, if a handler fails, and one of the returned follow-up commands is same as that of the current incoming failed command, let’s just not throw an exception for the current command.

// unconditionally enqueue all follow-up commands
queueClient.enqueueCommands(
  result.allCommandsToRequeue());

// My proposed (flawed) idea:
if (!result.allCommandsSucceeded()) {
  var currentCommandIsAccountedFor 
    = result.allCommandsToRequeue()
      .stream()
      .anyMatch(c -> c.isSameCommand(command));
  
  // If we have already enqueued a similar command 
  // then let the new command handle it.
  // This will also ACK the current SQS message.
  if (!currentCommandIsAccountedFor) {
    throw new RuntimeException(
    "Handlers failed...");
  } 
}

When I presented this to Claude, it agreed that this was a reasonable change and offered to write a test for it. But Codex actually pushed back.

Codex: I would take isSameCommand check out for now. The reason is simple: with the current handlers, it does not buy you anything in practice.

Then it actually proceeded to list each individual command executor and what follow-up command it returns and did an analysis on whether the optimization was actually worth it. It concluded it wasn’t. Even if it did work, it would make debugging a lot harder later by acknowledging a concrete failed message and replacing it with a speculative, inferred future replay.

Codex: Even if the new follow-up command are same as the incoming current command, those commands will start with a receive count of 0 in SQS and thus you lose the easy visibility into repeated failed messages. Bottom line: If you stop throwing, you are no longer using SQS retry/DLQ as the failure mechanism for the messages.

It explained that throwing an exception bounds retries for the current message. If a deterministic bug is crashing the handler, throwing ensures the message’s SQS receive count goes up until it gets safely isolated in the DLQ.

If we acknowledge the failing message and rely on our newly published “covering” command instead, that new command hits SQS as a brand-new message with a receive count of 0. If it fails again, it publishes another new command, also starting at 0.

By trying to be clever, my optimization would bypass the DLQ entirely, turning a standard SQS failure into a harder to detect failure loop. The visible signal that one logical unit of work is repeatedly failing gets smeared across a series of fresh messages, which is exactly the sort of thing that makes oncall debugging miserable.

I was very happy with this trade-off analysis. I got even lazier, and came up with another suggestion, still exploring if the isSameCommand is actually useful to prevent redundant work. Here was my next suggestion:

Me: What if, while we are accumulating the individual follow-up commands from the executors, we do some deduplication. Like so:

void processMessage(Command command) {
  var result = new ResultContainer();
  for (var executor : commandExecutors) {
    try {
      // My proposal
      var followupCommand = executor.run(command);
      var followupCommandIsAccountedFor 
          = result.allCommandsToRequeue()
              .stream()
              .anyMatch(c -> followupCommand.isSameCommand(command));
              
      // Only enqueue the current executor follow-up command
      // if another similar command is not already queued
      if (!followupCommandIsAccountedFor) {
        result.addSuccess(followupCommand);
      }
    } catch (Exception e) {
      result.addFailure(executor, e);
    }
  }
          
  queueClient.enqueueCommands(result.allCommandsToRequeue());
  ..
}

Claude’s response was to agree to the change.

Claude: The handler chain dedup logic in XXX looks correct to me. It properly deduplicates commands across executors using isSameCommand, and the underlying types all have sound equality semantics.

But even in this one Codex pushed back (although with a slightly softer stance):

Codex: The new logic inside the loop dedupes only one way: it drops a new command if it is a subset of an existing one, but it does not replace an existing narrower command when a broader one arrives later. That is order-dependent and can be non-deterministic if the list of commands are not always in the same order.

It also observed that command objects had other fields like source (the executor that originated it) that feed observability metrics and tracing. That was orthogonal to the change I was trying to make and not something I was thinking about at the time, which is exactly why it mattered. Codex was effectively guarding a nearby concern while I was focused on the local optimization, and that, to me, is the real value of a coding assistant.

I thought about it and I was happy with the reasoning so as Codex suggested, I ripped out all the premature optimization and the filtering logic ending with the dead simple logic:

void processMessage(Command command) {
  var result = new ResultContainer();
  for (var executor : commandExecutors) {
    try {
      result.addSuccess(executor.run(command));
    } catch (Exception e) {
      result.addFailure(executor, e);
    }
  }
          
  queueClient.enqueueCommands(
    result.allCommandsToRequeue());

  // If any of the handlers failed, 
  // throw to trigger an SQS retry of the original message
  if (!result.allCommandsSucceeded()) {
      throw new RuntimeException(
        "Handlers failed for " + command);
  }
}

The Takeaway

I actually enjoyed this session with Codex a lot more. It felt I did not have to work as hard to provide the additional context and I did feel I was conversing in a shorthand that I would only do so with another engineer in my team.

Claude acted like a good engineer but acted within the walls of the immediate code while Codex acted more like a Senior engineer and looked at the infrastructure and architecture surrounding the code.

In this case at least, it genuinely provided additional value beyond what I asked it to do - not just by writing the correct code but by not immediately agreeing to the questionable attempts. More importantly, it kept treating observability as part of the architecture rather than an afterthought. In distributed systems, a change that preserves nominal correctness while making failures harder to see is often not a real improvement at all. I intend to use it more often now especially for things that touch system boundaries.

Footnote: Idempotency

The final code might result in a “double-enqueue” (one from the retry and one from the manual enqueue). This is an acceptable tradeoff for me right now. As mentioned in the start of the post, command executions can be considered essentially idempotent.

The Architecture and The Bug#

Round 1: Claude#

Round 2: Codex#

Round 3: The Dangerous Optimization#

The Takeaway#

Footnote: Idempotency#