ADR-INF-004: Actor IO Error Handling Pattern with Cats Effect¶
Status: Accepted
Date: November 2025
Context: PR fixing flaky PeerDiscoveryManager tests
Background¶
During testing of the PeerDiscoveryManager actor, we encountered flaky test failures related to error handling when IO tasks were piped to actor recipients. The root cause was non-deterministic error propagation when using IO.onError().unsafeToFuture().pipeTo() pattern.
The Problem¶
The original error handling pattern in PeerDiscoveryManager.pipeToRecipient:
task
.onError(ex => IO(log.error(ex, "Failed to relay result to recipient.")))
.unsafeToFuture()
.pipeTo(recipient)
This approach had several issues:
-
Non-deterministic behavior:
IO.onErrorruns a callback on errors but rethrows the original error. When combined withunsafeToFuture().pipeTo(), the timing of logging vs. message delivery was unpredictable. -
Race conditions: Actor state transitions could race with error handling, leading to inconsistent actor state.
-
Flaky tests: Tests that simulated IO failures would sometimes pass and sometimes fail due to timing issues.
-
Unclear error delivery: Recipients would receive Scala
Failuremessages, but the conversion wasn't explicit in the code, making the error handling contract unclear.
Evidence from CI/Testing¶
- Job 56121089316 showed failing tests with logs: "Failed to start peer discovery." and "Failed to relay result to recipient."
- Tests like "keep serving the known peers if the service fails to start" and "propagate any error from the service to the caller" exhibited intermittent failures.
- The error log "Failed to relay result to recipient." appeared even when tests passed, indicating error handling was executing but in a non-deterministic way.
Decision¶
We adopt an explicit error handling pattern for all IO tasks piped to actors:
private def pipeToRecipient[T](recipient: ActorRef)(task: IO[T]): Unit = {
implicit val ec = context.dispatcher
// Convert IO[T] into a Future[Either[Throwable, T]] so we can explicitly handle errors
val attemptedF = task.attempt.unsafeToFuture()
// Map Left(ex) -> Status.Failure(ex) so recipients get a clear Failure message
val mappedF = attemptedF.map {
case Right(value) => value
case Left(ex) => Status.Failure(ex)
}
mappedF.pipeTo(recipient)
}
Key Principles¶
-
Use
IO.attempt: ConvertIO[T]toIO[Either[Throwable, T]]to make error handling explicit. -
Map to
Status.Failure: ConvertLeft(ex)toorg.apache.pekko.actor.Status.Failure(ex)before piping to recipients. -
Deterministic delivery: Recipients always receive either:
- The expected message type
Ton success -
Status.Failure(ex)on error -
No side-effects in error path: Avoid callbacks like
onErrorthat introduce timing dependencies. -
Self-piping requires failure handlers: When piping to
self, the actor's receive method must handleStatus.Failuremessages.
Implementation¶
Files Updated¶
- PeerDiscoveryManager.scala:
- Added import:
org.apache.pekko.actor.Status - Updated
pipeToRecipientmethod to use explicit error handling -
All tests passing, no more "Failed to relay result to recipient" errors
-
PeerManagerActor.scala:
- Added
pipeToRecipienthelper method (same pattern) - Updated
GetPeershandler to usepipeToRecipient(sender())(getPeers(...)) - Updated
SchedulePruneIncomingPeershandler to usepipeToRecipient(self)(...) - Added
Status.Failurehandler inhandlePruningto gracefully handle pruning errors
Pattern for Piping to External Actors¶
When piping IO results to external actors (e.g., sender() from an ask):
The caller will receive either:
- The result on success
- Status.Failure(ex) on error (which causes Future from ask to fail with the exception)
Pattern for Piping to Self¶
When piping IO results to self:
case StartAsyncOperation =>
pipeToRecipient(self)(performOperation())
case Status.Failure(ex) =>
log.warning("Async operation failed: {}", ex.getMessage)
// Handle failure appropriately (retry, fallback, etc.)
The actor must explicitly handle Status.Failure messages.
Consequences¶
Positive¶
-
Deterministic error behavior: Errors are always delivered as
Status.Failuremessages. -
No race conditions: State transitions and error handling are ordered by the actor mailbox.
-
Testable: Tests can reliably assert on error cases without flakiness.
-
Clear contract: The error handling contract is explicit in the code.
-
Consistent pattern: Same pattern works for all IO-to-actor scenarios.
-
Better debugging:
Status.Failuremessages are visible in actor system logs with standard formatting.
Negative¶
-
Boilerplate: Each actor using IO needs its own
pipeToRecipienthelper or needs to import a shared one. -
Learning curve: Developers need to understand this pattern vs. the simpler but flaky direct
pipeTo. -
Status.Failure handling: Actors piping to
selfmust remember to handleStatus.Failure.
Migration Impact¶
- Low risk: The change is localized to error handling paths and doesn't affect success cases.
- Backward compatible: External callers see the same behavior (Future fails on error).
- Test improvements: Flaky tests become stable.
Related Patterns¶
When NOT to Use This Pattern¶
-
Pure actor messages: When not using Cats Effect IO at all.
-
context.pipeToSelf: Pekko's
context.pipeToSelfhas built-in error handling and is preferred when the Future is already constructed. -
Synchronous operations: When the operation is purely synchronous, use regular message sends.
Alternative Approaches Considered¶
- Domain-level error messages: Wrap results in ADTs like
Result[T]orOperationResult[T]. -
Rejected: More boilerplate, and
Status.Failureis a standard Pekko pattern. -
Try[T] instead of Either: Use
task.attempt.map(_.toTry). -
Rejected:
Eitheris more composable and explicit in Scala 3. -
Supervisor strategy: Let actors crash and restart on errors.
- Rejected: Not appropriate for expected errors like network timeouts or resource allocation failures.
Future Considerations¶
-
Shared utility: Consider creating a shared
ActorIOOpstrait withpipeToRecipientto reduce boilerplate. -
Typed actors: When/if migrating to Pekko Typed, the equivalent pattern would use typed message protocols with explicit error types.
-
Monitoring: Consider adding metrics for
Status.Failurefrequency to detect systemic issues. -
Documentation: Update internal developer docs with this pattern as a best practice.
Compliance Check¶
All network and actor code using unsafeToFuture().pipeTo() should be reviewed:
- ✅
PeerDiscoveryManager.pipeToRecipient- Updated with explicit error handling - ✅
PeerManagerActor.pipeToRecipient- Updated with explicit error handling - ✅
PeerManagerActor.handlePruning- AddedStatus.Failurehandler - ✅ Regular sync actors (
BodiesFetcher,StateNodeFetcher,HeadersFetcher) - Usecontext.pipeToSelfwith explicit error handling - ⚠️
StateStorageActor- Pipes toself, hascase Failure(e) => throw ehandler (rethrows) - ⚠️
SyncStateSchedulerActor- Pipes toselfbut lacks explicitStatus.Failurehandler (future improvement)
Future Improvements¶
The following actors should be reviewed and potentially updated in future work:
-
StateStorageActor: Currently rethrows failures with
case Failure(e) => throw e. Consider whether graceful error handling would be more appropriate than crashing the actor. -
SyncStateSchedulerActor: Pipes IO results to
selfbut doesn't explicitly handleStatus.Failure. Should add handler to prevent unhandled messages.
References¶
- Cats Effect IO
- Pekko Actor Error Handling
- Pekko Status.Failure
- Original issue: Fix flaky PeerDiscoveryManager tests
- ADR-INF-002: Actor System Architecture (context on untyped actors)
Related Issues¶
- Flaky PeerDiscoveryManager tests (resolved)
- CI job 56121089316 (fixed)
- Future: Apply pattern to other actors as needed