ADR-001a: Netty Channel Lifecycle with Cats Effect IO¶
Status: Accepted
Date: November 2025
Parent ADR: INF-001: Scala 3 Migration
Deciders: Chippr Robotics LLC Engineering Team
Context¶
During the Scala 3 and Cats Effect 3 migration (ADR-001), we encountered a critical issue with the vendored scalanet library's UDP channel management. Channels were reporting as "CLOSED" during peer enrollment despite successful bind operations, preventing peer discovery from functioning.
Original Implementation (IOHK scalanet with Monix Task)¶
The original IOHK scalanet library (v0.8.0) used Monix Task and followed this pattern:
// Original Monix Task pattern
private lazy val serverBinding: ChannelFuture =
new Bootstrap().bind(localAddress)
override def client(to: Address): Resource[Task, Channel] = {
for {
_ <- Resource.liftF(raiseIfShutdown)
remoteAddress = to.inetSocketAddress
channel <- Resource {
ChannelImpl(
nettyChannel = serverBinding.channel, // Direct access to lazy val
localAddress = localAddress,
remoteAddress = remoteAddress,
...
).allocated
}
} yield channel
}
private def initialize: Task[Unit] =
toTask(serverBinding) // Wait for bind to complete
.onErrorRecoverWith { ... }
Key Characteristics:
- serverBinding is a lazy val that creates and caches the ChannelFuture
- initialize() waits for the bind operation to complete via toTask()
- Client channels access the Netty channel directly via serverBinding.channel
- No intermediate caching of the channel reference
Migrated Implementation (Initial Cats Effect IO Attempt)¶
The initial migration to Cats Effect IO introduced an optimization:
// Initial CE3 migration with boundChannelRef
class StaticUDPPeerGroup[M] private (
...
boundChannelRef: Ref[IO, Option[io.netty.channel.Channel]]
)
private def initialize: IO[Unit] =
for {
_ <- toTask(serverBinding)
channel = serverBinding.channel()
_ <- boundChannelRef.set(Some(channel)) // Cache channel reference
} yield ()
override def client(to: Address): Resource[IO, Channel] = {
for {
nettyChannel <- Resource.eval(boundChannelRef.get.flatMap {
case Some(ch) => IO.pure(ch)
case None => IO.raiseError(...)
})
channel <- Resource { ... }
} yield channel
}
Problems Introduced:
1. Race Condition: The channel reference was cached in boundChannelRef before Netty's async initialization completed
2. State Staleness: Accessing the cached reference could return a channel in an intermediate state
3. Thread Safety: The channel state was being inspected from different threads than Netty's event loop
4. Lazy Val Semantics: The lazy val serverBinding evaluation timing differed between Task and IO contexts
Investigation Findings¶
Netty Channel Lifecycle¶
Understanding Netty's channel lifecycle was critical:
1. Bootstrap.bind() called
↓
2. Channel created (NEW state)
↓
3. Channel registered with EventLoopGroup (REGISTERED)
↓
4. Bind operation initiated (BINDING)
↓
5. Bind completes, ChannelFuture fires (BOUND)
↓
6. Channel becomes active (ACTIVE)
↓
7. Channel ready for I/O operations
Critical Insight: The ChannelFuture returned by bind() completes at step 5, but the channel may not be in ACTIVE state (step 6) immediately. The cached channel reference at step 5 could be inspected before step 6 completes.
Monix Task vs Cats Effect IO Differences¶
| Aspect | Monix Task | Cats Effect IO |
|---|---|---|
| Evaluation | Lazy by default | Depends on context (eager/lazy) |
| Thread Pool | Scheduler-based | Work-stealing executor |
| Future Integration | Direct Task.fromFuture |
IO.async with callbacks |
| Lazy Val Interaction | Predictable sequencing | Can vary with fiber scheduling |
| Blocking Operations | Explicit .executeOn |
IO.blocking shift |
Key Discovery: When serverBinding (a lazy val containing a ChannelFuture) is evaluated in an IO context, the timing of when downstream operations see the channel state can vary based on fiber scheduling. Monix Task's scheduler had more predictable sequencing.
Root Cause Analysis¶
The bug manifested as:
ERROR - Netty channel is CLOSED when trying to send
Channel: NioDatagramChannel, isActive: false, isRegistered: false
Root causes:
1. Premature Caching: boundChannelRef.set(Some(channel)) happened before the channel was fully active
2. Async Completion: The bind future completing doesn't guarantee channel activation
3. Cross-Thread Access: Checking channel.isActive from IO fiber vs Netty event loop thread
4. Resource Cleanup: If initialization checks failed, the EventLoopGroup shut down, closing all channels
Decision¶
We decided to revert to the original IOHK scalanet pattern:
// Corrected CE3 pattern (matches original)
class StaticUDPPeerGroup[M] private (
...
// boundChannelRef removed
)
private lazy val serverBinding: io.netty.channel.ChannelFuture =
new Bootstrap()
.group(workerGroup)
.channel(classOf[NioDatagramChannel])
.bind(localAddress)
private def initialize: IO[Unit] =
for {
_ <- toTask(serverBinding).handleErrorWith { ... }
_ <- IO(logger.info(s"Server bound to address ${config.bindAddress}"))
} yield ()
override def client(to: Address): Resource[IO, Channel] = {
for {
_ <- Resource.eval(raiseIfShutdown)
remoteAddress = to.inetSocketAddress
nettyChannel = serverBinding.channel() // Direct access, no caching
channel <- Resource { ... }
} yield channel
}
Key Changes:
1. Removed boundChannelRef parameter and all caching
2. Access channel directly from serverBinding.channel() like the original
3. Simplified initialize() to match original pattern
4. Let Netty's internal synchronization handle channel state
Consequences¶
Positive¶
- Eliminated Race Condition: No premature caching of channel references
- Simpler Code: Removed complexity of managing
boundChannelRef - Proven Pattern: Matches battle-tested original IOHK implementation
- Thread Safety: Let Netty manage its own threading and state
- Test Validation: All 3 unit tests pass reliably; initialization and shutdown work correctly
- Robust Shutdown: Synchronous channel close with error handling prevents shutdown failures
Negative¶
- Migration Complexity: Required deep understanding of Netty and effect system differences
- Investigation Time: Significant effort to identify and resolve both initialization and shutdown races
Neutral¶
- Performance: No measurable difference (caching would have been premature optimization anyway)
- Type Safety: Both approaches are type-safe; the issues were runtime lifecycle management
Lessons Learned¶
For Future Effect System Migrations¶
- Validate Async Resource Lifecycles: Don't assume type-level compatibility means behavioral compatibility
- Compare Line-by-Line: When vendoring libraries, compare with original implementation closely
- Test Resource Initialization: Create specific tests for resource lifecycle sequences
- Avoid Premature Optimization: Don't cache async resources unless proven necessary
- Thread Awareness: Be aware of which thread pool/executor is being used for operations
- Understand Framework Internals: Deep understanding of Netty's lifecycle was essential
Pattern for Netty + Cats Effect Integration¶
DO:
- Let Netty manage its own channel state and threading
- Access channels directly from ChannelFutures when needed
- Wait for bind futures to complete before considering resources ready
- Use IO.blocking for operations that might block on Netty event loops
- Use synchronous channel operations (.syncUninterruptibly()) in shutdown paths
- Add comprehensive logging during debugging to track state transitions
- Handle errors gracefully in shutdown code to avoid cascading failures
DON'T: - Cache Netty channel references in separate Refs/state holders - Inspect channel state from threads other than Netty's event loop - Assume ChannelFuture completion means full resource readiness - Use async operations in shutdown that schedule on potentially-terminating executors - Optimize prematurely by introducing intermediate caching - Skip comparing with original implementations when migrating - Let shutdown failures propagate without error handling
Debugging Approach That Worked¶
- Compare with Original: Looked at IOHK scalanet v0.8.0 source code
- Add Detailed Logging: Tracked channel state through initialization sequence
- Check Thread Context: Logged which thread/executor was running operations
- Test Channel State: Verified
isOpen,isActive,isRegisteredat each step - Follow Netty Lifecycle: Understood the channel's state machine
- Simplify Incrementally: Removed complexity until matching original pattern
Implementation Notes¶
Testing Strategy¶
Unit tests validate: - Basic initialization works and channel becomes active - Client channels can be created after initialization - Multiple peer groups can coexist and shut down cleanly
Final Resolution (November 2025): All three unit tests now pass reliably after fixing the shutdown race condition:
-
Shutdown Race Fix: The final issue was in the
shutdown()method, which usedtoTask(channel.close())to asynchronously close the channel. This scheduled work on Netty's EventLoopGroup, but when multiple peer groups were shutting down in sequence, the executor could already be terminating, causing "event executor terminated" errors. -
Solution: Changed to synchronous close with error handling:
// Before (async scheduling that could fail): _ <- toTask(serverBinding.channel().close()) // After (synchronous close with error handling): _ <- IO { val channel = serverBinding.channel() if (channel.isOpen) { channel.close().syncUninterruptibly() } }.handleErrorWith { error => IO(logger.warn(s"Error closing channel: ${error.getMessage}")) }
This avoids scheduling on the potentially-shutting-down executor and handles errors gracefully.
Integration tests (in production): - Actual peer discovery and enrollment - Long-running stability - Network edge cases (timeouts, unreachable peers, etc.)
Migration Checklist for Similar Issues¶
If encountering similar issues elsewhere in the codebase:
- Compare vendored code with original line-by-line
- Check for cached references to async resources
- Validate resource lifecycle timing (creation → ready → cleanup)
- Test cross-thread state inspection
- Add lifecycle logging
- Create unit tests for resource initialization
- Simplify to match proven patterns
- Document findings in ADR
References¶
- GitHub Issue #337
- PR Fix - Commit 61d2076
- Original IOHK scalanet v0.8.0
- Netty User Guide - Channel Lifecycle
- Cats Effect 3 Documentation - Resource
- INF-001: Scala 3 Migration
Review and Update¶
This ADR should be reviewed when: - Additional Netty integration issues are discovered - Cats Effect releases major version updates - Performance issues arise in network layer - Similar patterns are needed elsewhere (HTTP clients, database connections, etc.)