The traditional boundary between development and operations is rapidly dissolving in modern cloud environments. This transformation is particularly evident in scenarios where application code directly manages infrastructure behavior, error handling, and data lifecycle—responsibilities that were once firmly in the operations domain. As James Hamilton noted over a decade ago, “A clear division between ops and dev leads to an over-the-wall approach that is too slow and too inefficient”1. Today’s cloud-native architectures have accelerated this convergence, creating what many now call BizDevOps—a modern approach that blends business, development, and operations to streamline processes and accelerate delivery2.
DynamoDB Streams: Developers as Operational Engineers
Amazon DynamoDB Streams exemplifies this shift by requiring developers to understand and account for operational dynamics when building stream-consuming applications.
Shard Lifecycle Management
DynamoDB stream shards have distinct operational characteristics that developers must comprehend:
- Limited Lifespan: Shards remain active for writing for up to 4 hours before automatically rolling over to new shards, with data retained for 24 hours3.
- Automatic Shard Management: Shards are “ephemeral: They are created and deleted automatically, as needed”4.
- Shard Splitting: When underlying table partitions increase due to growing data volume or higher throughput demands, corresponding stream shards split to maintain performance3.
While AWS Lambda can automate much of the record processing from these shards, developers must still design applications with awareness of these operational dynamics. For instance, they need to account for potential increases in IteratorAge latency during shard discovery after splits or rollovers, which could impact real-time processing guarantees.
Processing Order Requirements
Maintaining correct processing order is critical for data integrity:
“Because shards have a lineage (parent and children), an application must always process a parent shard before it processes a child shard. This helps ensure that the stream records are also processed in the correct order”4.
This requirement directly influences application architecture and falls squarely within the developer’s domain. The DynamoDB Streams Kinesis Adapter helps abstract some complexity by handling “new or expired shards, in addition to shards that split while the application is running”5, but developers must still design with these operational constraints in mind.
BisectBatchOnFunctionError: Coding for Operational Resilience
A compelling example of developers implementing operational logic is the “BisectBatchOnFunctionError” feature for Lambda functions processing DynamoDB Streams:
“When using BisectBatchOnFunctionError, check the BatchSize parameter in the OnFailure destination message’s metadata. The BatchSize could be greater than 1 since Lambda consolidates failed messages metadata when writing to the OnFailure destination”6.
This feature instructs Lambda to split problematic batches into halves and retry them separately when errors occur, effectively isolating “poison pill” records. Developers configure this behavior through code and configuration, defining how the system operates under failure conditions—a responsibility traditionally belonging to operations teams.
Event Sourcing: Developers Shaping Operational Data Flow
Event sourcing represents another domain where developers directly influence operational aspects of data management.
State Reconstruction and Read Models
In event-sourced systems, developers design both the events and the logic to apply them:
“When the system needs to be reconstructed, it reads all the events in the log and applies them in order to rebuild the state of the system. This allows the system to be reconstructed to any point in time by replaying events from the log up to that point”7.
This replay mechanism is essential for system operation—whether rebuilding state after cache invalidation or creating new query-optimized views. The ability to create entirely new schemas or read models by updating projection logic and replaying historical events means developers actively participate in operational data transformation and availability.
Schema Evolution Challenges
The immutability of events creates unique challenges for schema evolution:
“Your ride-sharing app adds a currency field to TripEnded events. What Breaks?”8
Since past events cannot be altered, developers must implement strategies to handle multiple versions of events, such as:
- Implementing version numbers within event payloads
- Creating event adapters that transform old event versions during replay
- Using backward/forward compatibility patterns8
This complex operational concern—maintaining compatibility with historical data formats—is managed entirely through developer-created solutions.
Snapshots as Operational Optimization
Snapshots represent an important operational optimization in event-sourced systems:
“Snapshots are a common optimization technique used in event sourcing systems. The idea is to periodically capture the state of an aggregate and store it as a snapshot. Then, when loading the aggregate, the system can start with the latest snapshot and apply only the events that occurred after it, reducing the number of events that need to be replayed”7.
Determining when and how to create snapshots—balancing performance needs against storage costs—becomes a development decision with direct operational impact.
Resharding Strategies: A Blend of Development and Operations
Resharding—the process of splitting or merging shards in stream processing systems—illustrates the blended responsibility model.
The Resharding Challenge
Resharding can dramatically impact application behavior:
“The Resharding Curse: If RIDE_123’s events scatter across shards during replay…”8
While platform services might automate the mechanics of resharding, developers must design applications resilient to these changes. This requires understanding how resharding affects partitioning, ordering, and parallelism.
Workarounds and Strategies
Developers implement various strategies to handle resharding challenges:
- Sequence-Aware Partitioning: Ensuring related events remain together
- Deterministic State Machines: Rebuilding state correctly regardless of event order8
- Idempotency: Providing safety nets against duplicate processing
These strategies represent operational logic embedded directly in application code.
Kafka Workers: The Traditional Operations-Managed Approach
In contrast to the integrated BizDevOps model, some platforms like Kafka maintain a more traditional separation between development and operations. Kafka Workers, for example, is “a client library which unifies records consuming from Kafka and processing them by user-defined WorkerTasks”9.
In this model:
- Operations teams manage the Kafka cluster environment, handling “capacity planning, performance tuning, cluster setup, and continuous monitoring”10
- Developers focus on implementing WorkerTasks that process records
This separation prevents the implementation of advanced features like automatic bisection of problematic batches, as the operational infrastructure remains separate from application logic. The Kafka engineer “designs safe and robust Kafka systems” and “performs physical infrastructure deployment”10, while developers implement business logic within the constraints of that infrastructure.
The BizDevOps Revolution
The convergence of business, development, and operations represents a fundamental shift in how organizations approach software delivery:
“BizDevOps extends [the DevOps] approach to encompass broader business considerations, fostering a holistic approach to software development that aligns technical efforts with business priorities”2.
This integration offers significant benefits:
- Breaking Down Silos: “Breaking down silos and restructuring teams to promote cross-functional collaboration and communication”2
- Faster Delivery: Eliminating handoffs between separate teams accelerates the development lifecycle
- Improved Quality: Developers who understand operational concerns build more resilient systems
However, this shift also presents challenges:
- Steeper Learning Curve: “Ramp up curve for new developers is extremely steep because of the lack of documentation and process”1
- On-Call Responsibilities: “Developers have to carry pagers. On-call rotations, with people on-call required to respond to a sev-1 within 15 minutes”1
- Balancing Priorities: “For some teams the operations load completely dominated, making it very difficult to tackle new features”1
The Fundamental Partitioning Decision
The dev-ops boundary represents one of the most fundamental partitioning decisions in computing history. Like the K-D tree partitioning sequence that determines system complexity, the choice of where to draw the line between development and operations has exponential implications for how systems evolve.
Historically, this partition made sense when:
- Hardware was expensive and required specialized knowledge
- Change was risky and infrequent
- Operations involved physical infrastructure management
But in cloud environments, this traditional partitioning creates more complexity than it solves. Modern cloud platforms abstract away the physical infrastructure concerns that originally justified the dev-ops separation. When developers can provision databases with a single API call and configure auto-scaling with a few lines of code, the traditional operations expertise becomes less relevant than application-specific operational logic.
This mirrors the insight from the K-D tree article: partition by business domains first, technical concerns second. The business domain boundary for modern applications increasingly includes operational behavior, error handling, and infrastructure management—not just business logic.
Service Mesh: When Operational Concerns Invade Development
Service mesh technologies (Istio, Linkerd) represent a particularly complex example of how operational concerns now require developer expertise. While marketed as simplifying microservices communication, service mesh actually forces developers to understand networking, security policies, and operational complexity that traditionally belonged to operations teams.
Developers as Network Engineers
Service mesh implementation requires developers to understand:
Traffic Management Policies:
- Request routing rules that determine how service calls are distributed
- Circuit breaker configuration for handling service failures
- Retry policies and timeouts that affect application behavior
- Load balancing strategies that impact performance characteristics
Security Configurations:
- mTLS certificate management between services
- Authorization policies that control service-to-service access
- Identity and authentication for service communication
- Security policy debugging when requests fail
Observability Integration:
- Distributed tracing configuration to understand request flows
- Metrics collection setup for monitoring service health
- Logging integration across the mesh infrastructure
- Dashboard interpretation for troubleshooting issues
The Configuration Complexity Shift
Rather than simplifying operations, service mesh transfers networking complexity to developers:
# Developers now write networking policies instead of just business logic
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service
spec:
http:
- match:
- headers:
end-user:
exact: premium
route:
- destination:
host: user-service
subset: v2
weight: 100
- route:
- destination:
host: user-service
subset: v1
weight: 100
The Operational Learning Curve:
- Developers must understand Kubernetes networking concepts
- Debug failures that span application code and mesh configuration
- Coordinate deployments with mesh policy changes
- Troubleshoot performance issues involving proxy behavior
Domain Knowledge Fragmentation
Service mesh fragments domain expertise across multiple configuration layers:
Business Logic in Code:
# The actual business functionality
def process_order(order_id):
user = get_user(order_id) # This call goes through the mesh
inventory = check_inventory(order_id) # This too
return create_order(user, inventory) # And this
Communication Logic in Mesh Config:
# The operational behavior of those calls
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: user-service
spec:
trafficPolicy:
circuitBreaker:
consecutiveErrors: 3
interval: 30s
connectionPool:
tcp:
maxConnections: 10
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 2
The Cognitive Split: Developers can no longer reason about their service behavior by looking at code alone. They must understand both business logic and mesh operational policies—expertise traditionally separated between dev and ops teams.
This represents the ultimate blurring of dev-ops boundaries: developers become responsible for networking and operational concerns while operations teams control the platform that developers must configure. Neither team has complete ownership, creating a complex interdependency that requires both groups to understand each other’s domains.
Conclusion
The examples of DynamoDB Streams and event sourcing demonstrate how modern cloud architectures have fundamentally blurred the lines between development and operations. Developers now routinely write code that directly controls infrastructure behavior, error handling, and data lifecycle management—tasks traditionally under the operations purview.
This shift requires a new mindset and skill set from both developers and operations professionals. Developers must understand operational concerns like shard management, data consistency during resharding, and schema evolution strategies. Operations teams must work more closely with development to create platforms that enable this integrated approach.
As organizations continue to adopt cloud-native architectures and event-driven patterns, the distinction between who builds software and who runs it will continue to fade. The future belongs to teams that can effectively blend business understanding, development expertise, and operational excellence into a cohesive approach to delivering software.
The traditional dev-ops partition, like any early partitioning decision, has exponential consequences. Organizations that recognize this boundary as a choice rather than an immutable fact will be better positioned to leverage the full potential of cloud-native architectures.
References
Originally published on LinkedIn by Gary Y.
Related Articles
- The K-D Tree of Software: Why Partition Sequence Determines System Complexity - Explores how early partitioning decisions have exponential impact on system complexity
- Breaking the Stateful Deployment Ceiling: Dimensional Partitioning for DevOps - How the same partitioning philosophy applies to deployment strategies
-
BizDevOps Revolution: Blurring the Lines Between Business, Dev and Ops ↩ ↩2 ↩3
-
Build Scalable Event-Driven Architectures with Amazon DynamoDB and AWS Lambda ↩ ↩2
-
DynamoDB Streams Kinesis Adapter Documentation ↩
-
Resharding & Schema Evolution in Event Sourcing: Surviving the Hidden Challenges ↩ ↩2 ↩3 ↩4
-
Kafka Workers Library Documentation ↩