Data Infrastructure
Real-Time Data Pipelines: When Milliseconds Matter
October 22, 2024
6 min read
When Real-Time Actually Means Real-Time
Not Real-Time: Processing data in 5-minute micro-batches and calling it "near real-time."
Actually Real-Time: End-to-end latency from event generation to actionable insight measured in single-digit milliseconds to low seconds.
Use Cases That Demand True Real-Time
Financial Trading Systems:
- Market data processing: <10ms
- Trade execution: <1ms
- Risk calculations: <100ms
Industrial Control Systems:
- Sensor data aggregation: <50ms
- Anomaly detection: <100ms
- Control loop adjustments: <10ms
Fraud Detection:
- Transaction scoring: <200ms
- Pattern matching: <500ms
- Decision enforcement: Immediate
Gaming and Live Experiences:
- Player state synchronization: <50ms
- Leaderboard updates: <100ms
- Event aggregation: <200ms
The Three Hard Problems
Problem 1: State Management at Scale
Challenge: Stream processing is stateless. Business logic requires state. How do you maintain state for millions of entities without killing throughput?
Example: Calculating rolling averages per sensor across 100,000 IoT devices. Each sensor needs independent state, updated every 100ms.
Approaches:
In-Memory State (Fast, Limited):
- RocksDB embedded in stream processor
- Partitioned by key for parallelism
- Periodic snapshots for recovery
- Limited by available memory
External State Store (Slower, Scalable):
- Redis for hot state access
- Cassandra for large state volumes
- Trade latency for capacity
- Network hops add milliseconds
Hybrid Approach (Optimal):
- Hot state in-memory (recent, frequently accessed)
- Warm state in Redis (less frequent access)
- Cold state in database (historical, rarely accessed)
- Tiering based on access patterns
Problem 2: Exactly-Once Processing
Challenge: Distributed systems fail. Messages get duplicated or lost. How do you guarantee each event is processed exactly once?
Naive Approach (At-Least-Once):
- Process every message
- Duplicate processing on retry
- Downstream systems see duplicates
Better Approach (Idempotency):
- Make processing operations idempotent
- Duplicate processing has no effect
- Requires careful design
Best Approach (Exactly-Once Semantics):
- Transactional state updates with offset commits
- Kafka transactions for multi-stage processing
- Higher latency overhead (2-5ms per transaction)
Trade-off: Exactly-once adds latency. Is correctness worth milliseconds?
Problem 3: Backpressure and Flow Control
Challenge: Upstream produces faster than downstream can consume. How do you prevent system collapse while maintaining low latency?
What Happens Without Backpressure:
- Queues fill up
- Memory exhausted
- System crashes
- Data loss
Backpressure Strategies:
Buffering (Simple, Limited):
- In-memory buffers absorb bursts
- Works for temporary spikes
- Fails on sustained overload
Rate Limiting (Protective, Blunt):
- Throttle upstream to sustainable rate
- Prevents overload
- Increases latency for all messages
Dynamic Scaling (Flexible, Complex):
- Add processing capacity on demand
- Kubernetes pod autoscaling
- Cloud function concurrency increases
- Takes 30-60 seconds to scale
Load Shedding (Aggressive, Necessary):
- Drop lowest-priority messages
- Maintain SLA for critical data
- Requires priority classification
Architecture Patterns That Work
Pattern 1: Lambda Architecture (Batch + Stream)
Structure:
- Batch layer: Complete, accurate, slow
- Speed layer: Approximate, fast
- Serving layer: Merge results
When to Use:
- Correctness matters more than speed
- Complex aggregations required
- Historical reprocessing needed
Trade-offs:
- Maintain two processing paths
- Eventual consistency by design
- Higher operational complexity
Pattern 2: Kappa Architecture (Stream-Only)
Structure:
- Single stream processing path
- Replay log for reprocessing
- No separate batch layer
When to Use:
- Low latency critical
- Simpler operational model preferred
- Stream processing handles all cases
Trade-offs:
- Replay entire log for corrections
- Less suitable for complex batch operations
- Requires mature stream processing
Pattern 3: Staged Event-Driven Architecture
Structure:
- Multiple processing stages
- Each stage independently scalable
- Event bus between stages
Stages:
- Ingestion: Validate, normalize, partition
- Enrichment: Add context, join with reference data
- Processing: Business logic, aggregation
- Actuation: Write to outputs, trigger actions
When to Use:
- Different stages have different scaling needs
- Multiple output destinations
- Complex processing pipelines
Trade-offs:
- More network hops = higher latency
- Debugging spans multiple services
- Eventual consistency between stages
Technology Selection by Latency Requirement
<10ms End-to-End Latency
Constraints:
- Single datacenter deployment
- In-memory processing only
- Minimize serialization overhead
- Direct memory communication
Technology Stack:
- Custom C++/Rust applications
- Shared memory IPC
- DPDK for network I/O
- Avoid general-purpose frameworks
10-100ms End-to-End Latency
Constraints:
- Regional deployment acceptable
- Some external state lookups viable
- Optimized serialization important
Technology Stack:
- Apache Flink for stream processing
- Redis for state management
- Kafka for event streaming
- gRPC for inter-service communication
100ms-1s End-to-End Latency
Constraints:
- Multi-region deployment possible
- Database lookups acceptable
- Standard serialization formats work
Technology Stack:
- Apache Kafka Streams
- PostgreSQL with connection pooling
- REST APIs for external calls
- Standard JSON serialization
1s+ End-to-End Latency
Constraints:
- Latency not critical
- Correctness and cost matter more
Technology Stack:
- Apache Spark Structured Streaming
- Any database system
- Batch-oriented optimizations
- Cost-effective infrastructure
Monitoring Real-Time Systems
Metrics That Matter
Latency Percentiles:
- P50: Typical performance
- P95: User experience boundary
- P99: Outlier detection
- P99.9: Tail latency issues
Do Not Average Latency:
- Average hides problems
- Use histograms and percentiles
- Track full distribution
Throughput Metrics:
- Messages per second (current)
- Processing capacity (maximum)
- Backlog size (queue depth)
- Consumer lag (time behind)
Error Rates:
- Processing failures
- Deserialization errors
- External dependency timeouts
- Dead letter queue growth
Alerting Strategies
Latency Degradation:
- Alert on P99 > threshold
- Trend analysis for gradual degradation
- Correlate with deployment events
Backlog Growth:
- Consumer lag > 60 seconds
- Queue depth > capacity threshold
- Sustained growth over 5 minutes
Error Rate Spikes:
- Error rate > 1% of throughput
- Sustained errors for 2+ minutes
- Specific error type correlation
Common Anti-Patterns
Anti-Pattern 1: Synchronous External Calls
What Happens:
- Stream processor calls REST API per message
- Network latency dominates
- Throughput collapses
Fix:
- Batch lookups where possible
- Cache reference data locally
- Use async I/O
Anti-Pattern 2: Unbounded State Growth
What Happens:
- State grows indefinitely
- Memory exhaustion
- Processing stops
Fix:
- Time-based state eviction
- Size-based limits with LRU eviction
- Separate hot/cold state tiers
Anti-Pattern 3: Single Point of Failure
What Happens:
- Critical component fails
- Entire pipeline stops
- SLA violated
Fix:
- Replicate stateful components
- Multi-AZ deployment
- Circuit breakers for external dependencies
- Graceful degradation modes
Anti-Pattern 4: Ignoring Serialization Cost
What Happens:
- Complex serialization format chosen
- CPU spent on encode/decode
- Throughput limited by serialization
Fix:
- Profile serialization overhead
- Use efficient formats (Protobuf, Avro)
- Consider schema evolution requirements
Key Takeaways
- Real-time means different things—define your latency SLA precisely
- State management is the hardest problem at scale
- Exactly-once processing adds latency—understand the trade-off
- Monitor latency percentiles, not averages
- Different latency tiers require different technology choices
- Backpressure is not optional for production systems
Build for the latency you need, not the latency that sounds impressive. A reliable 100ms system beats an unreliable 10ms system every time.
Related services: StreamMesh, FluxStream, InfraPulse