Data Infrastructure

Real-Time Data Pipelines: When Milliseconds Matter

October 22, 2024

6 min read

When Real-Time Actually Means Real-Time

Not Real-Time: Processing data in 5-minute micro-batches and calling it "near real-time."

Actually Real-Time: End-to-end latency from event generation to actionable insight measured in single-digit milliseconds to low seconds.

Use Cases That Demand True Real-Time

Financial Trading Systems:

  • Market data processing: <10ms
  • Trade execution: <1ms
  • Risk calculations: <100ms

Industrial Control Systems:

  • Sensor data aggregation: <50ms
  • Anomaly detection: <100ms
  • Control loop adjustments: <10ms

Fraud Detection:

  • Transaction scoring: <200ms
  • Pattern matching: <500ms
  • Decision enforcement: Immediate

Gaming and Live Experiences:

  • Player state synchronization: <50ms
  • Leaderboard updates: <100ms
  • Event aggregation: <200ms

The Three Hard Problems

Problem 1: State Management at Scale

Challenge: Stream processing is stateless. Business logic requires state. How do you maintain state for millions of entities without killing throughput?

Example: Calculating rolling averages per sensor across 100,000 IoT devices. Each sensor needs independent state, updated every 100ms.

Approaches:

In-Memory State (Fast, Limited):

  • RocksDB embedded in stream processor
  • Partitioned by key for parallelism
  • Periodic snapshots for recovery
  • Limited by available memory

External State Store (Slower, Scalable):

  • Redis for hot state access
  • Cassandra for large state volumes
  • Trade latency for capacity
  • Network hops add milliseconds

Hybrid Approach (Optimal):

  • Hot state in-memory (recent, frequently accessed)
  • Warm state in Redis (less frequent access)
  • Cold state in database (historical, rarely accessed)
  • Tiering based on access patterns

Problem 2: Exactly-Once Processing

Challenge: Distributed systems fail. Messages get duplicated or lost. How do you guarantee each event is processed exactly once?

Naive Approach (At-Least-Once):

  • Process every message
  • Duplicate processing on retry
  • Downstream systems see duplicates

Better Approach (Idempotency):

  • Make processing operations idempotent
  • Duplicate processing has no effect
  • Requires careful design

Best Approach (Exactly-Once Semantics):

  • Transactional state updates with offset commits
  • Kafka transactions for multi-stage processing
  • Higher latency overhead (2-5ms per transaction)

Trade-off: Exactly-once adds latency. Is correctness worth milliseconds?

Problem 3: Backpressure and Flow Control

Challenge: Upstream produces faster than downstream can consume. How do you prevent system collapse while maintaining low latency?

What Happens Without Backpressure:

  • Queues fill up
  • Memory exhausted
  • System crashes
  • Data loss

Backpressure Strategies:

Buffering (Simple, Limited):

  • In-memory buffers absorb bursts
  • Works for temporary spikes
  • Fails on sustained overload

Rate Limiting (Protective, Blunt):

  • Throttle upstream to sustainable rate
  • Prevents overload
  • Increases latency for all messages

Dynamic Scaling (Flexible, Complex):

  • Add processing capacity on demand
  • Kubernetes pod autoscaling
  • Cloud function concurrency increases
  • Takes 30-60 seconds to scale

Load Shedding (Aggressive, Necessary):

  • Drop lowest-priority messages
  • Maintain SLA for critical data
  • Requires priority classification

Architecture Patterns That Work

Pattern 1: Lambda Architecture (Batch + Stream)

Structure:

  • Batch layer: Complete, accurate, slow
  • Speed layer: Approximate, fast
  • Serving layer: Merge results

When to Use:

  • Correctness matters more than speed
  • Complex aggregations required
  • Historical reprocessing needed

Trade-offs:

  • Maintain two processing paths
  • Eventual consistency by design
  • Higher operational complexity

Pattern 2: Kappa Architecture (Stream-Only)

Structure:

  • Single stream processing path
  • Replay log for reprocessing
  • No separate batch layer

When to Use:

  • Low latency critical
  • Simpler operational model preferred
  • Stream processing handles all cases

Trade-offs:

  • Replay entire log for corrections
  • Less suitable for complex batch operations
  • Requires mature stream processing

Pattern 3: Staged Event-Driven Architecture

Structure:

  • Multiple processing stages
  • Each stage independently scalable
  • Event bus between stages

Stages:

  1. Ingestion: Validate, normalize, partition
  2. Enrichment: Add context, join with reference data
  3. Processing: Business logic, aggregation
  4. Actuation: Write to outputs, trigger actions

When to Use:

  • Different stages have different scaling needs
  • Multiple output destinations
  • Complex processing pipelines

Trade-offs:

  • More network hops = higher latency
  • Debugging spans multiple services
  • Eventual consistency between stages

Technology Selection by Latency Requirement

<10ms End-to-End Latency

Constraints:

  • Single datacenter deployment
  • In-memory processing only
  • Minimize serialization overhead
  • Direct memory communication

Technology Stack:

  • Custom C++/Rust applications
  • Shared memory IPC
  • DPDK for network I/O
  • Avoid general-purpose frameworks

10-100ms End-to-End Latency

Constraints:

  • Regional deployment acceptable
  • Some external state lookups viable
  • Optimized serialization important

Technology Stack:

  • Apache Flink for stream processing
  • Redis for state management
  • Kafka for event streaming
  • gRPC for inter-service communication

100ms-1s End-to-End Latency

Constraints:

  • Multi-region deployment possible
  • Database lookups acceptable
  • Standard serialization formats work

Technology Stack:

  • Apache Kafka Streams
  • PostgreSQL with connection pooling
  • REST APIs for external calls
  • Standard JSON serialization

1s+ End-to-End Latency

Constraints:

  • Latency not critical
  • Correctness and cost matter more

Technology Stack:

  • Apache Spark Structured Streaming
  • Any database system
  • Batch-oriented optimizations
  • Cost-effective infrastructure

Monitoring Real-Time Systems

Metrics That Matter

Latency Percentiles:

  • P50: Typical performance
  • P95: User experience boundary
  • P99: Outlier detection
  • P99.9: Tail latency issues

Do Not Average Latency:

  • Average hides problems
  • Use histograms and percentiles
  • Track full distribution

Throughput Metrics:

  • Messages per second (current)
  • Processing capacity (maximum)
  • Backlog size (queue depth)
  • Consumer lag (time behind)

Error Rates:

  • Processing failures
  • Deserialization errors
  • External dependency timeouts
  • Dead letter queue growth

Alerting Strategies

Latency Degradation:

  • Alert on P99 > threshold
  • Trend analysis for gradual degradation
  • Correlate with deployment events

Backlog Growth:

  • Consumer lag > 60 seconds
  • Queue depth > capacity threshold
  • Sustained growth over 5 minutes

Error Rate Spikes:

  • Error rate > 1% of throughput
  • Sustained errors for 2+ minutes
  • Specific error type correlation

Common Anti-Patterns

Anti-Pattern 1: Synchronous External Calls

What Happens:

  • Stream processor calls REST API per message
  • Network latency dominates
  • Throughput collapses

Fix:

  • Batch lookups where possible
  • Cache reference data locally
  • Use async I/O

Anti-Pattern 2: Unbounded State Growth

What Happens:

  • State grows indefinitely
  • Memory exhaustion
  • Processing stops

Fix:

  • Time-based state eviction
  • Size-based limits with LRU eviction
  • Separate hot/cold state tiers

Anti-Pattern 3: Single Point of Failure

What Happens:

  • Critical component fails
  • Entire pipeline stops
  • SLA violated

Fix:

  • Replicate stateful components
  • Multi-AZ deployment
  • Circuit breakers for external dependencies
  • Graceful degradation modes

Anti-Pattern 4: Ignoring Serialization Cost

What Happens:

  • Complex serialization format chosen
  • CPU spent on encode/decode
  • Throughput limited by serialization

Fix:

  • Profile serialization overhead
  • Use efficient formats (Protobuf, Avro)
  • Consider schema evolution requirements

Key Takeaways

  • Real-time means different things—define your latency SLA precisely
  • State management is the hardest problem at scale
  • Exactly-once processing adds latency—understand the trade-off
  • Monitor latency percentiles, not averages
  • Different latency tiers require different technology choices
  • Backpressure is not optional for production systems

Build for the latency you need, not the latency that sounds impressive. A reliable 100ms system beats an unreliable 10ms system every time.

Related services: StreamMesh, FluxStream, InfraPulse