Engineering 15 Mar 2026 1 min read

Lessons from Building Distributed Systems at TCS

Real-world insights from designing scalable backend systems handling millions of requests daily at Tata Consultancy Services.

By Chirag Singhal

Lessons from Building Distributed Systems at TCS

Since joining TCS in June 2025, I've been working on backend systems that handle millions of requests daily. Here are the hard-won lessons.

1. Event-Driven > Request-Response

Switching from synchronous REST calls to Apache Kafka for inter-service communication reduced our P99 latency by 60%.

# Before: Synchronous
response = service_b.process(data)  # Blocks until complete

# After: Event-driven
producer.send('events', {'type': 'process', 'data': data})

2. Redis is Not Just Caching

We use Redis for:

Rate limiting (sliding window counters)
Session management
Real-time leaderboards
Distributed locks (Redlock algorithm)

3. Circuit Breakers Save Lives

When downstream services fail, circuit breakers prevent cascade failures:

const breaker = new CircuitBreaker(apiCall, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
});

4. Observability is Non-Negotiable

You can't fix what you can't see. We use:

Prometheus + Grafana for metrics
Jaeger for distributed tracing
ELK stack for centralized logging

5. Design for Failure

Every service assumes its dependencies will fail. Graceful degradation > complete outage.

Comments

Comments are powered by giscus. Set PUBLIC_GISCUS_REPO_ID and PUBLIC_GISCUS_CATEGORY_ID in your environment to enable them.