Trending:
Data & Analytics

ShareChat hit 1B features/sec without database scaling through caching, data modeling

Indian social platform ShareChat scaled its ML feature store 1000x by optimizing architecture rather than adding infrastructure. The approach cut database load from 2B rows/sec to 18.4M through protocol buffers, service sharding, and aggressive caching - avoiding the instinct to throw more hardware at the problem.

ShareChat's engineering team achieved 1000x feature store throughput without expanding their ScyllaDB cluster, a case study in optimization over scaling.

The Indian social platform, serving 400M+ monthly users across ShareChat, Moj, and MX TakaTak, powers its recommendation engine with over 200 production ML models. Initial implementation hit 2 billion rows/sec due to poor data tiling (70 tiles per feature), grinding the system.

Three architectural shifts fixed it. First, data modeling: switching to protocol buffers and Flink processing reduced rows 100x. Second, sharding into 27 services with consistent hashing achieved 95% cache hit rates, later 99%. Third, infrastructure optimization - a forked FastCache implementation (100x less contention), gRPC buffer pooling, and garbage collection tuning.

Final load: 18.4 million rows/sec to serve 1 billion features/sec. The original ScyllaDB cluster was over-provisioned and later downsized.

This matters because feature stores are infrastructure bottlenecks as ML deployments scale. They handle real-time aggregations (1-hour to 30-day windows) for recommendation systems, requiring single-digit millisecond latency. The pattern - optimize aggressively before scaling - contradicts vendor advice but matches what works in production.

ShareChat runs a 100+ person ML team. The company hasn't disclosed when these optimizations occurred, though ScyllaDB published the case study this week.

Broader context: 32% of ML/AI use cases are production-ready according to recent surveys, up from 15% year-over-year. Feature stores are becoming critical infrastructure as agentic AI systems require persistent state management. Alternative approaches include Feast with Redis caching, Databricks Feature Store materialization strategies, and SageMaker's offline/online serving split.

The engineering details suggest most organizations could halve their feature store costs by auditing data models and cache strategies before adding capacity. History suggests they won't.