What This Session Is About
The data lakehouse was a major architectural advance — unifying storage, governance, and query capabilities in a way that made large-scale analytics possible for most organizations. But a lakehouse is still fundamentally a storage-and-retrieval system. It doesn't learn from what you do with the data.
Swetha introduced the data flywheel as the next evolution: an architecture where user interactions, model outputs, and evaluation signals continuously feed back into training pipelines and improve model quality over time. She drew on her experience at Google, Twitter, and now OpenAI to map out what this looks like in practice — and what it takes to actually build one.
The Four Components of a Data Flywheel
Every user interaction generates signal: clicks, completions, corrections, thumbs-up/down, dwell time, re-prompts. The flywheel starts with instrumenting your product to capture these signals at scale and routing them into a data pipeline alongside the inputs that generated them.
Raw feedback is noisy. The flywheel requires a quality filtering and labeling layer that distinguishes high-signal from low-signal examples. This can involve human annotation, automated reward models, or comparative ranking — often a combination. Bad training data doesn't just fail to help; it actively degrades the model.
The cleaned feedback feeds into a training or fine-tuning pipeline that produces model updates on a cadence (daily, weekly, or triggered by quality thresholds). This requires infrastructure: experiment tracking, model versioning, evaluation harnesses, and safe rollout mechanisms.
Without rigorous evaluation, you can't tell if the flywheel is improving the model or degrading it. Swetha emphasized building evaluation benchmarks from real user failures — tasks the model gets wrong today become the test cases for tomorrow's improvement.
Key Insights
- 01The flywheel is a moat — but it takes 12–18 months to start spinning. The compounding effect of continuous improvement is real, but the early months feel slow. Organizations that abandon the flywheel after 3–4 months because they don't see results miss the exponential curve that starts later.
- 02Data quality matters more than data volume. Swetha's consistent observation across Google, Twitter, and OpenAI: a smaller, carefully curated dataset of high-quality examples outperforms a larger noisy dataset in most fine-tuning scenarios. The quality filtering layer is often more valuable than the collection pipeline.
- 03The lakehouse is a prerequisite, not a destination. You can't build a flywheel without the lakehouse foundation — unified storage, governed access, reliable pipelines. But organizations that declare victory at the lakehouse stage and stop investing miss the compounding value that comes from closing the feedback loop.
- 04Evaluation is the hardest part. Building the training pipeline is tractable. Deciding what "better" means and measuring it reliably is where most flywheel initiatives struggle. Swetha recommended starting with a small, manually curated evaluation set derived from real production failures.
- 05Cultural change is required, not optional. A data flywheel requires product, data, and ML teams to operate in a tightly coupled loop. At most organizations, these teams operate in silos with different cadences and incentives. The flywheel architecture requires — and often forces — a different organizational model.
- 06Twitter's recommendation system was an early flywheel. Swetha described how Twitter's engagement-optimized feed operated as a de facto flywheel years before the term was widely used — user engagement fed into ranking signal, which improved the feed, which drove more engagement. The lesson: the pattern isn't new, but making it explicit and intentional is.
A lakehouse tells you what happened. A flywheel changes what happens next. The infrastructure looks similar from the outside — the difference is in what you do with the data after you collect it.
From the Q&A
Where do you start if you're building a flywheel from scratch?
Start with feedback capture on your highest-volume user interaction. Don't try to instrument everything at once. Pick one flow, capture inputs and outputs, add a simple rating mechanism, and build a labeled dataset of 500–1000 examples. That's your first fine-tuning experiment and your first evaluation set.
How do you prevent the flywheel from optimizing for the wrong metric?
This is Goodhart's Law applied to data flywheels. The feedback signal you optimize for becomes the metric that gets gamed — by users, by the model, or by both. Swetha's recommendation: use multiple orthogonal quality signals and maintain a human evaluation process that isn't tied to any single automated metric.
What does the flywheel look like at OpenAI vs. a typical enterprise?
At OpenAI, the scale and speed of the feedback loop is orders of magnitude larger than most enterprises. But the principles are the same. The key difference is that OpenAI has dedicated teams for each layer of the flywheel. At most enterprises, one or two engineers own the whole pipeline — which means you need to be much more selective about what you instrument and iterate on first.
How do you handle the privacy implications of using user data for training?
This is critical and underestimated. User data used for training must be anonymized, users must be informed and have opted in, and in regulated industries you may need to use synthetic data or differential privacy techniques instead. Swetha's strong advice: involve legal and privacy teams from day one, not after the pipeline is built.