Would I Use LLMs To Rebuild Twitter's Dynamic Product Ads? Yes And No!

I led Dynamic Product Ads at Twitter, where we matched millions of users to hundreds of millions of e-commerce products in real time. The output was the top 5-6 products that each user was most likely to buy. The system used product and user embeddings with classic ML models to serve personalized ads at Twitter’s scale. We saw 15-18% improvements in CTR and 12% improvements in conversion rates compared to brand advertisements.

This was a few years ago. But now, everyone is talking about AI and Large Language Models as if they will revolutionize everything. So, I was reflecting on if I would build the Dynamic Product Ads today, would I use LLMs? And more importantly, what would not change at all, and why?

The answer: I’d use LLMs for about 20% of the system, specifically for generating embeddings, and keep everything else the same.

Our Original Approach

The problem at hand: recommend products to a user that they are most likely to click and buy, while they are quickly scrolling their timeline. There were millions of products to choose from and advertisers. On top of that, there were millions of users scrolling at the same time. The system had to make predictions for millions of users within sub-millisecond latency. The Ad Serving pipeline would need to complete the prediction in under 50 milliseconds, at a maximum. The approach was very textbook (for 2022 at least).

Product Embeddings

We used each product’s metadata, like title, description, category, price, etc., and encoded it in a 128-dimensional dense vector space. We also utilized signals like user engagement with this product or conversion patterns to calculate embeddings.

User Embeddings

Users were represented by vectors based on signals like their engagement on the platform, profile information, and past purchases. Even geographies and the time of day played a key role here.

The Matching Model

At inference time, we would use a two-stage approach. First, we would run a fast approximate nearest neighbor search to retrieve candidate products whose embeddings were close to user embeddings. Then, we would use a gradient boosted decision tree to score those candidates, incorporating additional features like recency, price signals, and context like time of day.

This approach worked. The model and ANN (approximate nearest neighbor) were explainable, debuggable, and most importantly, fast enough for Twitter’s scale.

How I’d Approach This Today?

It’s 2026 now. If I were building this system today, here’s what I’d actually change.

Better Product Embeddings with LLM Encoders

The biggest improvement would be in generating better product embeddings. Modern Large Language Models are remarkably good at understanding semantic meaning and context. Instead of stitching together product descriptions (most of which were pretty bad to begin with), I’d use an LLM-based encoder to generate product embeddings.

This matters because a product titled “running shoes” would be semantically close to “sneakers for jogging”, even though they don’t share the exact words. Modern sentence transformers from Hugging Face, like all-MiniLM-L6-v2 handle this effortlessly.

We once had a Nike product catalog entry titled ‘Air Max 270 React’ that our 2022 embeddings couldn’t match to users searching for ‘cushioned running shoes’ or ‘athletic sneakers’ because there was no keyword overlap. The product got 35-40% fewer impressions than similar items in its first week until we collected enough engagement data. An LLM-based encoder would have understood the semantic relationship immediately.

Improved Cold-Start Handling

LLMs would also make the cold-start handling a bit better. When a new product appears in a catalog, LLM can extract rich signals from product descriptions, reviews, and images to generate a reasonable initial embedding. Similarly, for new users with sparse engagement history, modern encoders can better understand their profile information and initial tweets (if they have any) to create meaningful representations. Cold start was always our weakest point in classic embeddings-based user-product matching solutions.

So, where would LLMs fit into the actual architecture?

Hybrid Approach

I would still use classic ML for actual storing and serving layers. The architecture would look like:

Why would I use classic models for scoring? The reasons are latency, cost, and explainability.

LLMs cannot predict products for millions of users in under 10 milliseconds. They are an overkill for combining numerical features and making a ranking decision. Classic models would do this in microseconds. At Twitter’s scale, the difference between 1ms and 10ms inference time translates to millions of dollars in infrastructure costs and measurable drops in user engagement.

Cost matters more than people admit. Running LLM inference for every prediction request would cost 50-100x more than our classic approach.

What About Generating Ad Copy?

There is a lot of hype on the internet about using LLMs to generate personalized ad copy on the fly or to reason about user intent in real time. This is where it gets hard to decide if LLMs would actually be useful or not.

Generating ad copy with LLMs introduces unacceptable risks like hallucinations about product features, inconsistent branding, and hard-to-review content at this scale. The system would need to show millions of ad variations per day, and there would be no way to review them for accuracy and brand safety. One hallucinated claim about a product like “waterproof” when it’s not, or “FDA-approved” when it isn’t, would create legal liability. The risk doesn’t justify the marginal lift in engagement.

What Does Not Change?

Understanding user intent is still the hardest part. Whether the system uses embeddings from 2022 or LLMs from 2026, the fundamental challenge remains the same: inferring what someone wants from noisy signals. Someone who tweets about running shoes might be a marathon runner shopping for their next pair, or a casual observer who just watched a race on television. This problem needs good data to solve, thoughtful feature engineering, and lots of experimentation. No model architecture solves this.

Latency requirements are non-negotiable. At scale, every millisecond counts. Users would abandon experiences that feel slow. Ad systems cannot slow down the loading of the timeline. I have seen several ML infrastructure systems crumble in A/B tests because they added 100ms of latency to a well-oiled system. The model could be better, but the latency requirement trumps all.

Last-mile problems remain the same. Issues like cold start for new products or users, and data quality issues when catalogs have missing or incorrect product descriptions, are still there. These problems are orthogonal to the model architecture and require system design thinking, and not model architecture building.

Iteration speed beats model sophistication. A team that can run 10 experiments per week with a simpler model will constantly outperform a team running 1 experiment per week with a very sophisticated model. The ability to test quickly, measure results, and iterate is more valuable than marginal improvements in model quality. When we launched Dynamic Product Ads, we ran 3-4 experiments per week. We tested different embedding dimensions, different ANN algorithms, and different features in the scoring model. Most experiments failed. But the ones that worked compounded. That velocity mattered more than picking the “perfect” model architecture.

The Real Question is: Where Is The Bottleneck?

To be honest, most of the “how would you build it today with modern AI” discussion misses the point. The question should not be what is possible with the new technology. It should be, “What is the actual bottleneck in your system where AI can help?”

For us, the bottleneck was never about the quality of the embeddings. It was about understanding user intent, handling data quality issues in the product catalogs, and managing cold-start problems, as well as building systems that could handle scale. Modern AI genuinely helps with some of these, and there is real value in using AI. However, the fundamental system challenges don’t change.

If I were building this system today, I’d spend 20% of my effort on “using LLM to generate better embeddings” and 80% on the same problems of scale, data quality, experimentation, and understanding user intent.

The Unpopular Opinion

The tech industry loves revolutionizing narratives. There was similar hype around blockchain and Web3 a few years ago. But the truth is, most production ML systems work, scale, and make money. The revolutionary approach of using LLM would make 5% improvement, but would be 10x slower and 100x more expensive. Modern AI is genuinely valuable when applied thoughtfully to bottlenecks, not as a replacement for the entire system that already works.

Would I Use LLMs to Rebuild Twitter’s Dynamic Product Ads? Yes and No! | HackerNoon