At QCon SF 2024, Faye Zhang gave a talk titled Search: from Linear to Multiverse, covering three trends and techniques in AI-powered search: multi-modal interaction, personalization, and simulation with AI agents.
Zhang, a Staff Software Engineer at Pinterest, began with stats on the growth of AI as a primary search tool: from 1% of the population in January, 2024 to 8% in October, 2024. She said that it was projected to reach over 60% by 2027. She mentioned several AI features that made it useful for search, for example, its ability to scan reviews quickly or to find items using a visual description.
She then explored the trend toward multimodal interaction with AI search; unlike traditional search with text-only queries, AI models can also accept image, video, or speech. She cited several research papers, including one about Meta’s Chameleon model, and gave a high-level overview of the architecture of multi-modal interaction. The most common strategy is to map all input modalities into the same embedding space, as Meta’s ImageBind model does.
This leads to the next challenge: users want to iterate and refine their search in realtime in a “natural and intuitive way.” Zhang gave the example of searching for sunglasses. The user might begin by specifying price and shipping restrictions. The search AI returns several images, then the user selects one and asks for the same color but with a different shape. Zhang outlined an interaction-driven architecture for solving this problem.
This architecture consists of two parts. First is a vision transformer, which can understand image features and their natural language descriptions. Next is a T5 language model, both encoder and decoder, which handles the natural language interactions. Zhnag proposed using T5 encoder-decoder instead of a more common decoder-only model, because it can “deal with embedding and text at the same time,” and also because it can be fine-tuned efficiently.
Zhang then discussed the personalization of search based on user activity history. She gave an overview of Pinterest’s PinnerFormer, a Transformer-based model which predicts the next 20 days’ actions based on a user’s past year history. She also discussed a similar model, Hierarchical Sequential Transduction Units (HSTU) from Meta. Next she reviewed the challenges of bringing these systems into production; in particular, they require a lambda architecture, which has separate real-time and batch data processing pipelines.
The third trend she presented was agent simulation, in particular for testing the search system. In this scenario, AI agents simulate real users interacting with the system. This can be done quickly and at high scale, providing quick feedback on the search system’s behavior, compared with traditional testing methods. She mentioned it could also be effective for red-teaming and scale-testing.
Zhang concluded her talk with a look into the future. First, she pointed out that if agents begin to handle more search tasks for humans, it is likely that search results will become optimized for agents. Her next prediction was about on-device intelligence: because our mobile devices have lots of personal data, they can “create a hyper-personalized experience with privacy.” Finally, she touched on the debate about AGI, and which comes first: learning or knowledge? Her personal take is that the two are intertwined, but that an intelligent system doesn’t simply retrieve information, but can “generalize, reason, and also innovate.”