Grab responded to the challenges of finding valuable datasets among 200k+ tables by enhancing Hubble, the data discovery tool, with new capabilities leveraging GenAI technologies. The company reduced the data discovery process by incorporating LLMs to generate dataset documentation and created a Slack bot to bring effective data discovery to data consumers.
Grab manages many analytical datasets between its vast data lake, Kafka streams, production databases, and ML features. Historically, locating the most suitable dataset for new use cases was challenging for teams working on new data-based products. The company observed difficulties searching for suitable datasets, with 18% of searches abandoned without inspecting search results. Data consumers mainly relied on tribal knowledge, and data discovery took multiple days.
Shreyas Parbat, lead product manager at Grab, shares the team’s vision for improving data discovery:
Given the historical context, our vision was clear: to remove humans in the data discovery loop by automating the entire process using LLM-powered products. We aimed to reduce the time taken for data discovery from multiple days to mere seconds, eliminating the need for anyone to ask their colleagues data discovery questions ever again.
The team behind Hubble, the internal data discovery tool built on top of the Datahub platform, decided to invest heavily in improving the discovery process’s efficacy. They started by enhancing ElasticSearch table metadata and improving the documentation coverage of data lake tables, which was low, at only 20%.
Engineers conducted user interviews to discover how ElasticSearch should be tuned. Subsequently, they hid irrelevant tables, deboosted deprecated tables, and boosted the most relevant schemas and certified tables. They also added relevant tags and improved the search UI, resulting in a 12% increase in the search click-through rate.
The team created a solution using GPT-4 to generate documentation based on table schemas and sample data. The new solution was integrated with Hubble UI to enable data producers to easily create table-level documentation or customize the documentation generated by GenAI. As a result, the documentation coverage increased to 90%, with 95% of users finding the generated documentation useful.
Generating Dataset Documentation with GPT-4 (Source: Grab Engineering Blog)
The Hubble team created a Slack bot to bring easy data discovery to data consumers. Engineers decided to leverage Glean and integrated Hubble with Glean to make data-lake table documentation available on the Glean platform. The HubbleIQ bot, built with Glean Apps, was integrated with Hubble search and Slack.
HubbleIQ Slack Bot Leveraging Glean (Source: Grab Engineering Blog)
Grab plans further enhancements to the GenAI-based functionality, including enriching the documentation generator with more context and allowing analysts to auto-update documentation based on Slack threads. The team also wants to implement Reflexion to further improve the quality of generated documentation.