Google has introduced LangExtract, an open-source Python library designed to help developers extract structured information from unstructured text using large language models such as the Gemini models. The library simplifies the process of converting free-form text, including documents like clinical notes, legal texts, and customer feedback, into structured data. Developers can define extraction tasks through natural language instructions and example data, making it easier to process and organize information from various types of unstructured content.
One of LangExtract’s standout features is its use of controlled generation techniques. This ensures that the extracted information is consistently formatted and accurately linked to its original source in the text. The library highlights relevant spans of text, providing traceability so that each extracted entity is linked to its exact location in the original document. This feature ensures greater transparency and reliability when extracting information.
To handle long and complex documents, LangExtract incorporates advanced strategies like text chunking, parallel processing, and multiple extraction passes. These techniques help improve recall and accuracy, ensuring that the library can effectively extract information from large bodies of text while maintaining high-quality results. This makes LangExtract suitable for applications in various domains, from healthcare to legal documents, without the need for extensive fine-tuning of the underlying models.
LangExtract can be integrated with various LLMs, including cloud-based models like Gemini and local models via platforms such as Ollama. This flexibility makes it a versatile tool for developers working across different models. It enables users to define extraction tasks for a wide range of applications without requiring deep expertise in machine learning.
The release of LangExtract, has sparked enthusiastic responses within the developer community. Akshay Goel, a key contributor, expressed his excitement about the release and eagerness to see innovative applications from users, reflecting the collaborative spirit behind the project, posting:
Excited to release LangExtract alongside the team today and looking forward to seeing what the developer community builds with it!
Developer Kyle Brown described it as a major step forward in AI transparency, converting unstructured text into structured, understandable data. Adding to the momentum a TypeScript port of LangExtract, broadening its compatibility to support both OpenAI models and Google’s Gemini, demonstrating the community’s active involvement.
For anyone who is interested — I ported this to typescript and added an ability to use OpenAI not just Gemini.
The library is available under the Apache 2.0 license and can be easily installed via pip. It offers an accessible and powerful tool for developers looking to add information extraction capabilities to their applications.