By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: This Open Source Tool Could Save Your Data Team Hundreds of Hours | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > This Open Source Tool Could Save Your Data Team Hundreds of Hours | HackerNoon
Computing

This Open Source Tool Could Save Your Data Team Hundreds of Hours | HackerNoon

News Room
Last updated: 2025/06/09 at 6:37 PM
News Room Published 9 June 2025
Share
SHARE

CocoIndex supports Qdrant natively – the integration features a high performance Rust stack with incremental processing end to end for scale and data freshness. 🎉 We just rolled out our latest change that handles automatic target schema setup with Qdrant from CocoIndex indexing flow.

That means, developers don’t need to do any schema setup – including setting up table, field type, keys and index for target stores. The setup is the result of schema inference from CocoIndex flow definition. It is already supported with native integration with Postgres, Neo4j, and Kuzu. This allows for more seamless operation between the indexing and target stores.

No more manual setup

Previously, users had to manually create the collection before indexing:

curl -X PUT 'http://localhost:6333/collections/image_search' 
  -H 'Content-Type: application/json' 
  -d '{
    "vectors": {
      "embedding": {
        "size": 768,
        "distance": "Cosine"
      }
    }
  }'

With the new change, user don’t need to do any manual collection management.

How it works

Following dataflow programming model, user defines a flow, where every step has output data type information, and next setup takes in data type information. See an example (~100 lines of python end to end)

In short, it can be presented as the following lineage graph.

In the declarative dataflow as above

Target = Formula (Source)

It implies both data and the expected target schema. A single flow definition drives both data processing (including change handling) and target schema setup—providing a single source of truth for both data and schema. A similar way to think about it is like type systems inferring data type from operators and inputs – type inference (for example, Rust)

In the indexing flow, export embeddings and metadata directly to Qdrant is all you need.

doc_embeddings.export(
    "doc_embeddings",
    cocoindex.storages.Qdrant(collection_name=QDRANT_COLLECTION),
    primary_key_fields=["id"],
)

To run start a CocoIndex process, users need to first run the setup, that covers all the necessary setup for any backends needed.

cocoindex setup main.py

cocoindex setup

  • Create new backends for the schema setup, like tables/collections/etc.
  • Alter existing backends with schema change – it’ll try to do an non-destructive update if possible, e.g. primary keys don’t change and target storage support in-place schema update (e.g. ALTER TABLE in Postgres), otherwise drop and recreate.
  • Drop stale backends.

Developers then run

cocoindex update main.py [-L]

to start a indexing pipeline (-L for long running).

If you’ve made logic updates that requires the schema on the target store to be updated, don’t worry. When you run cocoindex update again after the logic update. CocoIndex will infer the schema for the target store. It requires an cocoindex setup to push the schema to the target store, which will notify you in the CLI. As a choice of design, CocoIndex won’t update any schema without your notice, as some schema update may involve destructive changes.

To drop a flow, you’d run

cocoindex drop main.py

cocoindex drop drops the backends when dropping the flow.

All backend entities for the target stores — such as a PostgreSQL table or a Qdrant collection – are owned by the flow as derived data, so will be dropped too.

Why automatic target schema inference?

The question should really be, why not?

The traditional way is users fully figure out when and how to setup/update the target schema themselves, including the specific schema. Indexing flows often span multiple systems. For example:

On the target store:

  • Vector databases (PGVector, Qdrant, etc.)
  • Relational databases (PostgreSQL)
  • Graph databases (Neo4j, Kuzu etc.)

The data types you’re outputting and your target schema must match up.

If there’s any internal state tracking, e.g., in the case of incremental processing

  • Internal tables (state tracking)

It’s tedious and painful in doing this manually, as all of these systems must agree on schema and structure. This typically requires:

  • Manual setup and syncing of schemas.
  • Tight coordination between developers, DevOps, and data engineers – people writing the code may not be the same people deploying / running it in an organization.
  • Debugging misalignments between flow logic and storage layers.
  • Production rollout is typically stressful.

Any addition moving parts to the indexing pipeline system adds frictions — any mismatch between the logic and the storage schema could result in silent failures or subtle bugs.

  • In some cases it’s not silent failures. The failure should be obvious, e.g. if users forgot to create a table or collection, it’ll just error out when writing to the target. In this case, the way to figure out the exact schema/configuration for the target is still subtle though.
  • Some other scenarios can lead to non-obvious issues, i.e. out of sync between storage for internal states and the target. e.g. users may drop the flow and recreate, but not do so for the target; or drop and recreate the target, but not do so for the internal storage. Then they’re out of sync and will be hard-to-debug issues. The gist is, a pipeline usually needs multiple backends and it can be error prone to keep them in sync manually.

Continuous changes to a system introduce persistent pains in production. Every time a data flow is updated, the target schema must evolve alongside — making it not a one-off tedious process, but an ongoing source of friction.

In real-world data systems, new fields often need indexing, old ones get deprecated, and transformations evolve. If a type changes, the schema must adapt. These shifts magnify the complexity and underscore the need for more resilient, adaptable infrastructure.

Following the dataflow programming model, every step is derived data all the way to the end. Indexing infrastructure requires data consistency between indexing pipeline and target stores, and the less loose ends, the easier and more robust it will be.

Our Vision: Declarative, Flow-Based Indexing

When we started CocoIndex, our vision was to allow developers to define data transformation and indexing logic declaratively — and CocoIndex do the rest. One big step toward this is automatic schema setup.

We’re committed to taking care of the underlying infrastructure, so developers can focus on what matters: the data and the logic. We are serious when we say, you can have production-ready data pipeline for AI with ~100 lines of Python code.

If you’ve ever struggled with keeping your indexing logic and storage setup in sync — we’ve been there. Let us know what you’d love to see next.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article The iPad Just Got a Lot More Like a Mac Thanks to These 20+ New Features
Next Article Apple’s big updates for Intel-based Macs will end with Tahoe
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Postgres and the Lakehouse Are Becoming One System — Here’s What Comes Next | HackerNoon
Computing
USA Gymnastics release statement over major shake up following Biles-Gaines feud
News
Barnes & Noble Nook iOS App Gains Purchase Links
News
Shanghai cracks down on illegal AI content on major platforms · TechNode
Computing

You Might also Like

Computing

Postgres and the Lakehouse Are Becoming One System — Here’s What Comes Next | HackerNoon

10 Min Read
Computing

Shanghai cracks down on illegal AI content on major platforms · TechNode

1 Min Read
Computing

How to Spot & Report a Fake Instagram Account in 2025 (5 Tips)

4 Min Read
Computing

How Does AI-Driven Content Curation Improve Social Media Engagement

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?