Embedding Pipelines Are The New ETL

In the indexing phase, the content divided into chunks is finally converted into vectors and stored in a corresponding database. The content is then available for semantic similarity searches. During the conversion step, embedding is carried out by a model that is specifically trained to convert text or content into dense numerical representations that encode its meaning. Two chunks expressing the same thought using different words produce vectors that are close to each other in this mathematical space. However, if they deal with different topics, they are far apart.

If a user now asks a question, the system embeds it in the same way, finds the chunks whose vectors are closest and returns them as context for the model’s reasoning process. This is different from the load process – but not when it comes to discipline: in embedding pipelines, each chunk in the index must be labeled with the name and version of the embedding model with which it was generated. Finally, embedding models evolve and vectors produced by different versions are not reliably comparable.

This exact problem occurs when embedding models in the pipeline are updated without a proper migration plan. In the end, vectors from different generations are mixed together in the same index – and the search quality deteriorates. The tricky thing is that this happens quietly – often in the form of subtly wrong answers. When upgrading an embedding model, I proceed in the same way as I do a schema migration: I plan explicitly, do it all in one go, and validate the retrieval quality using a representative query set. After all, there is as much at stake here as with any fundamental change to the data model.

Pipeline observability is not optional

Once an embedding pipeline is running in production, the question is no longer whether it will run, but rather whether it will do so correctly. This is more important in this case than with most other pipelines because errors are less noticeable: the index looks perfect, queries are returned without errors – and yet the system still provides incorrect answers. Until someone notices that the AI has no useful value.

That’s why observability discipline is also needed at this point. As soon as you treat embedding pipelines as production systems, you no longer think in terms of isolated steps, but rather in terms of signals. For example, the number of chunks per document becomes a simple but powerful health check: a sudden drop is usually not a model problem, but a sign of disrupted data collection or upstream parsing errors.

In addition, you also need a “golden set” of queries with verifiably correct outputs. This can act as a kind of data quality check after each pipeline change and reveal regressions that do not appear as explicit errors. Additionally, you can also track the lineage to find out which version of the embedding model created which chunks and when each document was last read. This makes it possible to attribute query problems to specific changes rather than simply guessing.

Data timeliness ultimately becomes a first-class signal. If documents become outdated beyond an acceptable threshold, this should also be visible during monitoring – before users are given poor results. The metric that brings it all together is Retrieval Quality over Time. This should be treated like any other pipeline SLA: it must be measured, tracked and owned. (fm)

This article was published as part of the English-speaking Expert Contributor Network published by Foundry. All information about the German expert network can be found here.