Detecting duplicate entities at scale is challenging due to the explosion in pairwise comparisons. A naive approach to match n items with each other requires O(n²) comparisons. The goal is to prune the search space while maintaining high recall (catching most true duplicates). Modern de-duplication pipelines therefore use a candidate generation phase via blocking keys, hashing, retrieval, etc. – to only compare records that are likely matches. In this article, we discuss standard and advanced blocking methods, probabilistic hashing (LSH), sparse vs. dense vector similarity filtering and multimodal (text+image) matching. Emphasis is on achieving high recall on multilingual-multimodal data by combining different strategies.
Blocking Strategies for Candidate Reduction
Blocking is a canonical technique that groups records into blocks based on some key so that only items within the same block are compared. Effective blocking dramatically cuts comparisons while still grouping true duplicates together. Several blocking strategies can be applied in multi-pass to improve recall:
Standard Blocking: Define a blocking key from record attributes (e.g. first 3 characters of product name, normalized brand, etc.). Records sharing the same key go into the same block, and only those are compared. This approach leverages the principle that duplicate records typically share common tokens or features. For example, blocking by the first three letters of the title would group products starting with “Sam” together (likely Samsung items). Standard blocking is simple and fast, but a single blocking key can miss duplicates that differ in that key (low recall if data is dirty or multilingual).
Multi-Pass and Hierarchical Blocking: To improve recall, run multiple blocking passes with different keys and take the union of candidate pairs. Multi-pass sorted neighborhood method is an example: one pass might block by brand, another by product model, etc., so that if a duplicate pair fails one blocking key, it might be caught by another. After multiple passes, all candidate pairs are merged and a transitive closure is performed to cluster duplicates. Hierarchical blocking goes further by using a sequence of keys in a hierarchy (coarse to fine). For instance, one might first block by broad category (e.g. electronics vs. clothing), then within each category block by brand, then by a finer key like model number. This layered blocking ensures only very similar items are finally compared. Learned blocking rules also combine multiple predicates (even across fields) to create conjunctive blocks (e.g. same city AND same zip code) for higher precision, as well as disjunctive combinations to widen recall.
Canopy Clustering: This is a soft blocking technique using an inexpensive similarity measure to create overlapping clusters called canopies. All items in a canopy are considered potential matches. For example, use a cheap metric like TF–IDF cosine similarity or shared tokens to group products roughly. Each record can fall into multiple canopies, ensuring recall (since canopies are overlapping, no hard cut). After this, a more expensive similarity (e.g. edit distance or a ML model) is computed within each canopy to make final decisions. The canopy approach (McCallum et al.) trades some extra candidates for significantly improved recall and speed, and has been shown to outperform traditional single-pass blocking in many cases. Canopy methods can potentially be useful for multilingual data where an inexpensive comparison (like shared numeric tokens or very common words) can group items across languages before a finer comparison.
Sorted Neighborhood & Windowing: Another classic approach is to sort records by some key and then only compare records that appear within a sliding window of size W in the sorted order. This assumes duplicates will sort near each other. Multi-pass sorted neighborhood runs this multiple times with different sort keys. This method also reduces comparisons from N² to N·W, but choosing good sort keys is crucial and challenging.
Modern blocking pipelines often learn the optimal combination of blocking rules from labeled examples. For instance, the Dedupe library tries many predicate functions (prefixes, n-grams, etc.) on fields and uses a greedy set-cover algorithm to select a small set of blocking rules that covers all training duplicate pairs while minimizing comparisons. This learning ensures high recall on known duplicates with minimal blocks. Overall, blocking is a critical first step to tame quadratic complexity, and when applied in multi-pass fashion it could provide a high-recall candidate set.
Probabilistic Candidate Generation (LSH)
Locality Sensitive Hashing (LSH) is a probabilistic technique that hashes items in such a way that similar items are likely to land in the same bucket (with high probability). Think of hashing like sorting mail into different postal bins, LSH creates “smart bins” where similar records naturally end up together, even when they’re not exactly identical. It can be seen as a generalization of blocking where the “blocking keys” are derived from random projections or min-wise hashing of features rather than explicit attributes. LSH is especially useful for high-dimensional representations (text shingles, embeddings, etc.) where direct blocking is hard. The key idea: use multiple hash functions from an LSH family such that true duplicates collide in at least one hashwith high probability. Each hash function focuses on different aspects of the data: one might group by first tokens, another by last tokens, and a third by structural patterns. This ensures that similar items have multiple opportunities to be bucketed together. Items that collide frequently across different hash functions become high-confidence candidates, while those that rarely collide are likely dissimilar (as shown in below figures). Collisions then form candidate pairs for detailed comparison. The below illustrations demonstrates this principle clearly: “Dr. John Smith” and “John Smith MD” collide in all three hash functions because they share multiple tokens, making them the highest-confidence candidate pair with 3 collisions. Meanwhile, unrelated items like “Central Hospital” and “Dr. John Smith” rarely collide, naturally filtering out unlikely matches. Collisions then form candidate pairs for detailed comparison.
Common LSH variants include MinHash for Jaccard similarity (e.g. on sets of title tokens) and SimHash (a form of signed random projection for cosine similarity) for textual or vector data. For example, min-hashing can compress a product description’s set of words into a short signature; identical or nearly identical descriptions will have matching signatures with high chance. A banding technique can further group similar signatures. In many industry adoptions, LSH (MinHash on spatial or textual features) turned an N² comparison taking multiple days into a Spark job of few hours, a full order-of-magnitude speedup. This illustrates the power of LSH to trade a tiny drop in precision for massive gains in scalability.
Other probabilistic filtering techniques include bloom filter-based blocking (hashing tokens to bit arrays to quickly pre-filter candidates) and sketches (e.g. using Count-Min Sketch to count rare combinations). For instance, HDB uses Count-Min Sketch to approximate block sizes without expensive data shuffling. When checking if a block containing “surname=Smith” is oversized, the sketch can quickly estimate it contains ~50,000 records, flagging it for intersection with other large blocks rather than costly exact counting. Hashed Dynamic Blocking (HDB) is a good example: it dynamically learns rare blocking keys (combinations of attribute values that are highly indicative of matches) and uses LSH to generalize those keys for large databases. HDB might discover that while “surname=Smith” alone creates massive blocks, the intersection “surname=Smith AND first_name=John AND birth_year=1985” occurs in only 12 records, making it an excellent blocking key that’s both selective and computationally feasible. The LSH component provides a tunable knob between recall and precision in blocking for huge data.
In summary, probabilistic methods like LSH greatly reduce pairwise computations by indexing items into buckets based on similarity-preserving hash functions. Unlike strict blocking, LSH can capture fuzzy matches and allows an explicit trade-off: more hash tables or more permissive LSH parameters increase recall (catch more pairs) at the cost of more candidates. These methods have become standard in industry for near-duplicate detection, often integrated into big-data platforms (e.g. Spark’s MLlib includes MinHashLSH and random projection LSH classes).
Sparse vs. Dense Vector Similarity Filtering
Another major approach treats duplicate detection as an information retrieval problem, similar to how search engines work. Instead of comparing every record against every other record, the system searches for the most similar records to each item. This significantly reduces computational load while naturally handling multilingual data.
Sparse Text Retrieval
Traditional information retrieval methods like TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 convert text into sparse vectors, which are mathematical representations where most values are zero, with non-zero values only for words present in the text. These systems build inverted indexes (like a book’s index).
For academic papers, you might index all publication titles and abstracts, then search for the top-10 most similar papers when checking for duplicates. Instead of comparing against millions of papers, you examine only the most promising candidates. This works well because duplicates often share specific identifying keywords like: journal names, author surnames, medical diagnostic codes, technical terminology, etc.
The approach leverages decades of search engine optimizations, making it both fast and scalable. However, sparse methods have a fundamental limitation: they only capture lexical similarity through exact word matches. When patient records use different terms like “hypertension” versus “high blood pressure,” these systems fail to recognize the semantic connection. To improve recall for multilingual data, systems can expand queries with translations or use cross-lingual indexing, but they still can’t understand that “手机” (Chinese) and “mobile phone” (English) mean the same thing without explicit dictionaries.
Dense Embeddings and ANN Search
Dense vector embeddings represent a paradigm shift. Neural networks, particularly language models like BERT and RoBERTa, convert text into dense vectors where every position carries meaningful information about semantic content. Unlike sparse vectors that count words, dense embeddings capture meaning, context, and relationships between concepts.
The breakthrough comes with multilingual models such as mBERT, XLM-R, and LaBSE, which map text from different languages into a shared mathematical space. A German patient record about cardiac procedures and its English equivalent can become nearby vectors despite sharing no common words, because the AI learned that “Herzinfarkt” and “heart attack” represent the same medical concept.
Finding similar records among millions of dense vectors requires Approximate Nearest Neighbor (ANN) search algorithms. HNSW (Hierarchical Navigable Small World) creates graph structures where similar vectors connect to each other, enabling rapid navigation with close to 100% recall in sub-linear time. Faiss offers alternative approaches like IVF (Inverted File) that partition vector space into clusters and search only relevant buckets, making it feasible to index 100M+ embeddings with query times in tens of milliseconds.
For financial institutions, separate Faiss indexes might contain embeddings for transaction descriptions and merchant categories, each handling millions of vectors. When investigating suspicious transactions, the system finds the top-k most semantically similar transactions. The critical aspect is setting appropriate values for k (number of neighbors) and similarity thresholds. If k is too low, you miss duplicates (lower recall); if too high, you include distant non-duplicates (more comparisons).
This approach transforms the matching problem into a search problem: given an item’s vector representation, find the most similar vectors in the database instead of comparing to every item. The result is a system that understands semantic similarity across languages while maintaining computational efficiency for large-scale operations.
Combining Strengths: Hybrid Approaches
Real-world systems increasingly combine both methodologies. A hybrid approach might use BM25 keyword matching to gather a few hundred candidates sharing basic terms, then apply embedding similarity to select the most semantically relevant ones. For healthcare records, this means first finding patients with shared diagnostic codes through keyword matching, then using semantic analysis to identify paraphrased symptoms. Dense vectors enable additional filtering through cosine similarity thresholds (e.g., >0.8) to prune obvious non-matches before expensive comparisons. The union strategy combines results from multiple methods like text embeddings, image embeddings, and keyword matching – leveraging complementary strengths. This approach can also be used to select candidates for multimodal entities. Strat with LSH on text attributes, then use image cosine similarity over images to select candidates for the final classifier which is most expensive and performant.
Conclusion
Modern duplicate detection at scale requires combining complementary techniques to achieve both high recall and computational efficiency. No single method excels across all scenarios: traditional blocking provides fast candidate generation but struggles with multilingual data, while dense embeddings enable cross-lingual understanding at higher computational cost, and use of multiple modalities capture duplicates that text-only methods miss. The most effective systems leverage hybrid pipelines that union results from sparse retrieval, dense embeddings, and multimodal signals, achieving high recall while reducing comparisons from quadratic to sub-quadratic complexity. As data grows in volume and diversity, success lies not in choosing between these techniques, but in orchestrating them effectively to handle the full complexity of modern deduplication challenges across languages, domains, and data modalities.
References:
- Elmagarmid, A., Ipeirotis, P., & Verykios, V. Duplicate Record Detection: A Survey. IEEE TKDE, 19(1). https://www.cs.purdue.edu/homes/ake/pub/survey2.pdf
- Borthwick, A. et al. (2020). Scalable Blocking for Very Large Databases. Amazon Science. https://assets.amazon.science/9d/b5/d7abad814e5fba5cc03806900fd0/scalable-blocking-for-very-large-databases.pdf
- Dhonzi, T. et al. (2016). Duplicate detection in Web shops using LSH. SAC ’16. https://personal.eur.nl/frasincar/papers/SAC2016/sac2016.pdf
- Zeakis, A. et al. (2023). Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. PVLDB 16(9). https://www.vldb.org/pvldb/vol16/p2225-skoutas.pdf
- McCallum, A., et al. (2000). Efficient Clustering of High-Dimensional Data Sets (Canopy Clustering). https://www.cs.purdue.edu/homes/ake/pub/survey2.pdf
- Dedupe.io Documentation (2024). Blocking and Predicates. https://docs.dedupe.io/en/latest/how-it-works/Making-smart-comparisons.html
- Coupang Engineering Blog (2022). Matching duplicate items to improve catalog quality. https://medium.com/coupang-engineering/matching-duplicate-items-to-improve-catalog-quality-ca4abc827f94
- Uber Engineering Blog (2017). Detecting Abuse at Scale: LSH at Uber. https://www.databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html
- Peeters, R., & Bizer, C. (2022). Cross-Language Learning for Product Matching. WWW ’22 Companion. https://arxiv.org/pdf/2110.03338
- kNN Search and BM25 Overview. https://medium.com/coupang-engineering/matching-duplicate-items-to-improve-catalog-quality-ca4abc827f94