What's your cost matrix? How much does a false positive hurt? False negative? I ...

What's your cost matrix? How much does a false positive hurt? False negative?

I built a commercial system like that for Thermo Fisher, except their descriptions were encoded as natural language text on input, not vectors (for an extra complication).

Some observations:

1. Crude methods based on vector embeddings, cosine similarity, Levenshtein, etc – don't work, if you care at all about false positives.

I see sibling comments recommend this, but it's clear this cannot work if you think about it. Values like "black" and "white", or "I" and "II" (part numbers), "with" and "without", are typically close together in such crude representations, but may lead to products that are not interchangeable.

2. A hybrid approach worked. The SW produced suggestions for which products might be duplicates (along with a soft confidence score), then let a human domain expert accept / reject these suggestions. It also learned from these expert decisions as it went, to save human time.

What I quickly learned is that even as a human (programmer with a PhD in ML), I could not look at two product descriptions and make the decision myself. Are these the same product or not? One word, even one letter, could be absolutely vital. Or absolutely irrelevant. Sometimes even the same attribute / word, depending on the product category.

Hence the final interactive solution with a domain expert in the middle. It worked well and saved time, rather clever, but not in the "hooray NN training" way. A lot of work went into normalizing the surface features intelligently based on context: units, hyphens / tokenization, typos…, because that's a mess in product sheets. The "fancy" downstream ML and clustering part was relatively simple by comparison.

But YMMV, the Thermo Fisher products were fairly specialized and sophisticated (in their millions).