Site Feedback

Snooth User: Philip James

Duplicate Wines

Posted by Philip James, Apr 28, 2008.

A lot of people say this is the biggest issue with Snooth, but I'll post on this thread periodically to keep people updated with the work we are doing to reduce this problem. At this point we've matched and merged over 2 million wines.

Last week we caught 180,000 duplicates, then we changed a few algorithms and caught an additional 18,000. Today, we tried something new and picked up 60,160. We are checking them now for accuracy, but they look good so far.

I expect our progress to slow at that point, but each of these runs makes a huge difference to what you see on the screen.

And dont forget that when you look at a nice wine page and see wine makers notes, images, reviews and several stores selling the wine, each of the components of those pages came from a separate feed. Some wines are already the product of 50+ merges!

Luckily we're data geeks here, and love this stuff


Reply by Philip James, Apr 29, 2008.

OK Justin's already talking about a new pair of algorithms which caught 1,000 and 7,600 dups respectively. I need to check these by hand as they are new first, so I need to brace myself for some excel pain...

Reply by Philip James, May 2, 2008.

OK, this is really the post where i talk to myself, but anecdotally we're hearing the dup problem (whilst in no way removed) is much improved...

The bloggers and power users ive spoken to have noticed a clear drop in dups, which is great to hear. I'll continue to use this thread to announce further milestones reached here.

Reply by John Andrews, May 5, 2008.

Philip ... being the 'Loxton' guy here ... I think that Loxton, Loxton Wines & Loxton Cellars need to be 'merged'. The official name for the winery is Loxton Cellars.

Reply by Philip James, May 5, 2008.

thanks John - we'll get that fixed.

Reply by Philip James, May 7, 2008.

45,000 more wines caught today - we will be running the rules more frequently as these numbers are just too high

Reply by Philip James, May 14, 2008.

OK, we're never going to be done, but we're feeling more and more confident with the dedupe algorithms. The basic algorithms run daily and make sure that the 2,500 new wines we add daily are merged into the master records seamlessly. The funkier dedupe rules run weekly and we still check those by hand as we build confidence in how they work.

There are still lots of dupes tucked away here and there, but this should be the best its ever been.

Would love to hear any feedback if you have any reports on how the level of dupes has changed. Thanks

Reply by joshparent, Jun 5, 2008.

I've noticed a drop in dupes, like you say, but I've also seen some dupes recently that are carbon copies of each other. Check out McPrice Myers 2005 Grenache.

Loving the site, hoping to help out.

Reply by Philip James, Jun 5, 2008.

Josh - we're still checking this, but what i think you found is pretty rare, and kind of cool. Basically, we update millions of prices every night, and then go and merge any newly created wines back to the master records. I think you found a few newly created wines that havent been merged back to their parent. The merge program will probably finish in about 10 minutes and i think they'll have disappeared by that point.

Like i said, a pretty random find.

We'll be doing this merging on the fly eventually, so the problem will occur less in the future.

Reply by Philip James, Jun 5, 2008.

Ta Da!

Its pretty cool when i'm too excited to wait to see the results but that it actually works like i promised. Phew

Back to Categories

Popular Topics

Top Contributors This Month

127503 Snooth User: rckr1951
29 posts
847804 Snooth User: EMark
22 posts
472290 Snooth User: jackwerickson
5 posts


View All

Snooth Media Network