HGFlyGirl t1_j2aen6d wrote on December 30, 2022 at 8:23 PM

I trained a model to find duplicate music files in my brother's huge collection of digital music. He was frustrated by so many duplicates that still had different file names, file sizes and tags. We couldn't find any existing software that could do it - because they were all just looking for matches on those parameters. The model ended up working quite well.

iantimmis t1_j2bb6x0 wrote on December 31, 2022 at 12:02 AM

How did you set it up?

HGFlyGirl t1_j2cpl6v wrote on December 31, 2022 at 7:06 AM

For pairs of files, I took their filename length, calculated the Levenshtein distance between them, their size in bytes and their duration in Ticks.

I used the ML.NET AutoML API to train a binary classifier.

[deleted] t1_j2biofr wrote on December 31, 2022 at 12:56 AM

[removed]

Apprehensive_Maize_4 t1_j2du7wc wrote on December 31, 2022 at 3:01 PM

>duplicates

fdupes or any of the programs here didn't work for you?

https://www.tecmint.com/find-and-delete-duplicate-files-in-linux/

HGFlyGirl t1_j2ewwry wrote on December 31, 2022 at 7:29 PM

Tried a few of these things. The problem was that a lot of the songs had been ripped from CD's using different software. So, some would be called things like track01.mp3 with a duplicate with a completely different file name. These could also have different byte lengths and durations. Then there are the ones that come from the original recording, the live version and/or the compilation album - which often differ a bit in all the parameters.