Submitted by Lintaar t3_zypgdt in MachineLearning
HGFlyGirl t1_j2aen6d wrote
I trained a model to find duplicate music files in my brother's huge collection of digital music. He was frustrated by so many duplicates that still had different file names, file sizes and tags. We couldn't find any existing software that could do it - because they were all just looking for matches on those parameters. The model ended up working quite well.
iantimmis t1_j2bb6x0 wrote
How did you set it up?
HGFlyGirl t1_j2cpl6v wrote
For pairs of files, I took their filename length, calculated the Levenshtein distance between them, their size in bytes and their duration in Ticks.
I used the ML.NET AutoML API to train a binary classifier.
[deleted] t1_j2biofr wrote
[removed]
Apprehensive_Maize_4 t1_j2du7wc wrote
>duplicates
fdupes or any of the programs here didn't work for you?
https://www.tecmint.com/find-and-delete-duplicate-files-in-linux/
HGFlyGirl t1_j2ewwry wrote
Tried a few of these things. The problem was that a lot of the songs had been ripped from CD's using different software. So, some would be called things like track01.mp3 with a duplicate with a completely different file name. These could also have different byte lengths and durations. Then there are the ones that come from the original recording, the live version and/or the compilation album - which often differ a bit in all the parameters.
Viewing a single comment thread. View all comments