Saturday, March 18, 2023

Cleaning up music library

I have a library of music sitting on various CDs and harddrives. Some of these tracks are not available on any streaming service. Some tracks were ripped from CDs. Others were purchased or downloaded. Others were recorded directly. There are also some with DRM, some that have been ripped or downloaded multiple times. And there are various backups of different versions of these. It is all a big mess. 

Google Music / YouTube Music

A good chunk of the music library was uploaded here. At one time there was a direct iTunes link. Other times it was manually done. There was also a limit to the number of files, so things were kept under that. (The number did change.) Files were also re-encoded to MP3 if they were not MP3. This could lead to some poor quality songs appearing as higher bitrate songs. The youtube libraries were the backups for files that seem to have gone awol.

iTunes

Much of the library had been imported to iTunes. There were also purchases in iTunes, as well as podcast downloads. In addition, it was used to rip CDs. Older tracks purchased from the store had DRM. (Luckily, most of these were "free tracks" that were not that great.) In this case, I just deleted the .m4p and the video tracks rather than try to find the good source. The whole library did not fit on one of the old harddrives, thus it was split into a few pieces. A new library has been somewhat cleaned up and built from scratch.

AudioGrabber/Real Jukebox

These were used to rip CDs back in the late 90s. Real Jukebox created windows media files and produced a good number of "Garbage files". Audiograbber was better. I initially used the free version that only ripped some tracks. I later purchased the paid version to rip everything. A few years later, the paid version was made free. Oh well. Many of the ripped MP3s were burned onto CDs and old harddrives. At the time, size was a premium, so the quality is not all that great. Most of the songs ripped here have been re-ripped at a higher quality. However, it can be challenging to differentiate between the "re-ripped" vs. the "re-encoded from an old rip." Some of the original CDs have now become so scratched up (or destroyed) that a better rip cannot be made.

Other downloads and rips

Various other downloads are scattered about. Most have been incorporated into some of the libraries above. However, some of not. Some have appeared multiple times. Amazon Music and Freegal seem to be the worst at making things "different but same". The same track may have a slightly different title, or be a little longer or shorter.  I also used audacity to convert a lot of audiotapes and LP records to digital format. The quality of these can range from horribly muddy to really good. Included are tapes of DJing, mix tapes, spoken takes and all sorts of things.

The Cleanup

Prerequisite: get information about the song.

The mediainfo command works wonders. I installed it on mac or on windows bash (using apt).

mediainfo --Output=JSON . >output.json

This gets all sorts of useful metadata for each song. Then I just need to clear out all the duplicates.

I started with the current iTunes library. Then brought in all the old files to compare.

Quick clean:

I initially looked for any file with the same ARTIST, TITLE and ALBUM, with the same DURATION, BITRATE and SIZE. This is about as close as you can get to an exact match without taking a fingerprint. This would also catch tracks even if the filename changed. (This will often happen as files are imported into iTunes.)

Account for metadata changes:

The first pass gets the exact matches. However, iTunes and other tools also make changes to the metadata that is stored. Sometimes album covers may be embedded or not. 

Next step is to give the size a little leeway. I ended up at about 15,000. At first it was only smaller. At times, I would only go for source larger, then I decided I didn't care.

Clean up garbage

I still had a bunch of .m4p and .lqt tracks with DRM. I had tried and failed to extract the .lqt files with an old windows 95 VM and lqt player binary. (I even set the date back.) These are mostly easy to find tracks (many that were available elsewhere.) The .m4p were primarily free tracks from iTunes. I just deleted all of those, along with DRM protected videos. There were also a bunch of .sid files that I tarred up. They may be interesting to listen to with a sidplayer, but not necessarily as part of the collection. I did a wholesale delete of news podcasts, but kept some more "timeless" ones. There were other random files that were also cleaned up.

The tricky part

Now comes the tricky part. There are still plenty of duplicate and junk files that need to be cleaned up. Some require some manual effort, and different algorithms. 

Manual removal of poor quality rips

Some tape rips were just awful, and not worth maintaining. I added to my "to get later if I want" list. (Most of the music I cared about were already elsewhere.) There were some CD rips that had problems. Alas, these were often due to problems with the source CD, so a re-rip would not help. Noted and deleted.

Removal of old audiobooks

There were plenty of audiobooks I had already listened to. Many appeared in multiple places, with different encodings. (I had re-encoded them after speeding them up.) These were all cleaned up.

Fuzzy matching

Different tools seemed to mangle characters in different ways. Some limited titles to 30 characters. Others changed special characters. My work around was to remove any non-alphanumeric characters and construct a key with just the uppercase versions of title, track and album. This helped catch some variations.

bitrate and duration flexibility

If the duplicate has a lower bitrate, it can be deleted, even if duration differs by a few seconds and few percentage points. No need to hang on to all poor quality re-rips.

size compare

Find all files with the same bitrate and same duration. If they have the same file size or the same encode date, they are duplicates. This will catch files that have been renamed. I had to be conservative, because this could get some false matches that happened to have same bitrate and duration. 

encode date compare

Similar to above, but key off encode date. It seemed to have about the same results as above, since it was pretty much doing the same thing.

what's left?

  • A few .wav files. These were already converted to aac
  • Some .vqf files - some also in AAC
  • Messed up files - there were some .ogg and .mp3 files with no length. I deleted most of these small junk files
  • Files without ID4 tags. Some were in directories with album name, with Artist-Track as filename. Some magic regex extracted these names to find the match. 
  • and and the - some have one or the other. Others have &. Filtering out resulted in more matches.
  • .m4b files - some had DRM. 
  • .midi files 

...and the manual part

  • Reripped foreign language courses: I had ripped them multiple times with different tools. The names were different each time, as were the bit rates and encoding method. No easy way to find the difference short of manually going through. There was also an older version of these sitting around.
  • Old Nena CD ripped without album name. (deleted)
  • old low quality rips of madness songs without titles (deleted)
  • Mr. Holland's Opus soundtrack ripped at low quality with no ID4 and filenames with "Various Artists" as the artist.
  • Files that had "CD 1" in one place and "Disc 1" in another. Added normalization of those for the key.
  • Some lectures that had no metadata (and multiple copies) - added metadata to one version, and deleted the others.
  • Truncated files. Some files were much much shorter in one place. (Seems google music often did some of this mangling) And to make it worse, these often had differences in titles.
  • "Various Artists" being listed as artist name (Freegal...)
  • Items there were tagged as "skips" and deleted. Need to find those.
  • Some songs match, but were a few seconds shorter and a much lower bit rate. I tweaked the comparison to delete some of this junk
  • There were some non-duplicate files that needed to be added to library
  • I needed to use a limit when splitting "-" to catch songs with "-" in name...  Well, limit didn't quite work, but grabbing everything and joining got the job done (this missed my use case because it had a typo, but caught some additional songs)
  • Needed better stripping of extension to handle periods in filenames
  • Some tracks with lower bitrate, but a little longer than the should be (with the "source" tracks a little shorter than they should be.) I did a sample listen, then allowed for a 5 second longer "junk"
  • Sapphire Bullets of Pure Love without "Pure"
  • Go West single as -Go West Single #1
  • Text "sixteeen" vs 16
  • Best of Cream (from tape) vs Strange Brew: Very best of cream
  • Google music re-encodes of AAC. The AAC has a slightly lower bitrate. Delete MP3s that have a slightly better bitrate than the AAAC
  • There were a bunch of files without album names. I added a filter to ignore album name when filtering out. This can catch things you don't want to (like live vs. studio version), so it is used cautiously
  • Some more files in odd formats (like VQF)
  • Some with duration not showing up in metadata - but playing fine. Just doing manual
  • Some titles have "A" others don't. filtering out.
  • Some have [instrumental] or other things. Filtering out.
  • Some more truncation (reduced to 18 character titles after filtering)
  • Name changes (such as "Fingerprints" to actual songs)
  • Bad song name in the source (fixed source name)
  • Songs not containing all metadata (no duration, so we do a format check instead)
  • Retried with "size" and "encode date" and no album matching. No album seemed to catch a few more.
  • A lot of songs had "VARIOUS" as the artist. Made a variation to look those up without artist
  • Parenthesis in songs differed (added option to strip parens)
  • Some dups are technically a little better, but the source has metadata updates (keep the source)
  • small songs - there are some "junk" versions of songs. Allow deleting of songs <30 seconds if another exists
  • Some songs really were unique and were copied over
  • Some had mutliple artists in one place and a single in another. Did these manually.
  • Classical stuff in different languages, and with composer in different places. Just manual.

... and more strangeness with mediainfo

I loaded the few remaining songs into iTunes. I looked at had a comment indicating it was from mp3.com. But mediainfo did not show it. Mediainfo also identified it as a wav file. Very weird. I tried mpgtx, and it just barfed because it thought is was a .wav file. Strange.

and some other things

  • Some diacritical characters that are standard characters in other places (regex match)
  • Doctor to dr.
  • Google Music takeout songs had varying degrees of garbage. Some were truncated to various lengths. Others had artist and album swapped (may have been a source problem.) Other things were uploaded with "Track 1" as the name. Lots of mess to clean up.
  • There are some bizarre lengths on google music songs. I ended up looking up some to make sure I had the right one. There were some that were obviously super short junk. The longer ones were trickier. I had cases where I had the junk version in the library, while the good version was in google music.
  • Artist names that were slightly different, without track numbers.
  • Some takeout versions were much better than the versions I currently had in the library. This required some manual listening to find what I wanted to use.
  • There were a number of tracks that were ripped multiple times. I just figured anything with a bit rate of at least 128 was good.

And now what...

I went through the existing music library and added everything that was missing. There were actually a few cases where I was able to fix a "Bad" version with a better one. 

I had a second music library for "other stuff". This started out as just "books", but then I added "personal" audio and other things that I don't want cycling through with music, but want fairly accessible. 

Then there is the pile of "I don't need this very accessible" stuff, like the non-standard formats and super good versions.

I did a little google listening to see if some of these tracks without names could be identified.  Some could, some couldn't.


No comments:

Post a Comment