"Look out honey, 'cause I'm using technology..."


The Musical Gardener's Tools #4: Lazyweb, lazyweb on the wall...

..who is the smartestest wgetter of them all?

I need a little help here. As I've described as part of an earlier post, one of my sources for new music is wget, in combination with an ever growing list of mp3 blog urls. The ever growing part is now slowly starting to become a problem. I ran my update script yesterday evening and it took well over 12 hours to complete. (Mind you, I have fiberoptics to the door, speed is not an issue, at least not at my end.) That is unacceptable, in terms of energy wasted. Also the way it works potentially wastes a lot of bandwidth for the poor blog owners, mostly because files I have deleted are downloaded again, unless they were removed from the blog in the meantime. Note that this hits sites heavier that put up music I don't like or already have, but that should hardly be the measure of all things. Maybe. ;)

I see two ways to solve this:

  1. drastically clean up the list of urls that I harvest from.

    This is possible, I do it semi-regularly, but new and interesting mp3 blogs keep popping up, so this is only a short term solution.

  2. filter out the stuff I know I don't want

    To some extent, I know what I don't want to download. First of all, long podcasts and extended mixes (let's arbitrarily say, anything over 20MB,) since the way I like to listen to music is at the individual track level, otherwise all my tagging tools and last.fm don't work. Anyway we're getting past the whole idea that (web) music radio is consumed in an order predefined by someone else. More suggestion, less force feeding, kthxbye. (On a tangent: can we get this for news radio: just the news items, not a whole, usually extremely repetitive, bulletin as atomic? True podcasting should let me skip items I'm not interested in/have already heard.) Second of all, for obvious reasons, all the files I've already downloaded but deleted.

Since I am far from a linux command line deity, I thought I would ask here, does anyone have any suggestions on how to start on tackling these two problems, given the script:

wget --timeout=5 -U"Mozilla/5.0" -r -l1 -H -t1 -x -nc -np -P ~/mp3blogs/ -A.mp3,.ogg -erobots=off -i ~/mp3blogs/urls.txt

A: How can I limit the length of mp3s and oggs downloaded in this way to for instance 20MB per file? Keep in mind, throwing them away after downloading is not an option, since I want to prevent the download from happening at all. I don't think wget has a switch for this, so it will probably not be possible in a one liner.

B: I would like to store all of the urls of the files I do download (probably just in a flat text file for now) and then have my script skip them when downloading. Again, I don't think a one liner is possible.

Solutions to either problem are worth a 20$ amazon voucher from me (or somewhere else, I don't really care, as long as I'm out only 40$ total and it's not too much hassle to get it to you.)

I am, of course, the sole judge of this contest, but I will try to be fair. You don't have to give me a whole script, I'm a fairly competent programmer, just not too deep into bash, but if you'll point me at where to start, and I get it to work, that counts as a solution. Although as I've said, it's going to grow beyond a one liner, I would like to keep it a simple script, and I'm not looking for an application. I could build one in Python myself, but I want to keep it zero maintenance, basically too simple to even put the code into subversion.

UPDATE 2008-01-28: I'm now looking into pavuk, which may or may not have all the features I need. If this works, I just earned myself 40$ :)

UPDATE 2008-01-28.1: pavuk, although having rather exotic naming of options and switches, seems to solve A quite nicely, which is a bandwidth (and time, and thus energy) saver. Finding all the right options was made much easier by this guide. I'm still thinking about solving B, there may be options in pavuk to help me with that too.

For completeness' sake, the updated script looks like this (except it should all be one line...):

pavuk -timeout 5000 -identity "Mozilla/5.0" -lmax 1 -retry 1 -dont_leave_dir -cdir ~/mp3blogs/ -asfx .mp3,.ogg -noRobots
 -urls_file ~/mp3blogs/urls.txt -maxsize 30000000 -fnrules F '*' '%h/%d/%n'

No comments: