Update 2008-03-11: There were a number of things wrong with this script making the spidering *waaaay* slower than it needs to be. Fixed that below, and added threading for both the spidering and downloading, thanks to this cool recipe by Wim Schut which lets me run all the sqlite code in a separate thread. (Important because you can only use sqlite connections in the thread in which they were created.) All of this results in a nice speed-up.
Ok I said I wasn't going to, but I did end up writing a bit of code, although it didn't get too far out of hand. Yet :). It solves *all* of my problems: it does not download files over 30MB in size, and it never downloads the same link twice.
I found this message on the python mailing list, which seemed like a very good start. It almost did what I needed, but not quite, and also the parsing was overcomplicated and didn't catch all links, so I replaced that with a simple regular expression.
I ended up changing most of the code and functionality, (for instance it now stores links in a database.) There's a lot of hard coding in there, which I could factor out if people want to use it, but for now it solves my problems beautifully ;).
It's used with the following syntax:
# initial set up python spider.py createdb # add a new blog to be harvested python spider.py add http://url.of.blog/ # (shallowly) spider all blogs for new links to files python spider.py # spider a url to a specific depth (5 for example should get # most everything, but will take a while) python spider.py deepspider 5 # download all files python spider.py download
A minor problem is that curl doesn't do *minimum* file sizes, and with a lot of broken links it does download something small that isn't really an ogg or mp3 file, but a http response. I can probably solve this better, but for now I call the download from an update script as follows:
python spider.py download find . -iname "*.mp3" -size "-100k" -print0 | xargs -0 rm find . -iname "*.ogg" -size "-100k" -print0 | xargs -0 rm find . -iname "*.mp3" -print0 | xargs -0 mp3gain -k -r -f find . -iname "*.ogg" -print0 | xargs -0 vorbisgain -fr
Translation: download files, throw away suspiciously small ones, mp3/vorbisgain what's left.
Here's the code:
Edit 2008-04-18: Moved the code to google code, so I don't have to update it here. Find the latest version here: spider.py