Saturday, May 12, 2007

How to make a backup of all blog content?

I like to have a backup of my blog on my notebook, so that I can run searches in it when I am not connected.

Blogspot has nice URLs for each post - e.g.
is the URL for a post that I wrote on making email -> blogger (mostly) work. This suggests a file system where there are directories 2007, 2007/03 and then a file 2007/03/how-to-make-email-to-blogger-work.html, which would be a case of nice software engineering.

How would I make a personal file system which mirrors my blog which has this structure? I'm unable to do this. I tried to use wget with recursive get options and it gets lost. A key feature that I want is to be able to say wget -c so that modified posts are picked up (but all posts are not brought down).

Right now, I have a simple and dumb solution: I take one file per month, and I fetch the whole thing every time (which is wasteful of resources for google). I use this script:


rm -f *.html *.text
for year in 2005 2006 2007 ; do
  for month in 01 02 03 04 05 06 07 08 09 10 11 12 ; do
   wget ""$year"_"$month"_01_archive.html"
   links -dump ""$year"_"$month"_01_archive.html" > $year$month.text

This works, but it's not a nice solution: (a) I'm wasting bandwidth and google's resources - and the waste will grow as the years go by - and (b) It doesn't get me the clean well organised file system with nice file names that ought to be possible.


  1. This is a good guide ...

    And BTW, blogger does not store articles in directory structure as you think it does. Thats only a virtual representation of the articles stored in a flat database.

    None the less, some of the tools in the article above should do what you want.


  2. sir,

    this might be useful

  3. Sir,

    Firefox has a wonderful add-on - DownThemAll. It allows one to download all the files from a particular blog directory [one-at-a-time] for e.g. 2007 / 2006 / 2005 / ..etc.

    The good thing is we can chose the format of the file we wish to download from the site / blog. This could be [.pdfs], [.html], [.doc] can even enter a different format of file.

    In case of a blog, the downloaded html pages will look exactly as they appear online, i.e. the RHS index, blog-owner's pic, etc.

  4. Ravi, thanks for the pointer. But you know me: I don't like doing anything which requires interaction. That takes too much time. I want a 100% automatable solution that can be stuck into a crontab and then I can forget about it.

    If I interacted with software, I'd get a lot less done! :-)


Please note: Comments are moderated; I will delete comments that misbehave. The rules are as follows. Only civilised conversation is permitted on this blog. Criticising me is perfectly okay; uncivilised language is not. I delete any comment which is spam, has personal attacks against anyone, or uses foul language.

Please note: LaTeX mathematics works. This means that if you want to say $10 you have to say \$10.