I just had a question about wget usage (for Rip javadocs from a doc site to a local zip file), and wondered where to post it. Here the wget tags on the sites I checked:
- ServerFault (83)
- SuperUser (134)
- Unix & Linux (21), Ask Ubuntu (12)
- Stack Overflow (303)
If we don't limit us to the tagged questions, but search the word, we will find more on each site:
- Pro Webmasters (21)
- Unix & Linux (79), Ask Ubuntu (168), Ask Different (11)
- Stack Overflow (303), ServerFault (83), SuperUser (134) simply forward to the corresponding tag.
Most of them contain quite similar questions.
I finally found my answer (it is not possible) at ServerFault, but I'm still not sure on which site my question would be best.
Is there some clear guideline describing the limits?
Jeff said that it depends on my profession/role. Thus here is what my question would look like (if I didn't already found that there is no solution with current wget, and I would need some alternative program):
I'm a programmer which wants to download (once) a website containing Javadoc documentation (which I'm later using to program against the API).
These consists of a bunch of HTML files, and most of them contain a link like
index.html?<filename>
(where<filename>
is the current file name). There is only oneindex.html
, e.g. these URLs all refer to the same file (and give an identical result when downloading). Usingwget -r
, all these are downloaded and saved individually.With
--reject 'index.html\?*'
wget will delete them after downloading, but it still will download them all, in effect downloading the same file some hundred or thousand times for bigger APIs, wasting both my bandwith as well as the one of the server.How would I avoid downloading them?
This is my current command line:
wget --no-parent --recursive --level=inf --page-requisites --wait=1 http://epaul.github.com/jsch-documentation/simple.javadoc/
So, I'm (for this concrete problem) a programmer and want to use the downloaded website for programming purposes, but the same problem could occur for a webmaster which wants to copy another website, a sysadmin which wants to provide a local mirror of some other website to his users, or some simple (Unix or Windows or Mac or ...) user which simply wants to download this website for offline reading.
I (now) don't want to create a program which will repeatedly download this website (or other similar ones), but if I would want to, the question and answers would not really differ, I think.
I'm not always a programmer, I also are sometimes a server administrator (for the server of an association), a webmaster (for this same association, and my own blog), and a power user (on different Linux systems at home and at the university - and I tested the command line on one of these).
Did I get really unlucky to hit the corner case where all these fields touch, or is there a problem with the delineation?
Or do I simply misunderstand this and it is totally clear where this question belongs?