Wget web aspirator - Basic useful operations

  toucheatout  2006-11-03 11:05  IT  

Wget, principles and comparable software

Wget is a very powerful remote file retriever. In the opensource field, it competes with curl. Both tools allow for site 'aspiration', both are packaged and often at least one is installed by default. They can perform a bit the way search engines do, in the way their spiders roam the web downloading pages, seeded with a single entry point, and extracting the links from each downloaded page as new items to download (see htdig for a search engine-like opensource tool, the --spider option of wget only check page existence and extracts links, but has no builtin for page indexing).

Simple usages of wget: simple download, site mirroring

The simplest usage of wget is to retrieve just one page (or a particular file, for instance an archive, a package, etc...)
wget http://www.example.com
However,to activate link extraction and recursion over the links found, it may be better used like
wget -r -l 5 -p -k http://www.example.com
That would go to a depth of 5 in the site before giving up retrieval. The -p option instructs to retrieve all contents needed by the pages, even if those elements bring the recursion (one element) further than the maximum.
Mirroring entire site can be done as easily
wget --mirror http://example.com
Getting only a subpart of the site: If only the content within a specific directory is desired, one can exclude the parent and other folders
wget --no-parent -r -l inf -k http://www.example.com/subPartOfTheSite
That (the -k) would also ask wget to convert the links in the downloaded files to relative links so that the site can be browsed locally after download. -l inf instructs wget never to stop the recursion. No file outside the path /subPartOfTheSite will be considered for download.

Usual wget usage

Wget is used primarily to

  • retrieve data (as a tgz from command-line, mirroring as above, ...).
  • It is also very much used by scheduled jobs from within a firewalled zone to do reporting (some dynamic DNS use this method for instance).
  • Wget can be put to use to trigger processing on a server (for instance site maintenance). It is similar in implementation to the previous item, but the point of interest is opposite (server vs client info).

Getting a bit more creative

Once familiar with the basics, one can push a bit, for instance with the -O - option (that is, print retrieved page on stdout). A pipe can then take care of feeding the page to some program (for instance to extract and analyze information from the <head> section).

Setup as a cron job, it can regularly "ping" a script, transferring data if needed (for instance in the framework of a simple monitoring application that register in a database the date, time, IP together with application data - a status code, results of a processing, current CPU load, ...). Many CMSes use that technique to implement a cron themselves (drupal does it)

 
Informatics


yro.slashdot.org - Your Rights online


nytimes.com New York Times - International


Informatic headlines