Fetching Webpage Content in Python and Perl

Python

2005-02-04

Suppose you want to fetch a webpage. The following code does it:

# -*- coding: utf-8 -*-
# Python

from urllib import urlopen
print urlopen('http://xahlee.org/Periodic_dosage_dir/_p2/russell-lecture.html').read()

Sometimes in working with html pages, you need to creat links. In URL, certain chars need to be encoded. For example, “http://xahlee.org/~xah” needs to be “http://xahlee.org/%7Exah”.

In Python, the “quote” function does it. “unquote” reverses it.

# -*- coding: utf-8 -*-
# Python

from urllib import quote
print quote("~joe's home page")
print 'http://www.google.com/search?q=' + quote("ménage à trois")
# (rely on the French to teach us interesting words)

For detail about URL encoding, see http://en.wikipedia.org/wiki/Percent-encoding

Reference: Python Doc↗.

Perl

In perl, there are several ways to get a webpage content. Long story short, the easiest way to get a webpage is to use the perl program HEAD or GET in “/usr/bin” or “/usr/local/bin”. Example:

GET 'http://yahoo.com/'

The HEAD and GET are usually installed along with perl. When one of the networking module is installed, perl contaminate your bin dirs with these programs. In the unix shell, try:

HEAD is similar to GET, except that it returns a summary of the page instead. (HEAD and GET are two calling methods of the HTTP protocol. The Perl script are named that way for this reason.)

If you need more complexty, perl has LWP::Simple and LWP::UserAgent to begin with. (there are a host of spaghetti others) Both of these need to be installed extra. The following code is untested, but should be it.

# perl

use strict;
# use LWP::Simple;
use LWP::UserAgent;

my $ua = new LWP::UserAgent;
$ua->timeout(120);
my $url='http://yahoo.com/';
my $request = new HTTP::Request('GET', $url);
my $response = $ua->request($request);
my $content = $response->content();
print $content;

note the above perl code. In many perl codes, they sport the Object Oriented syntax, often concomitantly with a normal syntax version as well.


Page created: 2005-01.
© 2005 by Xah Lee.
Xah Signet