Scraping a website into Drupal using Perl
Posted on: Saturday, December 5th 2009 by JF Paradis

Perl has been at the root of web development since the beginning: even Amazon is built on Perl. Today, Perl gives you access via CPAN to a set of over 18,000 mature modules on just about anything. There is even an Acme:: namespace reserved for joke modules.

Perl has a lots of benefits for a Drupal developer. First, the syntax of PHP has been greatly influenced by Perl, so most PHP programmers should feel comfortable in Perl. It is easy to install extra Perl modules on any Linux distribution from the command-line using CPAN, or on share hosts using the administration interface. And Perl is faster than PHP, which makes it an excellent candidate for the heavy-lifting part of a website.

Let's build a small perl script to:

  1. Log into a website
  2. Parse a page and search for specific content
  3. Format the content as an RSS feed
  4. Load the feed into Drupal

This solution would be extremely simple to build using only four Perl CPAN modules. Here is how it goes:

STEP 1: The first line in the perl script is the shebang, which points to the location of Perl on your system.

#!/usr/bin/perl -w

On shared hosts, you might have to use something like this to tell Perl to look inside your home directory:

#!/ramdisk/bin/perl -w
#
# Hostmonster fix
BEGIN {
my $homedir = ( getpwuid($>) )[7];
my @user_include;
foreach my $path (@INC) {
if ( -d $homedir . '/perl' . $path ) {
push @user_include, $homedir . '/perl' . $path;
}
}
unshift @INC, @user_include;
}

STEP 2: Declare the modules you intend to use (these must be installed first):

use CGI::Minimal;
use WWW::Mechanize;
use XML::RSS;
use HTTP::Message;

STEP 3: Define some constants. We'll provide a user agent (here IE8) to make sure the system will not reject us by mistake.

my $login_url = "https://example.com";
my $login_agent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)";
#
my $login_form_name = "form_login";
my $login_field_user = "login_id";
my $login_field_pass = "passwd";

STEP 4: Use CGI::Minimal to read the login and password coming from POST or GET:

# Gets us access to the HTTP request data
my $cgi = CGI::Minimal->new;
#
# Get the name and value for each parameter:
my $login_user = $cgi->param('user');
my $login_pass = $cgi->param('pass');

STEP5: WWW::Mechanize is our Swiss Army tool, allowing us to post forms, click on buttons, follow links etc. With only six lines of code, WWW::Mechanize can read the login page, find the login form on it, enter the user name ad password, submit the form, and return the next page.

# The autocheck => 1 tells Mechanize to die if any IO fails, so you don't have to manually check.
my $mech = WWW::Mechanize->new(autocheck => 1, agent =>$login_agent);
#
# Fetch the login page
$mech->get($login_url);
#
# Find and select the form by name, returning an HTML::Form object
$mech->form_name($login_form_name);
#
# Fill specific fields on the form
$mech->field($login_field_user,$login_user);
$mech->field($login_field_pass,$login_pass);
#
# Click the submit button
$mech->click();

STEP6: We could then navigate the site by following links using WWW::Mechanize, but let's say the content we are interested in is on the next page. We want to extract the following information:

<a href="http://example.com/123" class="post">
Link to post 123
</a>

With the help of WWW::Mechanize we can extract all the links which have class "post":

my @links = $agent->find_all_links(
tag => 'a',
class => 'post',
);

STEP7: Now build the RSS result using the XML::RSS:

# Syndication feed
my $rss = XML::RSS->new(version => '2.0');
#
# Create xml content
foreach (@links) {
$rss->add_item(
title => $_->text,
link => $_->url
);
}

STEP8: The final steps simply return the result using HTTP::Message:

# Manage the HTTP response
my $response = HTTP::Message->new;
#
# Create message with xml as text
$response->header('Content-Type' => 'application/rss+xml');
$response->content($rss->as_string);
#
# Send message to client
print $response->as_string;

STEP9: Finally, in Drupal, download and install FeedAPI and enable FeedAPI, FeedAPI Node and SimplePie Parser (external library required). Then create a Feed node with the URL pointing to your script:

http://localhost/feed.pl?user=foo&pass=bar

That's it! A very simple and strong foundation to build upon. For example, this can be used to perform a search on a site, or return the results in XML by replacing XML::Feed with XML::Generator.

Have fun!

Comments

Cool, will check out QueryPath. Thanks!

I think I should try it :).

If Perl isn't your thing check out QueryPath. It's a PHP-based solution that allows jQuery-like traversal of documents, like HTML. You can get the links with class 'post' from a page with something like the following:

$links = qp($url, 'a .post');

There's also a Drupal module for it.

Post new comment

The content of this field is kept private and will not be shown publicly.
Type the characters you see in this picture. (verify using audio)
Type the characters you see in the picture above; if you can't read them, submit the form and a new image will be generated. Not case sensitive.