my $connection = Win32::Internet->new();
my $amazon = $connection->FetchURL($amazonBook{"url"} . $amazonBook{"isbn"} . "/");
//...
if($amazon =~/<title>\s*Amazon\.com: (.*): Books.*<\/title>/)
{
$amazonBook{"title"} = $1;
}
elsif($amazon =~/<title>\s*Amazon\.ca: Books: (.*).*<\/title>/)
{
$amazonBook{"title"} = $1;
}
elsif($amazon =~/<title>\s*Amazon\.ca: (.*): Books.*<\/title>/)
{
$amazonBook{"title"} = $1;
}
elsif($amazon =~/<title>\s*Amazon\.com: (.*): Books.*<\/title>/)
{
$amazonBook{"title"} = $1;
}
else
{
$amazonBook{"status"} = "PARSE ERROR: title";
$amazonBook{"error"} = "PARSE ERROR: title";
}
//...
if($amazon=~/<span class="price">(.*)<\/span>/)
{
$amazonBook{"price"} = $1;
}
elsif($amazon=~/<span class=price><b>CDN\$ (.*)<\/b><\/span>/)
{
$amazonBook{"price"} = $1;
}
#if($amazonBook{"price"} =~/<font color=\#990000>CDN\$ (.*)<\/font>/)
#{
# $amazonBook{"price"} = $1;
#}
#elsif($amazonBook{"price"} =~ /\$(.*)/)
#{
# $amazonBook{"price"} = $1;
#}
So clumsy in comparison! Every time Amazon made the price bold or not bold, it'd break the whole program, so by the end of it all I had 4-5 different regular expressions for each piece of data that needed to be scraped, depending upon the whims of Amazon. I love in particular the one that looks for a particular font color in order to find the price. I'm positive regular expressions can be done much better than I did, but, still, using classes to drill down to a small subset of HTML and then drilling down by elements seems like a much better way to go.
No comments:
Post a Comment