Tuesday, July 1, 2014

Throwback Tuesday: Screen scraping the hard way

The much-more-precise Beautiful Soup code I posted reminded me of when I used to screen scrape without really having any idea how. I'm sure HTML tree parsers existed back in 2003, but I had just learned regular expressions in my college courses, and of course I was determined to use them. My friend had just started an online book selling business, so I wrote him some apps to screen scrape from Amazon to pull bulk pricing information.
 my $connection = Win32::Internet->new();  
 my $amazon = $connection->FetchURL($amazonBook{"url"} . $amazonBook{"isbn"} . "/");  
 //...  
 if($amazon =~/<title>\s*Amazon\.com: (.*): Books.*<\/title>/)  
      {  
           $amazonBook{"title"} = $1;        
      }  
      elsif($amazon =~/<title>\s*Amazon\.ca: Books: (.*).*<\/title>/)  
      {  
           $amazonBook{"title"} = $1;        
      }  
      elsif($amazon =~/<title>\s*Amazon\.ca: (.*): Books.*<\/title>/)  
      {  
           $amazonBook{"title"} = $1;        
      }  
      elsif($amazon =~/<title>\s*Amazon\.com: (.*): Books.*<\/title>/)  
      {  
           $amazonBook{"title"} = $1;        
      }  
      else  
      {  
           $amazonBook{"status"} = "PARSE ERROR: title";  
           $amazonBook{"error"} = "PARSE ERROR: title";  
      }  
 //...  
 if($amazon=~/<span class="price">(.*)<\/span>/)  
      {  
           $amazonBook{"price"} = $1;        
      }  
      elsif($amazon=~/<span class=price><b>CDN\$ (.*)<\/b><\/span>/)  
      {  
           $amazonBook{"price"} = $1;        
      }  
      #if($amazonBook{"price"} =~/<font color=\#990000>CDN\$ (.*)<\/font>/)  
      #{  
      #     $amazonBook{"price"} = $1;  
      #}  
      #elsif($amazonBook{"price"} =~ /\$(.*)/)  
      #{  
      #     $amazonBook{"price"} = $1;  
      #}  

So clumsy in comparison! Every time Amazon made the price bold or not bold, it'd break the whole program, so by the end of it all I had 4-5 different regular expressions for each piece of data that needed to be scraped, depending upon the whims of Amazon. I love in particular the one that looks for a particular font color in order to find the price. I'm positive regular expressions can be done much better than I did, but, still, using classes to drill down to a small subset of HTML and then drilling down by elements seems like a much better way to go.

No comments:

Post a Comment