Imagescraper: Difference between revisions

Latest revision as of 22:05, 15 December 2009

Use this to immediately pull down all the big pictures from Google Image searches. WARNING: scrapers get out of date when the target site changes - this is currently in need of an update.

Imagescraper in action

This script combs google's image results for the link to the "full" image, rather than just the thumbnail. As the original full image locations are determined, they are placed in the html result. Once you submit a search, should start to get full images right away in the results. Be careful, simple searches that result in a large number of hits can pull down LOTS of large images!

There's a secondary function as well. The resulting images are wrapped with HTML that will do an imagescrape of the site where the image is located, if you click on the image itself.

The script uses my favorite Perl module, WWW::Mechanize.

Someone asked me to release the perl code under the GPL. It's just a quick sloppy hack, so no promises, but here it is. One word of warning: it's tied to the format of Google's image search and results pages. If they change, the script will need to be updated. That being said, it's been working as-is for a long time now - for years, actually, without change. Also of note: I got into this based on some article discussion somewhere - we stand on the shoulders of giants - keep on sharing! :>

Imagescrape perl code

Just extract the files to a path accessible from your Perl-enabled apache-hosted website. If you're using [mod_deflate], for better responsiveness with the streaming results, you'll want to disable it on the results page; see this post for details.

Try out Imagescraper here:

http://thedigitalmachine.com/imagescrape/

@@ Line 1: / Line 1: @@
+Use this to immediately pull down all the big pictures from Google Image searches.  WARNING: scrapers get out of date when the target site changes - this is currently in need of an update.
 <br>
 [[Image:Imagescrape sample.jpg|center|frame|none|Imagescraper in action]]
@@ Line 9: / Line 11: @@
 The script uses my favorite Perl module, [http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm WWW::Mechanize].
-Someone asked me to release the perl code under the GPL.  It's just a quick sloppy hack, so no promises, but here it is.  One word of warning: it's pretty brittle, it's tied to the exact format of Google's image search and results pages.  If they change, the script will need to be updated.  Also of note: I got into this based on some article discussion somewhere - we stand on the shoulders of giants - keep on sharing!  :>
+Someone asked me to release the perl code under the GPL.  It's just a quick sloppy hack, so no promises, but here it is.  One word of warning: it's tied to the format of Google's image search and results pages.  If they change, the script will need to be updated.  That being said, it's been working as-is for a long time now - for years, actually, without change.  Also of note: I got into this based on some article discussion somewhere - we stand on the shoulders of giants - keep on sharing!  :>
 [http://thedigitalmachine.com/files/imagescrape.tar.gz Imagescrape perl code]
+Just extract the files to a path accessible from your Perl-enabled apache-hosted website.  If you're using [mod_deflate], for better responsiveness with the streaming results, you'll want to disable it on the results page; see [http://news.thedigitalmachine.com/2008/11/01/mod_deflate-p0wn3d/ this post] for details.
 Try out Imagescraper here:
 http://thedigitalmachine.com/imagescrape/