Web Scraping Images with Node, Xray, and Download

Share this video with your friends

Send Tweet

Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.

Nabil Makhout
Nabil Makhout
~ 9 years ago

Let's say I put my application on an application server, how will things download then? Won't it download the images on the server? If so how would I be able to do it on the clients pc?

Paul
Paul
~ 9 years ago

Hi. You can't create(and download) any files on a client machine because of the security issues - https://en.wikipedia.org/wiki/JavaScript#Security

Sequoia McDowell
Sequoia McDowell
~ 9 years ago

FYI if you're scraping large files like mp3s rather than small images you might not want to start downloads in a simple forEach. I don't know exactly what happens if you attempt to download 250 large files at once, but it probably isn't good! :) Another reason to avoid this would be to not accidentally DOS the site if it's a small mom & pop server rather than google.

A function like async's parallelLimit will allow you to say "download in parallel, but only 5 at a time" which may work better for you and the site operator.

John  Lorrey
John Lorrey
~ 8 years ago

hmm this download npm module doesn't seem to want to work for that for loop. I removed the for loop and just used url-download module passing in the whole arrary to download. var download = require("url-download"); download(results, './images').on('close', function (err, url) { console.log(url + ' has been downloaded.'); How this helps someone reading this.

izdb
izdb
~ 8 years ago

This lesson doesn't appear to work at all anymore, copied the code to node v5.3.0. Fails with no errors.

{ "name": "xray-tuts", "version": "1.0.0", "description": "", "main": "app.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "author": "", "license": "ISC", "devDependencies": { "download": "^5.0.2", "x-ray": "^2.3.1" } }

Vinny
Vinny
~ 8 years ago

Well, apparently the Download package has been updated. I checked their docs and fixed the code accordingly:

// ... ^^ imports and xray config

(function(err, result) {
  var images = result.filter(function(img) { 
    return img.width > 100; 
  })
  .map(function(img) {
    // Here is the new download code.
    // Download takes asset url and download destination.
    // I used map() here, but forEach would provide the 
    // same output
    Download(img.src, './images');
   });

   // Write the original return result to JSON file
   s.writeFile('./results.json', JSON.stringify(result, null, '\t'));
});
Alonso Lamas
Alonso Lamas
~ 8 years ago

Thanks Vinny!