It can be useful to block image requests when using puppeteer for web scraping and other activities.
This can help speed up the page load time and reduce the amount of data that needs to be downloaded.
This is especially useful when using a proxy server to scrape pages, as it can reduce the amount of bandwidth used, and therefore reduce the overall cost of your proxy per page scraped.
Using request interception naïvely
The easiest way to block images with puppeteer is using the built in request interception feature.
Once request interception is turned on, every request will stall unless it's continued, responded or aborted.
Here's a naïve example of blocking image requests with puppeteer (don't use this in production!):
When this script is run, the output on the command line will look similar to:
This puppeteer script is blocking all image requests, including data:image/svg+xml
images that are loaded via CSS. This will lead to a great saving in bandwidth.
Let's see the resulting screenshot to show the images that were blocked:
You'll notice that the images look broken because of the image requests being blocked.
However, the brand logos and some icons are still visible in the screenshot. This is because the brand logos are inline SVG embedded inside the HTML document, and not loaded via an image request.
The trouble with request interception
If you are using puppeteer to do web scraping, it is likely that you're also going to be using a third party package that also wants to intercept requests.
Examples of third party packages that hook into puppeteer to intercept requests are:
- adblockers, such as @cliqz/adblocker
- resource blockers, such as puppeteer-extra-plugin-block-resources
If you naïvely handle request interceptions like above whilst using third party libraries that also want to intercept requests, you will likely run into problems.
The main problem will be that puppeteer will raise a Request is already handled!
exception if you try to continue
, abort
or respond
to a request that has already been handled by another library.
Further problems could result in certain requests getting stalled, causing timeouts when navigating to certain URL's.
Being aware of multiple request interception handlers
These problems can be guarded against by always assuming that you are not the only one intercepting requests.
Let's rewrite the request interception handler above to be more robust:
We added the request.isInterceptResolutionHandled()
check to ensure that the request hasn't already been handled by another handler before we handle it.
This prevents us from receiving the Request is already handled!
exception and shields us from other bugs.
Multiple async request interception handlers
Handlers can also be async, and the value of request.isInterceptResolutionHandled()
is only safe to use in the same synchronous code block as where you call request.continue/abort/respond
.
If you are awaiting an asyncronous operation as part of your request interception handler, you need to ensure that you always call request.isInterceptResolutionHandled()
in the same synchronous context before going on to call abort/continue/respond
.
Let's see an example of this:
Using co-operative request intercept mode
Co-operative request interception is a way for multiple libraries to intercept requests in a way where they do not compete with each other and with your own puppeteer script.
Using co-operative request interception requires passing a priority
into the request.abort
, request.continue
or request.respond
methods as the second argument.
The default priority is 0
, and the method that gets called with the highest priority wins.
However, if any one handler does not pass a priority into the abort/continue/respond
methods, then it will prevail, so it's important to check the source code of any third party packages that are doing request interception to ensure they have been updated to use co-operative request interception before relying on it.
Here's an example of how co-operative request interception works:
In the above example, the second handler will always win, and the request will be aborted because abort
's priority of 2 is greater than continue
's priority of 1.
Using the chrome devtools protocol directly to block images with puppeteer
It's also possible to drop down a level of abstraction and use the underlying chrome devtools protocol (CDP) in order to start intercepting requests.
First, we need to get a handle to a CDP session and enable the Fetch
domain. The fetch domain allows us to substitute the browser's network layer with our own custom code.
When enabling the fetch domain, we pass in a urlPattern
which filters the requests that we want to intercept based on their URL. In our example, we want to intercept all requests, so we use the wildcard *
as the urlPattern
.
Next we need to setup an event handler for the Fetch.requestPaused
event.
This event gets fired when a request matching the urlPattern
is received.
Within this event handler function, we run custom code to decide what to do with each individual request.
As this post is about blocking image requests, we'll check the resourceType
of the request and if it's an Image
, we can abort the request.
Running this will give us the same behaviour as the request interception examples above.
We could also check for other resource types here, such as fonts, scripts and stylesheets and decide to block those too, as this will save us extra bandwidth.