Block Web Crawlers With Rails
Sometimes you just need to block web crawlers from accessing your web site or web app. In this post we take a look at how to do just that using Rails.
Join the DZone community and get the full member experience.
Join For FreeSearch engines “crawl” and “index” web content through programs called robots (a.k.a. crawlers or spiders). This may be problematic for our projects in situations such as:
- a staging environment
- migrating data from a legacy system to new locations
- rolling out alpha or beta features
Approaches to blocking crawlers in these scenarios include:
- authentication (best)
robots.txt
(crawling)X-Robots-Tag
(indexing)
Problem: Duplicate Content
With multiple environments or during a data migration period, duplicate content may be accessible to crawlers. Search engines will have to guess which version to index, assign authority, and rank in query results.
For example, we periodically backup our production data to the staging environment using Parity:
production backup
staging restore production
Things Search Engines Do
In order to provide results, a search engine may prepare by doing these things:
- check a domain’s robots settings (e.g.
http://example.com/robots.txt
) - request a web page on the domain (e.g.
http://example.com/
) - check the webpage’s
X-Robots-Tag
HTTP response header - cache the web page (saving its response body)
- index the web page (extract keywords from the response body for fast lookup)
- follow links on the web page to other web pages and repeat
Steps 1, 2, 3, and 6 are generally “crawling” steps and steps 4 and 5 are generally “indexing” steps.
Solution: Authentication (Best)
The most reliable way to hide content from a crawler is with authentication such as HTTP Basic authentication:
class ApplicationController < ActionController::Base
if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
http_basic_authenticate_with(
name: ENV.fetch("BASIC_AUTH_USERNAME"),
password: ENV.fetch("BASIC_AUTH_PASSWORD"),
)
end
end
This often is all we need for situations such as a staging environment. The following approaches are more limited but may be more suitable for other situations.
Notice we can control whether crawlers are allowed to access content via config in the environment. We can use Parity again to add configuration to Heroku staging:
staging config:set DISALLOW_ALL_WEB_CRAWLERS=true
Solution: robots.txt (Crawling)
The robots exclusion standard helps robots decide what action to take. A robot first looks at the /robots.txt
file on the domain before crawling it.
It is a de-facto standard (not owned by a standards body) and is opt-in by robots. Mainstream robots such as Googlebot
respect the standard, but bad actors may not.
An example /robots.txt
file looks like this:
User-agent: *
Disallow: /
This blocks (i.e. disallows) all content (/
) to all crawlers (User-agent
s). See this list of Google crawlers for examples of user agent tokens.
Globbing and regular expressions are not supported in this file. See what can go in it.
Add Climate Control to the
Gemfile
to control environment variables in tests:
gem "climate_control"
In spec/requests/robots_txt_spec.rb
:
require "rails_helper"
describe "robots.txt" do
context "when not blocking all web crawlers" do
it "allows all crawlers" do
get "/robots.txt"
expect(response.code).to eq "404"
expect(response.headers["X-Robots-Tag"]).to be_nil
end
end
context "when blocking all web crawlers" do
it "blocks all crawlers" do
ClimateControl.modify "DISALLOW_ALL_WEB_CRAWLERS" => "true" do
get "/robots.txt"
end
expect(response).to render_template "disallow_all"
expect(response.headers["X-Robots-Tag"]).to eq "none"
end
end
end
Google recommends no robots.txt if we want all our content to be crawled.
In config/routes.rb
:
get "/robots.txt" => "robots_txts#show"
In app/controllers/robots_txts_controller.rb
:
class RobotsTxtsController < ApplicationController
def show
if disallow_all_crawlers?
render "disallow_all", layout: false, content_type: "text/plain"
else
render nothing: true, status: 404
end
end
private
def disallow_all_crawlers?
ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
end
end
If we’re using an authentication library such as Clearance site-wide, we’ll want to skip its filter in our controller:
class ApplicationController < ActionController::Base
before_action :require_login
end
class RobotsTxtsController < ApplicationController
skip_before_action :require_login
end
Remove the default Rails robots.txt
and prepare the custom directory:
rm public/robots.txt
mkdir app/views/robots_txts
In app/views/robots_txts/disallow_all.erb
:
User-agent: *
Disallow: /
Solution: X-Robots-Tag
(Indexing)
It is possible for search engines to index content without crawling it, because websites might link to it. So, our robots.txt
technique blocked crawling, but not indexing.
Adding a X-Robots-Tag
header to our responses short-circuits the entire process; well-behaved crawlers won’t make HTTP requests at all to content on the domain.
You may have seen meta tags like this in projects you’ve worked on:
<meta name="robots" content="noindex,nofollow">
The X-Robots-Tag
header has the same effect as the robots
meta tag, but applies it to all content types in our app (e.g. images, scripts, styles), not only HTML files.
To block robots in our environment, we want a header like this:
X-Robots-Tag: none
The none
directive is equivalent to noindex, nofollow
. It tells robots not to index, follow links, or cache.
In lib/rack_x_robots_tag.rb
:
module Rack
class XRobotsTag
def initialize(app)
@app = app
end
def call(env)
status, headers, response = @app.call(env)
if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
headers["X-Robots-Tag"] = "none"
end
[status, headers, response]
end
end
end
In config/application.rb
:
require_relative "../lib/rack_x_robots_tag"
module YourAppName
class Application < Rails::Application
config.middleware.use Rack::XRobotsTag
end
end
Our specs will now pass.
Conclusion
Our environment’s content can be blocked in three different ways from crawling and indexing by web robots that respect the robots exclusion standard (most importantly Google).
Use authentication to entirely hide it, or robots.txt plus the X-Robots-Tag
for more granular control.
Published at DZone with permission of Dan Croak, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments