Getting blocked from crawling can happen very fast. Maybe we exceed the maximum number of requests, maybe we send too many requests during a certain time interval, maybe the web page has simply discovered that we are not a „real“ human or we disobey the rules for friendly crawling.
There are four strategies for not getting blocked:
- Rotating user agents
- Pauses
- Referrers
- Proxies
1. User-Agent
The first strategy is to camouflage your crawler. If you send out a request, the request contains several further–often hidden–information that tell web servers whether the request is coming from a real human or from a crawler. We can tinker this information to camouflage our requests.
The user-agent is one such piece of information. It tells the server about the web browser the request is coming from. This is usually done so that the web server can give you customized versions of the web page depending on the browser you are using. If we have a crawler, then the user-agent is usually not set and the web server is likely to reject our request.
Therefore, we can avoid getting blocked by setting the user-agent. Using our request library we can do so as follows:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
r = requests.get('en.wikipedia.org',headers=headers)
What user-agent should you set? Well, the best way is to rotate over a list of real user-agents and pick one randomly for every other request. You can find a list of user-agents at Udger. Now let’s make a function that sends random user-agents for each request:
import numpy as np
def get_random_user_agent():
u = ''
user_agents=[]
useragent_file = 'ua_file.txt'
try:
with open(useragent_file ) as f:
user_agents= f.readlines()
for u in user_agents:
user_agents.append(u[0])
u = random.randint(0, len(user_agents) - 1)
except Exception as exception:
print('Error:'+str(exception))
finally:
return u
The ua_file.txt contains one user-agent per line from Udger. This is an excerpt of the file:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.53 Safari/525.19
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.36 Safari/525.19
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/7.0.540.0 Safari/534.10
Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.4 (KHTML, like Gecko) Chrome/6.0.481.0 Safari/534.4
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.86 Safari/533.4
The function get_random_user_agent will return a random, distinct user-agent from the file. You can call the function right away from your request command:
user_agent = get_random_user_agent()
headers = {'user-agent': user_agent}
r = requests.get('en.wikipedia.org',headers=headers)
2. Pauses
The second strategy is to make random pauses after your requests. Then, to a monitoring tool, your requests look less systematic. At least supposedly.
You can make a pause by creating a list of potential pause durations and pick one after each request.
pauses = [7, 4, 5, 11]
r = requests.get('en.wikipedia.org',headers=headers)
pause = np.random.choice(pauses )
time.sleep(pause )
3. Referers
The third strategy is to modify the so-called referer (originally this was a misspelling of referrer, but over time it has established itself as a technical term). The referer is a part of the header field in the request sent to a web page that identifies the webpage where you are coming from. Thus, by looking at the referer, the web page can see where the request originated.
Referers are usually used for statistical purposes: the web server can track where visitors are coming from. But referers can also be used to identify crawlers–and we would prefer to stay camouflaged.
We can tinker the referer to avoid getting discovered. This can be done similarly to setting the user-agent:
headers = {
'user-agent': user_agent,
'referer': 'http://en.wikipedia.org'
}
There are two options for what URL to put into the referer. First, it is advisable to put the country link of a search engine into the referer. To the web page, this looks like you are a genuine user who has searched the web and then came across it. For instance, if I am crawling a German web page, I would set the referer to:
headers = {
'user-agent': user_agent,
'referer': 'http://www.google.de'
}
Second, in case you are crawling data from say, a product catalogue or a long list, you should set the referer to the overall list you are departing from. For example, if you are crawling all articles from Wikipedia’s Pulitzer Prize-winning newspapers, then you should enter as referer the link to the overview:
headers = {
'user-agent': user_agent,
'referer': 'https://en.wikipedia.org/wiki/Category:Pulitzer_Prize-winning_newspapers'
}
4. Proxy IP addresses
The above strategies are the first go-to. However, even if you set your user-agent, pause frequently, and set a referer, web pages can still tell whether you are a crawler. This is because they can track one main piece of information: your IP address.
The IP address is something like a street address in the Internet. Every computer on the Internet needs to have a unique IP address. If you send a request to a web server, it is like sending a mail to that street address of the web server. Like in real-life, you can easily monitor who is sending you mail, and if this mail is coming in unusual patterns (e.g., every five seconds for several days), then you are likely to get blocked.
We can overcome getting recognized by our IP addresses by hiding them behind a proxy. A proxy is nothing else than a computer that camouflages our request with its own IP address, thereby hiding the one of our crawler.
Working with proxy IP addresses is not trivial, and you can find out how in a separate blog post.