Crawling 101: Crawling a static web page

We will begin with building our first crawler. The goal is that the crawler takes a static web page as input and stores it on the computer.

Hands-on: Retrieving your first web page

Here’s our template code for requesting code from APIs. Have a look at it and then we will go block by block through it. In this example, we will be working with the Internet Archive API. The Internet Archive API allows us to look for historical versions of a web page. It is a great way of obtaining historical data, more about this in data sources.

#import the packages we need
import requests

#setup
address="https://en.wikipedia.org/wiki/Main_Page"
url = ('http://web.archive.org/cdx/search/cdx?url=' + address + '&output=json')
save_folder = "downloads\\"
save_name = "wikifrontpage.html"

#do the request
r = requests.get(url)

#download the web page
file = open(save_folder, "w+")
file.write(response.text)
file.close()

First, we import the requests library in a new python file. The requests library provides functions for sending requests to web pages and to retrieve them. That’s all we need for now.

import requests

Next, we will call up the web page we want to obtain. In order to get a response, in order to retrieve data we can use the GET request. The GET method retrieves data from an URL that you specify. To make a GET request, invoke requests.get(). Lets make a request.

#do the request
r = requests.get(url)

Stored in the variable r is the response. The response is essentially containing the web page behind the URL that we specified. In particular, it contains all the source code of the web page.

In a final step, we take the response and store it as a file on our computer. There it can rest for later analyses.

#download the web page
file = open(save_folder+save_name, "w+")
file.write(response.text)
file.close()

Now, that’s all to say about the basic structure of a crawler.

Werbung

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s