Pro tip: Get notified by email when the crawler has done its job

Crawling can take hours or days to complete. Sometimes crawling is done regularly, say, once a week or even daily. It can be useful to set up an automated notification system for two reasons. First, it is good to know that everything went well. Second, if something went wrong, we can immediately take action to get the data crawling up and running again.

In this post, you will learn how to set up an email-based notification system in Python such that we get a notice if the crawl job has been done or ran into an error.

Getting started: Setting up an Email Account

Let’s begin by setting up a throwaway account for our research project and modify the account so that we can access it in our Python code. A throwaway account is useful because we will need to adjust the security settings and doing so makes the account a little bit more vulnerable to hackers–and you don’t want to do this with your main email account. IN addition, receiving daily notification makes your inbox quite filled up with (mostly) nonsense stuff, and you may want to keep this out of your primary email inbox.

We will use Gmail for simplicity, but any other email provider should be perfectly fine as well.

To set up a Gmail for our project, do the following:

Hands-on: Getting notified when crawling is done

First of all, let’s do the admin stuff. We need to import smtplib:

import smtplib

Next, let’s setup a connection with our email account. The first block sets the configuration for our email server. This is adjusted for Gmail. In case you use a different email provider, you need to identify the SMTP address and ports.

server = smtplib.SMTP('smtp.gmail.com', 587)
    server.connect("smtp.gmail.com", 587)
    server.ehlo()
    server.starttls()
    server.ehlo()

Now, we need to specify our email credentials so that Python can connect with Gmail. Make sure to not show this code to anyone. Afterwards, we establish a connection with Gmail using the email password combination.

    email = "" #paste the throwaway email address between the quotes
    password = "" #paste your password between the quotes

    server.login(email, password)

Next, we compose the email that will be sent to us once the crawler is finished. For simplicity, we will send an email stating „Crawling finished successfully“ with the subject „Data crawler status report“.

    subject = "Data crawler status report"
    msg = "Crawling finished successfully!"
    msg = "Subject: {}\n\n{}".format(subject, msg_array[n])

Finally, let’s send the email.

    server.sendmail(email, email, msg)

Quick-code: All in one

Quick code: Here’s the full code ready for copy&paste:

import smtplib

server = smtplib.SMTP('smtp.gmail.com', 587)
    server.connect("smtp.gmail.com", 587)
    server.ehlo()
    server.starttls()
    server.ehlo()

    email = "" #paste the throwaway email address between the quotes
    password = "" #paste your password between the quotes
    server.login(email, password)
    subject = "Data crawler status report"
    msg = "Crawling finished successfully!"
    msg = "Subject: {}\n\n{}".format(subject, msg_array[n])

    server.sendmail(email, email, msg)

Werbung

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s