Why is Web Scraping Useful?

Access to relevant information is crucial when we are attempting to solve a problem or improve our understanding. In many cases, this information is already available to us in external or internal datasets and access is not an issue. Sometimes, however, these datasets are not available to us directly but we are able to create them by leveraging web scraping and the surplus of publicly available information on the web. Web scraping is a powerful process that drastically increases the amount of data available to us by allowing us to access any and all of the publicly available information on webpages. If web scraping is a topic that you are interested in learning more about, follow along below as I create several web scraping scripts using Python.

Tutorial Requirements

Although you will not need previous experience writing Python code, running the scripts will require a Python interpreter and several Python libraries that you may have to install: requests, beautifulsoup4 (the current version of Beautiful Soup), pandas, time, csv (optional), & unicodedata (optional). I won't be explaining how to install a Python interpreter or any of the libraries as there are plenty of great resources online that already do that. In addition to the interpreter and libraries, you will also want to install a text editor with syntax highlighting, such as Atom or Sublime Text.


Tutorial Outline

1. Web Scraping Script Example

2. Addressing Potential Errors

3. Additonal Web Scraping Examples


Web Scraping Script Example

This was the first web scraping script that I wrote and it was inspired by my experience in the SEO (Search Engine Optimization) field. When optimizing web pages for search, it is incredibly important to ensure that the page titles and header tags of key pages are optimized effectively. Page titles and header tags are some of the most basic elements of a web page but they are quick to update and a good place to start when optimizing. Before writing the seo_snapshot.py script, I knew that I wanted the script to be able to loop through a list of web pages and extract the page title and h1 header tag of each web page in the list. In addition, I also wanted to calculate the character count of the page titles in order to identify pages with titles outside of Moz' recommended 50-60 character range.

SEO Snapshot Script: seo_snapshot.py
SEO Snapshot Script (after addressing potential errors): seo_snapshot2.py

Process

After creating the seo_snapshot.py text file and opening using a text editor, you will need to import the three main libraries of requests, pandas and beautifulsoup4 (AKA bs4). If you have any issues when importing beautifulsoup4, please refer to the Beautiful Soup documentation. Also, at a later time, you can append this list with the optional libraries of csv, time, & unicodedata in order to add functionality to the script.

import requests from bs4 import BeautifulSoup import pandas as pd # import csv # import time # import unicodedata

Next, we want to create four lists: a complete list of web pages to crawl, an empty list of page titles, an empty list of page title character counts, and an empty list of h1s. As the creation of the three empty lists is simple and straightforward, we will do that first. Note: the difference between the [square brackets] below and (parentheses) is extremely subtle in this font. If you are unsure about which once is being used in a particular line of code, please copy and paste the lines into your text editor.

title_list = [] title_char_list = [] h1_list = []

As for the creation of the complete list of web pages to crawl, we will include three methods to generate our list: manual list entry, using an Excel file, and using a CSV file. For manual list entry, we will add this comment and line of code:

# If manually entering in pages as a list page_list = ['http://www.analysisfromdata.com/', 'http://www.analysisfromdata.com/portfolio', 'http://www.analysisfromdata.com/about-me']

Tip: I've noticed that if you use http instead of https in the page_list, the script will work for both secure and unsecure websites, but using https will not work for unsecure websites. For this reason, I've chosen to use the http version of the URLs in the example list despite the example site being secure.

To create the page list from an Excel or CSV file, we will build a dataframe using read_csv() or read_excel() and then convert the first column of that dataframe into a list using the tolist() method. Notes: as read_csv() and read_excel() are functions in the pandas library, we will have to prepend them with our alias 'pd.', we reference the first column using '0' as we are index 0, all the below lines are intentionally commented out because in this example we have already built the page_list using the manual entry method above, and you will need to replace the '/path/to/file/here/filename.' with the file path and filename for your Excel or CSV file.

# If building the list from an excel file: # page_df = pd.read_excel('/path/to/file/here/filename.xlsx') # page_list = page_df[page_df.columns[0]].tolist() # If building the list from a CSV file: # page_df = pd.read_csv('/path/to/file/here/filename.csv') # page_list = page_df[page_df.columns[0]].tolist()

Now that we have our completed list of pages to scrape, we need to utilize the requests.get() method to send get requests to each page. We will use two parameters in our requests.get() method: the url of each page and a header parameter that will "trick" the server into viewing our request as one from a regular web user, not a Python program. The 'Mozilla/5.0…' User-Agent value below is one of many that you can choose to place in your script so if you aren't getting good response objects it may be wise to try other User-Agent values and/or add additional parameters to the headers. Also, we are going to place our requests.get() method within a for loop to easily iterate through our list of web pages.

for page in page_list: # the line below should have a single tab before it response_obj = requests.get(page, headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3'})

After obtaining our response object, we will utilize BeautifulSoup() to obtain a BeautifulSoup object before attempting to pull out the title and h1 of each page.

# the line below should have a single tab before it soup_obj = BeautifulSoup(response_obj.text)

We can now use the find() and getText() methods to extract the title and h1 of each page and append each value to the end of the list. Note: we use getText() because we do not need to add HTML tags to h1_list and title_list.

# Each line below should have a single tab before it title_value = soup_obj.find('title').getText() # Adding the line below will help to normalize any odd characters in the title # title_value = unicodedata.normalize('NFKD', title_value) title_list.append(title_value) title_char_value = len(title_value) title_char_list.append(title_char_value) h1_value = soup_obj.find('h1').getText() # Adding the line below will help to normalize any odd characters in the h1 # h1_value = unicodedata.normalize('NFKD', h1_value) h1_list.append(h1_value)

We will eventually be augmenting the above code to account for potential errors and odd cases but before that let's print our lists and compare the lists to the live web pages. Tip: Use your browser's "Inspect" or "View Page Source" functions on individual web pages to easily identify the title and h1 tag of that page.

# Now that we are done with our for loop, the remaining lines should have no tabs before them print(page_list) print(title_list) print(title_char_list) print(h1_list)

If the titles and h1s look good and match up well to what is found on the live web pages, insert a '# ' at the start of each of the above print lines to comment them out. Then, follow the instructions below to combine our lists into a dictionary and then convert that dictionary to a DataFrame using pandas.

data = {'Web Pages':page_list, 'Page Title':title_list, 'Page Title Length':title_char_list, 'h1':h1_list} df = pd.DataFrame(data)

Now, chose whether to export the df dataframe as either an Excel or CSV file by removing the '# ' that precedes the corresponding line of code. After doing so, you will need to replace the '/path/to/file/here/filename.' with a valid file path on your computer and filename of your choice. Note: I used 'index = False' because I didn't want a row names column and 'header = True' because our dataframe has column names and I want to preserve these.

# If wanting to export dataframe as Excel # df.to_excel('/path/to/file/here/filename.xlsx', index = False, header = True) # If wanting to export dataframe as CSV # df.to_csv('/path/to/file/here/filename.csv', index = False, header = True)

We have now pieced together all of the core lines of code for our seo_snapshot.py web scraping script and the completed script has been inserted below for review. Note: remember to remove the '# ' from either the df.to_csv() or df.to_excel() line towards the bottom, after inserting the appropriate file path and filename of your choice.

import requests from bs4 import BeautifulSoup import pandas as pd # import csv # import time # import unicodedata title_list = [] title_char_list = [] h1_list = [] # If manually entering in pages as a list page_list = ['http://www.analysisfromdata.com/', 'http://www.analysisfromdata.com/portfolio', 'http://www.analysisfromdata.com/about-me'] # If building the list from an excel file: # page_df = pd.read_excel('/path/to/file/here/filename.xlsx') # page_list = page_df[page_df.columns[0]].tolist() # If building the list from a CSV file: # page_df = pd.read_csv('/path/to/file/here/filename.csv') # page_list = page_df[page_df.columns[0]].tolist() for page in page_list: # the line below should have a single tab before it response_obj = requests.get(page, headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3'}) # the line below should have a single tab before it soup_obj = BeautifulSoup(response_obj.text) # Each line below should have a single tab before it title_value = soup_obj.find('title').getText() # Adding the line below will help to normalize any odd characters in the title # title_value = unicodedata.normalize('NFKD', title_value) title_list.append(title_value) title_char_value = len(title_value) title_char_list.append(title_char_value) h1_value = soup_obj.find('h1').getText() # Adding the line below will help to normalize any odd characters in the h1 # h1_value = unicodedata.normalize('NFKD', h1_value) h1_list.append(h1_value) # Now that we are done with our for loop, the remaining lines should have no tabs before them # print(page_list) # print(title_list) # print(title_char_list) # print(h1_list) data = {'Web Pages':page_list, 'Page Title':title_list, 'Page Title Length':title_char_list, 'h1':h1_list} df = pd.DataFrame(data) # If wanting to export dataframe as Excel # df.to_excel('/path/to/file/here/filename.xlsx', index = False, header = True) # If wanting to export dataframe as CSV # df.to_csv('/path/to/file/here/filename.csv', index = False, header = True)


Addressing Potential Errors

We have now successfully built a Python web scraping script that will perform its intended function! That being said, our script is quite prone to errors and we should now try and account for them. The simplest method of addressing errors is to begin using the script and only remedy the errors that we observe firsthand. A better method is to first brainstorm several sources of potential errors, use these sources of potential errors to test the script and then create appropriate remedies. After this point, we would put the script in use and then address one-off errors as needed. As this is a tutorial, we will utilize the better method and begin by brainstorming sources of potential errors.

For this brainstorming session, let's start with what we know. We know that the script works when we give it a short list of accessible pages on claimed domains, specifically when these accessible pages have a page title and h1 header tag. Referring to our script and using what we know, we can identify the following potential sources of errors: (1) our page list references one or more unclaimed domain names that do not correspond to actual websites, (2) one or more of our pages doesn’t have a page title, (3) one or more of our pages has more than one page title, (4) one or more of our pages doesn’t have an h1 header tag, and (5) one or more of our pages has more than one h1 header tag. Now that we have identified several potential sources of errors, let's try and account for as many of them as possible. To start, change the values in your page_list as shown below and then run the script from the terminal. Tip: in the Python shell, try the 'exec(open('/path/to/file/here/seo_snapshot.py').read())' command to run the entire script.

# Replace the initial page_list definition line (towards the top of the script) with the line below, for testing purposes of (1) page_list = ['http://www.analysisfromdataaa.com/bad-domain']

When we use the page above in our script, requests.get() is not able to access the page as it references an unclaimed domain that doesn’t correspond to an actual website. This leads to one type of Request Exception, an HTTP error, but we can catch this by adding 'try' and 'except' lines to our for loop.

# Replace the entire for loop with the code below for page in page_list: # the line below should have a single tab before it try: # Each line below should have two tabs before it response_obj = requests.get(page, headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3'}) soup_obj = BeautifulSoup(response_obj.text) title_value = soup_obj.find('title').getText() # Adding the line below will help to normalize any odd characters in the title # title_value = unicodedata.normalize('NFKD', title_value) title_list.append(title_value) title_char_value = len(title_value) title_char_list.append(title_char_value) h1_value = soup_obj.find('h1').getText() # Adding the line below will help to normalize any odd characters in the h1 # h1_value = unicodedata.normalize('NFKD', h1_value) h1_list.append(h1_value) # the line below should have a single tab before it except requests.exceptions.RequestException as e: # Each line below should have two tabs before it title_list.append('Error, Req. Excep.') title_char_list.append('Error, Req. Excep.') h1_list.append('Error, Req. Excep.')

By adding the 'try:' line directly below the 'for page in page_list:' line, we are telling our computer to try all of our existing code in the for loop. The 'except requests.exceptions.RequestException as e:' line then instructs the computer to instead run the code beneath only if it observes a Request Exception - an HTTP error is one type of Request Exception but it is good to be aware that there are others. In the case of a Request Exception, we are going to use 'Error, Req. Excep.' as that page’s corresponding entries for the various lists.

Now that we have addressed (1), let’s try to address the instances of a page lacking either a page title (2) or lacking a h1 header tag (4). To simulate a page without a specified element, we will change our original page_list to a single page on this website that does not contain a title or h1 header element.

# Replace the current page_list definition line (towards the top of the script) with this line, for testing purposes of (2) and (4) page_list = ['http://www.analysisfromdata.com/scrape-test']

When we use our new page_list definition and run our script, our script does not finish running and we observe the following printout:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 32, in <module>AttributeError: 'NoneType' object has no attribute 'getText'

Although you may not see 'line 32' in your traceback, whatever line number you see is the line number where your computer encountered the error. If we review our code, we will see that it is the 'title_value = soup_obj.find('title').getText()' line that is giving us problems. To better understand why this is, let's take a closer look at our current soup_obj by running the following command directly in our python shell after observing the error above: soup_obj

When we run 'soup_obj' in our Python shell, we get the following printout:
<html><body><p><strong>Scrape test :)</strong></p></body></html>

We can clearly see that there is neither a page title or h1 header tag on the '/scrape-test/' page and this is what is causing our original Attribute Error on line 32. Rhetorically, how can we get the text content of something that doesn't exist (the h1 header tag)? In order to address this, we will be switching all find() methods to find_all()[0] as we also add several if else statements within our for loop. The find_all() method returns a list of all matches and we add the [0] because we are only concerned with the first match at the moment. Also, we can wrap the find_all() method in a len() function to determine if we found any matches at all as a len(find_all()) value of 0 tells us there were no matches. To compare, find() returns nothing when it doesn't find a match but find_all() returns an empty list which is much more useful to us. We will use this len(find_all()) combination in our if else statements to determine whether or not we want to append the title_list, h1_list with the actual values or an error message.

Replace the entire for loop again with the code below to address potential sources of errors (2) and (4).

# Replace the entire for loop with the code below for page in page_list: # the line below should have a single tab before it try: # Each line below should have two tabs before it response_obj = requests.get(page, headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3'}) soup_obj = BeautifulSoup(response_obj.text) if len(soup_obj.find_all('title')) >= 1: # Each line below should have three tabs before it title_value = soup_obj.find_all('title')[0].getText() # Adding the line below will help to normalize any odd characters in the title # title_value = unicodedata.normalize('NFKD', title_value) title_list.append(title_value) title_char_value = len(title_value) title_char_list.append(title_char_value) else: title_list.append('Error, no page title') title_char_list.append('Error, no page title') if len(soup_obj.find_all('h1')) >= 1: h1_value = soup_obj.find_all('h1')[0].getText() # Adding the line below will help to normalize any odd characters in the h1 # h1_value = unicodedata.normalize('NFKD', h1_value) h1_list.append(h1_value) else: h1_list.append('Error, no h1') except requests.exceptions.RequestException as e: title_list.append('Error, Req. Excep.') title_char_list.append('Error, Req. Excep.') h1_list.append('Error, Req. Excep.')

We now have to account for cases when a webpage may, for some odd reason, have more than one page title (3) or h1 header tag (5). Neither of these will cause a true error when running our script but having two page title values or h1 header tags is bad practice and it would be good to be able to identify the pages where this occurs. In order to do this, we will define additional lists (title_count_list and h1_count_list), add lines of code that assign values to append to these lists based on the len(find_all()) values we added in the previous step, and update our dictionary definition to account for the two additional lists. In order to test this scenario, we will add a page to our page list that contains two titles and two h1 header tags.

# Replace the current page_list definition line (towards the top of the script) with this line, for overall testing purposes page_list = ['http://www.analysisfromdata.com/scrape-test', 'http://www.analysisfromdata.com/scrape-test-duplicates']

# Defining additional title_count_list and h1_count_list title_count_list = [] h1_count_list = []

# Replace the entire for loop with the code below for page in page_list: # the line below should have a single tab before it try: # Each line below should have two tabs before it response_obj = requests.get(page, headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3'}) soup_obj = BeautifulSoup(response_obj.text) if len(soup_obj.find_all('title')) >= 1: # Each line below should have three tabs before it title_value = soup_obj.find_all('title')[0].getText() # Adding the line below will help to normalize any odd characters in the title # title_value = unicodedata.normalize('NFKD', title_value) title_list.append(title_value) title_char_value = len(title_value) title_char_list.append(title_char_value) title_count_value = len(soup_obj.find_all('title')) title_count_list.append(title_count_value) else: title_list.append('Error, no page title') title_char_list.append('Error, no page title') title_count_list.append('Error, no page title') if len(soup_obj.find_all('h1')) >= 1: h1_value = soup_obj.find_all('h1')[0].getText() # Adding the line below will help to normalize any odd characters in the h1 # h1_value = unicodedata.normalize('NFKD', h1_value) h1_list.append(h1_value) h1_count_value = len(soup_obj.find_all('h1')) h1_count_list.append(h1_count_value) else: h1_list.append('Error, no h1') h1_count_list.append('Error, no h1') except requests.exceptions.RequestException as e: title_list.append('Error, Req. Excep.') title_char_list.append('Error, Req. Excep.') h1_list.append('Error, Req. Excep.')

# Updating the dictionary definition line to account for the two new lists: title_count_list and h1_count_list data = {'Web Pages':page_list, 'Page Title':title_list, 'Number of Page Titles':title_count_list, 'Page Title Length':title_char_list, 'h1':h1_list, 'Number of h1s':h1_count_list}

We have now finished improving our web scraping script by addressing all of the potential errors discussed in this section. Our improved script has been inserted below and is also available at the top of this page under the name "seo_snapshot2.py". Hopefully you found this tutorial useful and maybe now you feel comfortable writing a web scraping script of your own! If you have any questions or suggestions for how to improve the script, please contact me by visiting my LinkedIn page.

import requests from bs4 import BeautifulSoup import pandas as pd # import csv # import time # import unicodedata title_list = [] title_char_list = [] h1_list = [] # Defining additional title_count_list and h1_count_list title_count_list = [] h1_count_list = [] # Replace the current page_list definition line (towards the top of the script) with this line, for overall testing purposes page_list = ['http://www.analysisfromdata.com/scrape-test', 'http://www.analysisfromdata.com/scrape-test-duplicates'] # If building the list from an excel file: # page_df = pd.read_excel('/path/to/file/here/filename.xlsx') # page_list = page_df[page_df.columns[0]].tolist() # If building the list from a CSV file: # page_df = pd.read_csv('/path/to/file/here/filename.csv') # page_list = page_df[page_df.columns[0]].tolist() # Replace the entire for loop with the code below for page in page_list: # the line below should have a single tab before it try: # Each line below should have two tabs before it response_obj = requests.get(page, headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3'}) soup_obj = BeautifulSoup(response_obj.text) if len(soup_obj.find_all('title')) >= 1: # Each line below should have three tabs before it title_value = soup_obj.find_all('title')[0].getText() # Adding the line below will help to normalize any odd characters in the title # title_value = unicodedata.normalize('NFKD', title_value) title_list.append(title_value) title_char_value = len(title_value) title_char_list.append(title_char_value) title_count_value = len(soup_obj.find_all('title')) title_count_list.append(title_count_value) else: title_list.append('Error, no page title') title_char_list.append('Error, no page title') title_count_list.append('Error, no page title') if len(soup_obj.find_all('h1')) >= 1: h1_value = soup_obj.find_all('h1')[0].getText() # Adding the line below will help to normalize any odd characters in the h1 # h1_value = unicodedata.normalize('NFKD', h1_value) h1_list.append(h1_value) h1_count_value = len(soup_obj.find_all('h1')) h1_count_list.append(h1_count_value) else: h1_list.append('Error, no h1') h1_count_list.append('Error, no h1') except requests.exceptions.RequestException as e: title_list.append('Error, Req. Excep.') title_char_list.append('Error, Req. Excep.') h1_list.append('Error, Req. Excep.') # Now that we are done with our for loop, the remaining lines should have no tabs before them # print(page_list) # print(title_list) # print(title_char_list) # print(h1_list) # Updating the dictionary definition line to account for the two new lists: title_count_list and h1_count_list data = {'Web Pages':page_list, 'Page Title':title_list, 'Number of Page Titles':title_count_list, 'Page Title Length':title_char_list, 'h1':h1_list, 'Number of h1s':h1_count_list} df = pd.DataFrame(data) # If wanting to export dataframe as Excel # df.to_excel('/path/to/file/here/filename.xlsx', index = False, header = True) # If wanting to export dataframe as CSV # df.to_csv('/path/to/file/here/filename.csv', index = False, header = True)


Additional Web Scraping Examples

Pickem Script: pickem.py
Disc Golf Script (Regular Sequence): discgolf_regular.py
Disc Golf Script (Random Sample): discgolf_random.py
Storage Script: storage.py