Rahul Banerjee

#Day24 - How to scrape tables and other use cases of Beautiful Soup Part2

#Day24 - How to scrape tables and other use cases of Beautiful Soup Part2

Subscribe to my newsletter and never miss my upcoming articles

Listen to this article

In yesterday's article, we talked about getting started with Beautiful Soup. We discussed the following functions

  • pretiffy()
  • find()
  • find_all()
  • select() Today we will try to scrape the data in the table of the worldometer website Screen Shot 2021-04-13 at 9.20.54 PM.png

The table has an id "main_table_countries_today". We will use the id to get the table element. Let's talk about the structure of the table

<table>
     <thead>
     </thead>
     <tr>
           <td> </td>
           <td> </td>
           <td> </td>
           .
           .
           .
           .
    </tr>
</table>

Screen Shot 2021-04-13 at 9.38.59 PM.png

"thead" contains the header row ( "Country,Other" , "Total Cases" , "New Cases" .........) . If this seems confusing, let's start actually scraping the elements and see the output

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.worldometers.info/coronavirus/").text

soup = BeautifulSoup(html, features= 'html.parser')

table = soup.select("#main_table_countries_today")[0]

headers = table.find("thead").get_text()

print(headers)

Screen Shot 2021-04-13 at 9.54.33 PM.png

We can use the split() function to break the string into a list of elements.

headers = headers.split("\n")
headers = [header for header in headers if header]
print(headers)

'''
OUTPUT
['#', 'Country,Other', 'TotalCases', 'NewCases', 'TotalDeaths', 
'NewDeaths', 'TotalRecovered', 'NewRecovered', 'ActiveCases',
 'Serious,Critical', 'Tot\xa0Cases/1M pop', 'Deaths/1M pop', 'TotalTests', 'Tests/', 
'1M pop', 'Population', 'Continent', 
'1 Caseevery X ppl1 Deathevery X ppl1 Testevery X ppl']
'''

We split by "/n" and then clean up the data. We remove the empty elements. Now let's try to scrap one of the "tr" elements

num_headers = len(headers)
table_body = table.find("tbody")
rows = table_body.find_all("tr")
for idx,row_element in enumerate(rows[8:]):
  row= row_element.get_text().split("\n")[1:]
  if len(row) != num_headers:
    print("Error!")
    break
print(" No Errors")
'''
OUTPUT
 No Errors
'''
  • We get all the elements
  • We start from element 8 since the row with "USA" is the 8th element in the list.
  • The first element in the row is an empty element and ignore it
  • We put a check to ensure that the length of the row and the headers are the same

Now, we have all the data. The data can be transformed and stored as a list of dictionaries or in a CSV.

How to get attributes of the tags

Let's try to get the href value inside a "a tag".

a_tag = soup.find('a')
print(a_tag)
print(f"Attributes :  {a_tag.__dict__['attrs']}")

'''
OUTPUT
<a class="navbar-brand" href="/"><img border="0" src="/img/worldometers-logo.gif" title="Worldometer"/></a>

Attributes :  {'href': '/', 'class': ['navbar-brand']}
'''

To get the href, we can simply do the following

href = a_tag['href']

Let's try to get the URL of the image inside the "a tag", i.e the value for "src"

img = soup.select("a img")[0]
print(img)
img_src = img['src']
print(f'Src is {img_src}')

'''
OUTPUT
<img border="0" src="/img/worldometers-logo.gif" title="Worldometer"/>
Src is /img/worldometers-logo.gif
'''
#100daysofcode#python#programming-languages#codenewbies#web-scraping
 
Share this