Web scraping with Python


 Table of contents

If you don’t want to type the code, you can copy the notebook web_scraping.ipynb into your home directory and run the cells directly.

To do this, open a terminal in Jupyter Lab (open a new launcher and click on the “Terminal” button) and run the following code:

cp ~/projects/def-sponsor00/shared/dhsi/web_scraping.ipynb .

Make sure not to forget the dot at the end.

You should then see the web_scraping.ipynb file appear on the left panel. Double-click on it to open it.

Load packages

import requests                 # To download the html data from a site
from bs4 import BeautifulSoup   # To parse the html data
import pandas as pd             # To store our data in a DataFrame

Get the data

url = "https://www.scrapethissite.com/pages/simple/"
page = requests.get(url)

Explore the raw data

print(page.text[:70])
<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">

Parse the data

soup = BeautifulSoup(page.content, "html.parser")

Extract the relevant section of the html

test =  soup.find('div', attrs={'class' : 'col-md-4 country'})
print(test.prettify())
<div class="col-md-4 country">
 <h3 class="country-name">
  <i class="flag-icon flag-icon-ad">
  </i>
  Andorra
 </h3>
 <div class="country-info">
  <strong>
   Capital:
  </strong>
  <span class="country-capital">
   Andorra la Vella
  </span>
  <br/>
  <strong>
   Population:
  </strong>
  <span class="country-population">
   84000
  </span>
  <br/>
  <strong>
   Area (km
   <sup>
    2
   </sup>
   ):
  </strong>
  <span class="country-area">
   468.0
  </span>
  <br/>
 </div>
</div>

Extract information for one country

test_name = test.find('h3', class_="country-name")
test_cap = test.find('span', class_="country-capital")
test_pop = test.find('span', class_="country-population")
test_area = test.find('span', class_="country-area")

Let’s look at our test data:

print(test_name, test_cap, test_pop, test_area)
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3> <span class="country-capital">Andorra la Vella</span> <span class="country-population">84000</span> <span class="country-area">468.0</span>

This is quite ugly …

Let’s remove the html formatting by extract only the text:

print(test_name.text, test_cap.text, test_pop.text, test_area.text)
                            Andorra
                         Andorra la Vella 84000 468.0

This is better, but let’s also remove all those spaces:

print(test_name.text.strip(), test_cap.text.strip(), test_pop.text.strip(), test_area.text.strip())
Andorra Andorra la Vella 84000 468.0

Finally a readable result!

Extract information for all countries

Let’s create a set with the information for all countries:

data = soup.find_all('div', class_="col-md-4 country")

data is a ResultSet object created by Beautiful Soup. It is an iterable, meaning that it can be used in a loop.

type(data)
bs4.element.ResultSet

For each element of data (so for each country), we can now get our information:

for country in data[:5]:
    name = country.find('h3', class_="country-name")
    cap = country.find('span', class_="country-capital")
    pop = country.find('span', class_="country-population")
    area = country.find('span', class_="country-area")
    print(name.text.strip(), cap.text.strip(), pop.text.strip(), area.text.strip())
Andorra Andorra la Vella 84000 468.0
United Arab Emirates Abu Dhabi 4975593 82880.0
Afghanistan Kabul 29121286 647500.0
Antigua and Barbuda St. John's 86754 443.0
Anguilla The Valley 13254 102.0

Store results in a DataFrame

Fist, we store the results in a list:

ls = []

for country in data:
    name = country.find('h3', class_="country-name")
    cap = country.find('span', class_="country-capital")
    pop = country.find('span', class_="country-population")
    area = country.find('span', class_="country-area")
    ls.append((name.text.strip(), cap.text.strip(), pop.text.strip(), area.text.strip()))
type(ls)
list
print(ls)
[('Andorra', 'Andorra la Vella', '84000', '468.0'), ('United Arab Emirates', 'Abu Dhabi', '4975593', '82880.0'), ('Afghanistan', 'Kabul', '29121286', '647500.0'), ('Antigua and Barbuda', "St. John's", '86754', '443.0'), ('Anguilla', 'The Valley', '13254', '102.0'), ('Albania', 'Tirana', '2986952', '28748.0'), ('Armenia', 'Yerevan', '2968000', '29800.0'), ('Angola', 'Luanda', '13068161', '1246700.0'), ('Antarctica', 'None', '0', '1.4E7'), ('Argentina', 'Buenos Aires', '41343201', '2766890.0'), ('American Samoa', 'Pago Pago', '57881', '199.0'), ('Austria', 'Vienna', '8205000', '83858.0'), ('Australia', 'Canberra', '21515754', '7686850.0'), ('Aruba', 'Oranjestad', '71566', '193.0'), ('Åland', 'Mariehamn', '26711', '1580.0'), ('Azerbaijan', 'Baku', '8303512', '86600.0'), ('Bosnia and Herzegovina', 'Sarajevo', '4590000', '51129.0'), ('Barbados', 'Bridgetown', '285653', '431.0'), ('Bangladesh', 'Dhaka', '156118464', '144000.0'), ('Belgium', 'Brussels', '10403000', '30510.0'), ('Burkina Faso', 'Ouagadougou', '16241811', '274200.0'), ('Bulgaria', 'Sofia', '7148785', '110910.0'), ('Bahrain', 'Manama', '738004', '665.0'), ('Burundi', 'Bujumbura', '9863117', '27830.0'), ('Benin', 'Porto-Novo', '9056010', '112620.0'), ('Saint Barthélemy', 'Gustavia', '8450', '21.0'), ('Bermuda', 'Hamilton', '65365', '53.0'), ('Brunei', 'Bandar Seri Begawan', '395027', '5770.0'), ('Bolivia', 'Sucre', '9947418', '1098580.0'), ('Bonaire', 'Kralendijk', '18012', '328.0'), ('Brazil', 'Brasília', '201103330', '8511965.0'), ('Bahamas', 'Nassau', '301790', '13940.0'), ('Bhutan', 'Thimphu', '699847', '47000.0'), ('Bouvet Island', 'None', '0', '49.0'), ('Botswana', 'Gaborone', '2029307', '600370.0'), ('Belarus', 'Minsk', '9685000', '207600.0'), ('Belize', 'Belmopan', '314522', '22966.0'), ('Canada', 'Ottawa', '33679000', '9984670.0'), ('Cocos [Keeling] Islands', 'West Island', '628', '14.0'), ('Democratic Republic of the Congo', 'Kinshasa', '70916439', '2345410.0'), ('Central African Republic', 'Bangui', '4844927', '622984.0'), ('Republic of the Congo', 'Brazzaville', '3039126', '342000.0'), ('Switzerland', 'Bern', '7581000', '41290.0'), ('Ivory Coast', 'Yamoussoukro', '21058798', '322460.0'), ('Cook Islands', 'Avarua', '21388', '240.0'), ('Chile', 'Santiago', '16746491', '756950.0'), ('Cameroon', 'Yaoundé', '19294149', '475440.0'), ('China', 'Beijing', '1330044000', '9596960.0'), ('Colombia', 'Bogotá', '47790000', '1138910.0'), ('Costa Rica', 'San José', '4516220', '51100.0'), ('Cuba', 'Havana', '11423000', '110860.0'), ('Cape Verde', 'Praia', '508659', '4033.0'), ('Curacao', 'Willemstad', '141766', '444.0'), ('Christmas Island', 'Flying Fish Cove', '1500', '135.0'), ('Cyprus', 'Nicosia', '1102677', '9250.0'), ('Czech Republic', 'Prague', '10476000', '78866.0'), ('Germany', 'Berlin', '81802257', '357021.0'), ('Djibouti', 'Djibouti', '740528', '23000.0'), ('Denmark', 'Copenhagen', '5484000', '43094.0'), ('Dominica', 'Roseau', '72813', '754.0'), ('Dominican Republic', 'Santo Domingo', '9823821', '48730.0'), ('Algeria', 'Algiers', '34586184', '2381740.0'), ('Ecuador', 'Quito', '14790608', '283560.0'), ('Estonia', 'Tallinn', '1291170', '45226.0'), ('Egypt', 'Cairo', '80471869', '1001450.0'), ('Western Sahara', 'Laâyoune / El Aaiún', '273008', '266000.0'), ('Eritrea', 'Asmara', '5792984', '121320.0'), ('Spain', 'Madrid', '46505963', '504782.0'), ('Ethiopia', 'Addis Ababa', '88013491', '1127127.0'), ('Finland', 'Helsinki', '5244000', '337030.0'), ('Fiji', 'Suva', '875983', '18270.0'), ('Falkland Islands', 'Stanley', '2638', '12173.0'), ('Micronesia', 'Palikir', '107708', '702.0'), ('Faroe Islands', 'Tórshavn', '48228', '1399.0'), ('France', 'Paris', '64768389', '547030.0'), ('Gabon', 'Libreville', '1545255', '267667.0'), ('United Kingdom', 'London', '62348447', '244820.0'), ('Grenada', "St. George's", '107818', '344.0'), ('Georgia', 'Tbilisi', '4630000', '69700.0'), ('French Guiana', 'Cayenne', '195506', '91000.0'), ('Guernsey', 'St Peter Port', '65228', '78.0'), ('Ghana', 'Accra', '24339838', '239460.0'), ('Gibraltar', 'Gibraltar', '27884', '6.5'), ('Greenland', 'Nuuk', '56375', '2166086.0'), ('Gambia', 'Bathurst', '1593256', '11300.0'), ('Guinea', 'Conakry', '10324025', '245857.0'), ('Guadeloupe', 'Basse-Terre', '443000', '1780.0'), ('Equatorial Guinea', 'Malabo', '1014999', '28051.0'), ('Greece', 'Athens', '11000000', '131940.0'), ('South Georgia and the South Sandwich Islands', 'Grytviken', '30', '3903.0'), ('Guatemala', 'Guatemala City', '13550440', '108890.0'), ('Guam', 'Hagåtña', '159358', '549.0'), ('Guinea-Bissau', 'Bissau', '1565126', '36120.0'), ('Guyana', 'Georgetown', '748486', '214970.0'), ('Hong Kong', 'Hong Kong', '6898686', '1092.0'), ('Heard Island and McDonald Islands', 'None', '0', '412.0'), ('Honduras', 'Tegucigalpa', '7989415', '112090.0'), ('Croatia', 'Zagreb', '4491000', '56542.0'), ('Haiti', 'Port-au-Prince', '9648924', '27750.0'), ('Hungary', 'Budapest', '9982000', '93030.0'), ('Indonesia', 'Jakarta', '242968342', '1919440.0'), ('Ireland', 'Dublin', '4622917', '70280.0'), ('Israel', 'None', '7353985', '20770.0'), ('Isle of Man', 'Douglas', '75049', '572.0'), ('India', 'New Delhi', '1173108018', '3287590.0'), ('British Indian Ocean Territory', 'None', '4000', '60.0'), ('Iraq', 'Baghdad', '29671605', '437072.0'), ('Iran', 'Tehran', '76923300', '1648000.0'), ('Iceland', 'Reykjavik', '308910', '103000.0'), ('Italy', 'Rome', '60340328', '301230.0'), ('Jersey', 'Saint Helier', '90812', '116.0'), ('Jamaica', 'Kingston', '2847232', '10991.0'), ('Jordan', 'Amman', '6407085', '92300.0'), ('Japan', 'Tokyo', '127288000', '377835.0'), ('Kenya', 'Nairobi', '40046566', '582650.0'), ('Kyrgyzstan', 'Bishkek', '5776500', '198500.0'), ('Cambodia', 'Phnom Penh', '14453680', '181040.0'), ('Kiribati', 'Tarawa', '92533', '811.0'), ('Comoros', 'Moroni', '773407', '2170.0'), ('Saint Kitts and Nevis', 'Basseterre', '51134', '261.0'), ('North Korea', 'Pyongyang', '22912177', '120540.0'), ('South Korea', 'Seoul', '48422644', '98480.0'), ('Kuwait', 'Kuwait City', '2789132', '17820.0'), ('Cayman Islands', 'George Town', '44270', '262.0'), ('Kazakhstan', 'Astana', '15340000', '2717300.0'), ('Laos', 'Vientiane', '6368162', '236800.0'), ('Lebanon', 'Beirut', '4125247', '10400.0'), ('Saint Lucia', 'Castries', '160922', '616.0'), ('Liechtenstein', 'Vaduz', '35000', '160.0'), ('Sri Lanka', 'Colombo', '21513990', '65610.0'), ('Liberia', 'Monrovia', '3685076', '111370.0'), ('Lesotho', 'Maseru', '1919552', '30355.0'), ('Lithuania', 'Vilnius', '2944459', '65200.0'), ('Luxembourg', 'Luxembourg', '497538', '2586.0'), ('Latvia', 'Riga', '2217969', '64589.0'), ('Libya', 'Tripoli', '6461454', '1759540.0'), ('Morocco', 'Rabat', '31627428', '446550.0'), ('Monaco', 'Monaco', '32965', '1.95'), ('Moldova', 'Chişinău', '4324000', '33843.0'), ('Montenegro', 'Podgorica', '666730', '14026.0'), ('Saint Martin', 'Marigot', '35925', '53.0'), ('Madagascar', 'Antananarivo', '21281844', '587040.0'), ('Marshall Islands', 'Majuro', '65859', '181.3'), ('Macedonia', 'Skopje', '2062294', '25333.0'), ('Mali', 'Bamako', '13796354', '1240000.0'), ('Myanmar [Burma]', 'Naypyitaw', '53414374', '678500.0'), ('Mongolia', 'Ulan Bator', '3086918', '1565000.0'), ('Macao', 'Macao', '449198', '254.0'), ('Northern Mariana Islands', 'Saipan', '53883', '477.0'), ('Martinique', 'Fort-de-France', '432900', '1100.0'), ('Mauritania', 'Nouakchott', '3205060', '1030700.0'), ('Montserrat', 'Plymouth', '9341', '102.0'), ('Malta', 'Valletta', '403000', '316.0'), ('Mauritius', 'Port Louis', '1294104', '2040.0'), ('Maldives', 'Malé', '395650', '300.0'), ('Malawi', 'Lilongwe', '15447500', '118480.0'), ('Mexico', 'Mexico City', '112468855', '1972550.0'), ('Malaysia', 'Kuala Lumpur', '28274729', '329750.0'), ('Mozambique', 'Maputo', '22061451', '801590.0'), ('Namibia', 'Windhoek', '2128471', '825418.0'), ('New Caledonia', 'Noumea', '216494', '19060.0'), ('Niger', 'Niamey', '15878271', '1267000.0'), ('Norfolk Island', 'Kingston', '1828', '34.6'), ('Nigeria', 'Abuja', '154000000', '923768.0'), ('Nicaragua', 'Managua', '5995928', '129494.0'), ('Netherlands', 'Amsterdam', '16645000', '41526.0'), ('Norway', 'Oslo', '5009150', '324220.0'), ('Nepal', 'Kathmandu', '28951852', '140800.0'), ('Nauru', 'Yaren', '10065', '21.0'), ('Niue', 'Alofi', '2166', '260.0'), ('New Zealand', 'Wellington', '4252277', '268680.0'), ('Oman', 'Muscat', '2967717', '212460.0'), ('Panama', 'Panama City', '3410676', '78200.0'), ('Peru', 'Lima', '29907003', '1285220.0'), ('French Polynesia', 'Papeete', '270485', '4167.0'), ('Papua New Guinea', 'Port Moresby', '6064515', '462840.0'), ('Philippines', 'Manila', '99900177', '300000.0'), ('Pakistan', 'Islamabad', '184404791', '803940.0'), ('Poland', 'Warsaw', '38500000', '312685.0'), ('Saint Pierre and Miquelon', 'Saint-Pierre', '7012', '242.0'), ('Pitcairn Islands', 'Adamstown', '46', '47.0'), ('Puerto Rico', 'San Juan', '3916632', '9104.0'), ('Palestine', 'None', '3800000', '5970.0'), ('Portugal', 'Lisbon', '10676000', '92391.0'), ('Palau', 'Melekeok', '19907', '458.0'), ('Paraguay', 'Asunción', '6375830', '406750.0'), ('Qatar', 'Doha', '840926', '11437.0'), ('Réunion', 'Saint-Denis', '776948', '2517.0'), ('Romania', 'Bucharest', '21959278', '237500.0'), ('Serbia', 'Belgrade', '7344847', '88361.0'), ('Russia', 'Moscow', '140702000', '1.71E7'), ('Rwanda', 'Kigali', '11055976', '26338.0'), ('Saudi Arabia', 'Riyadh', '25731776', '1960582.0'), ('Solomon Islands', 'Honiara', '559198', '28450.0'), ('Seychelles', 'Victoria', '88340', '455.0'), ('Sudan', 'Khartoum', '35000000', '1861484.0'), ('Sweden', 'Stockholm', '9828655', '449964.0'), ('Singapore', 'Singapore', '4701069', '692.7'), ('Saint Helena', 'Jamestown', '7460', '410.0'), ('Slovenia', 'Ljubljana', '2007000', '20273.0'), ('Svalbard and Jan Mayen', 'Longyearbyen', '2550', '62049.0'), ('Slovakia', 'Bratislava', '5455000', '48845.0'), ('Sierra Leone', 'Freetown', '5245695', '71740.0'), ('San Marino', 'San Marino', '31477', '61.2'), ('Senegal', 'Dakar', '12323252', '196190.0'), ('Somalia', 'Mogadishu', '10112453', '637657.0'), ('Suriname', 'Paramaribo', '492829', '163270.0'), ('South Sudan', 'Juba', '8260490', '644329.0'), ('São Tomé and Príncipe', 'São Tomé', '175808', '1001.0'), ('El Salvador', 'San Salvador', '6052064', '21040.0'), ('Sint Maarten', 'Philipsburg', '37429', '21.0'), ('Syria', 'Damascus', '22198110', '185180.0'), ('Swaziland', 'Mbabane', '1354051', '17363.0'), ('Turks and Caicos Islands', 'Cockburn Town', '20556', '430.0'), ('Chad', "N'Djamena", '10543464', '1284000.0'), ('French Southern Territories', 'Port-aux-Français', '140', '7829.0'), ('Togo', 'Lomé', '6587239', '56785.0'), ('Thailand', 'Bangkok', '67089500', '514000.0'), ('Tajikistan', 'Dushanbe', '7487489', '143100.0'), ('Tokelau', 'None', '1466', '10.0'), ('East Timor', 'Dili', '1154625', '15007.0'), ('Turkmenistan', 'Ashgabat', '4940916', '488100.0'), ('Tunisia', 'Tunis', '10589025', '163610.0'), ('Tonga', "Nuku'alofa", '122580', '748.0'), ('Turkey', 'Ankara', '77804122', '780580.0'), ('Trinidad and Tobago', 'Port of Spain', '1228691', '5128.0'), ('Tuvalu', 'Funafuti', '10472', '26.0'), ('Taiwan', 'Taipei', '22894384', '35980.0'), ('Tanzania', 'Dodoma', '41892895', '945087.0'), ('Ukraine', 'Kiev', '45415596', '603700.0'), ('Uganda', 'Kampala', '33398682', '236040.0'), ('U.S. Minor Outlying Islands', 'None', '0', '0.0'), ('United States', 'Washington', '310232863', '9629091.0'), ('Uruguay', 'Montevideo', '3477000', '176220.0'), ('Uzbekistan', 'Tashkent', '27865738', '447400.0'), ('Vatican City', 'Vatican City', '921', '0.44'), ('Saint Vincent and the Grenadines', 'Kingstown', '104217', '389.0'), ('Venezuela', 'Caracas', '27223228', '912050.0'), ('British Virgin Islands', 'Road Town', '21730', '153.0'), ('U.S. Virgin Islands', 'Charlotte Amalie', '108708', '352.0'), ('Vietnam', 'Hanoi', '89571130', '329560.0'), ('Vanuatu', 'Port Vila', '221552', '12200.0'), ('Wallis and Futuna', 'Mata-Utu', '16025', '274.0'), ('Samoa', 'Apia', '192001', '2944.0'), ('Kosovo', 'Pristina', '1800000', '10908.0'), ('Yemen', 'Sanaa', '23495361', '527970.0'), ('Mayotte', 'Mamoudzou', '159042', '374.0'), ('South Africa', 'Pretoria', '49000000', '1219912.0'), ('Zambia', 'Lusaka', '13460305', '752614.0'), ('Zimbabwe', 'Harare', '11651858', '390580.0')]

Then, we create a list with the column names for our DataFrame:

cols = ["Name", "Capital", "Population", "Area"]

Finally, we can create our DataFrame:

df = pd.DataFrame(ls, columns=cols)
type(df)
pandas.core.frame.DataFrame
print(df)
                     Name           Capital Population       Area
0                 Andorra  Andorra la Vella      84000      468.0
1    United Arab Emirates         Abu Dhabi    4975593    82880.0
2             Afghanistan             Kabul   29121286   647500.0
3     Antigua and Barbuda        St. John's      86754      443.0
4                Anguilla        The Valley      13254      102.0
..                    ...               ...        ...        ...
245                 Yemen             Sanaa   23495361   527970.0
246               Mayotte         Mamoudzou     159042      374.0
247          South Africa          Pretoria   49000000  1219912.0
248                Zambia            Lusaka   13460305   752614.0
249              Zimbabwe            Harare   11651858   390580.0

[250 rows x 4 columns]

Back to Day 3