Create Your Own Dataset: A Step-by-Step Guide to Web Scraping using Selenium and Python for Data Collection and Analysis

Mahesh Jadhav
6 min readMay 13, 2023

The first and most important requirement in the field of data science is data. However, it’s not always possible to find the required datasets on websites like Kaggle or Datahub. If you’re lucky enough to work for a company, you might have access to data through APIs or Databases. But, most of the time data may not be readily available, and you need to figure out how to create your own datasets from scratch. In such cases, web scraping can be a useful tool to gather the required data effectively.

Web scraping is a technique that extracts information from websites or web pages. This process uses automated software tools to scan the HTML of a website and gather various types of information, such as text, images, and links. The collected data can be examined, manipulated, and used for a variety of purposes like data mining, market research, and machine learning.

Nowadays, there are many tools available to scrape websites, but some of them may not work if websites generate content dynamically using JavaScript. This is why Selenium is becoming increasingly popular as it offers customized configurations and behaves like a real person interacting with the website through a browser.

Problem statement:

“Conduct an analysis of all the leading IT companies operating in India.

There are various websites available from which you can obtain this data, but we will be using one of the best among them, AmbitionBox. It is a career advisory and job search platform based in India. That also provides reviews and details about companies.

Let’s explore the AmbitionBox website using the browser.

alternate image name
AmbitionBox webpage

Observing the webpage carefully, let’s see how we can extract the data.

  1. AmbitionBox provides details for approximately 14,588 companies, but for our purposes, we will be focusing mainly on IT companies.
  2. Each company is assigned a separate box, which contains all the details regarding the company.
  3. Each page on the website has around 30 boxes, meaning that one page represents data about 30 companies.
  4. As there are around 333 pages available, we will have access to data on around 9,990 companies.

You can get below code in the form of jupyter notebook here and the extracted dataset is available here.

Importing all required libraries

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from time import sleep
import re
# Setting up the driver path where we have installed the
# chromedriver

DRIVER_PATH = 'D:\Applications\Chromium\chromedriver_win32'
# Creating a Service object using the specified driver_path which
# enables communication between the Selenium WebDriver and the
# browser.
s=Service(DRIVER_PATH)

Extracting the data

In the next code block, we will use Selenium to iterate through all web pages and extract all the data using BeautifulSoup along with regular expressions. The extracted data will be saved into a dataset. Additionally, we will print the page number after reading each page so that if any issues occur, we can identify the page where the problem occurred.

%%time# Create an empty Databasecompany_df = pd.DataFrame(columns = [ 'name','rating','review','tags','company_type','headquarters','age','total_emp','about'
])
# setup options parameter for selenium sessionoptions = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--window-size=1920,1200")
# Iterate for loop on all pagesfor p in range(1,334):
try:
driver = webdriver.Chrome(service=s,options=options)
url=f"https://www.ambitionbox.com/list-of-companies?IndustryName=it-services-and-consulting&sort_by=popularity&page={p}"
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
companies = soup.find_all("div", class_="company-content-wrapper")

for i in companies:
name = None
rating = None
review = None
tags = None

company_type = None
headquarters = None
age = None
total_emp = None

about = None

try:

name = i.find('h2',class_='company-name bold-title-l').text.strip()
rating = i.find('p',class_='rating').text.strip()
review = i.find('a',class_='review-count sbold-Labels').text.strip()
tags = ','.join([t.find('a',class_='ab_chip body-medium').text.strip() for t in i.find('ul',class_='chips-block')])

names = i.find('div',class_='company-basic-info')

for n in names.find_all('p'):
try:
if n.find('i',class_='icon-domain'): company_type = n.text.strip()
if n.find('i',class_='icon-pin-drop'): headquarters = n.text.strip()
if n.find('i',class_='icon-access-time'): age = n.text.strip()
if n.find('i',class_='icon-supervisor-account'): total_emp = n.text.strip()
except:
continue

about = i.find('p',class_='description').text.strip()

except:
continue
finally:
company_df.loc[len(company_df)] = [name,rating,review,tags,company_type,headquarters,age,total_emp,about]


sleep(1)
except Exception as e:
print(f"Something went wrong: {e}")
finally:
driver.quit()
print(p, end='|')

print('done')
1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29
|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|
55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80
|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104
|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|
124|125|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|
143|144|145|146|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|
162|163|164|165|166|167|168|169|170|171|172|173|174|175|176|177|178|179|180|
181|182|183|184|185|186|187|188|189|190|191|192|193|194|195|196|197|198|199|
200|201|202|203|204|205|206|207|208|209|210|211|212|213|214|215|216|217|218|
219|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|
238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|253|254|255|256|
257|258|259|260|261|262|263|264|265|266|267|268|269|270|271|272|273|274|275|
276|277|278|279|280|281|282|283|284|285|286|287|288|289|290|291|292|293|294|
295|296|297|298|299|300|301|302|303|304|305|306|307|308|309|310|311|312|313|
314|315|316|317|318|319|320|321|322|323|324|325|326|327|328|329|330|331|332|
333|done
CPU times: total: 31.3 s
Wall time: 44min 14s

Let’s explore the dataset created

company_df.head()
png

Manipulating the Data

We have obtained the data, but it is not in a format that will be useful for analysis. Therefore, we need to clean the data and retain only the necessary information.
To ensure that we do not lose our original data in case anything goes wrong, we will perform these manipulations on a copy of the original dataset.

df = company_df.copy()

1. review

This column represents numerical data, but Pandas has assigned it as a string. We will create a function that extracts only numerical data from the string.

def transform_review(data):
try:
if re.search(r'\d+k',data):
return float(re.search(r'\d+\.\d+|\d+',data).group()) * 1000
elif re.search(r'\d+',data):
return float(re.search(r'\d+',data).group())
else:
return 0
except:
return 0
df['review'] = df['review'].apply(transform_review)df.head(2)
png

2. headquarters

headquarters column consist of both the actual headquarters location and the number of offices that the company has. We can separate this information into two distinct columns, namely “headquarter” and “total_offices”.

  • headquarter
def transform_headquarter(data):
try:
return data.split('+') [0].strip()
except:
return None
df['headquarter'] = df['headquarters'].apply(transform_headquarter)
  • total_offices
def transform_total_offices(data):
try:
if re.search(r'\d+',data):
data = data.split('+') [1].strip()
return int(re.search(r'\d+',data).group()) + 1
else:
return 1
except:
return None
df['total_offices'] = df['headquarters'].apply(transform_total_offices)

Now that we have successfully separated and added our columns, we can delete the unnecessary column ‘headquarters’.

del df['headquarters']df.head(2)
png

3. age

This simply displays the age of the company from the year of its establishment. We can convert this numerical data into a date format and label it as ‘founded_year’, to gain a better understanding.

def transform_age(data):
try:
return int(pd.to_datetime('today').date().year - int(re.search(r'\d+',data).group()))
except:
return None
df['founded_year'] = df['age'].apply(transform_age)del df['age']df.head(2)
png

4. total_emp

Due to the mixture of text and numbers, the data is displayed as string values instead of numeric values. We will convert this data into numeric format. Please note that this is not the actual number of employees, but instead represents the minimum count of employees working in India.

def transform_total_emp(data):
try:
values = list(map(int,re.findall(r'\d+\.\d+|\d+',data)))
units = re.findall(r'Lakh|k',data)
if len(values)==0:
return 0
elif len(values)==1:
if units[0] == 'Lakh':
return values[0] * 100000
elif units[0] == 'k':
return values[0] * 1000
else:
return values[0]
elif len(values)==2:
if len(units)==0:
return (values[0]+values[1])/2
elif len(units)==1:
return values[0]+(values[1]*1000)/2
elif len(units)==2:
if units[0] == units[1] == 'Lakh':
return (values[0]+values[1])/2 * 100000
elif units[0] == units[1] == 'k':
return (values[0]+values[1])/2 * 1000
else:
return ((values[0]*1000) + (values[1]*100000))/2
else:
return 0
except:
return None
df['total_emp'] = df['total_emp'].apply(transform_total_emp)df.head(2)
png
df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 9990 entries, 0 to 9989
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 9990 non-null object
1 rating 9990 non-null object
2 review 9990 non-null float64
3 tags 9990 non-null object
4 company_type 7344 non-null object
5 total_emp 9101 non-null float64
6 about 6699 non-null object
7 headquarter 9171 non-null object
8 total_offices 9166 non-null float64
9 founded_year 7675 non-null float64
dtypes: float64(4), object(6)
memory usage: 858.5+ KB

Data has been cleaned successfully let’s correct the datatype of columns to store it in an efficient way

col_dtypes = {
'name': str,
'rating': int,
'review': float,
'tags': str,
'company_type': str,
'total_emp': float,
'about': str,
'headquarter': str,
'total_offices': int,
'founded_year': int,
}
df = df.astype(col_dtypes,errors='ignore')

Let’s check the entire dataset once again.

df.head()
png

We have all the necessary details, so now we can save the dataset in the desired format.

df.to_csv('AmbitionBox_Dataset.csv',index=False)

You can get the original python jupyter notebook here and the extracted dataset is available here.

--

--