Create Your Own Dataset: A Step-by-Step Guide to Web Scraping using Selenium and Python for Data Collection and Analysis

6 min readMay 13, 2023

The first and most important requirement in the field of data science is data. However, it’s not always possible to find the required datasets on websites like Kaggle or Datahub. If you’re lucky enough to work for a company, you might have access to data through APIs or Databases. But, most of the time data may not be readily available, and you need to figure out how to create your own datasets from scratch. In such cases, web scraping can be a useful tool to gather the required data effectively.

Web scraping is a technique that extracts information from websites or web pages. This process uses automated software tools to scan the HTML of a website and gather various types of information, such as text, images, and links. The collected data can be examined, manipulated, and used for a variety of purposes like data mining, market research, and machine learning.

Nowadays, there are many tools available to scrape websites, but some of them may not work if websites generate content dynamically using JavaScript. This is why Selenium is becoming increasingly popular as it offers customized configurations and behaves like a real person interacting with the website through a browser.

Problem statement:

“Conduct an analysis of all the leading IT companies operating in India.”

There are various websites available from which you can obtain this data, but we will be using one of the best among them, AmbitionBox. It is a career advisory and job search platform based in India. That also provides reviews and details about companies.

Let’s explore the AmbitionBox website using the browser.

https://www.ambitionbox.com/list-of-companies?IndustryName=it-services-and-consulting&sort_by=popularity&page=1

alternate image name — AmbitionBox webpage

Observing the webpage carefully, let’s see how we can extract the data.

AmbitionBox provides details for approximately 14,588 companies, but for our purposes, we will be focusing mainly on IT companies.
Each company is assigned a separate box, which contains all the details regarding the company.
Each page on the website has around 30 boxes, meaning that one page represents data about 30 companies.
As there are around 333 pages available, we will have access to data on around 9,990 companies.

You can get below code in the form of jupyter notebook here and the extracted dataset is available here.

Importing all required libraries

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from time import sleep
import re# Setting up the driver path where we have installed the
# chromedriver

DRIVER_PATH = 'D:\Applications\Chromium\chromedriver_win32'# Creating a Service object using the specified driver_path which
# enables communication between the Selenium WebDriver and the
# browser.s=Service(DRIVER_PATH)

Extracting the data

In the next code block, we will use Selenium to iterate through all web pages and extract all the data using BeautifulSoup along with regular expressions. The extracted data will be saved into a dataset. Additionally, we will print the page number after reading each page so that if any issues occur, we can identify the page where the problem occurred.

%%time# Create an empty Databasecompany_df = pd.DataFrame(columns = [ 'name','rating','review','tags','company_type','headquarters','age','total_emp','about'
])# setup options parameter for selenium sessionoptions = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--window-size=1920,1200")# Iterate for loop on all pagesfor p in range(1,334):
    try:
        driver = webdriver.Chrome(service=s,options=options)
        url=f"https://www.ambitionbox.com/list-of-companies?IndustryName=it-services-and-consulting&sort_by=popularity&page={p}"
        driver.get(url)
        soup = BeautifulSoup(driver.page_source,'html.parser')
        companies = soup.find_all("div", class_="company-content-wrapper")

        for i in companies:
            name = None
            rating = None
            review = None
            tags = None

            company_type = None
            headquarters = None
            age = None
            total_emp = None

            about = None

            try:

                name = i.find('h2',class_='company-name bold-title-l').text.strip()
                rating = i.find('p',class_='rating').text.strip()
                review = i.find('a',class_='review-count sbold-Labels').text.strip()
                tags = ','.join([t.find('a',class_='ab_chip body-medium').text.strip() for t in i.find('ul',class_='chips-block')])

                names = i.find('div',class_='company-basic-info')

                for n in names.find_all('p'):
                    try:
                        if n.find('i',class_='icon-domain'): company_type = n.text.strip()
                        if n.find('i',class_='icon-pin-drop'): headquarters = n.text.strip()
                        if n.find('i',class_='icon-access-time'): age = n.text.strip()
                        if n.find('i',class_='icon-supervisor-account'): total_emp = n.text.strip()
                    except:
                        continue

                about = i.find('p',class_='description').text.strip()
                
            except:
                continue
            finally:
                company_df.loc[len(company_df)] = [name,rating,review,tags,company_type,headquarters,age,total_emp,about]
                
        
        sleep(1)
    except Exception as e:
        print(f"Something went wrong: {e}")
    finally:
        driver.quit()
        print(p, end='|')
    
print('done')

1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29
|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|
55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80
|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104
|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|
124|125|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|
143|144|145|146|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|
162|163|164|165|166|167|168|169|170|171|172|173|174|175|176|177|178|179|180|
181|182|183|184|185|186|187|188|189|190|191|192|193|194|195|196|197|198|199|
200|201|202|203|204|205|206|207|208|209|210|211|212|213|214|215|216|217|218|
219|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|
238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|253|254|255|256|
257|258|259|260|261|262|263|264|265|266|267|268|269|270|271|272|273|274|275|
276|277|278|279|280|281|282|283|284|285|286|287|288|289|290|291|292|293|294|
295|296|297|298|299|300|301|302|303|304|305|306|307|308|309|310|311|312|313|
314|315|316|317|318|319|320|321|322|323|324|325|326|327|328|329|330|331|332|
333|done
CPU times: total: 31.3 s
Wall time: 44min 14s

Let’s explore the dataset created

company_df.head()

Manipulating the Data

We have obtained the data, but it is not in a format that will be useful for analysis. Therefore, we need to clean the data and retain only the necessary information.
To ensure that we do not lose our original data in case anything goes wrong, we will perform these manipulations on a copy of the original dataset.

df = company_df.copy()

1. review

This column represents numerical data, but Pandas has assigned it as a string. We will create a function that extracts only numerical data from the string.

def transform_review(data):
    try:
        if re.search(r'\d+k',data):
            return float(re.search(r'\d+\.\d+|\d+',data).group()) * 1000
        elif re.search(r'\d+',data):
            return float(re.search(r'\d+',data).group())
        else:
            return 0
    except:
        return 0df['review'] = df['review'].apply(transform_review)df.head(2)

2. headquarters

headquarters column consist of both the actual headquarters location and the number of offices that the company has. We can separate this information into two distinct columns, namely “headquarter” and “total_offices”.

headquarter

def transform_headquarter(data):
    try:
        return data.split('+') [0].strip()
    except:
        return Nonedf['headquarter'] = df['headquarters'].apply(transform_headquarter)

total_offices

def transform_total_offices(data):
    try:
        if re.search(r'\d+',data):
            data = data.split('+') [1].strip()
            return int(re.search(r'\d+',data).group()) + 1
        else:
            return 1
    except:
        return Nonedf['total_offices'] = df['headquarters'].apply(transform_total_offices)

Now that we have successfully separated and added our columns, we can delete the unnecessary column ‘headquarters’.

del df['headquarters']df.head(2)

3. age

This simply displays the age of the company from the year of its establishment. We can convert this numerical data into a date format and label it as ‘founded_year’, to gain a better understanding.

def transform_age(data):
    try:
        return int(pd.to_datetime('today').date().year - int(re.search(r'\d+',data).group()))
    except:
        return Nonedf['founded_year'] = df['age'].apply(transform_age)del df['age']df.head(2)

4. total_emp

Due to the mixture of text and numbers, the data is displayed as string values instead of numeric values. We will convert this data into numeric format. Please note that this is not the actual number of employees, but instead represents the minimum count of employees working in India.

def transform_total_emp(data):
    try:
        values = list(map(int,re.findall(r'\d+\.\d+|\d+',data)))
        units = re.findall(r'Lakh|k',data)
        if len(values)==0:
            return 0
        elif len(values)==1:
            if units[0] == 'Lakh':
                return values[0] * 100000
            elif units[0] == 'k':
                return values[0] * 1000
            else:
                return values[0]
        elif len(values)==2:
            if len(units)==0:
                return (values[0]+values[1])/2
            elif len(units)==1:
                return values[0]+(values[1]*1000)/2
            elif len(units)==2:
                if units[0] == units[1] == 'Lakh':
                    return (values[0]+values[1])/2 * 100000
                elif units[0] == units[1] == 'k':
                    return (values[0]+values[1])/2 * 1000
                else:
                    return ((values[0]*1000) + (values[1]*100000))/2 
        else:
            return 0
    except:
        return Nonedf['total_emp'] = df['total_emp'].apply(transform_total_emp)df.head(2)

df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 9990 entries, 0 to 9989
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           9990 non-null   object 
 1   rating         9990 non-null   object 
 2   review         9990 non-null   float64
 3   tags           9990 non-null   object 
 4   company_type   7344 non-null   object 
 5   total_emp      9101 non-null   float64
 6   about          6699 non-null   object 
 7   headquarter    9171 non-null   object 
 8   total_offices  9166 non-null   float64
 9   founded_year   7675 non-null   float64
dtypes: float64(4), object(6)
memory usage: 858.5+ KB

Data has been cleaned successfully let’s correct the datatype of columns to store it in an efficient way

col_dtypes = {
                'name': str,
                'rating': int,
                'review': float,
                'tags': str,
                'company_type': str,
                'total_emp': float,
                'about': str,
                'headquarter': str,
                'total_offices': int,
                'founded_year': int,
            }
df = df.astype(col_dtypes,errors='ignore')

Let’s check the entire dataset once again.

df.head()

We have all the necessary details, so now we can save the dataset in the desired format.

df.to_csv('AmbitionBox_Dataset.csv',index=False)

You can get the original python jupyter notebook here and the extracted dataset is available here.