Show full content
Data is the new diamond and Web Scraping is a powerful technique to gather those valuable data from the internet. Today, we will learn how we can create one web crawler that scrapes the data from iimjobs.com.
Part 1: Before we write the code
Pre-requisites:
Web Driver — Version:120.0.6095.0 for my example
Chromium — Version:120.0.6095.0 for my example
Selenium Library
BeautifulSoup4 Library
Note: The versions of Web Driver and Chromium should be the same.
You can download my version of the web driver and the chromium from here: https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html?prefix=Win/1216615/
If you want, you can choose your preferred versions from here:
- https://googlechromelabs.github.io/chrome-for-testing/
- https://vikyd.github.io/download-chromium-history-version/#/
You can install Selenium and BeautifulSoup4 libraries using the pip command in the terminal.
pip install selenium
pip install beautifulsoup4
Part 2: Let’s write the code
Create one Python file and give it your desired name.
Now, import the necessary libraries at the top of the file, like this.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import os #This will be used to get the present working directory
import pandas as pd # Pandas for using the DataFrame
import numpy as np # Numpy for making calculations
import time
import datetime
After importing the necessary libraries, it’s time to define the chrome options for the chrome driver.
# Give your path of the Chrome Driver in the CHROMEDRIVER_PATH variable
CHROMEDRIVER_PATH = r'C:\Program Files\chromedriver_win32\chromedriver.exe'
service = Service(CHROMEDRIVER_PATH)
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# In options.binary_location, give your Chromium path.
chrome_options.binary_location = r"C:\Users\TAMANG\Downloads\Win_1216615_chrome-win\chrome-win\chrome.exe"
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.add_argument('--no-sandbox')
Now, let us start by writing the code for Chrome Driver inside the main function.
In order to start writing the code, you need to first understand the structure of the page you want are going to scrape. In our case, it’s going to be IIMJobs website.
So, head over to the IIMJobs website and go to the page you want to scrape.

Let’s search IT jobs

Now, copy the URL of the web page that you are going to scrape and paste it inside the driver.get(), as done in the below excerpt.
We have created a data framewith following columns: ‘Job Title’, ‘Experience Reqd’, ‘City’, ‘Date Posted’, ‘URL’, as we are going to extract those values from the page.
Now, let us understand how to extract the first value of ‘Job Title’
First, press right-click on the Job Title, then click on the inspect option.

This opens up the css panel, which you need to study to pin point the address.

We have also put a scroll code just before we scrape the code, to get more of the data.
def main():
global dff1
dff1 = pd.DataFrame(columns=['Job Title', 'Experience Reqd', 'City', 'Date Posted', 'URL'])
driver = webdriver.Chrome(service = service, options = chrome_options)
driver.get("https://www.iimjobs.com/search/IT-0-0-0-1.html")
scroll = np.arange(1, 20)
counter = 0
for scroll in scroll:
driver.execute_script("window.scrollTo(0,(document.body.scrollHeight))")
time.sleep(0.75)
soup1 = BeautifulSoup(driver.page_source,'html5lib')
results = soup1.find('div', id='mainContainer')
job_elems1 = results.find_all('div', class_=['col-lg-9 col-md-9 col-sm-8 container pdmobr5', 'col-lg-3 col-md-3 col-sm-4 pdlr0 mtb2 hidden-xs'])
# print(job_elems1)
for job_elem1 in job_elems1:
finding = counter % 2
if finding == 0:
try:
print(counter)
counter = counter + 1
# Title
T_n_E = job_elem1.find('a', class_='mrmob5 hidden-xs')
before_the_parts = T_n_E.get_text()
parts = before_the_parts.split("(", 1)
T = parts[0].strip()
E = parts[1].strip(")")
Title = T
# print(Title)
# Experience
Exp = E
# print(Exp)
# URL
U = job_elem1.find('a',class_='mrmob5 hidden-xs').get('href')
URL = U
except Exception as e:
print("EXCEPTION OCCURRED | COUNTER = " + str(counter))
pass
else:
try:
print(counter)
counter = counter + 1
# Date Posted
D = job_elem1.find('span', class_='gry_txt txt12 original')
Date=D.text
print(Date)
# City
try:
C = job_elem1.find('span')
City=C.text.strip()
print(City)
except Exception as e:
City = None
except Exception as e:
print("EXCEPTION OCCURRED | COUNTER = " + str(counter))
pass
if finding == 1:
dff1 = pd.concat([dff1, pd.DataFrame([[Title, Exp, City, Date, URL]], columns = ['Job Title', 'Experience Reqd', 'City', 'Date Posted', 'URL'])], ignore_index=True)
dff1.to_excel("IIMJobsJobListing_BANKING_FINANCE"+ str(datetime.date.today()) + ".xlsx", index = False)
print(dff1)
else:
pass
# driver.find_element(By.XPATH, '/html/body/div[3]/div[3]/div[9]/div[5]/div/div/div[3]/div/a').click()
time.sleep(0.5)
driver.close()
main() # Calling the main function at the end
I have made a more complex code that uses multi-threading to obtain the data from multiple windows of different pages at the same time. You can go through the code in my GitHub here: https://github.com/SAGAR-TAMANG/web-scraping-iimjobs
A tutorial video on YouTube will be released soon describing this.
Step 3: Keep tinkering, keep learning.









