Introduction
Web scraping is a powerful technique to extract valuable information from websites. In this blog post, we explore the process of scraping car listings from OLX, focusing on the Tamil Nadu region. We will cover topics such as web scraping, data cleaning, and parsing, providing both code snippets and detailed explanations.
Web Scraping OLX Car Listings
To kickstart our adventure, we utilize the requests
library to fetch the HTML content of OLX’s car listings in Tamil Nadu. The BeautifulSoup
library helps parse the HTML, and by identifying a key marker (“myads”), we narrow down our content to the relevant section.
import requests
from bs4 import BeautifulSoup
url = "https://www.olx.in/tamil-nadu_g2001173/cars_c84/q-cars"
response = requests.get(url)
content = str(response.content)
p = content.find("myads")
content = content[p:]
Cleaning and Extracting Data
The raw HTML content is then saved to a file for reference. Next, we split the content based on the “title” keyword, and a list of data chunks is obtained. Each chunk represents a car listing.
content_list = content.split("title")
dlist = []
for txt in content_list:
# Data cleaning steps
val = 'title'+txt
val = val.replace("\\u002F"," ")
val = val[:val.find("spell")].strip('"').strip(",")
val = val.replace(val[val.find("images"):val.find("package")],"")
val = val[:val.find("]}")+2]
if len(val) > 2:
dlist.append(val)
Extracting Relevant Information
With the data chunks in hand, we filter out unwanted information and extract relevant details such as car titles and prices. We create a list final_data
to store this refined information.
final_data=[]
for data in dlist:
if ":" not in data: continue
if 'title"' not in data: continue
if "OLX" in data: continue
if "Length" in data: continue
if "_length" in data: continue
title = data[data.find("title")+7 : data.find('","')+1]
value = data[data.find('"raw":')+6 : data.find(',"currency"')]
if not value.isdigit(): continue
final_data.append([title, value])
Parsing and Cleaning HTML Content
To better understand the extracted information, we define a function to clean HTML content and another to identify the starting word of the description.
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, ' || ', raw_html)
return cleantext
def start_word(txt):
sw=""
for w in txt.split():
if word_mix(w) == {',', 'n'}:
sw = w
break
return sw
Parsing Car Descriptions
We create a parser function to process the cleaned HTML content, extracting relevant details. The results are then written to an output file.
with open("output_cars.txt","w") as f:
for i in range(len(car_content)):
data = parser(car_content[i])
f.write(str(data)+"\n")
Conclusion
Web scraping is a valuable skill for extracting information from websites. In this journey, we’ve explored the OLX car listings in Tamil Nadu, delving into web scraping, data cleaning, and parsing techniques. By combining these skills, we can transform raw HTML content into structured data for further analysis or visualization.
Leave a Reply