Extracting and Analyzing Car Listings from OLX – A Web Scraping Adventure

Introduction

Web scraping is a powerful technique to extract valuable information from websites. In this blog post, we explore the process of scraping car listings from OLX, focusing on the Tamil Nadu region. We will cover topics such as web scraping, data cleaning, and parsing, providing both code snippets and detailed explanations.

Web Scraping OLX Car Listings

To kickstart our adventure, we utilize the requests library to fetch the HTML content of OLX’s car listings in Tamil Nadu. The BeautifulSoup library helps parse the HTML, and by identifying a key marker (“myads”), we narrow down our content to the relevant section.

import requests
from bs4 import BeautifulSoup

url = "https://www.olx.in/tamil-nadu_g2001173/cars_c84/q-cars"
response = requests.get(url)
content = str(response.content)
p = content.find("myads")
content = content[p:]

Cleaning and Extracting Data

The raw HTML content is then saved to a file for reference. Next, we split the content based on the “title” keyword, and a list of data chunks is obtained. Each chunk represents a car listing.

content_list = content.split("title")

dlist = []
for txt in content_list:
# Data cleaning steps
val = 'title'+txt
val = val.replace("\\u002F"," ")
val = val[:val.find("spell")].strip('"').strip(",")
val = val.replace(val[val.find("images"):val.find("package")],"")
val = val[:val.find("]}")+2]

if len(val) > 2:
    dlist.append(val)

Extracting Relevant Information

With the data chunks in hand, we filter out unwanted information and extract relevant details such as car titles and prices. We create a list final_data to store this refined information.

final_data=[]
for data in dlist:
if ":" not in data: continue
if 'title"' not in data: continue
if "OLX" in data: continue
if "Length" in data: continue
if "_length" in data: continue

title = data[data.find("title")+7 : data.find('","')+1]
value = data[data.find('"raw":')+6 : data.find(',"currency"')]

if not value.isdigit(): continue
final_data.append([title, value])

Parsing and Cleaning HTML Content

To better understand the extracted information, we define a function to clean HTML content and another to identify the starting word of the description.

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' || ', raw_html)
    return cleantext

def start_word(txt):
    sw=""
    for w in txt.split():
        if word_mix(w) == {',', 'n'}:
            sw = w
            break
    return sw

Parsing Car Descriptions

We create a parser function to process the cleaned HTML content, extracting relevant details. The results are then written to an output file.

with open("output_cars.txt","w") as f:
    for i in range(len(car_content)):
        data = parser(car_content[i])
        f.write(str(data)+"\n")

Conclusion

Web scraping is a valuable skill for extracting information from websites. In this journey, we’ve explored the OLX car listings in Tamil Nadu, delving into web scraping, data cleaning, and parsing techniques. By combining these skills, we can transform raw HTML content into structured data for further analysis or visualization.