Exploratory Data Analysis and Market Basket Analysis with Python

In the realm of retail, understanding customer behavior and optimizing product offerings can be a game-changer. In this blog post, we’ll explore how to perform Exploratory Data Analysis (EDA) and Market Basket Analysis using Python, specifically focusing on a dataset related to retail transactions.

Introduction

The dataset we’re working with contains information about retail transactions. It includes details such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. Our goal is to explore customer purchase patterns and uncover associations between different products.

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules

# Load the dataset
data = pd.read_excel('data_path')

The dataset contains information about retail transactions, including details such as Invoice Number, Stock Code, Description, Quantity, Invoice Date, Unit Price, Customer ID, and Country.

data.info()
data.head()

Data Overview

Before diving into the analysis, let’s get an overview of the dataset. We observe that there are over 500,000 entries, and some columns, such as Customer ID and Description, have missing values.

data.info()
data.Country.value_counts()
len(data.CustomerID.unique())

We have transactions from various countries, with the majority coming from the United Kingdom. There are 4,373 unique customers.

Customer Insights

Let’s identify the top customers based on their total purchase amount:

data['TotalPrice'] = data['Quantity'] * data['UnitPrice']
customer_purchase = data.groupby(['CustomerID']).TotalPrice.sum().sort_values(ascending=False)

The top customers in terms of total purchase amount are identified, with CustomerID 14646.0 leading the pack.

Data Cleaning

To ensure accurate analysis, we clean the data by removing credit records:

# Remove credit records
data = data[~data.InvoiceNo.astype('str').str.startswith('C')]
# Strip whitespace from Description
data['Description'] = data.Description.str.strip()

Basket Creation

Now, we focus on transactions from a specific country, for instance, Germany:

data_Germany = data[data.Country == 'Germany']
basket_Germany = data_Germany.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

The basket is created, and the dataset is encoded for further analysis:

basket_encoded = basket_Germany.applymap(lambda x: 0 if x <= 0 else 1)
basket_Germany = basket_encoded

Market Basket Analysis

Using Apriori algorithm, we identify frequent itemsets and generate association rules:

frq_items = apriori(basket_Germany, min_support=0.05, use_colnames=True)
rules = association_rules(frq_items, metric="confidence", min_threshold=.1)
rules = rules.sort_values(['confidence', 'lift'], ascending=[False, False])

The association rules provide insights into item relationships. Let’s explore some of the interesting rules:

rules.head(20)

Among the top rules, we find interesting associations like “JUMBO BAG WOODLAND ANIMALS” being associated with “POSTAGE” with high confidence.

Conclusion

In this blog post, we embarked on a journey of exploring retail transaction data, identifying top customers, cleaning the data, and performing Market Basket Analysis. Understanding customer behavior and product associations can empower businesses to make informed decisions.

This is just a glimpse into the vast world of data analysis and its application in the retail domain. Further exploration and fine-tuning of parameters can reveal deeper insights, paving the way for data-driven strategies.

By leveraging Python and its rich ecosystem of libraries, businesses can unlock valuable information hidden within their data, driving growth and enhancing customer satisfaction.

Feel free to experiment with your own datasets and adapt the code to suit your specific business needs. Happy analyzing!