Pandas is a powerful and versatile open-source library for data analysis in Python. It provides easy-to-use data structures like Series and DataFrames, making it an essential tool for handling and manipulating data in machine learning projects. In this blog post, we will explore some key aspects of Pandas that are crucial for anyone working in the field of machine learning.
Agenda
Let’s break down the agenda for this Pandas tutorial:
- Handling Duplicates
- Function Application – map, apply, groupby, rolling, str
- Merge, Join & Concatenate
- Pivot-tables
- Normalizing JSON
Handling Duplicates
Sometimes, ensuring that data is not duplicated can be challenging, and it becomes crucial in the data cleaning step to identify and eliminate duplicate entries.
df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})
Detecting Duplicates with duplicated()
The duplicated()
function helps identify duplicate rows in a DataFrame.
df.duplicated()
This will output a boolean series, marking True
for duplicate rows. In the example, row 5 is identified as a duplicate.
Displaying Duplicate Rows
To display the duplicate rows, you can use boolean indexing:
df[df.duplicated()]
This will show the duplicate rows in the DataFrame.
Identifying Duplicates Based on Specific Columns
You can narrow down duplicate detection by specifying columns using the subset
parameter:
df[df.duplicated(subset=['A','B'])]
This example considers columns ‘A’ and ‘B’ together, revealing rows 1 and 5 as duplicates based on these columns.
Handling duplicates is a crucial step in ensuring data accuracy and reliability. By using Pandas functions like duplicated()
, you can easily identify and manage duplicate entries in your datasets.
Function Application
Using map
for Transformation
The map
function in Pandas is excellent for transforming one column into another. Let’s look at an example where we create a new column, ‘age_category,’ based on the ‘Age’ column.
titanic_data['age_category'] = titanic_data.Age.map(lambda age: 'Kid' if age < 18 else 'Adult')
Using apply
on Series and DataFrames
The apply
function is versatile and can be applied to both Series and DataFrames. Here, we calculate the sum of the ‘Age’ column and create a new column, ‘age_category,’ based on a custom function.
titanic_data.Age.apply('sum')
titanic_data.Age.apply(lambda age: 'Kid' if age < 18 else 'Adult')
Using apply
on DataFrames for Multiple Columns
The apply
function on DataFrames allows us to work with multiple columns simultaneously. In this example, we use a function to calculate fares differently for male and female passengers.
def fare_function(row):
if row.Sex == 'male':
return row.Fare * 2
else:
return row.Fare
titanic_data.apply(fare_function, axis=1)
Grouping Data with groupby
The groupby
function is handy for splitting data into groups, applying a function to each group, and combining the results. Here, we calculate the mean age for male and female passengers.
titanic_data.groupby(['Sex']).Age.mean()
titanic_data.groupby(['Sex']).Age.agg(['mean', 'min', 'max'])
Window-based Operations with rolling
The rolling
function is useful for window-based operations. In this example, we calculate the sum and minimum value for a rolling window of 5 in the ‘Age’ column.
titanic_data.Age.rolling(window=5, min_periods=1).agg(['sum', 'min'])
String Utilities with str
For columns containing strings, Pandas provides str utilities. This example filters rows containing ‘Mr’ in the ‘Name’ column.
titanic_data[titanic_data.Name.str.contains('Mr')]
Append, Merge, Join & Concatenate
Using append
for Stacking DataFrames
The append
function is handy for stacking DataFrames vertically.
df1.append(df2, ignore_index=True)
Merging DataFrames with merge
The merge
function is used to merge DataFrames based on a specified key.
left.merge(right, on='key')
left.merge(right, on='key', how='left')
Combining DataFrames with join
The join
function combines DataFrames based on index values.
left.join(right)
Pivot Tables
Extracting Information with Pivot Tables
Pivot tables are a powerful way to extract important information from data. In this example, we create a pivot table to summarize sales data.
pd.pivot_table(sales_data, index=['Manager', 'Rep'], values=['Account', 'Price'], aggfunc=[np.sum, np.mean])
Normalizing JSON Data
Handling Hierarchical JSON Data
JSON data is not always flat; it can be hierarchical. The json_normalize
function is useful for normalizing such data.
json_data = [{'state': 'Florida', 'shortname': 'FL', 'info': {'governor': 'Rick Scott'}, 'counties': [{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}]}]
json_normalize(json_data, 'counties', ['state', ['info', 'governor']])
By mastering these advanced Pandas techniques, you’ll be better equipped to handle complex data manipulation tasks in your data analysis and machine learning projects. Stay tuned for more in-depth tutorials on Pandas and other data science topics!
Conclusion
Pandas is an indispensable tool for any data scientist or machine learning practitioner. In this tutorial, we’ve covered just a fraction of what Pandas can offer. As you delve deeper into data analysis and machine learning, mastering Pandas will undoubtedly enhance your productivity and analytical capabilities.
Stay tuned for more tutorials on advanced Pandas functionalities!
Leave a Reply