Exploring Key Pandas Functions for Data Analysis

Aaishwarya Kulkarni
4 min readJun 29, 2024

--

I’ve been using Pandas for many years and find certain functions especially helpful for data analysis. These functions are essential for transforming and understanding complex datasets. Their efficiency makes them my go-to tools for extracting meaningful insights and preparing data for further analysis and machine learning tasks.

pandas.melt

The melt function can be used where data needs to be reshaped from a wide format to a long format. This is useful when you have columns representing values that would be better represented as rows.

df_melted = pd.melt(df, 
id_vars = ["Month"],
var_name = "Product",
value_name = "Sales")

If you observe, the column ‘Month’ becomes the identifier meaning it should not be melted. The rest columns will be melted and the corresponding values will become a new column called ‘Sales’.

Practical use:

  1. Reshaping the dataset to a long format can simplify the process of creating visualizations by ensuring that each variable is represented correctly.
  2. When merging datasets, it may be necessary to convert to a long format to align with specific keys and ensure accurate merging.
  3. Machine learning algorithms often require reshaping the dataset to a long format.

pandas.cut and pandas.qcut

The cut function can be used to segment and sort data values into bins, while qcut function can be used to discretize a variable into equal-sized buckets based on rank or sample quantiles.

pandas.cut

bins = [20, 30, 40, 50]
labels = ['20-30', '30-40', '40-50']
df['AgeGroup'] = pd.cut(df['Age'],
bins=bins,
labels=labels,
right=False)

If you observe, we have specified the edges of the ‘bins’ such as [20, 30), [30, 40), [40, 50) where the right edge is excluded because of the right=False parameter. We also specified the labels that will be assigned to each bin.

pandas.qcut

df['SalaryBin'] = pd.qcut(df['Salary'],
q=4,
labels=["Low", "Medium", "High", "Very High"])

If you observe, we have specified q=4, which will divide ‘Salary’ in 4 equal sized quantiles, i.e. each bin will contain approx. 25% of data points. We have also specified labels that will be assigned to each quantile.

Practical use:

cut:

  1. Summarizing data across demographic groups, facilitating analysis and comparisons based on specified criteria.
  2. Creating income brackets to facilitate economic analysis and categorization.
  3. Grading system with fixed grade groups allowing for the creation of fixed grade groups based on specified thresholds for evaluation and classification purposes.

qcut:

  1. Dividing the dataset into quantiles to ensure bin has an equal number of observations.
  2. Categorizing customers based on spending into equal-frequency bins for targeted marketing strategies.
  3. Segmenting performance metrics, like sales figures, into percentiles to identify top and bottom performers.

Both cut and qcut are essential tools in data preprocessing, enabling easier analysis and meaningful categorization of continuous data for machine learning.

pandas.crosstab

The crosstab function allows you to compute a frequency table of 2 or more variables which helps us get insights into the relationship between them.

age_group_department = pd.crosstab(df['AgeGroup'], 
df['Department'])
salary_sum = pd.crosstab(df['AgeGroup'], 
df['Department'],
values=df['Salary'],
aggfunc='sum')

If we observe, the age_group_department dataframe is a cross-tabulation of ‘AgeGroup’ and ‘Department’ that shows frequency of occurrences of each combination of these categories, while salary_sum dataframe is a cross-tabulation that sums the ‘Salary’ values of the combination of ‘AgeGroup’ and ‘Department’.

Practical use:

  1. Market analysis for analyzing survey results to understand market segmentation.
  2. Comparing sales performance across different regions and periods.
  3. Understanding expenditure patterns to see where most spending occurs and identify cost-saving opportunities.

pandas.factorize

The factorize function is used to encode categorical data as numerical labels.

df['Department_Code'], department_categories = pd.factorize(df['Department'])
department_categories --> Index(['HR', 'Finance', 'IT', 'Marketing'], dtype='object')

If we observe, the ‘Department’ column in dataframe was converted to numerical labels creating a new column called ‘Department_Code’. Its corresponding unique categories are stored in ‘department_categories’.

Practical use:

  1. Many machine learning algorithms require categorical variables to be converted into numerical input for processing.
  2. Storing categorical data as integers rather than strings can save memory and improve performance.
  3. For visualizations, numerical codes can be easier to plot.

Thanks for reading! Happy Coding!

--

--

No responses yet