Unveil the Starbucks Mysteries of Buy One Get One Free Offers

Michelle Hsu
8 min readDec 30, 2020

Do you consider yourself the kind of customer who are easily lured by the advertisement and purchase something you don’t necessary need? Or are you someone like me who tend to disregard the attractive commercial offers and stick to your own purchase behavior?

With the simulated data provided by Starbucks, I’m able to dig deeper into the secret of one of the most captivating offers — Buy One Get One Free (BOGO). The dataset contains offer information, customer profile, and the customer buying and offer response records.

For this Udacity capstone project, I’d like to understand when customers are being offered the BOGO offer, do they respond to the information and complete the offer? or do they complete the offer without even seeing the offer information? or Is it possible that some other factors might be more important than the offer itself that lead to the customer complete the BOGO offer?

Project Overview

This project is a requirement from Udacity Data Scientist Program. The data is provided from Starbucks, containing offer information, customer profile, and transaction records. The project outlines the problem statement, exploratory data analysis, and the model built to solve the problems.

Problem Statement

I want to see what factors influence whether a customer completes the Buy One Get One Free (BOGO) offer. Some people might be incentivized after viewing the offer information; some might complete the offer due to certain characteristics other than the offer itself. In short, I want to build a supervised model which can classify whether a customer will complete the BOGO offer.

Metrics

The metrics used in this project as listed as followed:
1. Accuracy
2. Area Under Curve
3. F1 score
4. Confusion Matrix
5. Feature Importances

1. Data Exploration & Visualization & Preprocessing

1.1. portfolio — containing offer ids and meta data about each offer (duration, type, etc.)

Figure 1. Portfolio

The dataframe Portfolio (Figure 1.) contains the offer information such as offer type, duration, and rewards amount. There are 3 different types of offers, which are discount, information, and BOGO.

Figure 2. Average Reward by Offer Type

Based on Figure 2. Average Reward by Offer Type, we know bogo offer type has the highest dollar amount rewards ($7.5) compared to others.

Note, the cell values in column “channels” are of list type, which is not desirable for machine learning. Therefore, I split the list values into 4 separate columns as dummy variables — email, mobile, social, and web.

1.2. profile — demographic data for each customer

Figure 3. Profile

The dataframe Profile (Figure 3.) contains customer demographic information such as gender, income, and age.

At the first glance, you can see there are some missing values in the dataframe. Specifically, column “gender” and “income” contains missing values (Figure 4.). Oddly, the percentage of missing value for the 2 variables are the same .

Figure 4. Profile Missing Values

After further investigation on the missing values, all customer profile with missing values in “gender” and “income” have “age” as 118, confirming these data points are erroneous user profile.Therefore, I dropped any data points with “gender” and “income” as missing values.

Figure 5. Income Distribution by Gender

According to Figure 5. Income Distribution by Gender, we can see that the income of female customers is relatively normal distributed. Most of the female customers have income range from $70k to $85k. As for male customers, their income mostly concentrated between $50k and $80k. For other customers, their income mostly range from $55k to $70k. Also, there are more female customer with income more than $80k than male and other customers.

1.3. transcript — records for transactions, offers received, offers viewed, and offers completed

Figure 6. Transcript

The dataframe Transcript (Figure 6.) contains records of transactions and customers response to the offer such as offer received, offer viewed, and offer completed.

Note, the column “value” has dict-like values which need to be split into 3 separate columns — offer id, reward, and amount.

Figure 7. Value split into 4 variables

After the splitting of column “value” (Figure 7.), we can see that there are duplicated offer id columns - “offer id” and “offer_id”. To be consistent, I’ll combine these 2 columns into “offer id”.

1.4. Combine the 3 dataframes and perform further data preprocessing

Figure 8. Merge 3 dataframes

Under the project scope, we only care about the people who were provided the BOGO offer and whether they completed the offer. Therefore, I’ll filter for records with “offer_type” as bogo (Figure 9.).

Figure 9. Filter for offer_type as bogo

Based on the dataset we have, there are 4 scenarios under this project scope:

1. people who didnt view but completed the BOGO offer

2. people who viewed the BOGO offer and completed the offer

3. people who viewed the BOGO offer but didn’t complete the offer

4. people who didn’t view and didn’t completed the BOGO offer

Step by step, I extracted the 4 scenarios with following feature engineering and data cleaning. Eventually, I got the master_df as Figure 10.

  • remove unnecessary columns — ‘event’, ‘customer id’, ‘offer id’, ‘reward_x’, ‘amount’, ‘offer_type’
  • rename column “reward_y” to “reward”
  • extract feature “membership_days” from “became_member_on”; then drop column “became_member_on”
  • convert gender into dummy variables — M, F, and O; ; then drop column “gender”
  • handle missing values; drop the records with missing values since they are from the same records
Figure 10. Cleaned master_df

Before building the model, I conducted a quick exploratory data analysis on the cleaned dataset.

1.4.1. Correlation analysis

Figure 11. Correlation Analysis
  • “reward” has strong positive relationship with “difficulty”
  • “offer_viewed” has relatively positive relationship with “social”
  • “social” has relatively positive relationship with “difficulty” and “reward” but relatively negative relationship with “duration”
  • “web” has relatively negative relationship with “difficulty”, “duration”, and “reward”
  • “M” has negative relationship with “F”

1.4.2. Income Distribution by Offer Completion

Based on the density plot Income Distribution by whether the Customer Completed the Offer (Figure 12.), we can see that people who completed the BOGO offer tend to have higher income than peopl who didn’t. Most of the people who completed the offer have income range from $70k to $80k; other the other hand, people who didn’t complete the offer have income range from $30k to $45k. Also, there are more people with income higher than $80k who completed the BOGO offer than people who didn’t.

Figure 12. Income Distribution by whether the Customer Completed the Offer

1.4.3. Customers completed BOGO vs. Customers who didn’t complete BOGO

From Figure 13. Customers completed BOGO vs. Customers who didn’t complete BOGO, we know that there are more people who completed the BOGO offers (~%60) than people who didn’t (~%40)

Figure 13. Customers completed BOGO vs. Customers who didn’t complete BOGO

2. Modeling & Metrics

For the modeling part, I tried 5 different algorithms, which are Naive Bayes, Logistic Regression, KNearestNeighbors, Random Forest, and Neural Network.

Note, to further refine the model, I implemented cross validation and hyperparameters tuning for Logistic Regression, KNearestNeighbors, and Random Forest.

For logistic regression, I implemented grid search cv and fine tuned the hyperparameter C with 5 fold cross validation (Figure 14.).

Figure 14. Logistic Regression with CV

For KNearestNeighbors, I also implemented grid search cv and tuned the hyperparameter n_neighbors with 5 fold cross validation (Figure 15.).

Figure 15. KNearestNeighbors with CV

For Random Forest, I implemented grid search cv and tuned the hyperparameter n_estimators, max_depth, and min_samples_leaf with 5 fold cross validation (Figure 16.).

Figure 16. Random Forest with CV

3. Results & Conclusion

According to the Metrics table (Figure 17.), we can see that overall Random Forest with CV performs better than other models, even the neural network. Although the accuracy for Random Forest with CV is not the highest, it’s AUC and F1 score are not bad at all.

Figure 17. Metrics

When looking at the confusion matrix (Figure 18.), the best model Random Forest with CV is doing pretty well in terms of classifying the positive class, meaning people who completed the BOGO offer. However, the model can certainly improve on the prediction for people who didn't complete the BOGO offer.

Figure 18. Confusion Matrix

Since we chose the tree-based model, we can see which feature plays the most important role in predicting whether the customer will complete the BOGO offer. According to Feature Importances chart (Figure 19.), membership_days turns out to be the most influential feature. The second and third important features are time and income. Surprisingly, feature offer_viewed is not even the top 5 feature, which means whether a customer views the BOGO offer or not doesn't significantly impact the offer completion.

Figure 19. Feature Importances

Note that there are more data points in positive class than in negative class. Although the class imbalance is not serious (60% vs. 40%), it still might affect the model performance. In addition, we don’t have sufficient number of features; the model only has 15 features in total. Therefore, even though we refined the model with cross validation and hyperparameter tuning, the room for model improvement is limited.

In the future, we can gather more information on the offer such as the month when the offer is released, the number of people being targeted for the offer, the type of product in the BOGO offer, etc. Additionally, it’d be beneficial to know more about the customer demographics other than income and age. For example, education level, race, wether the customer lives in the city or not are all good indicators to include in the model.

In terms of further improving the models, I will handle the class imbalance issue and conduct more feature engineering process to make sure the dataset is representative and generalizable enough. I might also try some other machine learning algorithms such as XGboost, Support Vector Machine, and other neural network architecture.

If you are interested in the codes or craving for more details on the analysis, you can find those information in my GitHub repo here.

--

--