Instacart - Predictions on Test Set

23 minute read

In this notebook we will load the trained Random Forest classifier, built in the Instacart Model Fitting notebook, and make reordered products predictions for the test set in the Instacart Market Basket Analysis Kaggle competition.

Support Functions

from IPython.display import Markdown, display
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy.stats
#----------------------------------------------------------------------
# Functions to load datasets into memory using space efficient data types.

def load_orders(path):
    def convert_eval_set(value):
        if 'prior' == value:
            return np.uint8(1)
        elif 'train' == value:
            return np.uint8(2)
        else:
            return np.uint8(3) # 'test'

    def convert_days_since_prior_order(value):
        # 30 is the maximum value
        if '' == value:
            return np.int8(-1)
        else:
            return np.int8(np.float(value))

    orders = pd.read_csv(path,
                         dtype={'order_id': np.uint32,
                                'user_id': np.uint32,
                                'order_number': np.uint8,
                                'order_dow': np.uint8,
                                'order_hour_of_day': np.uint8},
                         converters={'eval_set':convert_eval_set,
                                     'days_since_prior_order':convert_days_since_prior_order})

    orders = orders.astype({'eval_set': np.uint8, 
                            'days_since_prior_order': np.int8})
    
    return orders

def load_orders_prods(path):
    return pd.read_csv(path, dtype={'order_id': np.uint32,
                                    'product_id': np.uint32,
                                    'add_to_cart_order': np.uint8,
                                    'reordered': np.uint8})

#----------------------------------------------------------------------
# Function to generate markdown output
# Ref: https://stackoverflow.com/a/32035217
def printmd(string):
    display(Markdown(string))

Load Testing Data

orders = load_orders('data/split/sf_test_set_orders.csv')
orders_prods = load_orders_prods('data/split/sf_test_set_prior_order_products.csv')
# List of orders in history
prior_orders_only = orders[(1 == orders.eval_set)]
final_orders_only = orders[(1 != orders.eval_set)]

Meta Features - Mean Order Length and Mean Reorder Ratio Per Customer

# Compute mean order length per customer.

orders_length = orders_prods.groupby('order_id').add_to_cart_order.max().reset_index()
orders_length.rename(columns={'add_to_cart_order': 'total_items_ordered'}, inplace=True)
orders_length_merge = orders_length.merge(prior_orders_only[['order_id','user_id']], on='order_id')
orders_length_merge['order_id'] = orders_length_merge.order_id.astype(np.uint32)

mean_order_length_per_customer = orders_length_merge.groupby('user_id').total_items_ordered.mean().round().reset_index()
mean_order_length_per_customer['user_id'] = mean_order_length_per_customer.user_id.astype(np.uint32)
mean_order_length_per_customer.rename(columns={'total_items_ordered': 'mean_order_length'}, inplace=True)
mean_order_length_per_customer['mean_order_length'] = mean_order_length_per_customer.mean_order_length.astype(np.uint16)

del orders_length_merge

# Compute mean reorder ratio per customer.

# For each order compute ratio of re-ordered items to total ordered items.
orders_reorder_ratio = orders_prods.groupby('order_id').reordered.sum() / orders_length.set_index('order_id').total_items_ordered
orders_reorder_ratio = orders_reorder_ratio.reset_index()

del orders_length

# Exclude first orders, since none of the products have been ordered yet, 
# and so reordered ratio would be zero, thus skewing the mean reorder ratio
# both overall and per user.
orders_reorder_ratio = orders_reorder_ratio.merge(prior_orders_only[prior_orders_only.order_number > 1], on='order_id')
orders_reorder_ratio.rename(columns={0: 'reorder_ratio'}, inplace=True)
orders_reorder_ratio['order_id'] = orders_reorder_ratio.order_id.astype(np.uint32)

mean_reorder_ratio_per_customer = orders_reorder_ratio.groupby('user_id').reorder_ratio.mean().reset_index()
mean_reorder_ratio_per_customer['user_id'] = mean_reorder_ratio_per_customer.user_id.astype(np.uint32)
mean_reorder_ratio_per_customer.rename(columns={'reorder_ratio': 'mean_reorder_ratio'}, inplace=True)
mean_reorder_ratio_per_customer['mean_reorder_ratio'] = mean_reorder_ratio_per_customer.mean_reorder_ratio.astype(np.float16)

del orders_reorder_ratio
mean_order_length_per_customer.head()
user_id mean_order_length
0 3 7
1 4 4
2 6 5
3 11 13
4 12 15
mean_reorder_ratio_per_customer.head()
user_id mean_reorder_ratio
0 3 0.718750
1 4 0.035706
2 6 0.142822
3 11 0.402588
4 12 0.181763

Feature Engineering

Merge Prior Orders, Ordered Products, Aisles and Departments

flat_order_prods = orders_prods.merge(prior_orders_only[['order_id','user_id','order_number','days_since_prior_order']], on='order_id')
flat_order_prods.head()
order_id product_id add_to_cart_order reordered user_id order_number days_since_prior_order
0 13 17330 1 0 45082 2 1
1 13 27407 2 0 45082 2 1
2 13 35419 3 0 45082 2 1
3 13 196 4 0 45082 2 1
4 13 44635 5 0 45082 2 1

Days Since First Order (DSFO) per Order per Customer

DSFO_popc = prior_orders_only.copy()
# days since first order per order per customer
# add one since each users' first order has days_since_prior_order set to -1.
DSFO_popc['DSFO'] = DSFO_popc.groupby(['user_id']).days_since_prior_order.cumsum() + 1
DSFO_popc['DSFO'] = DSFO_popc.DSFO.astype(np.uint16)
del DSFO_popc['eval_set']
del DSFO_popc['order_number']
del DSFO_popc['order_dow']
del DSFO_popc['order_hour_of_day']
del DSFO_popc['days_since_prior_order']
DSFO_popc.head()
order_id user_id DSFO
0 1374495 3 0
1 444309 3 9
2 3002854 3 30
3 2037211 3 50
4 2710558 3 62

Max Days Since First Order (DSFO) per Customer

max_DSFO_pc = DSFO_popc.groupby(['user_id']).DSFO.max().reset_index()
max_DSFO_pc.rename(columns={'DSFO': 'max_DSFO'}, inplace=True)
max_DSFO_pc.head()
user_id max_DSFO
0 3 133
1 4 55
2 6 18
3 11 123
4 12 100

Number of Orders per Customer

orders_pc = prior_orders_only.groupby('user_id').order_number.max().reset_index()
orders_pc['user_id'] = orders_pc.user_id.astype(np.uint32)
orders_pc.rename(columns={'order_number': 'number_of_orders'}, inplace=True)
orders_pc.head()
user_id number_of_orders
0 3 12
1 4 5
2 6 3
3 11 7
4 12 5

Final Summary for Products Ordered per Customer

# days since first order per product per order per customer
props_pppc = flat_order_prods[['order_id','product_id','reordered']].merge(DSFO_popc, on="order_id")

# aggregate to get properties for each product ordered for each customer
props_pppc = props_pppc.groupby(['user_id','product_id']).agg({'DSFO': [min, max],
                                                               'reordered': sum})

# flatten hierarchical column index
props_pppc = props_pppc.reset_index()
props_pppc.columns = ['_'.join(col).strip('_') for col in props_pppc.columns.values]

# add max_DSFO and total orders per customer
props_pppc = props_pppc.merge(max_DSFO_pc, on='user_id')
props_pppc = props_pppc.merge(orders_pc, on='user_id')

# change data types for space efficiency
props_pppc['user_id'] = props_pppc.user_id.astype(np.uint32)
props_pppc['product_id'] = props_pppc.product_id.astype(np.uint32)

# add days since last order for the customer's final order
props_pppc = props_pppc.merge(final_orders_only[['user_id','days_since_prior_order']], on="user_id")
props_pppc.rename(columns={'days_since_prior_order': 'last_order_DSLO'}, inplace=True)

# compute reorder and recency probability along with the mean days to order each product.
props_pppc['reorder_prob'] = (props_pppc['reordered_sum'] + 1) / props_pppc['number_of_orders']
# check that DSFO_max is greater than zero to avoid NaN, since some customers might have only 
# multiple orders on same day that the first order was placed.
props_pppc['recency_prob'] = (props_pppc['DSFO_max'] / (props_pppc['max_DSFO'] + props_pppc['last_order_DSLO'])).where(props_pppc['last_order_DSLO'] > 0, 0)

# change all float64 fields to float16
props_pppc['reorder_prob'] = props_pppc.reorder_prob.astype(np.float16)
props_pppc['recency_prob'] = props_pppc.recency_prob.astype(np.float16)
# drop the columns we no longer need
del props_pppc['DSFO_min']
del props_pppc['DSFO_max']
del props_pppc['reordered_sum']
del props_pppc['max_DSFO']
del props_pppc['number_of_orders']
del props_pppc['last_order_DSLO']

props_pppc['pred_reordered_prob'] = 0
props_pppc['pred_reordered_prob'] = props_pppc.pred_reordered_prob.astype(np.float16)

props_pppc.head()
user_id product_id reorder_prob recency_prob pred_reordered_prob
0 3 248 0.083313 0.062500 0.0
1 3 1005 0.083313 0.743164 0.0
2 3 1819 0.250000 0.527832 0.0
3 3 7503 0.083313 0.208374 0.0
4 3 8021 0.083313 0.062500 0.0

Predictions on Test Data

Load Random Forest Model to Make Reorder Predictions

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

Extract Testing Feature Matrix

X_test = props_pppc.loc[:,['reorder_prob','recency_prob']].as_matrix()

Predict Product Reordering

The random forest trained model used for predictions will be loaded from disk and used to predict product reordering. The nearly 5 million products ordered in the past by the customers are processed in batches of at most a million and each time the model is deleted and reloaded from disk. This was done to work within the memory resources available to the Docker container used to carry out these experiments.

total_products = len(X_test)
print("Total products in test orders: {0}".format(total_products))
start_pos = 0
end_pos = 0
batch_size = 1000000
for i in range(0, np.int(np.round(total_products / 1000000))):
    clf = joblib.load('randomforest-all-training-data.pkl')
    start_pos = i * batch_size
    end_pos = start_pos + batch_size
    if end_pos > total_products:
        end_pos = total_products
    print("Processing {0} - {1}".format(start_pos, end_pos))
    y_hat_probs = clf.predict_proba(X_test[start_pos:end_pos])
    props_pppc.loc[start_pos:end_pos-1,['pred_reordered_prob']] = y_hat_probs[:,1]
    del clf
Total products in test orders: 4833292
Processing 0 - 1000000


[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.3s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:    0.6s
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:    1.1s
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:    1.7s
[Parallel(n_jobs=2)]: Done  21 tasks      | elapsed:    2.5s
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    3.2s
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    4.3s
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    5.5s
[Parallel(n_jobs=2)]: Done  57 tasks      | elapsed:    6.8s
[Parallel(n_jobs=2)]: Done  68 tasks      | elapsed:    8.0s
[Parallel(n_jobs=2)]: Done  81 tasks      | elapsed:    9.5s
[Parallel(n_jobs=2)]: Done  94 tasks      | elapsed:   10.9s
[Parallel(n_jobs=2)]: Done 109 tasks      | elapsed:   12.8s
[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:   14.3s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed:   16.2s finished


Processing 1000000 - 2000000


[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.3s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:    0.6s
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:    1.3s
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:    1.8s
[Parallel(n_jobs=2)]: Done  21 tasks      | elapsed:    2.8s
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    3.6s
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    4.6s
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    5.6s
[Parallel(n_jobs=2)]: Done  57 tasks      | elapsed:    7.0s
[Parallel(n_jobs=2)]: Done  68 tasks      | elapsed:    8.3s
[Parallel(n_jobs=2)]: Done  81 tasks      | elapsed:    9.8s
[Parallel(n_jobs=2)]: Done  94 tasks      | elapsed:   11.4s
[Parallel(n_jobs=2)]: Done 109 tasks      | elapsed:   13.2s
[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:   17.0s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed:   20.5s finished


Processing 2000000 - 3000000


[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.3s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:    0.5s
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:    1.2s
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:    1.7s
[Parallel(n_jobs=2)]: Done  21 tasks      | elapsed:    2.5s
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    3.3s
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    4.4s
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    6.0s
[Parallel(n_jobs=2)]: Done  57 tasks      | elapsed:    7.2s
[Parallel(n_jobs=2)]: Done  68 tasks      | elapsed:    8.9s
[Parallel(n_jobs=2)]: Done  81 tasks      | elapsed:   10.6s
[Parallel(n_jobs=2)]: Done  94 tasks      | elapsed:   12.0s
[Parallel(n_jobs=2)]: Done 109 tasks      | elapsed:   13.7s
[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:   15.8s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed:   23.0s finished


Processing 3000000 - 4000000


[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:    1.4s
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:    2.5s
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:    3.5s
[Parallel(n_jobs=2)]: Done  21 tasks      | elapsed:    4.5s
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    5.4s
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    6.4s
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    7.5s
[Parallel(n_jobs=2)]: Done  57 tasks      | elapsed:    8.9s
[Parallel(n_jobs=2)]: Done  68 tasks      | elapsed:   10.1s
[Parallel(n_jobs=2)]: Done  81 tasks      | elapsed:   11.8s
[Parallel(n_jobs=2)]: Done  94 tasks      | elapsed:   13.3s
[Parallel(n_jobs=2)]: Done 109 tasks      | elapsed:   15.0s
[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:   16.7s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed:   18.5s finished


Processing 4000000 - 4833292


[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed:    1.0s
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:    1.4s
[Parallel(n_jobs=2)]: Done  21 tasks      | elapsed:    2.2s
[Parallel(n_jobs=2)]: Done  28 tasks      | elapsed:    2.8s
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    3.8s
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    4.6s
[Parallel(n_jobs=2)]: Done  57 tasks      | elapsed:    6.3s
[Parallel(n_jobs=2)]: Done  68 tasks      | elapsed:    7.3s
[Parallel(n_jobs=2)]: Done  81 tasks      | elapsed:    8.5s
[Parallel(n_jobs=2)]: Done  94 tasks      | elapsed:    9.9s
[Parallel(n_jobs=2)]: Done 109 tasks      | elapsed:   11.3s
[Parallel(n_jobs=2)]: Done 124 tasks      | elapsed:   12.7s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed:   14.4s finished

Save Predictions to Disk

The predictions are saved to disk so that if the Jupyter notebook kernel stops working we would not need to perform the predictions once more. Loading the predictions from disk is so much faster.

props_pppc.to_csv('predictions.csv')

Load Predictions from Disk

def load_predictions_data(path):
    predictions = pd.read_csv(path, dtype={'user_id': np.uint32,
                                           'product_id': np.uint32,
                                           'pred_reordered_prob': np.float16},
                              usecols=['user_id','product_id','pred_reordered_prob'])

    return predictions


predictions = load_predictions_data('predictions.csv')

Add Order ID and User ID to Predictions

predictions = predictions.merge(final_orders_only[['order_id','user_id']], on='user_id')
predictions.sort_values(['order_id','pred_reordered_prob'], inplace=True, ascending=[True, False])
predictions.head()
user_id product_id pred_reordered_prob order_id
858078 36855 13107 0.938965 17
858079 36855 21463 0.882324 17
858080 36855 38777 0.846680 17
858081 36855 21709 0.792969 17
858082 36855 47766 0.792969 17

Calculate Mean Number of Reordered Products per Customer

The mean number of reordered products per customer in their last order is computed to include only the top predicted products per customer in the final order.

predictions = predictions.merge(mean_order_length_per_customer, on='user_id')
predictions = predictions.merge(mean_reorder_ratio_per_customer, on='user_id')
predictions['order_length'] = predictions.mean_order_length * predictions.mean_reorder_ratio
predictions['order_length'] = predictions['order_length'].round()
predictions.order_length = predictions.order_length.astype(int)
del predictions['user_id']
del predictions['mean_order_length']
del predictions['mean_reorder_ratio']
predictions_above_threshold = predictions.loc[predictions.pred_reordered_prob > 0.5]
del predictions['pred_reordered_prob']
predictions.head()
product_id order_id order_length
0 13107 17 2
1 21463 17 2
2 38777 17 2
3 21709 17 2
4 47766 17 2

Generate Reordered Products List for Each Last Order

LimitToMeanReorderLength = False

if LimitToMeanReorderLength:
    predictions_above_threshold = predictions_above_threshold.groupby('order_id').apply(lambda x: list(x['product_id'])[:list(x['order_length'])[0]])
else:
    predictions_above_threshold = predictions_above_threshold.groupby('order_id').apply(lambda x: list(x['product_id'])
                                                                                       )
predictions_above_threshold.head()
order_id
17     [13107, 21463, 38777, 21709, 47766, 26429, 392...
34     [39180, 39475, 47792, 47766, 2596, 16083, 4350...
137    [23794, 41787, 24852, 38689, 2326, 5134, 25890...
182    [9337, 39275, 13629, 5479, 47672, 47209, 33000...
257    [49235, 24852, 27104, 27966, 29837, 30233, 450...
dtype: object

Save Reordered Products List per Final Order to Disk

predictions_above_threshold.to_csv('submit_predictions_0_5.csv')

Determine List of Empty Orders and Save to Disk

Empty orders are final orders which our predictive model thinks will not include any reordered products. Although the predictive model assigns some probability to all products in a customer’s history, for some customers and their final order, the probabilities assigned to the products are lower than the threshold we use. In our case we determined that 0.7 gave us the best F1-score and so we are using that.

# dataframe with single column full of final order ids
all_last_orders = pd.DataFrame(predictions['order_id'].unique(), columns=['order_id'])

predictions_above_threshold = predictions_above_threshold.reset_index()

empty_orders = all_last_orders.loc[~all_last_orders.order_id.isin(predictions_above_threshold.order_id.values)]['order_id']
empty_orders.reset_index().to_csv('empty_orders_0_5.csv', index=False, header=False, columns=['order_id'])

Verify That All Final Orders Were Included

There are 75,000 final orders, so to verify that we included all of them in our predictions, we sum the number of orders in the empty orders list along with the number of orders that include at least one reordered product.

printmd("Normal orders **{0}** + Empty orders **{1}** = **{2}** total orders.".format(len(predictions_above_threshold),
                                                                                      len(empty_orders),
                                                                                      len(empty_orders) + len(predictions_above_threshold)))

Normal orders 74875 + Empty orders 125 = 75000 total orders.

Prepare Kaggle Submission File

The two CSV files generated above, one for the normal final orders and the other for the empty final orders, need to be formatted and combined to satisfy the submission file format expected by Kaggle for this particular competition.

This was achieved through a simple bash script file making use of sed and cat tools, prep_submission_file.sh.

Results on Kaggle Instacart Market Basket Analysis Competition

The following are the mean F1-scores achieved on the Instacart Market Basket Analysis competition hosted by Kaggle using only a probability threshold on top of the predictions from the trained random forest model.

Reorder Probability Threshold Mean F1 Score
0.5 0.3343295
0.6 0.3558827
0.7 0.3622994
0.8 0.3276510

Finally, we performed one final test using a probability threshold of 0.7 as before but combined with truncating the reordered products list to satisfy the mean number of reordered products per customer. In this experiment, the mean F1-score reported by Kaggle was 0.3313422.

Conclusion

The best mean F1-score avhieved using the fitted random forest model is 0.3622994. Although this score is nowhere near the one achieved by the challenge winner, at 0.4091449, it is a good start, especially when one considers the simplicity of the features, the model and the computational resources required.

Further Improvement

There are many things one could try out to improve the model, including coming up with other features, such as a slightly different recency probability computation using order number over total number of orders instead of the days since last order metric we used. Another option could be to use clustering to try to identify customer types, which could then in turn be explored and a specific model fitted to each customer type.

One can also use more powerful and complex methods, such as LSTMs. For instance, Kaggle user SJV used a mixture of deep learning models for feature extraction and prediction, attaining third place with a mean F1-score of 0.4081041. To read more about this head over here. Another common technique was F1 maximization, also used by Onodera along with an array of features to place 2nd in the competition with a mean F1-score of 0.4082039. You can read more about his approach here, but keep in mind that you need approximately 300GB of RAM to fit and evaluate his models. In comparison, the models fitted and evaluated in these notebooks work in under 4GB of RAM.