The goal of the challenge is to predict which articles the customer will purchase in the week immediately following the end of the provided training data.
There is a pytorch baseline by Ahmet Erdem (https://www.kaggle.com/code/aerdem4/h-m-pure-pytorch-baseline) that tries to make predictions based on the customer’s purchase history. It’s been commented that the model implemented is something similar to a graph neural network. This approach only makes use of article_id
and t_dat
, the article IDs and the times of purchase.
Some data wrangling takes the provided “transactions_train.csv” to the following form:
Data frame showing 3 training samples: 3584, 75808, and 231659.
Each row is a training sample. Even though customer_id is displayed, it’s not actually used in the modelling. “article_id” and “week_history” belong to the context period, and “week” and “target” belong to the target period. We want to predict what happens in the target period, based on what happened in the context period.
Take the first row as an example (sample 3584). The target period is week 1 (as indicated by the “week” column), which is the week one week before the end of the training data period. The articles purchased in this period are listed in “target”, so this is what we would like our model to be able to predict. During the context period, “article_id” lists all the articles that have been purchased, and “week_history” lists the corresponding weeks in which they were purchased. So, the weeks in “week_history” are prior to the week in “week”. In this training sample, 3 articles were purchased in week 6, while 10 articles were purchased in week 4.
Whilst the length of the target period is fixed at 1 week (because this is what we want to do for the submission samples), the context period length can be increased by increasing the number of preceding weeks (week_hist_max
). The longer the context period, chances are more purchased articles will be collected in the purchase history, resulting in longer “article_id” and “week_history”.
The sequence length (seq_len
) is the maximum number of purchased articles that will actually be looked at by the model, starting from the most recently purchased. This is needed because whilst purchase history can vary in length from sample to sample, in a training batch, they all need to have the same length. So, if there are more articles in the purchase history than is allowed by the sequence length, the history will be truncated. If there are fewer articles in the purchase history than the sequence length, the sequence will be padded with the placeholder article.