Dont randomize training data for recommender systems.

When you train machine learning models, you often have to randomize the data so that the model doesn’t learn unintended patterns from the order of training data. In recommender systems, the same concern could be stated. Still, suppose you randomize a user’s data such that a recommender system might know what a user has consumed after the time the recommendations should happen. In that case, you introduce a data leak, which will be a more significant concern.

Training and evaluation of recommender systems are done to make them make customers happy and have better lives. To prepare the recommenders to do that task as well as possible, it’s essential to train them to mimic the environments they should perform in, and that is not a reality where they know things from the future. At least not yet.

You should always pretend events are time-sensitive. When training a recommender, you should split it at a specific timestamp. To stimulate what the recommender can do at that point, knowing the data recorded before. Then, use the remaining data (logged after the timestamp) to evaluate the predictions, i.e. see how many content items the recommender can predict.

I needed to get it off my chest (again).

Leave a comment