Let’s talk about sex, mr ChatBot

Posted by kimfalk

The accepted abstract for my lightning talk at the AltRecSys workshop at RecSys2024 in Bari. (the image was not part of the submission)

Almost any news source, whether a regular newspaper, LinkedIn, or similar, will tell you these days that ChatGPT and LLMs will generally solve any problem, steal your job and do a better job. We see one blog post after another showing how to create cold start recommenders with an LLM without providing user data that outperforms SOTA algorithms. The LLM just “knows.”

However, this is only true for domains and topics often discussed on the internet and topics falling within the ethics and rules of the country where the model was created or, indeed, the company itself. As soon as you move away from those, you start having problems.

One example where this is especially true is in the booming market of sex toys. Many of the product features and words and phrases used to describe sex toys will fall outside of company guidelines and often be part of the stop word list. As a result, it is tough to use the embeddings created by the LLMs. The LLMs will simply ignore essential features of the product or otherwise produce embeddings that are, in fact, useless due to misunderstanding the semantics of the words.

Facing these problems, many customers looking for sex toys get lost in the many different choices. Since many product types are taboo, it’s difficult for customers to learn and the stores to personalise experiences and create recommendations for the customers. Furthermore, from an industry practitioner’s point of view, this also sets unrealistic expectations on what can be done using these technologies, as the expectation is that you can create great recommendations as long as you use an LMM.

Most LLMs are trained on publicly available data, and the articles that do talk about sex toys are, in many cases, presenting a distorted worldview and push bias’ which most system engineers want to remove. This also hints at a broader problem since many use cases encounter similar issues. Namely, LLMs can only learn what they are given. So, decisions and availability of data have a significant impact on the applicability of the LLMs

This becomes even worse when used in minor languages. Everything that works in English becomes complex when moved to a small language like one of the Scandinavian languages. The models often confuse the meaning of words because many of these languages have similar words with slightly different meanings, and often, the training data will be mixed between the languages due to poor language classification.

Sex toys and Scandinavian languages might both be extreme examples, but it does illustrate problems and biases which are introduced into our systems.

Cosine similarity doesn’t always make sense

Posted by kimfalk

Harald Stecks’s paper “Is cosine similarity of embeddings really about similarity”[1] states and mathematically proves that cosine similarity (CS) doesn’t always make sense when calculating similarity in recommender systems.

For example, it might not work if normalization is performed incorrectly during model training or if used to compare vectors from different latent spaces. The paper doesn’t say it never makes sense, only that you can get into situations where it doesn’t.

Therefore, it doesn’t mean you should stop using Cosine similarity altogether; it is only a reminder that you should always test and evaluate your assumptions. This is valid for cosine similarity and anything else you base your system on.

[1] https://research.netflix.com/publication/is-cosine-similarity-of-embeddings-really-about-similarity

Using LLMs doesn’t always help readability.

Posted by kimfalk

As an experienced reviewer of recommender systems articles, I have had the privilege of evaluating submissions for numerous large conferences, primarily on the industry track but also on the research track.

A rarely discussed barrier to getting your article accepted at one of these conferences is its readability.

While numerous tools are available to assist writers, it’s important to exercise caution. Before the introduction of LLMs, one of the biggest offenders was Google Translate, which has “helped” non-English speakers translate text. Unfortunately, many of these translations don’t actually mean the same thing. With the introduction of LLMs, many English-speaking authors also hurt readability using tools.

An LLM is a great tool for making your language sound richer and more colorful, which is great if you are writing a novel or other creative piece of content. However, in a scientific article, the best approach is to simplify as much as possible. To convey your research, please make it easier for the reader.

If you do use an LLM or any other tool to help you write it, please do the reviewers and future readers a favor and ensure you and others understand what is written first before submitting.

LLMs are great but are not making Recommender systems obsolete (yet)

Posted by kimfalk

LLMs are great and can do mind-boggling things with their language comprehension capabilities. They have generative abilities that make them seem like oracles, but please caution yourself because they are not.

Stuffing an LLM into a recommender system does not solve all problems. In fact, they might create quite a few more than they will solve at this point.

That’s not to say that they don’t have a place in the world of RecSys, but it is another component rather than a replacement altogether. The idea that it will make behavioral data obsolete seems a bit naive to me. Language Models can enhance recommender systems. LLMs significantly enhance recommender systems by leveraging their advanced language comprehension capabilities to generate personalized recommendations. However, it’s essential to recognize their limitations. While they excel in understanding language, they may not adequately address all complexities of user behavior and context, potentially creating more issues than they solve.

Does the solution have to contain machine learning?

Posted by kimfalk

Does a solution have to contain machine learning to be good or to tap into the voice of the many – do we need to have an LLM?

In many cases, the answer might be no, for sure not as the first solution. If you consider recommender systems or reranking models, simply reordering them according to recency can significantly improve the experience.

But of course, don’t stop there. There can be many more things to try which might be possible improvements. However, one of the most significant issues of recommender development is that it is tough to evaluate a system without testing it on users. If you have a simple idea that could improve your KPI, it will likely earn you a lot of money while battling with more complex algorithms.

Having something simple in production also enables you to start evaluating and monitoring and ensure you have set up the collection of data needed before adding more complexity to the system. A simple solution also provides you not only evidence and data that could enable your (machine learning model/)system to become even better. If nothing else, it provides a benchmark to compare to the much more complex solution.

January is the month of Experiments.

Posted by kimfalk

January is the time when all the data science experiments have been ready for the last two months but have been delayed because of the code freeze of Christmas and maybe even Cyber weekend. As a data scientist, this is a great time because you are probably allowed to test more risky things in production. But remember that you are allowed to do it because it is a slow period.

And be diligent. There are challenges with this month because your historical data is all over the place due to the changed habits of Christmas shoppers (see previous post), which could mean that models will probably not perform well in production.

Another reason to be extra on your toes when setting up the A/B tests is that the KPIs of most e-commerce sites are plummeting compared to last month’s capitalist spree of spending.

Are you sure you have data to prove that it is not somehow your experiment’s fault?

The drifty month of January

Posted by kimfalk

Happy New Year, everyone,

Welcome to January, the month where most behavioural-based e-commerce recommender systems struggle.

Hopefully, December was full of lots of transactions, so the system has a lot of good-quality data. But most customers don’t have or want to have the same shopping habits in January as they did in December. Also, what makes it all murkier is that most people buy presents for other people, not just one person but several people, so most taste profiles have gone crazy.

Your recommender training pipeline might still produce offline evaluations on par with last month’s evaluations (at least if the data still includes Christmas shoppers), but will it correspond with what will happen online? Andrew Ng talks about the importance of your data representing the production environment in which it should be deployed. The question then begs: Is it better to remove the December data altogether, as November data might look much more like what is happening in January?

No matter what, it’s now a good idea to filter out anything Christmas-related.

Dont randomize training data for recommender systems.

Posted by kimfalk

When you train machine learning models, you often have to randomize the data so that the model doesn’t learn unintended patterns from the order of training data. In recommender systems, the same concern could be stated. Still, suppose you randomize a user’s data such that a recommender system might know what a user has consumed after the time the recommendations should happen. In that case, you introduce a data leak, which will be a more significant concern.

Training and evaluation of recommender systems are done to make them make customers happy and have better lives. To prepare the recommenders to do that task as well as possible, it’s essential to train them to mimic the environments they should perform in, and that is not a reality where they know things from the future. At least not yet.

You should always pretend events are time-sensitive. When training a recommender, you should split it at a specific timestamp. To stimulate what the recommender can do at that point, knowing the data recorded before. Then, use the remaining data (logged after the timestamp) to evaluate the predictions, i.e. see how many content items the recommender can predict.

I needed to get it off my chest (again).

Personalisation is Personal

Posted by kimfalk

Without turning this into a play on words, it’s interesting that personalization is very personal. The more people I talk to about what personalization is, the more answers I get.

This is one of these funny paradoxes because, basically, in any survey you read, it is clear that, if asked, people want personalization. But if you ask the same people to describe personalization, they can’t explain it.

When discussing personalization, it often comes down to anecdotal examples of where Netflix’s recommender didn’t work. But to turn that on its head, the reason why Netflix is brought up is because it’s so good that when it fails, it gets noticed, while most attempts at personalization are challenging to spot.

I am curious: can any of you explain what good personalization is (without using specific examples)

Usually, I define it as optimizing for two things:
* Help the user find what they are looking for fast
* Educate the users and help them find the content they didn’t know they needed but do.

https://www.bing.com/images/create/how-an-ai-monster-creates-personalisation/654e30f34782406ab53817eab6538d03?id=JCygKOhTQOy09Os8B%2fBjnQ%3d%3d&view=detailv2&idpp=genimg&idpclose=1&FORM=SYDBIC

What is Retrieval Augmented Generation (RAG)

Posted by kimfalk

A RAG framework contains an LLM paired with a knowledge base.

A RAG process takes a query and assesses if it relates to subjects defined in the paired knowledge base. If yes, it searches its knowledge base to extract information related to the user’s question. Any relevant context in the knowledge base is then passed to the LLM along with the original query, and an answer is produced.

This helps with two things: firstly, it reduces the risk of hallucinations, and secondly, it reduces the chances that an LLM will leak sensitive data, as you can leave it out of the training data.

The knowledge base can also be a recommender system, which will allow the LLM to extract context and feed that into the recommender that, in return, delivers crisp recommendations. (this idea is investigated in the RecSys23 article: Retrieval-augmented Recommender System: Enhancing Recommender Systems with Large Language Models (https://lnkd.in/dHvK8SNJ)
)

Notes from A Curious Mind

Author: kimfalk

Let’s talk about sex, mr ChatBot

Cosine similarity doesn’t always make sense

Using LLMs doesn’t always help readability.

LLMs are great but are not making Recommender systems obsolete (yet)

Does the solution have to contain machine learning?

January is the month of Experiments.

The drifty month of January

Dont randomize training data for recommender systems.

Personalisation is Personal

What is Retrieval Augmented Generation (RAG)