Using LLMs doesn’t always help readability.

created with DALL-E

As an experienced reviewer of recommender systems articles, I have had the privilege of evaluating submissions for numerous large conferences, primarily on the industry track but also on the research track.

A rarely discussed barrier to getting your article accepted at one of these conferences is its readability.

While numerous tools are available to assist writers, it’s important to exercise caution. Before the introduction of LLMs, one of the biggest offenders was Google Translate, which has “helped” non-English speakers translate text. Unfortunately, many of these translations don’t actually mean the same thing. With the introduction of LLMs, many English-speaking authors also hurt readability using tools.

An LLM is a great tool for making your language sound richer and more colorful, which is great if you are writing a novel or other creative piece of content. However, in a scientific article, the best approach is to simplify as much as possible. To convey your research, please make it easier for the reader.

If you do use an LLM or any other tool to help you write it, please do the reviewers and future readers a favor and ensure you and others understand what is written first before submitting.

LLMs are great but are not making Recommender systems obsolete (yet)

LLMs are great and can do mind-boggling things with their language comprehension capabilities. They have generative abilities that make them seem like oracles, but please caution yourself because they are not.

Stuffing an LLM into a recommender system does not solve all problems. In fact, they might create quite a few more than they will solve at this point.

That’s not to say that they don’t have a place in the world of RecSys, but it is another component rather than a replacement altogether. The idea that it will make behavioral data obsolete seems a bit naive to me. Language Models can enhance recommender systems. LLMs significantly enhance recommender systems by leveraging their advanced language comprehension capabilities to generate personalized recommendations. However, it’s essential to recognize their limitations. While they excel in understanding language, they may not adequately address all complexities of user behavior and context, potentially creating more issues than they solve.

Does the solution have to contain machine learning?

Does a solution have to contain machine learning to be good or to tap into the voice of the many – do we need to have an LLM?

In many cases, the answer might be no, for sure not as the first solution. If you consider recommender systems or reranking models, simply reordering them according to recency can significantly improve the experience.

But of course, don’t stop there. There can be many more things to try which might be possible improvements. However, one of the most significant issues of recommender development is that it is tough to evaluate a system without testing it on users. If you have a simple idea that could improve your KPI, it will likely earn you a lot of money while battling with more complex algorithms.

Having something simple in production also enables you to start evaluating and monitoring and ensure you have set up the collection of data needed before adding more complexity to the system. A simple solution also provides you not only evidence and data that could enable your (machine learning model/)system to become even better. If nothing else, it provides a benchmark to compare to the much more complex solution.

January is the month of Experiments.

January is the time when all the data science experiments have been ready for the last two months but have been delayed because of the code freeze of Christmas and maybe even Cyber weekend. As a data scientist, this is a great time because you are probably allowed to test more risky things in production. But remember that you are allowed to do it because it is a slow period.

And be diligent. There are challenges with this month because your historical data is all over the place due to the changed habits of Christmas shoppers (see previous post), which could mean that models will probably not perform well in production.

Another reason to be extra on your toes when setting up the A/B tests is that the KPIs of most e-commerce sites are plummeting compared to last month’s capitalist spree of spending.

Are you sure you have data to prove that it is not somehow your experiment’s fault?

The drifty month of January

Happy New Year, everyone,

Welcome to January, the month where most behavioural-based e-commerce recommender systems struggle.

Hopefully, December was full of lots of transactions, so the system has a lot of good-quality data. But most customers don’t have or want to have the same shopping habits in January as they did in December. Also, what makes it all murkier is that most people buy presents for other people, not just one person but several people, so most taste profiles have gone crazy.

Your recommender training pipeline might still produce offline evaluations on par with last month’s evaluations (at least if the data still includes Christmas shoppers), but will it correspond with what will happen online? Andrew Ng talks about the importance of your data representing the production environment in which it should be deployed. The question then begs: Is it better to remove the December data altogether, as November data might look much more like what is happening in January? 

No matter what, it’s now a good idea to filter out anything Christmas-related.

Dont randomize training data for recommender systems.

When you train machine learning models, you often have to randomize the data so that the model doesn’t learn unintended patterns from the order of training data. In recommender systems, the same concern could be stated. Still, suppose you randomize a user’s data such that a recommender system might know what a user has consumed after the time the recommendations should happen. In that case, you introduce a data leak, which will be a more significant concern.

Training and evaluation of recommender systems are done to make them make customers happy and have better lives. To prepare the recommenders to do that task as well as possible, it’s essential to train them to mimic the environments they should perform in, and that is not a reality where they know things from the future. At least not yet.

You should always pretend events are time-sensitive. When training a recommender, you should split it at a specific timestamp. To stimulate what the recommender can do at that point, knowing the data recorded before. Then, use the remaining data (logged after the timestamp) to evaluate the predictions, i.e. see how many content items the recommender can predict.

I needed to get it off my chest (again).

Personalisation is Personal

Without turning this into a play on words, it’s interesting that personalization is very personal. The more people I talk to about what personalization is, the more answers I get.

This is one of these funny paradoxes because, basically, in any survey you read, it is clear that, if asked, people want personalization. But if you ask the same people to describe personalization, they can’t explain it.

When discussing personalization, it often comes down to anecdotal examples of where Netflix’s recommender didn’t work. But to turn that on its head, the reason why Netflix is brought up is because it’s so good that when it fails, it gets noticed, while most attempts at personalization are challenging to spot.

I am curious: can any of you explain what good personalization is (without using specific examples)

Usually, I define it as optimizing for two things:
* Help the user find what they are looking for fast
* Educate the users and help them find the content they didn’t know they needed but do.

https://www.bing.com/images/create/how-an-ai-monster-creates-personalisation/654e30f34782406ab53817eab6538d03?id=JCygKOhTQOy09Os8B%2fBjnQ%3d%3d&view=detailv2&idpp=genimg&idpclose=1&FORM=SYDBIC

What is Retrieval Augmented Generation (RAG)

A RAG framework contains an LLM paired with a knowledge base.

A RAG process takes a query and assesses if it relates to subjects defined in the paired knowledge base. If yes, it searches its knowledge base to extract information related to the user’s question. Any relevant context in the knowledge base is then passed to the LLM along with the original query, and an answer is produced.

This helps with two things: firstly, it reduces the risk of hallucinations, and secondly, it reduces the chances that an LLM will leak sensitive data, as you can leave it out of the training data.

The knowledge base can also be a recommender system, which will allow the LLM to extract context and feed that into the recommender that, in return, delivers crisp recommendations. (this idea is investigated in the RecSys23 article: Retrieval-augmented Recommender System: Enhancing Recommender Systems with Large Language Models (https://lnkd.in/dHvK8SNJ)
)

Dialogue summaries with LLM

You are part of a group on some messaging service, Slack, WhatsApp or similar, and half the people in the group are on the other side of the world, so you wake up with kilometres of chat chains or just in a group where people that like to send messages much more frequently than you like to read. Couldn’t it be great if you would really like to be able to just get the gist of the discussion rather than go through all of it? These are just examples from my personal life. More serious ones could be a summarization of police reports, medical reports or Customer Interviews:

This is where LLMs with their text understanding should be able to help. They can save you from those long message chains on who should bring what for the next school picnic or similar. But how does it actually work. To start out its always good to have some data you can use to evaluate (or train) your LLM.

One such dataset can be found here [1] and loaded using Hugging Face.

huggingface_dataset = "knkarthick/dialogsum"

The dataset contains dialogues and human summarization of them. This is an example of such a dialogue from the dataset:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

This dialogue is labelled with the following human-curated summary:

#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

To make the LLM summarise it, you can paste something like the following into the prompt:

Summarize the following conversation:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
Summary:

Remember that LLMs are there to complete text strings, so it doesn’t really know this is a dialogue. It looks at this as any other document that it is requested to predict the continuation of. This is important to remember because that means that “Summarize the following conversation:” might not be the best start of the document of, and “Summary:” for that sake, it could be better written as: “What did they talk about:“. It depends on how the model was trained, the data, and if the model was trained with specific templates for this task.

Another thing to consider when creating a dialogue summarizer is to keep an eye on the tokenizer.

The diagram is taken from the Attention is all you need article [2].

The diagram shows the impressive transformers, which people are going to great lengths to train. But what surprised me was that the tokenizer doesn’t recognize “#Person1# as a special token, which means that if I replace it with #kim#, the transformers see something else and can therefore change the output of the model. I wrote a bit more about that here blog. A reason to point this out is that tokenizers are usually not considered part of the pipeline that needs to be trained and is not in the part of the diagram above. They are the ones that format the inputs at the bottom of the diagram.

References:

Token names can change the output of your LLM when creating dialogue summaries

I ran into an interesting but unexpected effect when I was playing around with generating dialogue summaries with the FLAN-T5 LLM. 

If you are new to LLM dialogue summarization have a look at this post Dialogue summaries with LLMs

Changing the token names changes the output. In the first version of the dialogue, it had: #kim# and #person1#

To my knowledge dialogue members are usually defined as #person1#, #person2# ect. I was lazy, so I used #kim# as one, and I discovered that the model assumed me to be a woman (Kim is a female name in many places). And generated the following resume: 

Kim is unsure of her dream job.

I looked at it and thought, okay, I will change the name to see if a more common male name will make it write he instead of she. So I changed the Kim to Lars instead, a more common male name, to see if it used information from the token. I had expected it to return the same as above, only with Lars and he instead. But it returned: 

#Lars#: I don’t think a dream job exists, or that there is one out there.

Now if it’s a male name, we are no longer unsure. Lastly, I tried to do #person2# instead any specific name, and got the following:

#Person1#: I don’t think a dream job exists, or that there is one out there

This is the same as the one with Lars only the subject has changed. Now, it’s person1 who doesn’t think a dream job is out there. Im not sure if I have investigated it enough to call bias, but I would be sure to keep my subject tokens to #person1# in the future.

The dialogue was about defining your dream job, but the exact content is unimportant here.

The problem described here is not the transformers trained inside the LLMs in particular, I would have thought that the tokenizer would have some reserved tokens for words on the form #<something>#. On the other hand, the transformers should be able to understand it, if the data used to train it also contained different types of words like #<something>#.