Dialogue summaries with LLM

You are part of a group on some messaging service, Slack, WhatsApp or similar, and half the people in the group are on the other side of the world, so you wake up with kilometres of chat chains or just in a group where people that like to send messages much more frequently than you like to read. Couldn’t it be great if you would really like to be able to just get the gist of the discussion rather than go through all of it? These are just examples from my personal life. More serious ones could be a summarization of police reports, medical reports or Customer Interviews:

This is where LLMs with their text understanding should be able to help. They can save you from those long message chains on who should bring what for the next school picnic or similar. But how does it actually work. To start out its always good to have some data you can use to evaluate (or train) your LLM.

One such dataset can be found here [1] and loaded using Hugging Face.

huggingface_dataset = "knkarthick/dialogsum"

The dataset contains dialogues and human summarization of them. This is an example of such a dialogue from the dataset:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

This dialogue is labelled with the following human-curated summary:

#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

To make the LLM summarise it, you can paste something like the following into the prompt:

Summarize the following conversation:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
Summary:

Remember that LLMs are there to complete text strings, so it doesn’t really know this is a dialogue. It looks at this as any other document that it is requested to predict the continuation of. This is important to remember because that means that “Summarize the following conversation:” might not be the best start of the document of, and “Summary:” for that sake, it could be better written as: “What did they talk about:“. It depends on how the model was trained, the data, and if the model was trained with specific templates for this task.

Another thing to consider when creating a dialogue summarizer is to keep an eye on the tokenizer.

The diagram is taken from the Attention is all you need article [2].

The diagram shows the impressive transformers, which people are going to great lengths to train. But what surprised me was that the tokenizer doesn’t recognize “#Person1# as a special token, which means that if I replace it with #kim#, the transformers see something else and can therefore change the output of the model. I wrote a bit more about that here blog. A reason to point this out is that tokenizers are usually not considered part of the pipeline that needs to be trained and is not in the part of the diagram above. They are the ones that format the inputs at the bottom of the diagram.

References:

Token names can change the output of your LLM when creating dialogue summaries

I ran into an interesting but unexpected effect when I was playing around with generating dialogue summaries with the FLAN-T5 LLM. 

If you are new to LLM dialogue summarization have a look at this post Dialogue summaries with LLMs

Changing the token names changes the output. In the first version of the dialogue, it had: #kim# and #person1#

To my knowledge dialogue members are usually defined as #person1#, #person2# ect. I was lazy, so I used #kim# as one, and I discovered that the model assumed me to be a woman (Kim is a female name in many places). And generated the following resume: 

Kim is unsure of her dream job.

I looked at it and thought, okay, I will change the name to see if a more common male name will make it write he instead of she. So I changed the Kim to Lars instead, a more common male name, to see if it used information from the token. I had expected it to return the same as above, only with Lars and he instead. But it returned: 

#Lars#: I don’t think a dream job exists, or that there is one out there.

Now if it’s a male name, we are no longer unsure. Lastly, I tried to do #person2# instead any specific name, and got the following:

#Person1#: I don’t think a dream job exists, or that there is one out there

This is the same as the one with Lars only the subject has changed. Now, it’s person1 who doesn’t think a dream job is out there. Im not sure if I have investigated it enough to call bias, but I would be sure to keep my subject tokens to #person1# in the future.

The dialogue was about defining your dream job, but the exact content is unimportant here.

The problem described here is not the transformers trained inside the LLMs in particular, I would have thought that the tokenizer would have some reserved tokens for words on the form #<something>#. On the other hand, the transformers should be able to understand it, if the data used to train it also contained different types of words like #<something>#.

RecSys Summer School 2023

Thank you to everyone who planned, executed, and joined the RecSys Summer School #RSSS2023. It was an intense week with many great talks, but also due to all discussions and different perspectives heard and had during the breaks and social events. I love the fierce submersion into the RecSys topic, and it is hard not to feel a bit empty after it, but luckily RecSys 2023 (https://lnkd.in/eiDAJP2U) is just around the corner (in time, not location).

I am still digesting the talks and content. I was happy that the week also reflected that evaluation is discussed extensively in Recommender Systems circles. Evaluating Recommenders is hard, if not almost impossible offline and very difficult in online scenarios. There are many reasons for that, such as data bias and the frivolous nature of users. But one thing that was highlighted by many (including myself (vink vink)) was that an evaluation is not a number. Evaluation is answering a question, where metrics provide evidence to answer that question. Lien Michiels did my favourite talk of the week about off-line evaluation (even if she did have many slides with bullet points).

A recommender system is never a single objective component, and I enjoyed Robin Burke talking about the integrities of multistakeholder recommenders. There are always many parties who expect to be considered in recommender systems. Besides the stakeholders, there are also social constraints, such as fairness, that should be considered. Christine Bauer gave a very interesting talk on this subject.

Recommenders should also educate the user about what is in the catalogue. Most importantly, the recommender should not confuse users and push them away from the current use-case. This and other subjects were discussed in the frame of E-Commerce in an excellent talk delivered by Humberto Corona 🎧 sharing his experiences with recommenders in e-commerce.

Conversational recommenders will be a big thing with the new chatbots like ChatGPT, and Cataldo Musto gave an excellent introduction to these. It is still a research question of how to merge the power of the bots with good recommendations.

Knowledge-based recommendations are everything that starts with the content metadata and uses it to find similarities and create recommendations based on that. Pasquale Lops and Marco de Gemmis gave us a good introduction. They explained why we should consider content data first and behavioural data as side information, not vice versa.

This was just to mention a few. All talks were fascinating!

Thank you again for organising Alan SaidToine Bogers and Maria Maistro and inviting me to talk!

A book conversation

Hi ChatGPT, thank you for recommending my book to me. I am delighted you wrote so positively about it.

What is also worth knowing, before I praise it for being the oracle it must be to recommend my book, is that I actually tried regenerating the text four times, and then when my book didn’t appear, told it that:

Then I reposted the question and finally came up with the version I sought, where it was recommending my book.

You usually never see how much work went into making it say the things it’s quoted to say. Remember, the text is only ever as credible as its sources and writers.

The book​ is Out!

The printed book is out!

Four years in the making, nothing on the standards of George R. R. Martin, but still a loong time. I was happy to see that last week, before being released fully, it appeared amongst the 10th most sold in Mannings early releases.
Thanks to all who bought it, supported it, reviewed it and waited so long for the final version!

screenshot 2019-01-18 15.24.06

I am delighted it is completed and I hope that you will enjoy it. Please feel free to comment, review or discuss with me. I also do talks if there is an audience that would like to hear about recommenders.

For now, happy days! Can’t wait to hold a paper copy later this week.

The printed book (and the ebook) are available here and will be for sale on all good webshops in the near future.

Get the book here

Christmas present recommender

There are many people out there that can feel the adrenaline slowly upping its quantities in their bodies, and the stress starting as we near the jolly time of Christmas. We are not all bound to spend fortunes on presents to celebrate the birth of Christ but, if you are and unsure what you should give your loved ones, then you have an excellent opportunity to try out implementing a recommender system.

If you have a friend whom you know likes books, or just looking for a book yourself, a way to approach it is by downloading one of the public datasets around with book ratings. For example, there is one collected with the book-crossings website. Using that and implementing a content-based recommender you could find similar books to the ones that your friend likes. The cool thing about content-based recommenders is that they don’t need ratings from users, so if you have descriptions of other books you can compare them with the descriptions of the books your friend liked.

So, finding the perfect present for a book-loving friend could only be few steps away:

  • Get a list of what the friend likes (probably that can be found on goodreads.com)
  • A data set with book descriptions
  • (optional) My book to understand how to implement a content-based recommender (book)

You could also go more general and take Amazon’s product dataset, or use one with movies in which case I also describe other types of recommenders you can use, collaborative filtering, or learning to rank. I had much fun with the last one.

If you do implement a recommender, please write a comment and happy seasonal festivity, if you celebrate this one.

RecSys 2017

I am just back from Como, Italy where I attended RecSys2017. It was great to meet all the people in person whose work I have been reading and whose presentations I have YouTubed. And more importantly, it was great to talk about my work and get it verified by people working on similar problems. I am in the legal content business now, and I have to admit that I didn’t find anybody who was having the same type of content as us, but it has some of the same types of challenges as for example news sites battle with. A large part of my reason going to a conference like the RecSys, is also to polish off the reasons why I do what I do. To meet so many geeks and discuss all the nerdy details is a great way to charge my motivation for doing it.

RecSys is an academic conference, where aspiring researchers come to show off their research, improving accuracy in predicting ratings on a narrow list of datasets. But it also has an industry track, where the industry comes to show off how they do recommender systems in practice. I think that both parts are great, only I would wish that research is a little bit more geared towards running them online, and the industry being a bit more technical. But it is a huge strength that both industry and academia are present. And I hope that will continue.

At lunch,  a group of us would meet under the trees of the beautiful Villa Erba, where we had lots of interesting talks about how we could make the research part more relevant for business’ – most of us being professionals. And we all agreed (stop me if I am wrong :)) it is a shame that all research is evaluated offline using metrics that most businesses (and researchers) actually agree that it doesn’t say anything about the quality of the recommender system. The result is that researchers are left with the task of optimizing machine learning algorithms using datasets, which are meant to be used for recommender systems, rather than actually doing recommender systems.

The metrics used are also a point to discuss because using precision and recall at K is not a good measurement of a recommendation, but it is the best that they/we have come up with. But if that is what we should measure the quality of research, and use it as a benchmark, we should also agree on the how to split the data into training and test set, how we should sort the ratings split by the users in the test set. I have seen several good talks at the conference about creating a framework for doing this, but it seems that it gets forgotten again as soon as the question round is finished.

Next year’s conference has a challenge sponsored by Spotify, which is great, and I look forward to playing with the data, I wish that they could also make a live service available for researchers to try out their algorithms live, and there by also focusing research on many of practical issues that are faced by a recommender systems engineer.

No matter what I hope to be there again at RecSys 2018.

Recommending a Recommender summer project

File 18-06-2017, 15.07.42Today it’s sunny in Copenhagen so it would be a great day for starting to collect data for an ice cream vendor recommender system. What you need is a lot of ratings of different ice-cream shops. Maybe some data on which types of ice-cream they sell, and whether they have seats and such.

Then you can spend the days where it’s too hot to be outside on reading about how to handle geodata and then how to implement the recommender systems.

You can implement a content-based recommender using the data on the vendor for suggesting similar places for people planning a trip, and use a “learning to rank” algorithm to combine ratings and position into recommendations for users on the move.

Incidentally, today is a good start to on such a project because Manning has a deal of today on “Geoprocessing with Python” which means you can get it at half price off today, and tomorrow you will find the same deal for my book “Practical Recommender systems“. Go and have an ice-cream and think about it.

And if you do it, please let me know how it went!

Can copy-paste behaviour predict film taste?

(A deleted scene from the Practical Recommender Systems book)

People use a computer in different ways – when I need to copy-paste something, I always use <ctlr>-c and <ctlr>-v, while my wife insists (how irritating!) on always using the mouse right-click menu. I do not know if such habits can be translated to implicit ratings unless they could be based on whether to be a geek or not. Just for fun, we could pursue the thought a bit. Let’s pretend that we are looking at films. If we have The Matrix – the geek movie of all time- who would you recommend it to if you only knew how the user performs copy-paste? I would answer the ones using the <ctlr> way. My wife is helping me a lot with this (book), so I should be careful about guessing what kind of films a person who does copy-paste with the right-click menu likes. Still, I would put her way in between the geek way and the ones doing copy-pasting using the menu in the top of the window.

Asking around, I tried to find members of each group, the ones that use the window menu to copy-paste, the ones that right-click and finally the ones that use only the keyboard. Then, I asked them to point out which of the three movies they liked more. This was the result of my little survey:

  • <ctrl>-way = The Matrix.
  • Right-click-way = Life of Walter Mitty
  • Window menu-way = You’ve Got Mail

effort to do copy-paste
You can maybe think of it as shown in the figure.

To put this into practice, I would need to record a series of copy-paste events from the user. At that point, the system could recommend a movie that fits this user’s behavior.

I hope that this has never been implemented in practice. Still, if you think about it, if you have the choice of recommending between “The Matrix” and “You’ve got mail,” the copy-paste behavior could maybe help the system better understand what to recommend.

The conclusion is that although evidence might not be an obvious telltale about users’ tastes, it might contribute to making the implicit ratings more precise.