Dialogue summaries with LLM

You are part of a group on some messaging service, Slack, WhatsApp or similar, and half the people in the group are on the other side of the world, so you wake up with kilometres of chat chains or just in a group where people that like to send messages much more frequently than you like to read. Couldn’t it be great if you would really like to be able to just get the gist of the discussion rather than go through all of it? These are just examples from my personal life. More serious ones could be a summarization of police reports, medical reports or Customer Interviews:

This is where LLMs with their text understanding should be able to help. They can save you from those long message chains on who should bring what for the next school picnic or similar. But how does it actually work. To start out its always good to have some data you can use to evaluate (or train) your LLM.

One such dataset can be found here [1] and loaded using Hugging Face.

huggingface_dataset = "knkarthick/dialogsum"

The dataset contains dialogues and human summarization of them. This is an example of such a dialogue from the dataset:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

This dialogue is labelled with the following human-curated summary:

#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

To make the LLM summarise it, you can paste something like the following into the prompt:

Summarize the following conversation:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
Summary:

Remember that LLMs are there to complete text strings, so it doesn’t really know this is a dialogue. It looks at this as any other document that it is requested to predict the continuation of. This is important to remember because that means that “Summarize the following conversation:” might not be the best start of the document of, and “Summary:” for that sake, it could be better written as: “What did they talk about:“. It depends on how the model was trained, the data, and if the model was trained with specific templates for this task.

Another thing to consider when creating a dialogue summarizer is to keep an eye on the tokenizer.

The diagram is taken from the Attention is all you need article [2].

The diagram shows the impressive transformers, which people are going to great lengths to train. But what surprised me was that the tokenizer doesn’t recognize “#Person1# as a special token, which means that if I replace it with #kim#, the transformers see something else and can therefore change the output of the model. I wrote a bit more about that here blog. A reason to point this out is that tokenizers are usually not considered part of the pipeline that needs to be trained and is not in the part of the diagram above. They are the ones that format the inputs at the bottom of the diagram.

References:

One thought on “Dialogue summaries with LLM

  1. Pingback: Token names can change the output of your LLM when creating dialogue summaries | Notes from A Curious Mind

Leave a comment