Token names can change the output of your LLM when creating dialogue summaries

I ran into an interesting but unexpected effect when I was playing around with generating dialogue summaries with the FLAN-T5 LLM. 

If you are new to LLM dialogue summarization have a look at this post Dialogue summaries with LLMs

Changing the token names changes the output. In the first version of the dialogue, it had: #kim# and #person1#

To my knowledge dialogue members are usually defined as #person1#, #person2# ect. I was lazy, so I used #kim# as one, and I discovered that the model assumed me to be a woman (Kim is a female name in many places). And generated the following resume: 

Kim is unsure of her dream job.

I looked at it and thought, okay, I will change the name to see if a more common male name will make it write he instead of she. So I changed the Kim to Lars instead, a more common male name, to see if it used information from the token. I had expected it to return the same as above, only with Lars and he instead. But it returned: 

#Lars#: I don’t think a dream job exists, or that there is one out there.

Now if it’s a male name, we are no longer unsure. Lastly, I tried to do #person2# instead any specific name, and got the following:

#Person1#: I don’t think a dream job exists, or that there is one out there

This is the same as the one with Lars only the subject has changed. Now, it’s person1 who doesn’t think a dream job is out there. Im not sure if I have investigated it enough to call bias, but I would be sure to keep my subject tokens to #person1# in the future.

The dialogue was about defining your dream job, but the exact content is unimportant here.

The problem described here is not the transformers trained inside the LLMs in particular, I would have thought that the tokenizer would have some reserved tokens for words on the form #<something>#. On the other hand, the transformers should be able to understand it, if the data used to train it also contained different types of words like #<something>#.

One thought on “Token names can change the output of your LLM when creating dialogue summaries

  1. Pingback: Dialogue summaries with LLM | Notes from A Curious Mind

Leave a comment