A book conversation

Hi ChatGPT, thank you for recommending my book to me. I am delighted you wrote so positively about it.

What is also worth knowing, before I praise it for being the oracle it must be to recommend my book, is that I actually tried regenerating the text four times, and then when my book didn’t appear, told it that:

Then I reposted the question and finally came up with the version I sought, where it was recommending my book.

You usually never see how much work went into making it say the things it’s quoted to say. Remember, the text is only ever as credible as its sources and writers.

The book​ is Out!

The printed book is out!

Four years in the making, nothing on the standards of George R. R. Martin, but still a loong time. I was happy to see that last week, before being released fully, it appeared amongst the 10th most sold in Mannings early releases.
Thanks to all who bought it, supported it, reviewed it and waited so long for the final version!

screenshot 2019-01-18 15.24.06

I am delighted it is completed and I hope that you will enjoy it. Please feel free to comment, review or discuss with me. I also do talks if there is an audience that would like to hear about recommenders.

For now, happy days! Can’t wait to hold a paper copy later this week.

The printed book (and the ebook) are available here and will be for sale on all good webshops in the near future.

Get the book here

Christmas present recommender

There are many people out there that can feel the adrenaline slowly upping its quantities in their bodies, and the stress starting as we near the jolly time of Christmas. We are not all bound to spend fortunes on presents to celebrate the birth of Christ but, if you are and unsure what you should give your loved ones, then you have an excellent opportunity to try out implementing a recommender system.

If you have a friend whom you know likes books, or just looking for a book yourself, a way to approach it is by downloading one of the public datasets around with book ratings. For example, there is one collected with the book-crossings website. Using that and implementing a content-based recommender you could find similar books to the ones that your friend likes. The cool thing about content-based recommenders is that they don’t need ratings from users, so if you have descriptions of other books you can compare them with the descriptions of the books your friend liked.

So, finding the perfect present for a book-loving friend could only be few steps away:

  • Get a list of what the friend likes (probably that can be found on goodreads.com)
  • A data set with book descriptions
  • (optional) My book to understand how to implement a content-based recommender (book)

You could also go more general and take Amazon’s product dataset, or use one with movies in which case I also describe other types of recommenders you can use, collaborative filtering, or learning to rank. I had much fun with the last one.

If you do implement a recommender, please write a comment and happy seasonal festivity, if you celebrate this one.

RecSys 2017

I am just back from Como, Italy where I attended RecSys2017. It was great to meet all the people in person whose work I have been reading and whose presentations I have YouTubed. And more importantly, it was great to talk about my work and get it verified by people working on similar problems. I am in the legal content business now, and I have to admit that I didn’t find anybody who was having the same type of content as us, but it has some of the same types of challenges as for example news sites battle with. A large part of my reason going to a conference like the RecSys, is also to polish off the reasons why I do what I do. To meet so many geeks and discuss all the nerdy details is a great way to charge my motivation for doing it.

RecSys is an academic conference, where aspiring researchers come to show off their research, improving accuracy in predicting ratings on a narrow list of datasets. But it also has an industry track, where the industry comes to show off how they do recommender systems in practice. I think that both parts are great, only I would wish that research is a little bit more geared towards running them online, and the industry being a bit more technical. But it is a huge strength that both industry and academia are present. And I hope that will continue.

At lunch,  a group of us would meet under the trees of the beautiful Villa Erba, where we had lots of interesting talks about how we could make the research part more relevant for business’ – most of us being professionals. And we all agreed (stop me if I am wrong :)) it is a shame that all research is evaluated offline using metrics that most businesses (and researchers) actually agree that it doesn’t say anything about the quality of the recommender system. The result is that researchers are left with the task of optimizing machine learning algorithms using datasets, which are meant to be used for recommender systems, rather than actually doing recommender systems.

The metrics used are also a point to discuss because using precision and recall at K is not a good measurement of a recommendation, but it is the best that they/we have come up with. But if that is what we should measure the quality of research, and use it as a benchmark, we should also agree on the how to split the data into training and test set, how we should sort the ratings split by the users in the test set. I have seen several good talks at the conference about creating a framework for doing this, but it seems that it gets forgotten again as soon as the question round is finished.

Next year’s conference has a challenge sponsored by Spotify, which is great, and I look forward to playing with the data, I wish that they could also make a live service available for researchers to try out their algorithms live, and there by also focusing research on many of practical issues that are faced by a recommender systems engineer.

No matter what I hope to be there again at RecSys 2018.

Recommending a Recommender summer project

File 18-06-2017, 15.07.42Today it’s sunny in Copenhagen so it would be a great day for starting to collect data for an ice cream vendor recommender system. What you need is a lot of ratings of different ice-cream shops. Maybe some data on which types of ice-cream they sell, and whether they have seats and such.

Then you can spend the days where it’s too hot to be outside on reading about how to handle geodata and then how to implement the recommender systems.

You can implement a content-based recommender using the data on the vendor for suggesting similar places for people planning a trip, and use a “learning to rank” algorithm to combine ratings and position into recommendations for users on the move.

Incidentally, today is a good start to on such a project because Manning has a deal of today on “Geoprocessing with Python” which means you can get it at half price off today, and tomorrow you will find the same deal for my book “Practical Recommender systems“. Go and have an ice-cream and think about it.

And if you do it, please let me know how it went!

Can copy-paste behaviour predict film taste?

(A deleted scene from the Practical Recommender Systems book)

People use a computer in different ways – when I need to copy-paste something I always use <ctlr>-c and <ctlr>-v, while my wife insists (how irritating!) on always using the mouse right-click menu. I do not know if such habits can be translated to implicit ratings, unless they could be based on whether to be a geek or not. Just for fun, we could pursue the thought a bit. Let’s pretend that we are looking at films. If we have the Matrix – the geek movie of all time- who would you recommend it to if you only knew how the user performs copy-paste? I would answer the ones using the <ctlr> way. My wife is helping me a lot with this (book), so I should be careful about guessing what kind of films a person who does copy-paste with the right-click menu likes, but I would put her way in between the geek way and the ones doing copy-pasting using the menu in the top of the window.

Asking around, I tried to find members of each group, the ones that use the window menu to copy-paste, the ones that right-click and finally the ones that use only the keyboard. Then I asked them to point at which of three movies they liked more. This was the result of my little survey:

  • <ctrl>-way = The Matrix.
  • Right-click-way = Life of Walter Mitty
  • Window menu-way = You’ve Got Mail


effort to do copy-paste
You can maybe think of it like shown in the figure.

To put this into practice I would need to record a series of copy-paste events from the user, at which point the system could then recommend the movie, which fits to this user’s behaviour.

I hope that this has never been implemented in practice, but if you think about it, then if you have the choice of recommending between “The Matrix” and “You’ve got mail”, the copy-paste behaviour could maybe contribute to give the system a better understanding of what to recommend.

The conclusion is that even if evidence might not be an obvious telltale about users taste, but it might contribute in making the implicit ratings more precise.

Introducing Practical Recommender Systems

Practical Recommender Systems

Front page of Practical Recommender Systems

For a computer scientist like me, the world of IT is such an exciting place! Since I started at  university, I have seen the creation of companies like Amazon and Google, and later Netflix. They were for sure lucky to be in the right place at the right time. But it was ingenuity that has kept them in the market. What they did is a long story, but what I find interesting is that they have taken large quantities of content and made it accessible to the masses.

One of the advantages of being an internet business is the fact that you are not limited by physical walls like traditional shops and your list of products can be close to never ending. If a physical store was truly so vast, customers would struggle to find anything and  simply get lost. They would probably  go to the shop next door, which has fewer products and buy things that are not exactly what they wanted, but are easily accessible.

Offering lots of content does not ensure success, not even if you have precisely what your users want. Often 20% of your content will produce 80% of your business, if you can match the rest of the 80% of the content with your users, you will have more happy users and more business. The problem of activating the last 80% of the content is called the long tail problem.

A way to enhance the accessibility to the content for the users is to add a recommender system to you site. This can attempt to predict what your customers want and serve it to them.

Implementation of Recommender systems is an intriguing task. The actual algorithms like collaborative or content-based filtering are just a small part of it. If you do not feed the algorithm with the right data, it will not produce anything worth looking at. Using user ratings will often not produce the results that users want. Looking at context is also often something worth thinking about. And when it is all implemented and running, how do you know that it is working, how do you measure improvements?

I never found a book answering these questions; I found lots of good books explaining how to implement the algorithms mentioned above, but never a book that described everything around as well. So I started working on one. It just came out in an early release at Manning

Go and have a look, the first chapter is free!


Big Data, The silver bullet ?

gaussiandistproblemA sign that Big Data is Big can be seen in the fact that the term Big Data has found its way all the way to the average Danish newspaper reader. Politiken, the biggest newspaper in Denmark, had it as a theme and wrote 6 pages about it. [among others this one 1] not long ago.

The newspaper article refers to hungry.dk – a Danish company, to illustrate the use of Big Data in a business.

hungry.dk provides a portal for take-away restaurants and is highlighted because they are using data from one of the new data interfaces delivered by the Danish state as part of the digitaliser.dk initiative. They retrieve smiley data for restaurants and merge it with their data on restaurants, enabling them to remove places that do not comply with the Danish food regulations. This, the article says, is one example which describes how companies can use Big Data, which is something that we are lacking behind in Denmark, and thereby loosing workplaces and competition edge, by not taking advantage of it [ 3 ].

I think its great that hungry.dk merges its list of takeout places with the smiley database, and its a great example of how businesses should take advantage of the data provided by the state. But for me Big Data is about analysing large dataset, to find patterns or to use it to predict the future, not so much about merging data from different sources. Merging data from different sources could very well be a step in a Big Data problem, but is not a Big Data problem in it self.

Either way, Big Data is something that is on a lot of peoples minds, and a tool that most companies should be consider using. The Data that the Danish state provides are there to be used, and even if you could ask why the state should pay for supporting high usages of the databases to enable businesses to bloom.

But where to start? Most books and websites makes it sounds like its about asking for the key at the Big Data engineer, and then the river of knowledge will magically start flowing out of your databases, and make exactly your company special. But is it that easy?

First point of order is the consider what data you got, or how to get it. Do you save backups? Ensure that all customer information is not a critical item, or actually save it in a way that makes it retrievable?. The collection of data can also be from an public API, or get data from Twitter to find trends and moods of things, or make a smart Facebook app so people completely voluntarily tells you all their secrets about everything from their dishwasher breaking to where they want to go for holiday next year ( imagine if a dishwasher seller could get info like that, or easyjet knew where people dreamt about going)(by the way are you sure that your facebook birthday app doesnt already collect data like that?)

When you got the data, and got enough to have a statistical relevancy (se 5) you can introduce the data to a statistician and her machine to enable them to learn something from the data. Actually you should probably include her when collecting the data too.

With success you can optimise your sale, your campaigns or who you build bridges better. There is virtually no limits to what you can make a machine help you with, if only you know how to teach it. An example where many businesses claims to gain on using machine learning is with recommendation systems, were it is said to add up to 10% more sale. [The classic examples netflix.com and Amazon.co.uk].

It sounds easy, but like most investigations, a single hitch can end in false conclusions, which will be hard to spot and resolve.

“The way to turn data into insight is to squash the notion that big data is a silver bullet. We preach that data and analytics is important but then we empower people to be curious and ask questions and get involved in big data analytics.” [ 6 ]

Everybody is Talking About Big Data.

2014-06-13 06.17.18Everybody is talking about it, everybody is saying that they will soon have a version ready that will utilize the heaps of data, which are piling up in databases around us. But what is actually possible to achieve with it? Some say EVERYTHING, others are a bit more sceptical and think:

it’s being paraded around as a magic bullet, raising unrealistic expectations that will surely be disappointed. – Cathy O’Neil and Rachel Schutt in “Doing Data Science”

In my opinion, Big Data can be used for many things, but like everything using statistic, you should remember that correlation does not imply causation – just because something happens just after something else, it does not imply that one is a reaction to the other. Manipulated correctly data can prove almost any thesis, and its contradicting thesis. It is exciting to search for patterns or structure in the sea of data, to seek out information which no man has seen before, but be careful and sceptical always, especially when the results are too good.

I think its interesting that people can analyse data and find that children performs better in school when they eat breakfast every day[1], but personally I am more into predicting things whether it is to recommend good books, predict earthquakes or finding pregnant women from they shopping habits[2], is incredible cool.

I have been working with recommendation systems, studied Machine learning at the university and I am now working with it. I will always try to collect new ideas and learn more, which I intend to write about here. Many can be also found in Danish here at QED.dk

My hope is that this blog can be a place for people to come and read new interesting posts on Big Data and Machine Learning, but also please add to the discussion in comments or as guest bloggers.

Thank you for reading this, hope to see you again!