News recommendation practice: the manufacture of recommendation system process

This article mainly explains the construction of the recommendation system process, which mainly includes two parts: Offline and Online.

Offline

The offline part mainly performs offline calculation based on the previously stored material portraits and user portraits, and provides each user with a list of popular pages and recommended pages and caches them to facilitate the acquisition of the list of online services. So the following mainly helps you sort out the process of generating these two lists and caching them to redis.

List of popular pages

When the user enters the system and clicks the hot button, the online service will obtain a list of popular pages for the current user, and this list is generated and cached in redis by the offline part for the user when the user logs in. To obtain online services, you only need to pull them from redis.

The so-called popular page means that for each article, it will calculate the popularity information of the article according to its publishing time and the user's behavior record (the number of likes, favorites and readings), and then sort according to the popularity value. Therefore, to calculate the popular records of the article, only the dynamic and static information of the article portrait is needed, and the sky-level update, the logic is as follows:

After the materials are processed in the early morning of each day, we will get the publication time of each article (static feature) and the number of likes, favorites, and reads (dynamic features) that each article has accumulated so far. At this time, we can traverse the materials For each article in the pool, get the publication time of the article, and make the difference with the current time to get the timeliness of the article, and then filter out the articles that have been published for too long according to the timeliness, and then combine the dynamic characteristics of the article, based on the heat formula, Can calculate the heat value of the article. Each article has a hot value. Sorting according to the hot value, you can get the hot list of articles, and cache the list to redis in the form of zset. The reason for zset is that it can help us automatically sort according to the hot value. This is a public hot list, which can be used as the initialization state of each user's hot list.

Due to the different preferences and interests of each user, for the same popular page, the clicked articles may be different, and when we expose to users, we often filter out the content that has been exposed to users first, so for the sake of each user When the user logs in, we will generate a list of popular pages for each user, and the initialization state is the public list above. After that, when the user clicks on the popular page, he will get the article from his own list of popular pages. Of course, this is an online service, which we will discuss in detail later.

Therefore, the summary of the offline hot page list generation process is to traverse the material pool every day. For each article, calculate the hot value based on the dynamic information and static features, and sort the hot value to generate a public hot template as a separate hot list for each user. beginning of. code show as below:

def get_hot_rec_list(self):
        """Get the likes, favorites and creation time of the material, calculate the popularity and generate a list of popular recommendations to store in redis
        """
        # Traverse all articles in the material pool
        for item in self.feature_protrail_collection.find():
            news_id = item['news_id']
            news_cate = item['cate']
            news_ctime = item['ctime']
            news_likes_num = item['likes']
            news_collections_num = item['collections']
            news_read_num = item['read_num']
            news_hot_value = item['hot_value']

            #print(news_id, news_cate, news_ctime, news_likes_num, news_collections_num, news_read_num, news_hot_value)

            # Time conversion and calculation time difference The premise is to ensure that the current time is greater than the news creation time, and no exceptions are currently caught
            news_ctime_standard = datetime.strptime(news_ctime, "%Y-%m-%d %H:%M")
            cur_time_standard = datetime.now()
            time_day_diff = (cur_time_standard - news_ctime_standard).days
            time_hour_diff = (cur_time_standard - news_ctime_standard).seconds / 3600

            # Only the content of the last 3 days
            if time_day_diff > 3:
                continue
            
            # To calculate the heat score, here we use the Rubik's Cube show heat formula, which can be adjusted. read_num is the last hot_value and the last hot_value is added?   Because like_num are also accumulated, so the calculation here is not the value-added, but the real-time heat.
            # news_hot_value = (news_likes_num * 6 + news_collections_num * 3 + news_read_num * 1) * 10 / (time_hour_diff+1)**1.2
            # 72 means 3 days,
            news_hot_value = (news_likes_num * 0.6 + news_collections_num * 0.3 + news_read_num * 0.1) * 10 / (1 + time_hour_diff / 72) 

            #print(news_likes_num, news_collections_num, time_hour_diff)

            # Update article hot_value of material pool
            item['hot_value'] = news_hot_value
            self.feature_protrail_collection.update({'news_id':news_id}, item)

            #print("news_hot_value: ", news_hot_value)

            # save to redis
            self.reclist_redis.zadd('hot_list', {'{}_{}'.format(news_cate, news_id): news_hot_value}, nx=True)
copy

Recommended page list

When the user enters the system, the recommendation page is in sight. The online service will obtain a list of recommended pages for the current user. This list is also generated offline and cached in redis when the user enters.

The recommendation page is also the part where our recommendation system works. For each user, we will generate a different recommendation page. This is what we know as "thousands of people and thousands of faces". How to do this? It is necessary to use the saved user portraits and item portraits to create features, and then predict the sorting through the model to achieve the so-called personalization. Of course, for a new user, since we have not stored the user portrait in advance, it means that the personalized recommendation process may not be able to be followed. Here we treat it as a cold start. Therefore, this part is divided into two parts: cold start and personalized recommendation. The logic is as follows:

Cold start: The cold start is mainly for new users. We do not have too detailed user portrait information, so we can only get some general information through some rough information, such as age, gender (which will be obtained when the user registers), etc. Articles (articles suitable for this age and gender), and then generate a cold-start recommendation list for new users based on the article's popularity information. Of course, this is just a simple way. Cold start is actually a more complicated scenario. Interested students can consult some other materials, and welcome to discuss with us. Here are four groups of people according to the user's age and gender

def generate_cold_start_news_list_to_redis_for_register_user(self):
        """Make a cold start news list for registered users
        """
        for user_info in self.register_user_sess.query(RegisterUser).all():
            if int(user_info.age) < 23 and user_info.gender == "female":
                redis_key = "cold_start_group:{}".format(str(1))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
            elif int(user_info.age) >= 23 and user_info.gender == "female":
                redis_key = "cold_start_group:{}".format(str(2))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
            elif int(user_info.age) < 23 and user_info.gender == "male":
                redis_key = "cold_start_group:{}".format(str(3))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
            elif int(user_info.age) >= 23 and user_info.gender == "male":
                redis_key = "cold_start_group:{}".format(str(4))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
            else:
                pass 
        print("generate_cold_start_news_list_to_redis_for_register_user.")
copy

Personalization: Personalized recommendation is mainly for old users. We capture their interests and hobbies through the normal recommendation process, achieve personalized recommendation, and optimize user experience. Therefore, this part follows the normal recommendation process, such as the well-known recall→sort→rearrangement→personalized list generation. The purpose of recall is to quickly find a small number of users potentially interested in items from the massive item library based on some user characteristics. For fine row, the emphasis is on fast. Fine row mainly integrates more features and uses complex models to make personalized recommendations, emphasizing accuracy. On the rearrangement side, it is mainly based on the results of the fine-arrangement, plus various business strategies, such as deduplication, insertion, breaking up, diversity assurance, etc., which are mainly dominated by technical product strategies or improve user experience. Therefore, these links are combined to form the entire architecture of the personalized recommendation system with the "funnel of lightning speed". Since this is the recommended key link, each module has very rich knowledge details, so I won't introduce it too much here. If there is a chance, I will sort it out separately.

Therefore, to summarize the process of generating the recommended page list, firstly, it is divided into two waves according to the type of users. If it is a new user, the cold-start recommendation process is followed, and a cold-start recommendation list is generated for the user through the rough information of the user. If it is an old user, go through the personalized recommendation process, and generate a personalized list for the old user by recalling → sorting → rearranging, etc. Finally, they are all stored in Redis.

At this point, the offline process ends, and through offline, for each user, we generate a list of popular pages and a list of recommended pages offline.

Next, we look at online.

Online

Online is to provide a series of services for the behavior triggered by the user in the process of using the APP or the system. When the user first enters the system, he will enter the recommendation page of the news. At this time, the system will obtain the article on the recommendation page for the user and display it. When a user enters a popular page, the system will obtain a list of popular pages for the user and display them. The following mainly introduces some details of the two online acquisition processes.

Get the list of recommended pages: This service is triggered when the user just enters the system, and when the user browses articles in the recommended page, refreshes and pulls down the process. When the system triggers the service, it will first determine whether the user is a new user or an old user.

  • If it is a new user, read the recommendation list from the cold start list stored offline, and select a specified number of articles to recommend (for example, 10 articles are recommended to the user at a time), but before the recommendation, the exposed articles need to be removed (to avoid Repeated exposure will affect the user experience), so for each user, we will also record an exposed list, which is convenient for us to deduplicate. At the same time, when a batch of articles is exposed, we will even update our exposure list.
  • If it is an old user, read it from the personalized recommendation list stored offline. As above, select the specified number of articles, remove the exposure, generate the final recommendation list, and update the user exposure record at the same time.
  • In this way, the recommendation service of the recommendation page is completed.

Get the list of popular pages: This service is triggered when the user clicks on the popular page and refreshes the article in the popular page. When the service is triggered, it will still judge new users and old users.

code show as below:

def get_hot_list(self, user_id):
        """Top Page List Results"""
        hot_list_key_prefix = "user_id_hot_list:"
        hot_list_user_key = hot_list_key_prefix + str(user_id)

        user_exposure_prefix = "user_exposure:"
        user_exposure_key = user_exposure_prefix + str(user_id)

        # When there is no data for this user in the database, copy a copy from the popular list 
        if self.reclist_redis_db.exists(hot_list_user_key) == 0: # Returns 1 if it exists, returns 0 if it does not exist
            print("copy a hot_list for {}".format(hot_list_user_key))
            # Regenerate a hot page recommendation list for the current user, that is, copy the list in the hot_list to the current user, and replace the key with user_id
            self.reclist_redis_db.zunionstore(hot_list_user_key, ["hot_list"])

        # There are 10 items by default on a page, but 20 items are selected here, because some of them may have been exposed on the recommended page
        article_num = 200

        # What is returned is a list of news_id zrevrange sort scores from large to small
        candiate_id_list = self.reclist_redis_db.zrevrange(hot_list_user_key, 0, article_num-1)

        if len(candiate_id_list) > 0:
            # Get the specific content of the news according to news_id, and return a list, the elements in the list are the news information dictionary displayed in order
            news_info_list = []
            selected_news = []   # record what was actually chosen
            cou = 0

            # Exposure List
            print("self.reclist_redis_db.exists(key)",self.exposure_redis_db.exists(user_exposure_key))
            if self.exposure_redis_db.exists(user_exposure_key) > 0:
                exposure_list = self.exposure_redis_db.smembers(user_exposure_key)
                news_expose_list = set(map(lambda x: x.split(':')[0], exposure_list))
            else:
                news_expose_list = set()

            for i in range(len(candiate_id_list)):
                candiate = candiate_id_list[i]
                news_id = candiate.split('_')[1]

                # De-duplicated exposed, including on recommended pages and hot pages
                if news_id in news_expose_list:
                    continue

                # TODO Some news may not get static information, what bug s should be here
                # The reason for the bug is that the data in json.loads() redis will report an error, and the data in redis needs to be processed
                # It can be filtered when the material is processed, and the news that json cannot be load ed
                try:
                    news_info_dict = self.get_news_detail(news_id)
                except Exception as e:
                    with open("/home/recsys/news_rec_server/logs/news_bad_cases.log", "a+") as f:
                        f.write(news_id + "\n")
                        print("there are not news detail info for {}".format(news_id))
                    continue
                # You need to confirm the json received by the front end, the key needs to be single or double quotes
                news_info_list.append(news_info_dict)
                news_expose_list.add(news_id)
                # Note that the key of the original number contains category information
                selected_news.append(candiate)
                cou += 1
                if cou == 10:
                    break
            
            if len(selected_news) > 0:
                # Manually delete the cached results read, this is very important, returns the number of deleted elements, used to detect whether they are really deleted
                removed_num = self.reclist_redis_db.zrem(hot_list_user_key, *selected_news)
                print("the numbers of be removed:", removed_num)

            # Exposure reset
            self._save_user_exposure(user_id,news_expose_list)
            return news_info_list 
        else:
            #TODO do this temporarily, it's not good
            self.reclist_redis_db.zunionstore(hot_list_user_key, ["hot_list"])
            print("copy a hot_list for {}".format(hot_list_user_key))
            # If it is data that has been refreshed and re-copied after all the content, remember to clear today's exposure data.
            self.exposure_redis_db.delete(user_exposure_key)
            return  self.get_hot_list(user_id)
copy
  • If it is a new user, you need to generate a list of popular pages for the user from the public cold start template stored offline, and then obtain it, select the specified number of articles to recommend, and like the above, go to exposure, generate the final recommendation list, and update the exposure record. .
  • If it is an old user, read it from the user's popular list stored offline, select a specified number of articles to recommend, go to exposure, generate the final recommendation list, and update the exposure record.
  • In this way, the recommendation service for popular pages is completed.

So far, the recommendation related process of the Online part is over, and the online recommendation process mainly generates a recommendation list for the user's recommended pages and popular pages to serve.

Tags: Big Data Cache Storage

Posted by unklematt on Fri, 11 Nov 2022 12:48:10 +0300