WE ARE THE BRAIN OF THE DATA
Artificial Intelligence focused on consumer behavior
Daylight has been two plus years developing a set of self learning algorithms, in this first stage, working to be capable to identify and predict consumer behavior, using all the variables available in each and every section of the web where we incorporate our solution, which adds on to our main knowledge base (with the client's permission).
This core data of Daylight grows in processing power and in learning capabilities with each new client’s web section we bring into the core. In this first stage we have translated the self learning algorithms we have been working on, in a sophisticated and complex recommender system, to be integrated in our prospect clients mainframe. For our future, our VP, Doctor in the field Anthony Bell, is leading us to be a global source of research and leader in the Business Analytics, Commerce, Fintech, and Core AI Daylight's work to date, has resulted in the development of a complex Recommender System that predicts web based consumer behavior, and is now at a stage to be able to enhance sales for online web companies. |
Only a small number of people in the world are able to develop the technology we have in our hands. And we count with a leader in the field, who has lead and developed the field for over 25 years.
But most important, commercially, we are focusing our proprietary technology in a mainstream need today across industries: sales, a product that is able to help these companies sell more.
Specifically in companies that are in the 100 to 300 million USD year revenue mark.
But most important, commercially, we are focusing our proprietary technology in a mainstream need today across industries: sales, a product that is able to help these companies sell more.
Specifically in companies that are in the 100 to 300 million USD year revenue mark.
A trillion dollar a year industry
Daylight's CEO vision
Hello, my name is Ignacio Blavi and Im the CEO of Daylight A. I. One of the top worlds top data-driven behavior & communications company.
For ever we have been wondering how to achieve certailn goal, planing and planing to do so. THE KEY HAS ALWAYS BEEN INFORMATION. So, now-a-days, expert scientists have been using insights, so they can tell you all you need to know about your target audience, and how to reach your prospect clients. for that exists DAYLIGHT, to light up the dark around you, so you can make better business goal-driven decisions. |
Daylight's software architecture
We have been researching a while in architectures built for running recommender systems. Most of these architectures have a batch processing nature. Some of these do also include real time processing, to get an idea of actual trends. From our experience, relying completely in a real time architecture introduces a number of problems. Mostly such an architecture does not provide detail analytics, since the processing happens in small batches of information (e.g., single row transactions etc.,).
A combination of both batch and real time processing results into a lambda architecture. This architecture achieves a compromise with the data being analyzed in the real time to get an approximation of what is being searched or clicked more in that specific moment, and stored into HDFS for a later batch processing in the complete dataset. This gives us the possibility to be aware of what is going on right now, and have detailed analytics later. The architecture of our recommender system is presented in the figure below.
A combination of both batch and real time processing results into a lambda architecture. This architecture achieves a compromise with the data being analyzed in the real time to get an approximation of what is being searched or clicked more in that specific moment, and stored into HDFS for a later batch processing in the complete dataset. This gives us the possibility to be aware of what is going on right now, and have detailed analytics later. The architecture of our recommender system is presented in the figure below.
Before discussing the above architecture in detail, we will present a scenario how the recommendations take place in case of a new customer and a returning customer.
Scenario:
• A customer visits the e-commerce website for the first time. In this case, the website can recommend him only what is in trend for the moment in general.
• The customer creates an account where he provides his credential information. User credentials are stored in the client’s data server. In this case we have demographic data, which help us improve our recommendations for this user based on his location, gender, age,
profession etc.
• As the user keeps browsing the website, performing queries, clicks, and other actions, this data keeps being stored in the client’s dataset in the form of relational database, logs and multimedia. At the same time the user actions are sent to the recommender system engine (9) using an API we develop (10). These actions mostly include user credentials and the features of the product or category of products the user searched or clicked.
• Once per day, week or whatever interval of time we in collaboration with client set, we fetch all the raw data from the client data server to our data lake (4). Raw data from our data lake (HDFS) are read, cleaned, preprocessed and transformed in a way that it is easier for us to provide as input to our recommending models. This data preprocessing is done using our developed Python scripts.
• The cleaned data is stored along the raw data in HDFS.
• From the preprocessed data, we create journeys. In every journey, we have the client in one side, and a purchased product in the other side. In the middle there exists, searches, clicks, add to carts, customer service calls and other actions that made the user decide to buy the product. These are stored in our mongo cluster.
• From our mongo cluster, we can generate dashboards of statistics, insights and other relevant information showing how the business is
currently doing.
• We decide a periodic interval to run our recommending model in the whole data we got in HDFS. This interval can be twice per day, every night, every week or other schedules we decide. The schedule is affected from the website traffic and the number of new information the website gets. We use Apache Oozie to setup a schedule of when to run the jobs we define.
• By running the model in specific intervals, we make sure to update it with the latest information such that the recommendations will be more accurate and specific to the properties of a customer.
• Now going a little back, when the user performs actions in the website, the website sends information to the model which predicts which products the customer is mostly liked to click and return that information to website. This information are used from the website to populate the homepage or any other section with content relevant to the user.
Since the explanation of a scenarios is done, now we will discuss some technologies we used to develop our recommender system.
The data that exists in the client’s server gets updated regularly. This data exists in relational databases, NoSQL databases, flat files, csv, excel sheets and other possible ways. This data somehow should be stored in our local infrastructure or cloud in order to preprocess it and make it ready for our learning models. Previously we did take the data manually and put them in our platform. We also wrote a Python script to manually download data that exists in form of files and multimedia. The problem we had was getting the data that exists in relational form inside database management systems. To do that, we use Apache Sqoop.
Apache Sqoop stands as an intermediate layer between Hadoop and relational databases. Sqoop easies this process a lot, since we simply need to provide information like destination of the database, authentication detail, HDFS location and Sqoop deals with the remaining part. We do also consider Apache Flume as a way to store the user actions in our data lake. Apache Flume offers us the possibility to listed for user actions, store them in a channel, and bring them into our data source (HDFS).
We use Hadoop distributed file system, in a cluster of commodity hardware nodes to store the client’s data. HDFS is a big player in this architecture. It is not only the fault tolerance, and the scalability that it offers. The number of tools and frameworks that run on top of it is very large. We use Hive to write jobs to retrieve analytics in forms of reports from the data. Hive helps also helps us in aggregating the data in various contexts, such that when building the models, it is easier to access only the information you want.
To perform computation, we use apache Spark. Spark helps us especially in the case when we need to run iterative algorithms (most of machine learning algorithms are). This is not possible using map reduce paradigm. On Spark, queries, model buildings and analytics are performed faster than in Map Reduce. Furthermore, Spark offers several prebuilt machine learning algorithms within MLlib library which we use to extract insight from data. We considered also Apache Mahout but it was a slower solution than MLlib.
Cleaning, preprocessing, data transformation and journey are all developed using Python programming language. There we have written scripts that read data from HDFS, apply cleaning functions, preprocessing functions and transform them in a json-format which we dump into mongo. This process it is pretty much automatic, we simply need to build up the configuration files telling which kind of processing functions to apply at which dataset.
• A customer visits the e-commerce website for the first time. In this case, the website can recommend him only what is in trend for the moment in general.
• The customer creates an account where he provides his credential information. User credentials are stored in the client’s data server. In this case we have demographic data, which help us improve our recommendations for this user based on his location, gender, age,
profession etc.
• As the user keeps browsing the website, performing queries, clicks, and other actions, this data keeps being stored in the client’s dataset in the form of relational database, logs and multimedia. At the same time the user actions are sent to the recommender system engine (9) using an API we develop (10). These actions mostly include user credentials and the features of the product or category of products the user searched or clicked.
• Once per day, week or whatever interval of time we in collaboration with client set, we fetch all the raw data from the client data server to our data lake (4). Raw data from our data lake (HDFS) are read, cleaned, preprocessed and transformed in a way that it is easier for us to provide as input to our recommending models. This data preprocessing is done using our developed Python scripts.
• The cleaned data is stored along the raw data in HDFS.
• From the preprocessed data, we create journeys. In every journey, we have the client in one side, and a purchased product in the other side. In the middle there exists, searches, clicks, add to carts, customer service calls and other actions that made the user decide to buy the product. These are stored in our mongo cluster.
• From our mongo cluster, we can generate dashboards of statistics, insights and other relevant information showing how the business is
currently doing.
• We decide a periodic interval to run our recommending model in the whole data we got in HDFS. This interval can be twice per day, every night, every week or other schedules we decide. The schedule is affected from the website traffic and the number of new information the website gets. We use Apache Oozie to setup a schedule of when to run the jobs we define.
• By running the model in specific intervals, we make sure to update it with the latest information such that the recommendations will be more accurate and specific to the properties of a customer.
• Now going a little back, when the user performs actions in the website, the website sends information to the model which predicts which products the customer is mostly liked to click and return that information to website. This information are used from the website to populate the homepage or any other section with content relevant to the user.
Since the explanation of a scenarios is done, now we will discuss some technologies we used to develop our recommender system.
The data that exists in the client’s server gets updated regularly. This data exists in relational databases, NoSQL databases, flat files, csv, excel sheets and other possible ways. This data somehow should be stored in our local infrastructure or cloud in order to preprocess it and make it ready for our learning models. Previously we did take the data manually and put them in our platform. We also wrote a Python script to manually download data that exists in form of files and multimedia. The problem we had was getting the data that exists in relational form inside database management systems. To do that, we use Apache Sqoop.
Apache Sqoop stands as an intermediate layer between Hadoop and relational databases. Sqoop easies this process a lot, since we simply need to provide information like destination of the database, authentication detail, HDFS location and Sqoop deals with the remaining part. We do also consider Apache Flume as a way to store the user actions in our data lake. Apache Flume offers us the possibility to listed for user actions, store them in a channel, and bring them into our data source (HDFS).
We use Hadoop distributed file system, in a cluster of commodity hardware nodes to store the client’s data. HDFS is a big player in this architecture. It is not only the fault tolerance, and the scalability that it offers. The number of tools and frameworks that run on top of it is very large. We use Hive to write jobs to retrieve analytics in forms of reports from the data. Hive helps also helps us in aggregating the data in various contexts, such that when building the models, it is easier to access only the information you want.
To perform computation, we use apache Spark. Spark helps us especially in the case when we need to run iterative algorithms (most of machine learning algorithms are). This is not possible using map reduce paradigm. On Spark, queries, model buildings and analytics are performed faster than in Map Reduce. Furthermore, Spark offers several prebuilt machine learning algorithms within MLlib library which we use to extract insight from data. We considered also Apache Mahout but it was a slower solution than MLlib.
Cleaning, preprocessing, data transformation and journey are all developed using Python programming language. There we have written scripts that read data from HDFS, apply cleaning functions, preprocessing functions and transform them in a json-format which we dump into mongo. This process it is pretty much automatic, we simply need to build up the configuration files telling which kind of processing functions to apply at which dataset.
Algorithms
Different algorithms are used for different purposes when recommending items to users. In the sections below, we explore the details of the ones we have implemented and are using in our product. These algorithms are state of the art for modern recommender systems.
Cluster Based Recommender Systems
Clustering is the process of partitioning a set of data objects or items into subgroups. Each subgroup is a cluster, such that items in a cluster are similar to one another, yet dissimilar to objects in other clusters. Different clustering methods may generate different clusters on the same data set. The partitioning is not performed by humans, but by the clustering algorithm. Hence, clustering is useful in that it can lead to the discovery of previously unknown groups within the data [1].
We have implemented a cluster based recommender system, which takes a dataset of items, and clusters them based on specified features. To make cluster based recommendations, we will need to first generate the clusters. By having the dataset stored, we need to feed our API with the dataset. The recommender engine can be configured, the user can increase or decrease the number of clusters to be generated and the number of iterations needed for some of the clustering methods.
RSCluster rscluster = new RSCluster();
rscluster.train(dataset, 545, 34, 0) //first : dataset,
//second: no. of Clusters, //third: iterations,
//fourth: cluster method
Table 1. Java code for training the Cluster based Recommender System
After the clusters have been generated, the trained model will be saved and ready to be used. The user will need to send the item’s id and the number of items she wants to be returned.
Table 2. Java code for getting cluster based recommendations
We mentioned that there are different clustering methods. For our cluster based recommender system, we have implemented three clustering algorithms:
RSCluster rscluster = new RSCluster();
rscluster.load(trainedModel);
rscluster.recommend(353, 10)
//first : item’s ID,
//second: no. of recommended items to return
• K-Means: is a centroid based algorithm which uses a centroid to represent a cluster. The definition of the centroid can be done in different ways, by using the mean or the medoid of the points assigned to the cluster. Having a dataset 𝐷 containing 𝑛 objects, we define the number of clusters 𝑘. First, we choose arbitrarily 𝑘 objects as our centroids of the clusters. Each object will be assigned to the cluster based on the centroid to which it is the nearest. After the initial cluster definition is done, we update the cluster centroids by calculating the mean of each cluster. Using the new centroids, we redistribute the objects to the cluster on which it is nearer the centroid. This is an iterative process, and the k- means algorithms is not guaranteed to converge to the global optimum, it often needs to be terminated to a local optimum. [1]
The time complexity of k-means algorithm is: 𝑂(𝑛𝑡𝑘) where 𝑛 is the total number of objects or items in a dataset, 𝑡 is the number of iterations and 𝑘 is the number of clusters defined. This algorithm is efficient in processing large dataset and it is relatively scalable.
• Fuzzy K-means: It is known that the items inside a cluster generated by clustering methods are similar to each other, but dissimilar to items in the other clusters. Fuzzy k-means is a data clustering technique wherein each data point belongs to a cluster to some degree that is specified by a membership grade. [2]
The Fuzzy k-means algorithm attempts to partition a finite collection of elements 𝑋 = {𝑥1 , 𝑥2 , ... , 𝑥𝑛 } into a collection of c fuzzy clusters with
respect to some given criterion. Fuzzy set allows for degree of membership. A single point can have partial membership in more than one class. There can be no empty classes and no class that contains no data points. Given a finite set of data, the algorithm returns a list of c cluster centers 𝑉, such that 𝑉 = 𝑣𝑖, 𝑖 = 1,2,...,𝑐 and a partitionmatrixUsuch𝑡h𝑎𝑡𝑈 = 𝑢𝑖𝑗,𝑖 =1....𝑐,𝑗 =1,...𝑛where𝑢𝑖𝑗 isanumericalvaluein [0, 1] that tells the degree to which the element
belongs to the 𝑖 − 𝑡h cluster.
Using this algorithm, the user will send as a parameter the item’s id, and the number of recommended items she wants. The returned items, will be a combination based on the percentage that the input item belongs to the clusters. So, if the input item belongs 60% to cluster A, 30% to cluster B and 10% to cluster C, the recommended items will be a
• DB-SCAN: To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions in the data space, separated by sparse regions. This is the main strategy behind density- based clustering methods, which can discover clusters of nonspherical shape. The density of an object 𝑜 can be measured by the number of objects close to 𝑜. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core objects, that is, objects that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as clusters. [3]
A user-specified parameter 𝜖 > 0 is used to specify the radius of a neighborhood we consider for every object. The 𝜖 − 𝑛𝑒𝑖𝑔h𝑏𝑜𝑟h𝑜𝑜𝑑 of an object 𝑜 is the space within a radius 𝜖 centered at 𝑜. Due to the fixed neighborhood size parameterized by 𝜖, the density of a neighborhood can be measured simply by the number of objects in the neighborhood. To determine whether a
neighborhood is dense or not, DBSCAN uses another user-specified parameter, 𝑀𝑖𝑛𝑃𝑡𝑠, which specifies the density threshold of dense regions. An object is a core object if the 𝜖 − 𝑛𝑒𝑖𝑔h𝑏𝑜𝑟h𝑜𝑜𝑑 of the object contains at least 𝑀𝑖𝑛𝑃𝑡𝑠 objects. Core objects are the pillars of dense regions.
Frequent Pattern based Recommender System
Frequent patterns reveal valuable information from a dataset. We have leveraged this by developing a frequent pattern based recommender system. The following is a linguistic description of the frequent pattern recommender system:
1. The recommender system requires a specific format of the dataset to generate the frequent patterns. Data must be in a transactional format, where each entry represents the items of the user clicks, or items bought etc.
6318 11495 7906 0021 5443 0023
11853 14274 10387 13422 13199 13119 14204 11706 14663 11818 11495 7906 14727 14729 14720 10387
13755 12642 3650 2720 12747 13872 13691 11052 13237 13499 13538 13440 14708 14706 11293 13025 13204 13085 14882
14694 14678 2720 12642 2454 254 0014 1456 9956 14649 14707 14729 13755 14678 2720 2545 23 2145
2. We generate the frequent pattern using the FP-Growth algorithm. This algorithm adopts the divide- and-conquer strategy. The algorithm works as follow: First, it compresses the database representing frequent items into a frequent pattern tree, or FP-tree, which retains the itemset association information. It then divides the compressed database into a set of conditional databases (a special kind of projected database), each associated with one frequent item or “pattern fragment,” and mines each database separately. For each “pattern fragment,” only its associated data sets need to be examined. Therefore, this approach may substantially reduce the size of the data sets to be searched, along with the “growth” of patterns being examined. [4]
The data that will be generated by the FP-growth will generate patterns in this format: Patterns
[[16883, 16846, 16885, 16903, 16893, 16904]], 242 [[16883, 16846, 16885, 16903, 16893, 16904, 16888]], 174 [[16883, 16846, 16885, 16903, 16893, 16888]], 26 [[16883, 16846, 16885, 16903, 16876]], 301 [[16883, 16846, 16885, 16903, 16876, 16893]], 26
[[16883, 16846, 16885, 16903, 16876, 16893, 16904]], 320 [[16883, 16846, 16885, 16903, 16876, 16893, 16904, 16888]],115
3. For the recommendation, we implemented association rules technique. Association rule mining consists of first finding frequent itemsets (sets of items, for example item A and item B, satisfying a minimum support threshold, or percentage of the task relevant tuples), from which strong association rules in the form of 𝐴 ⇒ 𝐵 are generated. These rules also satisfy a minimum confidence threshold (a prespecified probability of satisfying B under the condition that A is satisfied).
The data that will be generated from the association rule mining algorithm, will have this format:
Pattern
16908, 16905, 16846, 16810, 16904;16881;
16908, 16905, 16846, 16810, 16904;16838; 16908, 16905, 16846, 16810, 16904;16896; 16908, 16905, 16846, 16810, 16904;16848; 16908, 16905, 16846, 16810, 16904;16910; 16908, 16905, 16846, 16810, 16904;16842; 16908, 16905, 16846, 16810, 16904;16885;
Confidence 0.7272727272727273 0.9090909090909091 0.7727272727272727 0.7727272727272727 0.8181818181818182 0.7272727272727273 0.9545454545454546
4. The last step is the recommendation part. The recommendation is done by sending a list of items to the recommender system. Items that will be returned by the recommender system will be the items of the association rule with the highest confidence level.
Cluster Based Recommender Systems
Clustering is the process of partitioning a set of data objects or items into subgroups. Each subgroup is a cluster, such that items in a cluster are similar to one another, yet dissimilar to objects in other clusters. Different clustering methods may generate different clusters on the same data set. The partitioning is not performed by humans, but by the clustering algorithm. Hence, clustering is useful in that it can lead to the discovery of previously unknown groups within the data [1].
We have implemented a cluster based recommender system, which takes a dataset of items, and clusters them based on specified features. To make cluster based recommendations, we will need to first generate the clusters. By having the dataset stored, we need to feed our API with the dataset. The recommender engine can be configured, the user can increase or decrease the number of clusters to be generated and the number of iterations needed for some of the clustering methods.
RSCluster rscluster = new RSCluster();
rscluster.train(dataset, 545, 34, 0) //first : dataset,
//second: no. of Clusters, //third: iterations,
//fourth: cluster method
Table 1. Java code for training the Cluster based Recommender System
After the clusters have been generated, the trained model will be saved and ready to be used. The user will need to send the item’s id and the number of items she wants to be returned.
Table 2. Java code for getting cluster based recommendations
We mentioned that there are different clustering methods. For our cluster based recommender system, we have implemented three clustering algorithms:
RSCluster rscluster = new RSCluster();
rscluster.load(trainedModel);
rscluster.recommend(353, 10)
//first : item’s ID,
//second: no. of recommended items to return
• K-Means: is a centroid based algorithm which uses a centroid to represent a cluster. The definition of the centroid can be done in different ways, by using the mean or the medoid of the points assigned to the cluster. Having a dataset 𝐷 containing 𝑛 objects, we define the number of clusters 𝑘. First, we choose arbitrarily 𝑘 objects as our centroids of the clusters. Each object will be assigned to the cluster based on the centroid to which it is the nearest. After the initial cluster definition is done, we update the cluster centroids by calculating the mean of each cluster. Using the new centroids, we redistribute the objects to the cluster on which it is nearer the centroid. This is an iterative process, and the k- means algorithms is not guaranteed to converge to the global optimum, it often needs to be terminated to a local optimum. [1]
The time complexity of k-means algorithm is: 𝑂(𝑛𝑡𝑘) where 𝑛 is the total number of objects or items in a dataset, 𝑡 is the number of iterations and 𝑘 is the number of clusters defined. This algorithm is efficient in processing large dataset and it is relatively scalable.
• Fuzzy K-means: It is known that the items inside a cluster generated by clustering methods are similar to each other, but dissimilar to items in the other clusters. Fuzzy k-means is a data clustering technique wherein each data point belongs to a cluster to some degree that is specified by a membership grade. [2]
The Fuzzy k-means algorithm attempts to partition a finite collection of elements 𝑋 = {𝑥1 , 𝑥2 , ... , 𝑥𝑛 } into a collection of c fuzzy clusters with
respect to some given criterion. Fuzzy set allows for degree of membership. A single point can have partial membership in more than one class. There can be no empty classes and no class that contains no data points. Given a finite set of data, the algorithm returns a list of c cluster centers 𝑉, such that 𝑉 = 𝑣𝑖, 𝑖 = 1,2,...,𝑐 and a partitionmatrixUsuch𝑡h𝑎𝑡𝑈 = 𝑢𝑖𝑗,𝑖 =1....𝑐,𝑗 =1,...𝑛where𝑢𝑖𝑗 isanumericalvaluein [0, 1] that tells the degree to which the element
belongs to the 𝑖 − 𝑡h cluster.
Using this algorithm, the user will send as a parameter the item’s id, and the number of recommended items she wants. The returned items, will be a combination based on the percentage that the input item belongs to the clusters. So, if the input item belongs 60% to cluster A, 30% to cluster B and 10% to cluster C, the recommended items will be a
• DB-SCAN: To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions in the data space, separated by sparse regions. This is the main strategy behind density- based clustering methods, which can discover clusters of nonspherical shape. The density of an object 𝑜 can be measured by the number of objects close to 𝑜. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core objects, that is, objects that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as clusters. [3]
A user-specified parameter 𝜖 > 0 is used to specify the radius of a neighborhood we consider for every object. The 𝜖 − 𝑛𝑒𝑖𝑔h𝑏𝑜𝑟h𝑜𝑜𝑑 of an object 𝑜 is the space within a radius 𝜖 centered at 𝑜. Due to the fixed neighborhood size parameterized by 𝜖, the density of a neighborhood can be measured simply by the number of objects in the neighborhood. To determine whether a
neighborhood is dense or not, DBSCAN uses another user-specified parameter, 𝑀𝑖𝑛𝑃𝑡𝑠, which specifies the density threshold of dense regions. An object is a core object if the 𝜖 − 𝑛𝑒𝑖𝑔h𝑏𝑜𝑟h𝑜𝑜𝑑 of the object contains at least 𝑀𝑖𝑛𝑃𝑡𝑠 objects. Core objects are the pillars of dense regions.
Frequent Pattern based Recommender System
Frequent patterns reveal valuable information from a dataset. We have leveraged this by developing a frequent pattern based recommender system. The following is a linguistic description of the frequent pattern recommender system:
1. The recommender system requires a specific format of the dataset to generate the frequent patterns. Data must be in a transactional format, where each entry represents the items of the user clicks, or items bought etc.
6318 11495 7906 0021 5443 0023
11853 14274 10387 13422 13199 13119 14204 11706 14663 11818 11495 7906 14727 14729 14720 10387
13755 12642 3650 2720 12747 13872 13691 11052 13237 13499 13538 13440 14708 14706 11293 13025 13204 13085 14882
14694 14678 2720 12642 2454 254 0014 1456 9956 14649 14707 14729 13755 14678 2720 2545 23 2145
2. We generate the frequent pattern using the FP-Growth algorithm. This algorithm adopts the divide- and-conquer strategy. The algorithm works as follow: First, it compresses the database representing frequent items into a frequent pattern tree, or FP-tree, which retains the itemset association information. It then divides the compressed database into a set of conditional databases (a special kind of projected database), each associated with one frequent item or “pattern fragment,” and mines each database separately. For each “pattern fragment,” only its associated data sets need to be examined. Therefore, this approach may substantially reduce the size of the data sets to be searched, along with the “growth” of patterns being examined. [4]
The data that will be generated by the FP-growth will generate patterns in this format: Patterns
[[16883, 16846, 16885, 16903, 16893, 16904]], 242 [[16883, 16846, 16885, 16903, 16893, 16904, 16888]], 174 [[16883, 16846, 16885, 16903, 16893, 16888]], 26 [[16883, 16846, 16885, 16903, 16876]], 301 [[16883, 16846, 16885, 16903, 16876, 16893]], 26
[[16883, 16846, 16885, 16903, 16876, 16893, 16904]], 320 [[16883, 16846, 16885, 16903, 16876, 16893, 16904, 16888]],115
3. For the recommendation, we implemented association rules technique. Association rule mining consists of first finding frequent itemsets (sets of items, for example item A and item B, satisfying a minimum support threshold, or percentage of the task relevant tuples), from which strong association rules in the form of 𝐴 ⇒ 𝐵 are generated. These rules also satisfy a minimum confidence threshold (a prespecified probability of satisfying B under the condition that A is satisfied).
The data that will be generated from the association rule mining algorithm, will have this format:
Pattern
16908, 16905, 16846, 16810, 16904;16881;
16908, 16905, 16846, 16810, 16904;16838; 16908, 16905, 16846, 16810, 16904;16896; 16908, 16905, 16846, 16810, 16904;16848; 16908, 16905, 16846, 16810, 16904;16910; 16908, 16905, 16846, 16810, 16904;16842; 16908, 16905, 16846, 16810, 16904;16885;
Confidence 0.7272727272727273 0.9090909090909091 0.7727272727272727 0.7727272727272727 0.8181818181818182 0.7272727272727273 0.9545454545454546
4. The last step is the recommendation part. The recommendation is done by sending a list of items to the recommender system. Items that will be returned by the recommender system will be the items of the association rule with the highest confidence level.
Deep Learning
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. With the ever- growing volume, complexity and dynamicity of online information, recommender system has been an effective key solution to overcome such information overload. In recent years, deep learning’s revolutionary advances in speech recognition, image analysis and natural language processing have gained significant attention. Meanwhile, recent studies also demonstrate its effectiveness in coping with information retrieval and recommendation tasks. Applying deep learning techniques into recommender system has been gaining momentum due to its state-of-the-art performances and high-quality recommendations. In contrast to traditional recommendation models, deep learning provides a better understanding of user’s demands, item’s characteristics and historical interactions between them.
Deep learning is a sub research field of machine learning. It learns multiple levels of representations and abstractions from data, which can solve both supervised and unsupervised learning tasks. In this subsection, we clarify a diverse array of deep learning concepts that are used in recommender systems:
• Multilayer Perceptron (MLP) is a feedforward neural network with multiple (one or more) hidden layers between input layer and output layer. Here, the perceptron can employ arbitrary activation function and does not necessarily represent strictly binary classifier.
• Autoencoder (AE) is an unsupervised model attempting to reconstruct its input data in the output layer. In general, the bottleneck layer (the middle-most layer) is used as a salient feature representation of the input data. There are many variants of autoencoders such as denoising autoencoder, marginalized denoising autoencoder, sparse autoencoder, contractive autoencoder and variational autoencoder (VAE)
• Convolutional Neural Network (CNN) is a special kind of feedforward neural network with convolution layers and pooling operations. It can capture the global and local features and significantly enhancing the efficiency and accuracy. It performs well in processing data with grid- like topology.
• Recurrent Neural Network (RNN) is suitable for modelling sequential data. Unlike feedforward neural network, there are loops and memories in RNN to remember former computations. Variants such as Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) network are often deployed in practice to overcome the vanishing gradient problem.
• Deep Semantic Similarity Model (DSSM), or more specifically, Deep Structured Semantic Model [45], is a deep neural network for learning semantic representations of entities in a common continuous semantic space and measuring their semantic similarities.
• Restricted Boltzmann Machine (RBM) is a two-layer neural network consisting of a visible layer and a hidden layer. It can be easily stacked to a deep net. Restricted here means that there are no intra-layer communications in visible layer or hidden layer.
• • Neural Autoregressive Distribution Estimation (NADE) is an unsupervised neural network built atop autoregressive model and feedforward neural network. It is a tractable and efficient estimator for modelling data distributions and densities.
• • Generative Adversarial Network (GAN) is a generative neural network which consists of a discriminator and a generator. The two neural networks are trained simultaneously by competing in a minimax game framework.
Deep learning is a sub research field of machine learning. It learns multiple levels of representations and abstractions from data, which can solve both supervised and unsupervised learning tasks. In this subsection, we clarify a diverse array of deep learning concepts that are used in recommender systems:
• Multilayer Perceptron (MLP) is a feedforward neural network with multiple (one or more) hidden layers between input layer and output layer. Here, the perceptron can employ arbitrary activation function and does not necessarily represent strictly binary classifier.
• Autoencoder (AE) is an unsupervised model attempting to reconstruct its input data in the output layer. In general, the bottleneck layer (the middle-most layer) is used as a salient feature representation of the input data. There are many variants of autoencoders such as denoising autoencoder, marginalized denoising autoencoder, sparse autoencoder, contractive autoencoder and variational autoencoder (VAE)
• Convolutional Neural Network (CNN) is a special kind of feedforward neural network with convolution layers and pooling operations. It can capture the global and local features and significantly enhancing the efficiency and accuracy. It performs well in processing data with grid- like topology.
• Recurrent Neural Network (RNN) is suitable for modelling sequential data. Unlike feedforward neural network, there are loops and memories in RNN to remember former computations. Variants such as Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) network are often deployed in practice to overcome the vanishing gradient problem.
• Deep Semantic Similarity Model (DSSM), or more specifically, Deep Structured Semantic Model [45], is a deep neural network for learning semantic representations of entities in a common continuous semantic space and measuring their semantic similarities.
• Restricted Boltzmann Machine (RBM) is a two-layer neural network consisting of a visible layer and a hidden layer. It can be easily stacked to a deep net. Restricted here means that there are no intra-layer communications in visible layer or hidden layer.
• • Neural Autoregressive Distribution Estimation (NADE) is an unsupervised neural network built atop autoregressive model and feedforward neural network. It is a tractable and efficient estimator for modelling data distributions and densities.
• • Generative Adversarial Network (GAN) is a generative neural network which consists of a discriminator and a generator. The two neural networks are trained simultaneously by competing in a minimax game framework.
Deep learning based recommendation models fit into two broad categories: models using single deep learning technique and deep composite model (recommender system which involves two or more deep learning techniques).
Model using Single Deep Learning Technique. In this category, models are divided into eight subcategories in conformity with the eight deep learning models: MLP, AE, CNN, RNN, DSSM, RBM, NADE and GAN based recommender system. The deep learning technique in use determines the strengths and application scenarios of these recommendation models. For instance, MLP can easily model the nonlinear interactions between users and items; CNN is capable of extracting local and global representations from heterogeneous data sources such as textual and visual information; RNN enables the recommender system to model the temporal dynamics of rating data and sequential influences of content information; DSSM is able to perform semantic matching between users and items.
Deep Composite Model. Some deep learning based recommendation models utilize more than one deep learning technique. The motivation is that different deep learning techniques can complement one another and enable a more powerful hybrid model. There are many possible combinations of these eight deep learning techniques but not all have been exploited. Some possible combinations are:
On the last year and a half, we have experimented and worked with all of these algorithms and here is a quick roundup of our findings: CNN and Autoencoder, CNN and RNN, CNN and MLP, RNN and Autoencoder, RNN and MLP, CNN and DSSM, RNN and DSSM.
• MLP - Multilayer Perceptron is a concise but effective model. As such, it is widely used in many areas, especially in industry areas such as video recommendations. Multilayer feedforward
networks are demonstrated to be able to approximate any measurable function to any desired degree of accuracy. It is the basis of many advanced models but not very practical for e-commerce recommendations as it only aims to capture the non-linear relationship between users and items.
• AE - there exist two general ways of applying autoencoder to recommender system: (1) using autoencoder to learn lower-dimensional
feature representations at the bottleneck layer; or (2) fling the blanks of rating matrix directly in the reconstruction layer. Autoencoders can be integrated with traditional recommender systems like collaborative filtering. Tightly coupled model learns the parameters of autoencoder and recommender model simultaneously, which enables recommender model to provide guidance for autoencoder to learn more semantic features. Loosely coupled model is performed in two steps: learning salient feature representations via autoencoders, and then feeding these feature representations to recommender system. Both forms have their own strengths and shortcomings. For example, tightly coupled model requires carefully design and optimization to avoid the local optimum, but recommendation and feature learning can be performed at once; loosely coupled method can be easily extended to existing advanced models, but they require more training steps.
• CNN - Convolution Neural Network is powerful in processing visual, textual and audio information but not ideal for the e-commerce type of recommendations. It can be used in combination with other deep learning models when images or audio is part of the features but it’s quite complex to integrate. Most of the CNN based recommender systems utilize CNN for feature extraction.
• RRN - Recurrent neural network is specifically suitable for coping with the temporal dynamics of ratings and sequential features in recommender system. This is the algorithm we are using so we will talk about it on detail in a further section.
• DSSM - Deep Semantic Similarity Model is a deep neural network widely used in information retrieval area. The issue with this model is that it is based on the hypothesis that users have similar tastes in one domain should have similar tastes in other domains. Intuitively, this assumption might be unreasonable in many cases.
• RBM - Restricted Boltzmann Machine is the first recommendation model that built atop deep learning. It is normally used combined with Collaborative Filtering and this is a huge restriction as a lot of e- commerce platforms don’t have such features making them unavailable for our platform.
• NADE and GAN: NADE and GAN based recommender systems. These are some emerging algorithms and not yet production ready. NADE presents a tractable method for approximating the real distribution of source data and produces state-of-the-art recommendation accuracy in terms of rating prediction (compared with other deep learning based recommendation models) on several experimental datasets; GAN is capable of fusing discriminative model with generative model together and posses the advantages of these two schools of thinking.
Our approach
In many online systems where recommendations are applied, interactions between a user and the system are organized into sessions. A session is a group of interactions that take place within a given time frame. Sessions from a user can occur on the same day, or over several days, weeks, or months. A session usually has a goal, such as finding an item to purchase, or listening to music of a certain style or mood. A simple way of incorporating past user session information in session-based algorithm would be to simply concatenate past and current user sessions. While this seems like a reasonable approach, our experiments show that this is not the best solution. In our solution, we use a novel algorithm based on RNNs that can deal with both cases: (i) session-aware recommenders, when user identifiers are present and propagate information from the previous user session to the next, thus improving the recommendation accuracy, and (ii) session-based recommenders, when there are no past sessions (i.e., no user identifiers). The algorithm is based on a Hierarchical RNN where the hidden state of a lower- level RNN at the end of one user session is passed as an input to a higher-level RNN which aims at predicting a good initialization (i.e., a good context vector) for the hidden state of the lower RNN for the next session of the user.
Session-based Recurrent Neural Network
Our model is based on the session-based Recurrent Neural Network (RNN henceforth) model [5]. RNN is based on a single Gated Recurrent Unit (GRU) layer that models the interactions of the user within a session. The RNN takes as input the current item ID in the session and outputs a score for each item representing the likelihood of being the next item in the session. Formally, for each session 𝑆𝑚 = { 𝑖𝑚,1, 𝑖𝑚,2, ... , 𝑖𝑚,𝑁}, RNN computes the following session- level representation.
𝑠=𝐺𝑅𝑈(𝑖,𝑠),𝑛=1,...,𝑁−1𝑚,𝑛𝑠𝑒𝑠𝑚,𝑛𝑚,𝑛−1𝑚
Where𝐺𝑅𝑈𝑠𝑒𝑠 isthesession-levelGRUand𝑠𝑚,𝑛itshiddenstateatstep𝑛,being𝑠𝑚, 0=0(thenullvector), and 𝑖 is the one-hot vector of the current item ID. The output of the RNN is a score 𝑟̂ for every item
𝑚,𝑛 𝑚,𝑛
in the catalog indicating the likelihood of being the next item in the session (or, equivalently, its relevance for the next step in the session)
𝑟̂ =𝑔(𝑠 ),𝑛=1,...,𝑁 −1 𝑚,𝑛 𝑚,𝑛 𝑚
where 𝑔 (·) is a non-linear function like softmax or 𝑡𝑎𝑛h depending on the loss function. During training, scores are compared to a one-hot vector of the next item ID in the session to compute the loss. The network can be trained with several ranking loss functions such as cross-entropy, BPR and TOP1. In our implementation, the TOP1 loss always outperformed other ranking losses, so we consider only. The TOP1 loss is the regularized approximation of the relative rank of the relevant item. The relative rank of the
relevant item is given by 1 ∑𝑁𝑆 𝐼{𝑟̂ > 𝑟̂ } where 𝑟 is the score of a sampled ‘irrelevant item’. 𝐼{·}
𝑁𝑆
is approximated with a sigmoid. To force the scores of negative examples (‘irrelevant items’) towards zero a regularization term is added to the loss. The
final loss function is as follows: 𝐿𝑠 = 1 ∑𝑁𝑆 𝜎(𝑟̂ −
𝑗=1 𝑠,𝑗 𝑠,𝑖 𝑠,𝑗 𝑁𝑆 𝑗=1 𝑠,𝑗
𝑟̂ 0 + 𝜎(𝑟̂ 2 ). RNN is trained efficiently with session-parallel mini-batches. At each training step, the 𝑠,𝑖 𝑠,𝑗
input to 𝐺𝑅𝑈𝑠𝑒𝑠 is the stacked one-hot representation of the current item ID of a batch of sessions. The session-parallel mechanism keeps the pointers to the
current item of every session in the mini-batch and resets the hidden state of the RNN when sessions end. To further reduce the computational complexity, the loss is computed over the current item IDs and a sample of negative items. Specifically, the current item ID of each session is used as positive item and the IDs of the remaining sessions in the mini-batch as negative items when computing the loss. This makes explicit negative item sampling unnecessary and enables popularity-based sampling. However, since user-identifiers are unknown in pure session-based scenarios, there are good chances that negative samples will be ‘contaminated’ by positive items the user interacts with in other sessions.
Personalized Session-based Hierarchical Recurrent Neural Network
Our HRNN model builds on top of RNN by: (i) adding an additional GRU layer to model information across user sessions and to track the evolution of the user interests over time; (ii) using a powerful user- parallel mini-batch mechanism for efficient training [6].
Architecture. Beside the session-level GRU, our HRNN model adds one user- level 𝐺𝑅𝑈 (𝐺𝑅𝑈𝑢𝑠𝑟 ) to model the user activity across sessions.
Figure below shows a graphical representation of HRNN.
Model using Single Deep Learning Technique. In this category, models are divided into eight subcategories in conformity with the eight deep learning models: MLP, AE, CNN, RNN, DSSM, RBM, NADE and GAN based recommender system. The deep learning technique in use determines the strengths and application scenarios of these recommendation models. For instance, MLP can easily model the nonlinear interactions between users and items; CNN is capable of extracting local and global representations from heterogeneous data sources such as textual and visual information; RNN enables the recommender system to model the temporal dynamics of rating data and sequential influences of content information; DSSM is able to perform semantic matching between users and items.
Deep Composite Model. Some deep learning based recommendation models utilize more than one deep learning technique. The motivation is that different deep learning techniques can complement one another and enable a more powerful hybrid model. There are many possible combinations of these eight deep learning techniques but not all have been exploited. Some possible combinations are:
On the last year and a half, we have experimented and worked with all of these algorithms and here is a quick roundup of our findings: CNN and Autoencoder, CNN and RNN, CNN and MLP, RNN and Autoencoder, RNN and MLP, CNN and DSSM, RNN and DSSM.
• MLP - Multilayer Perceptron is a concise but effective model. As such, it is widely used in many areas, especially in industry areas such as video recommendations. Multilayer feedforward
networks are demonstrated to be able to approximate any measurable function to any desired degree of accuracy. It is the basis of many advanced models but not very practical for e-commerce recommendations as it only aims to capture the non-linear relationship between users and items.
• AE - there exist two general ways of applying autoencoder to recommender system: (1) using autoencoder to learn lower-dimensional
feature representations at the bottleneck layer; or (2) fling the blanks of rating matrix directly in the reconstruction layer. Autoencoders can be integrated with traditional recommender systems like collaborative filtering. Tightly coupled model learns the parameters of autoencoder and recommender model simultaneously, which enables recommender model to provide guidance for autoencoder to learn more semantic features. Loosely coupled model is performed in two steps: learning salient feature representations via autoencoders, and then feeding these feature representations to recommender system. Both forms have their own strengths and shortcomings. For example, tightly coupled model requires carefully design and optimization to avoid the local optimum, but recommendation and feature learning can be performed at once; loosely coupled method can be easily extended to existing advanced models, but they require more training steps.
• CNN - Convolution Neural Network is powerful in processing visual, textual and audio information but not ideal for the e-commerce type of recommendations. It can be used in combination with other deep learning models when images or audio is part of the features but it’s quite complex to integrate. Most of the CNN based recommender systems utilize CNN for feature extraction.
• RRN - Recurrent neural network is specifically suitable for coping with the temporal dynamics of ratings and sequential features in recommender system. This is the algorithm we are using so we will talk about it on detail in a further section.
• DSSM - Deep Semantic Similarity Model is a deep neural network widely used in information retrieval area. The issue with this model is that it is based on the hypothesis that users have similar tastes in one domain should have similar tastes in other domains. Intuitively, this assumption might be unreasonable in many cases.
• RBM - Restricted Boltzmann Machine is the first recommendation model that built atop deep learning. It is normally used combined with Collaborative Filtering and this is a huge restriction as a lot of e- commerce platforms don’t have such features making them unavailable for our platform.
• NADE and GAN: NADE and GAN based recommender systems. These are some emerging algorithms and not yet production ready. NADE presents a tractable method for approximating the real distribution of source data and produces state-of-the-art recommendation accuracy in terms of rating prediction (compared with other deep learning based recommendation models) on several experimental datasets; GAN is capable of fusing discriminative model with generative model together and posses the advantages of these two schools of thinking.
Our approach
In many online systems where recommendations are applied, interactions between a user and the system are organized into sessions. A session is a group of interactions that take place within a given time frame. Sessions from a user can occur on the same day, or over several days, weeks, or months. A session usually has a goal, such as finding an item to purchase, or listening to music of a certain style or mood. A simple way of incorporating past user session information in session-based algorithm would be to simply concatenate past and current user sessions. While this seems like a reasonable approach, our experiments show that this is not the best solution. In our solution, we use a novel algorithm based on RNNs that can deal with both cases: (i) session-aware recommenders, when user identifiers are present and propagate information from the previous user session to the next, thus improving the recommendation accuracy, and (ii) session-based recommenders, when there are no past sessions (i.e., no user identifiers). The algorithm is based on a Hierarchical RNN where the hidden state of a lower- level RNN at the end of one user session is passed as an input to a higher-level RNN which aims at predicting a good initialization (i.e., a good context vector) for the hidden state of the lower RNN for the next session of the user.
Session-based Recurrent Neural Network
Our model is based on the session-based Recurrent Neural Network (RNN henceforth) model [5]. RNN is based on a single Gated Recurrent Unit (GRU) layer that models the interactions of the user within a session. The RNN takes as input the current item ID in the session and outputs a score for each item representing the likelihood of being the next item in the session. Formally, for each session 𝑆𝑚 = { 𝑖𝑚,1, 𝑖𝑚,2, ... , 𝑖𝑚,𝑁}, RNN computes the following session- level representation.
𝑠=𝐺𝑅𝑈(𝑖,𝑠),𝑛=1,...,𝑁−1𝑚,𝑛𝑠𝑒𝑠𝑚,𝑛𝑚,𝑛−1𝑚
Where𝐺𝑅𝑈𝑠𝑒𝑠 isthesession-levelGRUand𝑠𝑚,𝑛itshiddenstateatstep𝑛,being𝑠𝑚, 0=0(thenullvector), and 𝑖 is the one-hot vector of the current item ID. The output of the RNN is a score 𝑟̂ for every item
𝑚,𝑛 𝑚,𝑛
in the catalog indicating the likelihood of being the next item in the session (or, equivalently, its relevance for the next step in the session)
𝑟̂ =𝑔(𝑠 ),𝑛=1,...,𝑁 −1 𝑚,𝑛 𝑚,𝑛 𝑚
where 𝑔 (·) is a non-linear function like softmax or 𝑡𝑎𝑛h depending on the loss function. During training, scores are compared to a one-hot vector of the next item ID in the session to compute the loss. The network can be trained with several ranking loss functions such as cross-entropy, BPR and TOP1. In our implementation, the TOP1 loss always outperformed other ranking losses, so we consider only. The TOP1 loss is the regularized approximation of the relative rank of the relevant item. The relative rank of the
relevant item is given by 1 ∑𝑁𝑆 𝐼{𝑟̂ > 𝑟̂ } where 𝑟 is the score of a sampled ‘irrelevant item’. 𝐼{·}
𝑁𝑆
is approximated with a sigmoid. To force the scores of negative examples (‘irrelevant items’) towards zero a regularization term is added to the loss. The
final loss function is as follows: 𝐿𝑠 = 1 ∑𝑁𝑆 𝜎(𝑟̂ −
𝑗=1 𝑠,𝑗 𝑠,𝑖 𝑠,𝑗 𝑁𝑆 𝑗=1 𝑠,𝑗
𝑟̂ 0 + 𝜎(𝑟̂ 2 ). RNN is trained efficiently with session-parallel mini-batches. At each training step, the 𝑠,𝑖 𝑠,𝑗
input to 𝐺𝑅𝑈𝑠𝑒𝑠 is the stacked one-hot representation of the current item ID of a batch of sessions. The session-parallel mechanism keeps the pointers to the
current item of every session in the mini-batch and resets the hidden state of the RNN when sessions end. To further reduce the computational complexity, the loss is computed over the current item IDs and a sample of negative items. Specifically, the current item ID of each session is used as positive item and the IDs of the remaining sessions in the mini-batch as negative items when computing the loss. This makes explicit negative item sampling unnecessary and enables popularity-based sampling. However, since user-identifiers are unknown in pure session-based scenarios, there are good chances that negative samples will be ‘contaminated’ by positive items the user interacts with in other sessions.
Personalized Session-based Hierarchical Recurrent Neural Network
Our HRNN model builds on top of RNN by: (i) adding an additional GRU layer to model information across user sessions and to track the evolution of the user interests over time; (ii) using a powerful user- parallel mini-batch mechanism for efficient training [6].
Architecture. Beside the session-level GRU, our HRNN model adds one user- level 𝐺𝑅𝑈 (𝐺𝑅𝑈𝑢𝑠𝑟 ) to model the user activity across sessions.
Figure below shows a graphical representation of HRNN.
At each time step, recommendations are generated by 𝐺𝑅𝑈𝑠𝑒𝑠, as in RNN. However, when a session ends, theuserrepresentationisupdated.Whenanewsessionstarts,thehiddenstateof𝐺𝑅𝑈𝑢𝑠 𝑟 isusedtoinitialize 𝐺𝑅𝑈𝑠𝑒𝑠 and, optionally, propagated in input to 𝐺𝑅𝑈𝑠𝑒𝑠.
Formally, for each user u with sessions 𝐶𝑢 = {𝑆𝑢, 𝑆𝑢, ... , 𝑆𝑢 }, the user-level GRU takes as input the 12 𝑀𝑢
session-level representations 𝑆𝑢, 𝑆𝑢, ... , 𝑆𝑢 , being 𝑆𝑢 = 𝑆𝑢 12𝑀𝑢 𝑚𝑚,𝑁𝑚−1
the last hidden state of 𝐺𝑅𝑈𝑠𝑒𝑠 of each user session 𝑆𝑢 , and uses them to update
the user-level representation 𝐶𝑢 . Henceforth we drop the user 𝑚𝑚
superscript u to unclutter notation. The user-level representation cm is updated as
𝑐𝑚 = 𝐺𝑅𝑈𝑢𝑠𝑟(𝑠𝑚,𝑐𝑚−1),𝑚=1,...,𝑀𝑢
where 𝑐0 = 0 (the null vector). The input to the user-level GRU is connected to
the last hidden state of the session-level GRU. In this way, the user-level GRU can track the evolution of the user across sessions and, in turn, model the dynamics user interests seamlessly. Notice that the user-level representation is kept fixed throughout the session and it is updated only when the session ends.
The user-level representation is then used to initialize the hidden state of the session-level GRU. Given 𝑐𝑚, the initial hidden state 𝑠𝑚+1,0 of the session-level GRU for the following session is set to
𝑠 =𝑡𝑎𝑛h(𝑊 𝑐 +𝑏 ) 𝑚+1,0 𝑖𝑛𝑖𝑡 𝑚 𝑖𝑛𝑖𝑡
where Winit and binit are the initialization weights and biases respectively. In this way, the information relative to the preferences expressed by the user in the previous sessions is transferred to the session-level. Session-level representations are then updated as follows
𝑠=𝐺𝑅𝑈(𝑖,𝑠,[,𝑐]),𝑛=1,...,𝑁−1𝑚+1,𝑛𝑠𝑒𝑠𝑚+1,𝑛𝑚+1,𝑛+1𝑚𝑚+1
where the square brackets indicate that cm can be optionally propagated in input to the session-level GRU. The model is trained end-to-end using back- propagation. The weights of 𝐺𝑅𝑈𝑢𝑠𝑟 are updated only between sessions, i.e. when a session ends and when the forthcoming session starts. However, when the user representation is propagated in input to 𝐺𝑅𝑈𝑠𝑒𝑠, the weights of 𝐺𝑅𝑈𝑢𝑠𝑟
are updated also within sessions even if 𝑐𝑚 is kept fixed. We also tried with propagating the user-level representation to the final prediction layer but we
always incurred into severe degradation of the performances, even wrt. simple session-based RNN. We therefore discarded this setting from this discussion. Note here that the 𝐺𝑅𝑈𝑢𝑠𝑟 does not simply pass on the hidden state of the
previous user session to the next but also learns (during training) how user sessions evolve during time. In effect 𝐺𝑅𝑈𝑢𝑠𝑟 computes and evolves a user
profile that is based on the previous user sessions, thus in effect personalizing the 𝐺𝑅𝑈𝑠𝑒𝑠. In the original RNN, users who had clicked/interacted with the same sequence of items in a session would get the same recommendations; in HRNN this is not anymore, the case, recommendations will be influenced by the users past sessions as well.
In summary, we considered the following two different HRNN settings, depending on whether the user representation cm is considered:
• HRNN Init, in which 𝑐𝑚 is used only to initialize the representation of the next session.
• HRNN All, in which 𝑐𝑚 is used for initialization and propagated in input at each step of the next session.
In HRNN Init, the session-level GRU can exploit the historical preferences along with the session-level dynamics of the user interest. HRNN All instead enforces the usage of the user representation at session- level at the expense of a slightly greater model complexity. This can lead to substantially different results depending on the recommendation scenario.
Learning. For the sake of efficiency in training, we have edited the session- parallel mini-batch mechanism to account for user identifiers during training as seen in the figure below.
Formally, for each user u with sessions 𝐶𝑢 = {𝑆𝑢, 𝑆𝑢, ... , 𝑆𝑢 }, the user-level GRU takes as input the 12 𝑀𝑢
session-level representations 𝑆𝑢, 𝑆𝑢, ... , 𝑆𝑢 , being 𝑆𝑢 = 𝑆𝑢 12𝑀𝑢 𝑚𝑚,𝑁𝑚−1
the last hidden state of 𝐺𝑅𝑈𝑠𝑒𝑠 of each user session 𝑆𝑢 , and uses them to update
the user-level representation 𝐶𝑢 . Henceforth we drop the user 𝑚𝑚
superscript u to unclutter notation. The user-level representation cm is updated as
𝑐𝑚 = 𝐺𝑅𝑈𝑢𝑠𝑟(𝑠𝑚,𝑐𝑚−1),𝑚=1,...,𝑀𝑢
where 𝑐0 = 0 (the null vector). The input to the user-level GRU is connected to
the last hidden state of the session-level GRU. In this way, the user-level GRU can track the evolution of the user across sessions and, in turn, model the dynamics user interests seamlessly. Notice that the user-level representation is kept fixed throughout the session and it is updated only when the session ends.
The user-level representation is then used to initialize the hidden state of the session-level GRU. Given 𝑐𝑚, the initial hidden state 𝑠𝑚+1,0 of the session-level GRU for the following session is set to
𝑠 =𝑡𝑎𝑛h(𝑊 𝑐 +𝑏 ) 𝑚+1,0 𝑖𝑛𝑖𝑡 𝑚 𝑖𝑛𝑖𝑡
where Winit and binit are the initialization weights and biases respectively. In this way, the information relative to the preferences expressed by the user in the previous sessions is transferred to the session-level. Session-level representations are then updated as follows
𝑠=𝐺𝑅𝑈(𝑖,𝑠,[,𝑐]),𝑛=1,...,𝑁−1𝑚+1,𝑛𝑠𝑒𝑠𝑚+1,𝑛𝑚+1,𝑛+1𝑚𝑚+1
where the square brackets indicate that cm can be optionally propagated in input to the session-level GRU. The model is trained end-to-end using back- propagation. The weights of 𝐺𝑅𝑈𝑢𝑠𝑟 are updated only between sessions, i.e. when a session ends and when the forthcoming session starts. However, when the user representation is propagated in input to 𝐺𝑅𝑈𝑠𝑒𝑠, the weights of 𝐺𝑅𝑈𝑢𝑠𝑟
are updated also within sessions even if 𝑐𝑚 is kept fixed. We also tried with propagating the user-level representation to the final prediction layer but we
always incurred into severe degradation of the performances, even wrt. simple session-based RNN. We therefore discarded this setting from this discussion. Note here that the 𝐺𝑅𝑈𝑢𝑠𝑟 does not simply pass on the hidden state of the
previous user session to the next but also learns (during training) how user sessions evolve during time. In effect 𝐺𝑅𝑈𝑢𝑠𝑟 computes and evolves a user
profile that is based on the previous user sessions, thus in effect personalizing the 𝐺𝑅𝑈𝑠𝑒𝑠. In the original RNN, users who had clicked/interacted with the same sequence of items in a session would get the same recommendations; in HRNN this is not anymore, the case, recommendations will be influenced by the users past sessions as well.
In summary, we considered the following two different HRNN settings, depending on whether the user representation cm is considered:
• HRNN Init, in which 𝑐𝑚 is used only to initialize the representation of the next session.
• HRNN All, in which 𝑐𝑚 is used for initialization and propagated in input at each step of the next session.
In HRNN Init, the session-level GRU can exploit the historical preferences along with the session-level dynamics of the user interest. HRNN All instead enforces the usage of the user representation at session- level at the expense of a slightly greater model complexity. This can lead to substantially different results depending on the recommendation scenario.
Learning. For the sake of efficiency in training, we have edited the session- parallel mini-batch mechanism to account for user identifiers during training as seen in the figure below.
We first group sessions by user and then sort session events within each group by time-stamp. We then order users at random. At the first iteration, the first item of the first session of the first B users constitute the input to the HRNN; the second item in each session constitute its output. The output is then used as input for the next iteration, and so on. When a user has been processed completely, the hidden states of both 𝐺𝑅𝑈𝑢𝑠𝑟 and 𝐺𝑅𝑈𝑠𝑒𝑠 are reset and the next user is put in its place in the mini-batch.
With user-parallel mini-batches we can train HRNNs efficiently over users having different number of sessions and sessions of different length. Moreover, this mechanism allows to sample negative items in a user-independent fashion, hence reducing the chances of ‘contamination’ of the negative samples with actual positive items. The sampling procedure is still popularity-based, since the likelihood for an item to appear in the mini-batch is proportional to its popularity. Both properties are known to be beneficial for pairwise learning with implicit user feedback.
With user-parallel mini-batches we can train HRNNs efficiently over users having different number of sessions and sessions of different length. Moreover, this mechanism allows to sample negative items in a user-independent fashion, hence reducing the chances of ‘contamination’ of the negative samples with actual positive items. The sampling procedure is still popularity-based, since the likelihood for an item to appear in the mini-batch is proportional to its popularity. Both properties are known to be beneficial for pairwise learning with implicit user feedback.