Stage I: Classical Recommendation

Discover the first stage of our model, merging cutting-edge Collaborative Filtering with Semantic Similarity techniques. We provide accurate, personalized recommendations by examining user interactions and semantic connections among items. Join us in revolutionizing recommendation systems through innovation.

Dataset exploration

1. MolRec dataset

MolRec dataset contains Implicit information about users, chemical compounds, and inferred ratings based on the total citation count of papers related to each compound's ChEBI ID. It is used to train the Collaborative Filtering model and observe the recommended chemical compound.

Datasets

1.2K Authors

User_id represents the user identification that is assigned to each individual user within our system

Over 100 Chemical Compounds

Chebi_id stands for different chemical compound, and the compound number is determined and assigned by the National Library of Medicine

5.4K Ratings

The rating is determined by the total citation of papers related to the chemical compound with the chebi_id

95 % Sparsity Level

The high level sparsity can happen due to natural data sparsity

2. CHEBI Ontology OWL file dataset

The ChEBI Ontology OWL file is a structured data representation of the Chemical Entities of Biological Interest (ChEBI) ontology using the Web Ontology Language (OWL). ChEBI focuses on small chemical compounds relevant to biological research, providing a systematic classification of these entities. The OWL file includes definitions, relationships, and properties that describe the molecular structures, functional groups, classifications, and other attributes of these chemical entities. Researchers and developers use this file to build applications, perform data analysis, and enhance the exchange of chemical information across different platforms and domains. The dataset is used to train the Chemical semantic similarity algorithm based on the ChEBI ontology.

(Reference Link: https://www.ebi.ac.uk/chebi/downloadsForward.do)

Model Overview

Collaborative
Filtering (CF)

The Collaborative Filtering (CF) aspect utilizes two algorithms, including Alternative Least Squares (ALS) and Bayesian Personalized Ranking (BPR):

1. ALS is a matrix factorization technique that uncovers latent factors in user-item interactions

2. BPR focuses on pairwise comparisons to learn personalized rankings.

Semantic Similarity (Content Based) component leverages Chemical semantic similarity based on the ChEBI ontology (ONTO). This involves using DiShIn for calculating distances between entities in the semantic base and employs the Resnik similarity metric to quantify the semantic relatedness between items.

ONTO can be combined with CF by multiplying their similarity scores. In the next section, we will also examine the performance of hybrid of CF and ONTO.

Semantic Similarity (Content Based)

Results & Evaluation

Evaluation method: Precision, Recall, and F-Measure (Relevance for the ranked list) MRR, nDCG, and IAUC (Ranking Order Correctness)

See the figures below, we can find a trend that combining CF with ONTO, it yields better results than the basic CF alone. Among the algorithm, the best model is ALS_ONTO. It demonstrates superior performance compared to other algorithms across various metrics such as F-measure, Recall, Precision, nDCG, and DCG. Its performance in the MRR metric is also comparable to other algorithms. Therefore, we have decided to utilize ALS_ONTO for generating 3 to 5 recommended chemical compounds, which will then be inputted into the Stage II model.

However, due to limited computation hardware, limited time, and a relatively small dataset with high sparsity (around 0.6 to 0.7), we are unable to replicate the accuracy of the original paper.