This post is written retrospectively. This project was completed about a month ago, on 19 May '23. This project was for my com sci subject Machine Learning (COMP30027) at Uni and I am very proud of how it has turned out. I learned invaluable skills, which I will cover, and need to document them while it is still fresh in my mind.
Introduction
This project aimed to create and tune a supervised machine-learning model to predict the rating of a book based on a set of given attributes (features). We were provided with over 30,000 instances of books by the COMP30027 team. This dataset was collected from GoodReads.com the go-to website for information on books and personal reviews from people around the world. These instances contained the following attributes:
name
authors
publish year
publish month
publish day
publisher
language
page numbers
description
book rating (3, 4, or 5)
The name, authors, and description features are raw text features that were encoded using doc2vec and CountVectorizer text encoding methods. These encoding methods convert written text into high-dimensional numeric vectors which a computer can understand better, as these models struggle with raw text.
Several instances had missing attributes in the language feature, however, due to time constraints, only a brief feature preprocessing selection process was only able to be conducted. Using what existed in the language column, it was encoded using one-hot-encoder and concatenated onto the datasets used, but did not show to improve model performance (I will expand on this later). In future research, I would populate the language column by analysing the description feature and incorporating the column. Building a language classifier using a convolution neural network would be an interesting project to embark on in the future.
An extremely large class imbalance caused the accuracy metric to overestimate the model's true accuracy. Therefore, the F_1 Score was used, which considers the model's precision and recall capturing the performance for the minority class. The aforementioned equations are as follows:
Preliminary Model Testing
Several models were chosen to be tested, and the best-performing two were to be further tuned. The Naive Bayes model was not considered as the nature of the encoded test data violates the Naive Bayes independence assumption. The following models were tested on all six datasets (the results for the author-count-vectorised dataset):
Model | F_1 Score |
Zero-R | 0.274 |
LinearSVC | 0.442 |
Decision Tree | 0.465 |
K-NN | 0.400 |
Logistic Regression | 0.382 |
AdaBoost (Decision Tree) | 0.337 |
Overall tests, the two models with the most consistent success were chosen. Linear-SVC showed the most consistently high F_1 score, and the Decision Tree was inconsistently doing well. This suggested both models were uncovering the underlying relationships more accurately. At the time, I attributed the Decision Tree's inconsistency to the large class imbalance. Therefore, I decided would instead, that a properly tuned AdaBoost model would be favourable since it will bias the minority class.
Linear Support Vector Classifier
Naturally, SVMs are binary classifiers however, our dataset contains multiple class labels. Therefore, a multi-class SVM technique was necessary. The sci-kit learn documentation states the .LinearSVC()
function uses a 'one-vs-one' approach by default so I decided to build another model using the 'one-vs-all' approach. The 'one-vs-one' approach creates n(n − 1)/2 classifiers for n-discrete class labels where each classification problem is divided between two discrete class labels. The 'one-vs-all' method creates n-classifiers where each classifier distinguishes between one discrete class label and the rest.
A GridSearch was performed to tune the C hyperparameter for the LinearSVC over all six datasets. The C values tested were [0.1, 0.5, 1, 10] and performance was compared using the F_1 Score. The following table shows the best-performing C value and its corresponding F_1 Score for each dataset:
DataSet | Best C Value | F_1 Score |
Name (doc2vec) | 1 | 0.28 |
Authors (doc2vec) | 0.1 | 0.27 |
Description (doc2vec) | 1 | 0.28 |
Name (Count Vectorizer) | 1 | 0.31 |
Authors (Count Vectorizer) | 1 | 0.44 |
Description (Count Vectorizer) | 1 | 0.28 |
The LinearSVC performed best with the Count Vectorised Author dataset and had a 70% accuracy.
AdaBoost
AdaBoost is an ensemble machine learning method that aggregates predictions from multiple base classifiers to choose the final classification result. Each base classifier is iteratively trained and tested on a random subset of instances and features from the training data. The results of a base classifier determine its model weight (the influence it has over the final decision) and the training instance weight for the next model. The hard-to-classify instances of the previous base classifier are biased, and the probability they will appear in the training of the next base classifier is increased—the minority class tends to be the hard-to-classifier instances. This can bias the minority class and find patterns that may be obscured by the majority class. The base classifier, in this case, is the short decision tree, over one feature. Intuitively, the AdaBoost model seemed to fit the classification problem well. A highly imbalanced class distribution and correlated dataset features suggested a correctly tuned AdaBoost model could outperform a Decision Tree. A GridSearch was performed for hyperparameter optimisation. First, the Decision Stump depth was optimised:
Decision Stump Length | Accuracy | F_1 Score |
1 | 0.699 | 0.326 |
2 | 0.699 | 0.339 |
3 | 0.697 | 0.339 |
With the best decision stump depth discovered, a GridSearch was performed on the number of base classifiers:
Number of Estimators | Accuracy | F_1 Score |
50 | 0.699 | 0.339 |
100 | 0.699 | 0.343 |
200 | 0.697 | 0.333 |
Radial Bias Function Kernel SVC
A noteworthy mention is the SVC using the Radial Bias Kernel Function (RBF). I stumbled upon this classifier when trying different SVC kernels for fun. The SVC using RBF didn't have the best F_1 Score but had the highest accuracy out of all the models. Since my class was holding a competition on Kaggle for this project, I submitted this model's results for the competition. It achieved 74% accuracy in the competition dataset.
Discussion
I will keep this brief for now and might add more later.
The results of the higher accuracy achieved using RBF SVC suggests a non-linear relationship. A GridSearch was unable to be further explored due to computational complexity, a stronger computer will be needed for future projects. Considering the non-linear relationship, an ensemble method that is more robust for non-linear relationships should be explored in the future such as Random Forest. This algorithm utilises Bootstrap Aggregation and would be the first method I would explore if the project was continued.
In reflection, although this project was frustrating at times, it was enjoyable and extremely rewarding. I was also very satisfied with my final report which will be appended to this post once the semester is finished.