News classification (unbalanced classes) — NLP
Natural Lenguage Processing, Machine Learning, TF-IDF, Bag of Words, Sparse Vectors, spaCy, Deep Learning, Doc2Bow, Keras
1. Introduction
Natural Language Processing (NLP) is a field of artificial intelligence that helps computers to understand, interpret and manipulate human language. NLP is concerned with research into computationally efficient mechanisms for human-machine communication using natural language.
Case of Study
In this project, we are going to work with the meneame.net news dataset. Menéame dataset contains 177,000 news from different newspapers. What we want to achieve is to create a model that, according to the text of the news, classifies it into a certain class or type of news.
During this project, due to the imbalance of the classes (type of news) in our database, we will be able to see that this classification is quite difficult to curry out. We are going to see how we have tried to solve this problem.
2. Data
The class to predict is the type of news (sub column of the database), from the noticia (new) and extracto (extract) columns. We see which categories (variable we are going to predict) are in our dataframe.

We can see that the dataset is quite unbalanced, as 76.2% of the news items are in the mnm category. Mnm is a category that comes from the Menéame website, which does not represent any specific topic. This is why we are interested in eliminating this class from our data in order to be able to make the prediction properly.
We delete the mnm class and observe what data we have.
Now, we see that there are 4 relevant classes in our data that represent 98.7% of the data. This is why we are going to stick to these four classes: tecnología (technology) , ocio (leisure) , actualidad (current news) and cultura (culture). We will make a prediction of the data containing one of these categories.
We chose these classes as the others do not have a representation in the database in order to take them into account.

As it can be more clearly seen in this graph, the classes are quite unbalanced, with actualidad (current news) represent 57.9% of the data, while ocio (leisure) for only 8.3%. Therefore, later on, we will perform a balancing of the classes to see what improvement there is between the training of the models with the balanced classes with respect to the unbalanced classes.
Text normalisation is the process of transforming text into a canonical form. We normalise the text with the following function. What does this function do?

- We clean up mentions and urls.
- Then, we convert to tokens and remove punctuation marks. Tokenising a text consists of dividing the text into its constituent units, where a unit is understood to be the simplest element with its own meaning for the analysis in question, in this case, the words. These tokens are transformed to lower case.
- We also lemmatised . Lemmatisation is the segmentation of a word to separate the root (lexeme) from the inflectional morphemes. We also remove digits, words with less than two letters and stop words.
We merge the text of the noticia (new) column and the extracto (extract) into one. We split the data into a training set, and a test set and we started to train the models with the train set.
3. Models with Sparse Vectors
We are going to perform feature extraction where we convert text documents into numeric vectors, so that we can train our models. To start with, we are going to use the vectors Space Model: Bag of Words and TF-IDF.
To measure the models we will use the F1 Score instead of accuracy. F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution (as is the case here).
With these sparse vectors, we made several models: Logistic Regression, Multinomial NB, SGDClassifier, Linear SVC, Decision Tree and Random Forest.



- We can see that the decision trees perform significantly worse than the other models.
- We can see that the models have a high failure rate in the Ocio (leisure) category, which is why we have to find a way to try to achieve a better result in this aspect.
- On the other hand, we can see how the f1 score is not completely determinant to choose the model, since there are models that predict a great majority as actualidad (current news), and achieve a great precision in this part, which makes them have a good f1 score. We have to carry out an analysis of the confusion matrix.
4. Models with Balanced Classes
As we have pointed out, we think that the results could be improved by balancing the classes since, for example, the leisure class (ocio) , which usually fails quite a lot of our models, only represents 8.3% of our data, while actualidad (current news) represent 56.6%. In the above results, we see that current news represents the highest f1 score in our model results, and we think that this is because it also represents a larger number of observations in our dataset. This is why class balancing can help us.
There are two ways of balancing classes:
- Under-sampling: Remove samples from over-represented classes. It is most useful when the dataset is large.
- Over-sampling: Add more samples from under-represented classes. It is most useful when the dataset is small.

As our dataset is large, we will balance the classes using oversampling. We will do this with SMOTE, which is an oversampling method that creates synthetic samples of the minority class. We use the imblearn package to oversample the minority classes.
There is also the option of scikit-learn, to balance the classes in the models with the parameter class_weight= “balanced”. We will do this in some models later. Now, we do it with SMOTE.
Let’s train the models and see the results. As we have said before, the f1 score results are not entirely relevant as we have to take into account the confusion matrix, we see the best ones:



- We can see that in this part, we obtain better results in the leisure category (ocio) than in the models without the balanced classes.
- The results of the classification trees are still quite poor.
- We did not implement a support vector machine (SVM) due to the high computational cost, which is why in the following section we will study the LSA algorithm to, among other things, reduce the dimensionality and be able to implement an SVM.
5. Models with Topic Modeling — LSA
In this section we will perform topic modelling with two main purposes. To perform a dimensionality reduction, in order to deal with models such as the SVC that we did not run before due to their computational cost. On the other hand, we also wanted to see how the classifiers models performed on a sparse matrix of type LSA.



In this case the model with the best f1-score is the LinearSVC, although the SVC offers very good results.
6. Models with Doc2Bow
Now we are going to use feature extraction with dense vectors. In sparse models the numeric vector is not related to the meaning of the words. Dense models like Doc2Bow, encode each word/sentence as a numeric vector with semantic information.



- We notice that with doc2bow dense vectors our models improve substantially.
- The SVC is quite a good model, one of the best. It performs a good prediction of the leisure class, while maintaining good accuracy in the other classes.
7. Keras
In this new model, convolutional networks will be used to try to find an optimal solution to the problem.
Specifically, regardless of the computational cost, the best option is to perform a convolution to obtain the model. However, due to the high computational cost required to perform a convolution, the procedure to be carried out is the definition of a multilayer perceptron focused on solving the classification problem.

The results obtained using a multi-layer perceptron are not good, a good f1_score is not achieved, but a good solution to the problem of classifying the Leisure (ocio) category is obtained.
8. Final steps
We select what we consider to be the best models and and try to improve them with the use of the model parameters. Then we see how they perform in test set.
We have done this in order to test the model with totally unknown data; and so the model tests with totally new data. It is true that, in this case, this step is somewhat irrelevant, but it can help to give us an estimate of how these models would perform with new news.
- Logistic Regression with TFIDF — Balanced classes

- Linear SVC — LSA

- SVC — Doc2Bow

9. Conclusions
In conclusion, we would like to highlight several important points that we have found throughout this project:
- Firstly, the models have quite a lot of difficulties with the prediction of the Leisure (ocio) class, as can be seen throughout the paper. This class is often confused with the other classes, due to the fact that it is the class for which we have the fewest samples. It should also be said that leisure, with culture and actuality, it makes sense that they can be confused. Add that the news in our dataset do not contain a large amount of text. None of them manage to have a good prediction in this class without reducing f1 score in the other classes.
- It is worth noting the improvement in class balancing. We have noticed how the classifiers have improved their prediction with the sparse vectors when the classes have been balanced. Class balancing, which we also do with the previous model with the sklearn option in the model. This improvement is understandable, since, as we have already mentioned, actuality presented a fairly overwhelming majority with respect to the other classes.
- We can see that the SGDClassifier works well, but we will always reject it as it is not a stable model. It presents large variations and that is why we are not very interested in it.
- We highlight the models that have worked best: Logistic Regression, SVC and LinearSVC. We can see that Keras and, above all, classification trees do not work with this problem and obtain rather poor results.
- The application of dense vectors with Doc2bow to the models shows an improvement in the prediction of the Leisure (Ocio) class with respect to the other ways of vectorising the information.
Finally, we chose the Logistic Regression model with the TF-IDF matrix because it has the best f1 score among those we considered the best models. However, the others also perform quite well as they have a higher accuracy in the leisure class. As they all perform well, we chose the best f1 score.