Change language

How can a Data Scientist create a classifier if some of the data is mislabelled?

| | |

In software development in general, and video games in particular, it is always important to be able to analyse system performance and user behaviour. For analysts to be able to gather information and make useful recommendations, and for developers to be able to use those recommendations to improve the product, care must be taken in advance not only about correct logging, but also about the correct partitioning of the data. Because this is not always possible, some data is not used in analysis, or worse, incorrect conclusions are drawn from it.

The approach to dealing with mispartitioned data described in this article will be helpful to any analyst or datascientist who deals with mispartitioned data, but is keen to use it to build solutions, and is not looking for an easy way out.

Featured review: Best laptop for Machine Learning

What is mislabelled data and why does it happen?

Mislabeled data is data whose labels do not correspond to reality. For example, you have a set of pictures of cats and dogs, but some of the cats appear to be dogs according to the markup for some reason. Such a problem can arise for several reasons: the subjectivity of the person marking the data; errors in obtaining the data; and, in the case of indirect markup, choosing the wrong algorithm. Obviously, such problems can occur in absolutely any field: medicine, entertainment, training - anywhere.

As you know, games have a lot of events (such as promotions for some content, the launch of a new mode) with completely different objectives, ranging from attracting new players to monetizing or increasing their engagement. After hosting another game event, you, as an analyst, have the following task: to implement an algorithm that can be used to predict the players' participation in a similar event in the future.

The inputs that are available to you include:

  • player characteristics (how many battles they play, what modes they play, what vehicles they prefer, etc.),
  • The player's participation in the event (e.g. whether they purchased the content offered).
  • It seems that nothing prevents you from training an algorithm based on player characteristics to predict the probability of its participation in the event. However, something goes wrong during the event, and you realise that some of the players who wanted to participate are unable to do so. If it's a purchase, they try to make it, but they can't - the amount of content offered is limited; if it's a new mode, they try to play it, but they can't enter the battle due to some technical problems. These players will remain in a "haven't participated, but would like to" status.

What a shame! After all, as an analyst, you really need accurate data about the participants of the event, in order to use this information to train the algorithm.

That said, the third column doesn't really exist. And there is no technical way to check whether the label is correct or not. All you know is that some of the players have fallen into the wrong class.

What to do?

Option one: nothing!

Well, we can't train the model and make predictions - it happens. We can always calculate descriptive statistics for players participating in the event (for example, the average number of fights per day) and make simple rules for selecting potential participants in a new event. In the case of an average number of fights per day, it may be that a group of participants has a metric value on average 30% higher than that of non-participants. So we will assume that all players with the same metric as the participants in the event will be potential participants in the next event.


  • Simple and quick.


  • Lack of scalability or limited number of metrics you can cover. If we are talking about 1-2 metrics, it's easy to use the rules for them. But in reality, there are many more metrics that you will want to compare. And if you suddenly want to look at the mutual influence of several attributes, it will be very difficult to do so and the speed-related advantage will no longer be relevant.
  • Accuracy/quality. You simply won't be able to adequately assess them.
  • The influence of faulty data. You can't get misplaced players anywhere after all, so their characteristics will distort the values of the metrics in question. For example, the "non-participants" of the event will include potential participants with a much higher number of fights per day than the non-participants. As a result, the average value or some other statistic will be erroneously inflated.

Variant two: train the algorithm on the available markup

We take all possible metrics we can think of (including the average number of fights per day) as a feature vector to characterize players, and participation of a player in an event as a target variable, assuming we have no errors in the data. We train an algorithm on this data, using the magic of machine learning, and use the resulting algorithm to predict player participation in the next event.


  • Scalability. Due to the fact that it is a machine learning algorithm, you take into account many more features and their mutual influence on the target variable.
  • Accuracy/quality. Now you can assess the quality of the resulting model (there is a reason why so many quality metrics have been invented for machine learning methods), but you will probably not be satisfied with what you get.


  • The impact of faulty data. And you won't be satisfied with the quality of the algorithm because erroneous data are still present in the training sample and have a significant impact on learning.

Option three is to re-partition the players and train the model on the re-partitioned data

Before training the model on the characteristics of the players and the target variable, we try to obtain new, more accurate values for the target variable. And only after that, using the updated values, let's train the final algorithm.


  • Scalability. It's still the same machine learning model, with which you will account for multiple features and their interrelationships.
  • Accuracy/Quality. See previous point.
  • No influence from erroneous data. If you were able to properly re-partition your sample (and we will describe how to do it below), then the impact of erroneous data on the training result will be minimized.

Considering that the latter option has the most advantages and we are not looking for easy ways out, this is what we will focus on.

How do you find incorrectly marked objects in the sample?

The quality of the final model depends on two things: the quality of the input data (especially its markup) and the features/tuning of the selected model. In our case, the main focus is on data quality, so we will not optimize the features of the models. However this is not an excuse not to do it! In order to re-partition the data we'll need:

  • the original sample with the "wrong" partitioning,
  • several differently architected machine learning methods,
  • time, and the high power of your hardware.

Suppose the initial data is as follows. Here X={x1,x2,…, xn} is a vector of features describing each player, and y is the target variable corresponding to whether the player participated in the event or not (1 - participated, 0 - did not participate respectively).

import pandas as pd
import numpy as np

data = pd.read_csv('../data.csv')

We start by training a basic classifier to understand the quality of the model on a misplaced sample. We choose Random Forest and its implementation in Sklearn as the classifier.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import  RandomForestClassifier

X_data, y_data = data.drop(['y'], axis = 1), data['y']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=.3, random_state=RS)
clf = RandomForestClassifier(n_estimators=250, random_state=42, n_jobs=15), y_train)
y_pred = clf.predict(X_test)

Let's look at the quality of the model: let's derive an error matrix and basic quality metrics.

df_confusion = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

We see that the classification quality for class 1, i.e. event participants, is very low: recall1= 0.19 and f1-score= 0.29. The average for the model is f1-score= 0.62.

If you were not going to re-partition the data, you would be unlikely to dwell on these results, given that the model has allocated almost all event participants to those who will not participate. You would end up going back to calculating the underlying statistics.

Let's hope you decide to move on. Schematically, the entire re-partitioning of the data would boil down to the following. Split the raw data into N parts with an equal distribution of objects from class 0 and class 1. On each (N-1) part train 5 or more machine learning methods, preferably different in architecture and predicting probability. In our case we use the familiar Random Forest, as well as Logistic regression, Naive Bayes, XGBoost, and CatBoost.

To do this, we initialise the models with the required parameters. The parameters, by the way, are already better selected at this stage by optimising the hyper-parameters.

from catboost import CatBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

clfs = {}

logreg_model = LogisticRegression(C=100)
clfs['LogReg'] = {'clf': LogisticRegression(), 'name':'LogisticRegression', 'model': logreg_model}

rf_model = RandomForestClassifier(n_estimators=250, max_depth=18, n_jobs=15)
clfs['RandomForest'] = {'clf': RandomForestClassifier(), 'name':'RandomForest', 'model': rf_model}

xgb_model = XGBClassifier(n_estimators=500, max_depth=10, learning_rate=0.1, n_jobs=15)
clfs['XGB'] = {'clf': XGBClassifier(), 'name': 'XGBClassifier', 'model': xgb_model}

catb_model = CatBoostClassifier(learning_rate=0.2, iterations=500, depth=10, thread_count=15, verbose=False)
clfs['CatBoost'] = {'clf': CatBoostClassifier(), 'name': 'CatBoostClassifier', 'model': catb_model}

nb_model = GaussianNB()
clfs['NB'] = {'clf': GaussianNB(), 'name':'GaussianNB', 'model': nb_model}

Next, divide the original data into 5 parts with equal distribution of grade 0 and grade 1 examples.

data_0 = np.array_split(data[data['y'] == 0].sample(frac=1), 5)
data_1 = np.array_split(data[data['y'] == 1].sample(frac=1), 5)

dfs = {i: data_0[i].append(data_1[i]) for i in range(5)}

Finally, we re-partition the data, iterating over each of the 5 parts of the sample and using each of the 5 models announced above for prediction.

from sklearn.preprocessing import  StandardScaler

threshold = 0.5
relabeled_data = pd.DataFrame()
for i in range(5):
    # test - dataframe #i, train - all except #i
    df_test = dfs[i]
    df_train = pd.concat([value for key, value in dfs.items() if key != i])
    X_train, y_train = df_train.drop(['y'], axis=1), df_train['y']
    X_test, y_test = df_test.drop(['y'], axis=1), df_test['y']

    df_w_predicts = df_test.copy()
    for value in clfs.values():
        model = value['model']
        if value['name'] in ['LogisticRegression', 'GaussianNB']:
  , y_train)
            predicts = (model.predict_proba(StandardScaler().fit_transform(X_test)
                                               )[:, 1] >= threshold).astype(bool)
  , y_train)
            predicts = (model.predict_proba(X_test)[:, 1] >= threshold).astype(bool)

        df_w_predicts[value['name']] = predicts
        relabeled_data = relabeled_data.append(df_w_predicts)

As a result of the re-partitioning, each model will predict the probability that the player was a participant in the event. The target variable is re-partitioned if all the models predict a probability above some threshold. In the current example threshold=0.5. The data will look as follows:

A logical question arises: how do we check the quality of the re-partitioning? Alternatively, plot the distributions of the players in terms of actual participants in the event, potential participants (i.e. those we have re-partitioned from class 0 to class 1), and non-participants in the event. As a result, you will get the following:

We can clearly see that the distributions of the main metrics of potential participants (formerly "non-participants") are almost identical to the distributions of real participants.

We train Random Forest again to compare the quality of the models before and after repartitioning. Also the resulting model can already be used to predict the participation of new players in the next event.

relabeled_data.drop(['y_old'], axis=1, inplace=True)
X_data, y_data = relabeled_data.drop(['y_new'], axis = 1), relabeled_data['y_new']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=.3, random_state=42)
clf = RandomForestClassifier(n_estimators=250, random_state=42, n_jobs=15), y_train)
y_pred = clf.predict(X_test)

df_confusion = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print(classification_report(y_test, y_pred))

We see that the quality of classification, f1-score, has increased to 0.84, i.e. by 35%! Also, recall1= 0.64 now, and we have not lost in recall0. This means that we have started to classify potential event participants much more correctly.

Read also: Best laptop for Zoom

What next?

I have talked about one of the options for improving the quality of the raw data. To improve the final classification algorithm, you can experiment some more:

  • The final classification method. In addition to Random Forest you can try other methods. You should also optimize the hyperparameters of the algorithm and the threshold probability at which the model assigns the object to class 1.
  • The threshold probability value for repartitioning. In our example this value is 0.5. It can be "moved" either way, depending on the result you want: to re-partition as many or as few players as possible. In general, when selecting the threshold value, you should first be guided by common sense, some kind of reference (if there is one, of course) and, as has been demonstrated, using comparisons of distributions of key metrics in the actual and re-partitioned classes.
  • Try a different approach to sample re-partitioning. On the Internet one can find implementations of sample re-partitioning based, for example, on segmentation. Initially the problem of learning without a teacher is solved, all data is divided into segments. After that, each segment is assigned a class label, which is the most frequent among the objects in that segment. Thus objects with a label other than the one assigned will be overlabeled.

Hopefully you will rarely encounter such situations in your work, but if you do, you will find this material useful!


Learn programming in R: courses


Best Python online courses for 2022


Best laptop for Fortnite


Best laptop for Excel


Best laptop for Solidworks


Best laptop for Roblox


Best computer for crypto mining


Best laptop for Sims 4


Latest questions


Common xlabel/ylabel for matplotlib subplots

12 answers


How to specify multiple return types using type-hints

12 answers


Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers


Flake8: Ignore specific warning for entire file

12 answers


glob exclude pattern

12 answers


How to avoid HTTP error 429 (Too Many Requests) python

12 answers


Python CSV error: line contains NULL byte

12 answers


csv.Error: iterator should return strings, not bytes

12 answers



Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python


How to specify multiple return types using type-hints


Printing words vertically in Python


Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries


Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically