Mathematically, the chi-square test is performed for two distributions, two of which determine the similarity level of their respective variances. In his null hypothesis , he assumes that these distributions are independent. Thus, this test can be used to determine the best performance for a given dataset by identifying the characteristics on which the output class label is most dependent. For each function in the dataset, is calculated and then ordered in descending order according to
value. The higher the value
the more the output label depends on the function and the higher the value that the function has to determine the output.
Suppose the object under consideration has m attribute values, and the output data has k class labels. Then the value is given by the following expression:
where
— observed frequency
— expected frequency
A contingency table with m rows and k columns is created for each object. Each cell (i, j) denotes the number of rows that have the attribute attribute as i and the class label as k. Thus, each cell in this table represents the observed frequency. To calculate the expected frequency for each cell, the fraction of the function value in the total dataset is first calculated and then multiplied by the total number of labels in the current class.
Solved example:
Consider the following table:
The output variable here is a column named “ PlayTennis ", which determines whether you played tennis on a given day, taking into account the weather.
The contingency table for the Outlook function is structured as follows: