The main challenges in detecting credit card fraud are:
How to solve these problems?
Before getting into the code, he is asked to work on a Jupyter notebook. If not installed on your computer, you can use Google Colab .
You can download the dataset from this link
If the link doesn`t work, go to this link and log in to kaggle to download the dataset.
Code: import all required libraries
Code: data loading
Code: Understanding Data p>
Code: data description
Output: p >
(284807, 31) Time V1 ... Amount Class count 284807.000000 2.848070e + 05 ... 284807.000000 284807.000000 mean 94813.859575 3.919560e-15 ... 88.349619 0.001727 std 47488.145955 1.958696e + 00 ... 250.120109 0.0415 0.000000 -5.640751e + 01 ... 0.000000 0.000000 25% 54201.500000 -9.203734e-01 ... 5.600000 0.000000 50% 84692.000000 1.810880e-02 ... 22.000000 0.000000 75% 139320.500000 1.315642e + 00 ... 77.165000 0.000000 max 172792.000000 2.454930e + 00 ... 25691.160000 1.000000 [8 rows x 31 columns]
Code: data imbalance
Time to explain the data we are dealing with p>
Only 0.17% fraudulent transactions from all transactions. The data is highly imbalanced. Let`s apply our models first without balancing them, and if we don`t get good accuracy, then we can find a way to balance this dataset. But first, let`s implement the model without it and balance the data only if necessary.
Code: print information about the amount of the fraudulent transaction
Output: strong >
Amount details of the fraudulent transaction count 492.000000 mean 122.211321 std 256.683288 min 0.000000 25% 1.000000 50% 9.250000 75% 105.890000 max 2125.870000 Name: Amount, dtype: float64
Code: Print amount information for a regular transaction
Amount details of valid transaction count 284315.000000 mean 88.291022 std 250.105092 min 0.000000 25 % 5.650000 50% 22.000000 75% 77.050000 max 25691.160000 Name: Amount, dtype: float64
As we can clearly see, the average Money transaction for fraudulent transactions is higher. This makes this problem solvable.
Code: Building a correlation matrix
A correlation matrix graphically gives us an idea of how functions correlate with each other and can help us to predict which features are most relevant for forecasting.
In HeatMap we can clearly see that most of the functions are not related to other features, but there are some features that correlate positively or negatively with each other. For example, V2 and V5 correlate strongly negatively with a feature called Amount . We also see some correlation with V20 and Amount . This gives us a deeper understanding of the data available to us.
Code: Split X and Y Values
Split Data into Input Parameters and Format Output Values
(284807, 30) (284807,)
Training and bifurcation testing
We will divide the dataset into two main groups. One for training the model and the other for testing the performance of our trained model.
Code: Building a random forest model using skicit learn strong >
Code: creation of all kinds evaluation parameters
The model used is Random Forest classifier The accuracy is 0.9995611109160493 The precision is 0.9866666666666667 The recall is 0.7551020408163265 The F1-Score is 0.8554913294797689 The Matthews correlation coefficient is0.8629589216367891
Code: confusion visualization
Comparison with other algorithms without considering data imbalances.
As you can clearly see with our random forest model, we clearly get better results even for review, which is the hardest part.
This book project was first presented to me during my first week in my current role of managing the data mining development at SAS. Writ- ing a book has always been a bucket‐list item, and I was ver...
The Pragmatic Programmer: Your Journey To Mastery, 20th Anniversary Edition (2nd Edition). The Pragmatic Programmer is one of those rare technical books that you will read, reread, and re-read over...
Data is “unreasonably effective”. Nobel laureate Eugene Wigner referred to the unreasonable effectiveness of mathematics in the natural sciences. What is big data? Its sizes are in the order of te...
The role of adaptation, learning and optimization are becoming increasingly essen- tial and intertwined. The capability of a system to adapt either through modification of its physiological structure ...