The demand for real time processing has increased significantly as processing huge volumes of data alone is not enough to react on changing business conditions in real time. Real time processing is required when data needs to be processed fast and actions need to be computed and initiated in realtime.
Real time processing system involves following stages:
Real time processing requires data to be processed as it flows (Streaming data). Again processing streaming data has its own challenges and limitations. A real time processing platform should be able:
- to handle huge amount of data probably with bursty rates.
- to process streaming data online and in real time.
- to adapt to the changing environment and data pattern autonomously.
- to be computationally efficient.
- to preserve the interpretability and transparency in a dynamic sense.
In this blog, we will see a use-case of how one can use real time processing system with machine learning to detect credit card frauds.
Use case: Credit card fraud detection
Statistics shows that every 4 out of 10000 active credit card accounts are fraudulent, due to which credit card companies lose approximately seven cents per hundred dollars. For 2010, this translated into roughly $8.6 billion. (source: cnbc, wikipedia)
Credit card frauds needs to be detected in near real time in order to quickly react to it. Many companies started building systems to detect and notify whenever fraudulent transactions occur. Detecting suspicious transactions in real-time is critical in financial systems. Once a fraud is detected, possible actions that can be taken are:
- Email or SMS the credit card holder
- Initiate a manual verification process
- Block the card from any transactions till verified
Amount of data generated by these transactions is huge and requires a big data platform for processing. Here we will describe a scalable platform solution for the real time processing system mentioned above.
Fraud detection System
Credit card fraud detection is beyond a real time stream processing problem, it requires techniques to detect whether a transaction is fraudulent or not. The detection techniques involve machine learning algorithms to understand the cardholder’s usage patterns and detect suspicious activity/anomaly.
Transactions are fed in real time into fraud detection system to analyze against the knowledge the system has built based on past transaction history of the user and detect if current transaction is fraudulent or not. In case the transaction is detected as fraudulent the system can take actions.
Fig 1.1 High level architecture
Credit card fraud detection involves two types of processing:
Offline processing: In order to understand the usage patterns of each credit card its transaction history is analyzed. A model is built to classify future transactions as fraudulent or not.
Real time processing: New transactions are matched against the model (built during the offline processing) to detect whether they are fraudulent or not.
At this level, historic data is pre processed and analysed to generate a learning model for each card. HDFS suits very well to store the huge amount of historic data of credit card transactions. Spark is the choice to cleanse, normalize and process data to generate learning models per card which are then persisted into HBase. HBase is very efficient to enable random, realtime read/write access of data.
Fig 1.2 Preprocessing/ analysis historic data
Real time processing
Kafka acts as a Buffer to hold the transaction streams. Kafka can handle hundreds of megabytes of reads and writes per second from thousands of clients and is very well suited for real time streaming use cases. Spark loads the model (generated in the offline processing) from HBase and uses it to analyse the real time stream of transactions from Kafka as fraudulent or not. The predicted results are persisted into HBase. Elasticsearch provides real-time search and analytics capabilities while Kibana is a flexible analytics and visualization platform that enables real-time summary and charting of streaming data.
Fig 1.3 Processing streams of real time data
The goal is to find if the card is being used by someone other than the cardholder himself. This requires one to understand the usage pattern of the card holder to detect anomalies.
Logistic regression is used to generate models based on the usage patterns. Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables (called as a “feature”) that determine the outcome.
The model should be able to
- detect and reduce noise in arriving data instance
- detect concept drift
- model simplification and evaluation
- fast algorithms and heuristics developments
Features are derived from transaction history of the card holder. In case of credit card, features (independent variables) can be:
- Number of transactions per day
- Location and time difference between two transactions
- Avg of transaction amount per day
- Transaction currency etc…
Model is generated using the features derived from past transaction and is tested against the real data. New features can be added or removed based on the outcome (the process is repeated until satisfactory results are achieved).
The final outcome will be a probability with a threshold (calculated manually) indicating transactions as fraudulent or not.
In this post we covered different stages and the architecture for a real time processing systems. In the upcoming blog will cover more about logistic regression and how logistic regression solves problems similar to detecting credit card frauds, besides one can use neural networks, decision tree and support vector machines to solve the same problem.