Churn Baby Churn
Photo taken while volunteering in Skid Row/DTLA.
View the whole project on GitHub here.
Background
This project focused on customer churn analysis—predicting whether a customer would leave based on their historical data. The dataset originated from an earlier project, where I created a bot to scrape and modify Excel workbooks. I wanted to do a little bit more with the dataset and this gave me the perfect excuse to use it.
Tools Used
The raw data from my previous automation project was imported into a PostgreSQL database. This allowed for efficient storage, management, and querying. Before importing, I further modified the dataset by enriching it with additional calculations and cleaning up inconsistencies.
Then, using SQL, I conducted exploratory data analysis (EDA) to uncover patterns, clean the data, and prepare it for modeling. Key steps included handling missing values, identifying outliers, and creating calculated columns to enrich the dataset.
The Model
I selected a Random Forest classifier for its robustness and ability to handle both numerical and categorical data effectively. This decision was influenced by its reputation for strong performance on structured data and minimal hyperparameter tuning requirements.
The data was split into training and test sets. Using scikit-learn, I trained the Random Forest model and fine-tuned it through cross-validation.
One challenge was ensuring the SQL transformations aligned with the requirements of the machine learning model. For example, categorical columns needed encoding, and normalization was applied to numerical features where necessary.
Model Performance
The Random Forest classifier achieved an accuracy of 100%, as shown in the classification report below:
While the perfect accuracy score is impressive, it is important to note some potential concerns:
Data Imbalance: The dataset appears to be imbalanced, with a significantly higher number of True instances (158) compared to False instances (25). This could lead to overly optimistic performance metrics since the model might focus more on the majority class.
Potential Overfitting: Given the perfect precision, recall, and F1-scores, there is a possibility that the model has overfitted to the training data, especially if the test set is not sufficiently diverse.
Analyzing the Data with SQL
Through SQL analysis, I identified key drivers for customer churn. The process also reinforced my understanding of database management and the synergy between SQL and machine learning workflows. Find the results of the SQL queries here.
While I did not deploy this specific project, I could leverage Docker for containerizing the application and AWS for hosting the model as an API. This would enable scalable and accessible predictions for real-world applications. By implementing deployment strategies, I could provide a live service capable of ingesting new data, running predictions, and delivering results to end-users in real-time.
Until next time.
- Mali