My First Data Science Project

JPMC QUANT CHALLENGE 2023

James Sawyer
4 min readMay 23, 2023

I have always considered myself to be an avid learner, and the learning project that I have recently undertaken over the past five months (or so) has been in the field of Data Science. The following is my first go at a supervised learning project using the JPMC QUANT CHALLENGE 2023 dataset from Kaggle.

Photo by Markus Winkler on Unsplash

Introduction

You may not know this about me, but I studied Finance, with an emphasis on Investment Management at University. I competed in stock market challenges, studied portfolio management, and researched the fundamentals of different companies, but I never had dived into Machine Learning for any of my coursework (logically). So when I began studying the field of data science, the question always lingered on my mind, “Can you predict the price of a stock?” (this is not a unique concept). So for my first project, I thought that it would be fun to aim to answer that very question.

Let’s Talk Data

I used the JPMC Quant Challenge dataset from Kaggle. The file itself has 15,000 rows and 12 columns. Each of the columns contains fundamental data about a particular stock with a total of 100 stocks in the dataset. For the sake of the project, I chose to use one stock for testing.

The dataset itself was rather clean, and only a few of the columns had outliers.

Visualizations

Histogram for each column
Boxplot for each column
Price of the stock over time

I winsorized the data to remove the outliers, which you can see below.

At this point, I standardized the dataset. Then I performed a correlation analysis to look at the different relationships between variables and the target. The following three visualizations are what I found to be most interesting.

As you can see, the variables often did not linearly correlate with the target, but some of the variables had a linear correlation with each other (based on visualization alone). During this stage, I also created a new feature called “Working Capital”, which was the difference between current assets and current liabilities since those variables were highly correlated with each other.

Results

I tested two different Supervised Learning Models. The first model was Ordinary Least Squares (OLS). The OLS model, in hindsight, was a poor choice. The variables did not appear to be linearly correlated with the target, and the OLS model reflected that with an R-squared value of 0.049.

The second Supervised Learning Model that I chose for the project was the Support Vector Regression (SVR). SVR can model non-linear relationships more effectively than OLS, and that became apparent when running the model. I achieved an R-squared value of 0.56.

Using the following hyperparameters, I was able to achieve an R-squared score of 0.91: gamma=0.5, C=10, epsilon = 0.05.

Conclusion

I truly enjoyed this project and am proud of the results. I also admit that there are limitations in the project itself. For example, we only discussed one of the one-hundred available stocks in the dataset. The analysis can definitely be scaled and improved upon further. I would also like to note that building a machine learning model for stock prediction should also only be used as another tool in a full analysis, as there are many outside factors that can come into play when predicting a stock’s price.

Also, after completing this project, I then performed a time series forecasting analysis on the stock, which I will discuss in a subsequent post.

--

--