Becoming an Angeleno
* Photo taken while at my first Dodgers game.
In this post, I document my process of extracting data from the MLB Data API, processing it to create a roster with team statistics, and storing the results into PostgreSQL. This data warehouse will allow for various analyses and the creation of prediction models in the future.
Inspiration
I was inspired to do this after looking through my photos and finding my pictures of my first L.A. Dodgers game. I went last year with the first friends I met in L.A. because one of them wanted to go to get the “true American experience”. Once I walked through the stadium and saw the beautiful view of the field and the palm trees in the horizon, I instantly fell in love in with the culture. I couldn’t help but join the crowd when a player would advance a base and loved hearing the music intros that showed the team’s personality. Mookie Betts instantly became my favorite player, and I made sure to grab a blanket on the way out so I could bring my experience back home. Now, every time I wear my Dodgers hat, I feel like a true L.A. local.
So, it only made sense to download data specific to the L.A. Dodgers for this project.
Extracting Data from the MLB API
Unfortunately, the MLB Data API doesn’t have extensive documentation, which made it a bit challenging to extract all the data I needed. After reviewing several API endpoints, I settled on four that provided the information I was looking for:
Team Profiles - by Active: This endpoint helped me find the team ID and team key, which were necessary to access other data, such as active players, games played in 2024, and box scores.
Player Profiles - by Team: This endpoint allowed me to gather information about all the active players on the Dodgers, which I used to build the team roster.
Schedules: This endpoint helped me retrieve the Game IDs for each of the Dodger games played in 2024.
Box Score [Final]: Finally, I used this endpoint to pull player statistics after each game.
Processing the Data and Storing It in PostgreSQL
After gathering the necessary data, I loaded everything into a data frame. I then calculated stats such as slugging percentage, on-base percentage, and OPS. Once the calculations were complete, I uploaded the data to PostgreSQL.
This data is now ready for deeper analysis and even building prediction models based on the performance of players throughout the 2024 season.
By following this process, I’ve created a database that will allow me to perform various analyses and build predictions in the future.
Click here to view the entire project on GitHub.