Building an end-to-end Movie Recommendation App in iOS with Machine Learning

Kevin Abram
29 min readSep 29, 2024

--

Photo by Mika Baumeister on Unsplash

In this article, I want to share my learnings in building an end-to-end Movie Recommendation App in iOS with Machine Learning. This includes creating the iOS app with SwiftUI, creating the backend using Flask, and building the machine learning model using Python.

I will also include all of the step-by-step guides on how to create the full app complete with the data analysis, data wrangling, building the backend, building the iOS app and building the machine learning model.

So what would it look like?

From the above video, the main feature of this app will be:

  1. The login screen to sign in to respective users.
  2. The home screen to get movies the user has watched and the recommended movies (that utilize machine learning).
  3. The detail screen to get more information about a specific movie.
  4. The profile screen to sign out and then sign in as another user.

For the full code, you can refer to this GitHub repository: https://github.com/kevinabram111/Movie-Recommendation

To achieve the desired app, I have used several tools and programming languages to get to this app, these are:

  1. Mobile iOS App: Xcode, Swift and SwiftUI
  2. Backend Service: Flask and Python
  3. Machine Learning Model: Python (using sci-kit learn library) in Google Colab
  4. Data Analysis and Data Wrangling: R and RStudio (You can use Visual Studio code as an alternative)

So, without further ado, let us have a deeper dive into the step-by-step process of recreating this app.

Photo by Kelly Sikkema on Unsplash

Part 1: The requirements gathering

For the requirements, we will need several pieces of data and other requirements to get started. Here are:

  1. Movie Lens Dataset: https://grouplens.org/datasets/movielens. We will be using the 1MB data since it is enough for this case.
  2. TMDB Account API: https://developer.themoviedb.org/reference/intro/getting-started
  3. Python: https://www.python.org
  4. Flask: https://pypi.org/project/Flask. You can then find the quickstart here: https://flask.palletsprojects.com/en/3.0.x/quickstart
  5. Xcode: https://developer.apple.com/xcode
  6. RStudio: https://posit.co/download/rstudio-desktop

Just a note, if you just want to run the app, the TMDB account is optional, since our backend will just come from the CSV file and we don’t hit any external API.

The TMDB account is used for data scraping, in which I will explain in the next few steps.

Photo by Lukas Blazek on Unsplash

Part 2: Preliminary Data Analysis

For the data analysis, I will be using R for all of the data analysis. You can open the file using RStudio. You can install RStudio from this link: https://posit.co/download/rstudio-desktop.

However, in this step, you are free to use Python or others to analyze the dataset more. For my case, I used the code below:


library(dplyr)
library(tidyr)
library(ClusterR)
library(ggplot2)
library(tidyverse)

# Load the datasets
movies <- read.csv("movies.csv")
ratings <- read.csv("ratings.csv")
links <- read.csv("links.csv")

# Check the structure of the datasets
str(movies)
str(ratings)
str(links)

# Summary of the datasets
summary(movies)
summary(ratings)
summary(links)

# Count the number of movies by genre (genres are combined in the 'genres' column)
# Split genres into separate rows for more detailed analysis
movies_genres <- movies %>%
separate_rows(genres, sep = "\\|") %>%
group_by(genres) %>%
summarise(count = n()) %>%
arrange(desc(count))

# Plot the count of movies by genre
ggplot(movies_genres, aes(x = reorder(genres, count), y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Count of Movies by Genre", x = "Genre", y = "Count")

# Distribution of ratings
ggplot(ratings, aes(x = rating)) +
geom_histogram(binwidth = 0.5, fill = "coral", color = "black") +
labs(title = "Distribution of Ratings", x = "Rating", y = "Count")

# Number of ratings per movie
ratings_per_movie <- ratings %>%
group_by(movieId) %>%
summarise(count = n()) %>%
arrange(desc(count))

# Plot the number of ratings per movie (top 20)
top_20_movies <- head(ratings_per_movie, 20)
ggplot(top_20_movies, aes(x = reorder(movieId, -count), y = count)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(title = "Top 20 Movies with Most Ratings", x = "Movie ID", y = "Number of Ratings")

# Check for missing values in the links dataset
missing_links <- sum(is.na(links$tmdbId))

# Output the number of missing TMDB IDs
cat("Number of missing TMDB IDs:", missing_links)

# Merge the movies and ratings datasets
movies_ratings <- merge(movies, ratings, by = "movieId")

# Average rating for each movie
avg_ratings <- movies_ratings %>%
group_by(title) %>%
summarise(avg_rating = mean(rating), rating_count = n()) %>%
arrange(desc(avg_rating))

# Plot top 10 highest-rated movies with more than 100 ratings
top_rated_movies <- avg_ratings %>%
filter(rating_count > 100) %>%
top_n(10, wt = avg_rating)

ggplot(top_rated_movies, aes(x = reorder(title, avg_rating), y = avg_rating)) +
geom_bar(stat = "identity", fill = "purple") +
coord_flip() +
labs(title = "Top 10 Highest Rated Movies (with > 100 ratings)", x = "Movie Title", y = "Average Rating")

The above code is the code that I used for the preliminary data analysis. I think some of you might be confused about the code that I wrote above. But it’s okay since I will explain to you all what I do in my code one by one.

library(dplyr)
library(tidyr)
library(ClusterR)
library(ggplot2)
library(tidyverse)

# Load the datasets
movies <- read.csv("movies.csv")
ratings <- read.csv("ratings.csv")
links <- read.csv("links.csv")

# Check the structure of the datasets
str(movies)
str(ratings)
str(links)

# Summary of the datasets
summary(movies)
summary(ratings)
summary(links)

For the above code, we will initialize the basic libraries needed for the R code, then we will read from the CSV. Make sure you download the data from the movie lens here first: https://grouplens.org/datasets/movielens.

Make sure you downloaded the small 1MB data. The bigger data is too much and will take a very long time for just the scraping that will be explained in the next few steps. We will be using movies, ratings, and links only for this project.

The structure of the dataset

For the structure of the data, here is the information that we can get:

  1. There are roughly around 190000 different movies, starting from movie_id of 1 to 193609
  2. The title and genres are characters, which is a form of text
  3. There are around 600 users, starting from the user_id of 1 to 610
  4. The minimum rating is 0.5 and the maximum rating is 5. The mean and median of the rating is 3.5, which means that the data is slightly skewed to the right.
# Count the number of movies by genre (genres are combined in the 'genres' column)
# Split genres into separate rows for more detailed analysis
movies_genres <- movies %>%
separate_rows(genres, sep = "\\|") %>%
group_by(genres) %>%
summarise(count = n()) %>%
arrange(desc(count))

# Plot the count of movies by genre
ggplot(movies_genres, aes(x = reorder(genres, count), y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Count of Movies by Genre", x = "Genre", y = "Count")

The next step is to get the individual movie genres and count them. Since the dataset has a structure like Action|Drama|Mystery, we will need to separate it by | first. Then we will display it on a bar chart and sort it from the genres with the highest number of counts to the lowest.

Count of movies by genre displayed in a bar chart

Based on the bar chart above, we can see that the top 5 movie genres with the highest number are Drama, Comedy, Thriller, Action, and Romance.

There are a lot of potential reasons for this. One potential reason might be that these genres are the most popular movie genres watched by customers or any other factors like how simple it is to create Drama movies rather than any other movies. However, we need more deeper analysis to confirm this.

# Distribution of ratings
ggplot(ratings, aes(x = rating)) +
geom_histogram(binwidth = 0.5, fill = "coral", color = "black") +
labs(title = "Distribution of Ratings", x = "Rating", y = "Count")

The next phase is to check the distribution of the ratings. We will be using histograms for this. Since the start of the data is 0.5 and the end of the data is 5, a 0.5 bin width would be appropriate to display the histogram.

Distribution of Ratings through histogram

From the chart above, we can see that the distribution of the ratings is skewed to the right. We can see that most of the movies in the dataset tend to have good ratings rather than bad ratings. With the availability of good ratings, we can use these rating metrics to build our Machine Learning model later.

# Number of ratings per movie
ratings_per_movie <- ratings %>%
group_by(movieId) %>%
summarise(count = n()) %>%
arrange(desc(count))

# Plot the number of ratings per movie (top 20)
top_20_movies <- head(ratings_per_movie, 20)
ggplot(top_20_movies, aes(x = reorder(movieId, -count), y = count)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(title = "Top 20 Movies with Most Ratings", x = "Movie ID", y = "Number of Ratings")

The next phase is to get the movies with the highest number of ratings through a bar chart, we will be using the top 20 only since there are a lot of movies in the dataset.

Top 20 movies with the most ratings displayed in a bar chart

From the chart above, we can see several movies that have the most ratings. This can correlate to the number of watchers, and how trending the movie is. However, to analyze the correlation, I need more data to get more insights about this.

From the bar chart above, for example, the top 5 movies that have the most ratings are:

  1. Movie_id 356: Forrest Gump
  2. Movie_id 318: Shawshank Redemption
  3. Movie_id 296: Pulp Fiction
  4. Movie_id 593: Silence of the Lambs
  5. Movie_id 2571: The Matrix (1999)
# Merge the movies and ratings datasets
movies_ratings <- merge(movies, ratings, by = "movieId")

# Average rating for each movie
avg_ratings <- movies_ratings %>%
group_by(title) %>%
summarise(avg_rating = mean(rating), rating_count = n()) %>%
arrange(desc(avg_rating))

# Plot top 10 highest-rated movies with more than 100 ratings
top_rated_movies <- avg_ratings %>%
filter(rating_count > 100) %>%
top_n(10, wt = avg_rating)

ggplot(top_rated_movies, aes(x = reorder(title, avg_rating), y = avg_rating)) +
geom_bar(stat = "identity", fill = "purple") +
coord_flip() +
labs(title = "Top 10 Highest Rated Movies (with > 100 ratings)", x = "Movie Title", y = "Average Rating")

The next chart will display the top 10 highest-rated movies with more than 100 ratings. I use 100 ratings as a minimum benchmark since the score of movies that have fewer ratings can be on the end of the higher or lower side, which is not suitable for further analysis.

Top 10 Highest Rated Movies displayed through bar chart

From the bar chart above we can see that the top 10 highest-rated movies are: Shawshank Redemption, GodFather, Fight Club, Godfather: Part II, The Departed, Goodfellas, The Dark Knight, The Usual Suspects, The Princess Bride and Star Wars: Episode IV — A New Hope.

This is great data since most of the movies that are highly rated here are popular movies throughout the histories. Movies like The Dark Knight, and Star Wars are quite popular.

Photo by Leon Seibert on Unsplash

Part 3: Web Scraping

For the web scraping, make sure that you have installed R Studio and have created an account here: https://developer.themoviedb.org/reference/intro/getting-started

For the web scraping part, I used the endpoints for TMDB and looped it for each of the movies. This data will then be exported to JSON. But first, I want to test it out.

library(httr)

url <- "https://api.themoviedb.org/3/movie/862"

queryString <- list(language = "en-US")

# Your API key and base URL
response <- VERB("GET", url, query = queryString, add_headers(''), content_type("application/octet-stream"), accept("application/json"))

content(response, "text")

Above is the simple code I used to test if the endpoint works. If you are confused since it does not work, you will need to put your bearer token here as well on the add_headers. You can find the information on the API documentation as well.

Test the API endpoint for scraping

Based on the results above, the data will be shown in a JSON format like above. This is a sign that the scraping has been successful. Because of that, let’s implement it!

# Load necessary libraries
library(httr)
library(jsonlite)

# Read the CSV file
movie_data <- read.csv("links.csv")

# Extract the tmdbId column
tmdb_ids <- movie_data$tmdbId

# Initialize an empty list to store the JSON responses
json_list <- list()

# Your API key and base URL
api_key <- ""
base_url <- "https://api.themoviedb.org/3/movie/"

# Loop over each tmdbId and make API requests
for (tmdb_id in tmdb_ids) {
url <- paste0(base_url, tmdb_id)
queryString <- list(language = "en-US")

response <- VERB("GET", url, query = queryString,
add_headers('Authorization' = api_key),
content_type("application/octet-stream"),
accept("application/json"))

# Parse the JSON content
json_content <- content(response, "text")

# Append the JSON content to the list
json_list <- append(json_list, list(json_content))

# Optional: Print progress
cat("Processed tmdbId:", tmdb_id, "\n")
}

# Export the list of JSON objects to a file
writeLines(json_list, "movie_data.json")

cat("Data export complete. JSON saved to movie_data.json\n")

The next step is to loop over the tmdb_id on the links CSV file. What will this loop do is it will call the API endpoint for each tmdb_id, and get more details about the movie referencing that tmdb_id.

Scraping process for the tmdb_id

When the code is run, it will load and will call each of the API endpoints one by one. Meaning, the loop will be like this:

  1. Get the detail of tmdb_id 1 through the endpoint and append it to the JSON file
  2. Get the detail of tmdb_id 2 through the endpoint and append it to the JSON file
  3. Get the detail of tmdb_id 3 through the endpoint and append it to the JSON file, and so on.

The process will take a while, around 30 minutes or more if I can remember. Note that it will take a lot longer if you choose the larger dataset. Just use the 1MB dataset for now.

Sample of the scraped JSON result

After the JSON response has all been acquired. The JSON response would be similar to the data above. From this, we can say that the web scraping process has been successful, and let us move to the next steps.

Photo by Dan Loran on Unsplash

Part 4: Data Wrangling

For this step, I want to combine all of the data that I have acquired from the JSON file and merge it into a single dataset. This also includes NULL handling as well. For our requirements, we need two kinds of data:

  1. Data of movies, complete with their image and descriptions
  2. Data of users, complete with their username and password

Let us get started with the movie data:

# This code is to combine several data into the movies data set, like imdb_id, original_title, overview, poster_path, backdrop_path and many more

# Load necessary libraries
library(dplyr)
library(readr)
library(jsonlite)

# Load the movies.csv file
movies <- read_csv("movies.csv")

# Load the links.csv file
links <- read_csv("links.csv")

# Add the "tt" prefix to the imdbId in links.csv
links <- links %>%
mutate(imdbId = paste0("tt", imdbId))

# Initialize an empty list to store each parsed JSON line with relevant columns
json_list <- list()

# Open the JSON file and read it line by line
con <- file("movie_data.json", "r")
while (TRUE) {
line <- readLines(con, n = 1, warn = FALSE)
if (length(line) == 0) {
break
}

# Try to parse the JSON and only extract the necessary fields
json_parsed <- tryCatch(fromJSON(line, flatten = TRUE), error = function(e) NULL)

if (!is.null(json_parsed)) {
# Check and handle missing fields by assigning NA if the field doesn't exist or is empty
imdb_id <- if (!is.null(json_parsed$imdb_id) && length(json_parsed$imdb_id) > 0) json_parsed$imdb_id else NA
original_title <- if (!is.null(json_parsed$original_title) && length(json_parsed$original_title) > 0) json_parsed$original_title else NA
overview <- if (!is.null(json_parsed$overview) && length(json_parsed$overview) > 0) json_parsed$overview else NA
poster_path <- if (!is.null(json_parsed$poster_path) && length(json_parsed$poster_path) > 0) json_parsed$poster_path else NA
backdrop_path <- if (!is.null(json_parsed$belongs_to_collection$backdrop_path) &&
length(json_parsed$belongs_to_collection$backdrop_path) > 0) {
json_parsed$belongs_to_collection$backdrop_path
} else {
NA
}

# Create a data frame for this row and append it to the list
json_list <- append(json_list, list(
data.frame(imdb_id = imdb_id,
original_title = original_title,
overview = overview,
poster_path = poster_path,
backdrop_path = backdrop_path,
stringsAsFactors = FALSE)
))
}
}
close(con)

# Combine the list into a single data frame
json_data <- bind_rows(json_list)

# Merge the links.csv with json_data using imdbId and imdb_id for matching
links_json_combined <- links %>%
left_join(json_data, by = c("imdbId" = "imdb_id"))

# Merge the combined data with movies.csv by movieId
final_combined <- movies %>%
left_join(links_json_combined, by = "movieId")

# Write the final combined data to a new CSV file
write_csv(final_combined, "combined_movies_with_json.csv")

# View the first few rows of the final combined dataset
head(final_combined)

For the movie data, we will combine the data of the movies and the JSON file from the links CSV as a reference. But as the JSON file contains “tt” like “tt0114709” while the imbd_id in links CSV only contains 0114709.

Because of this, I added the “tt” prefix before looping for each JSON file. The dataset that we will also need for this data are:

  1. imdb_id: To be used as a reference
  2. original_title: To be used as the title of the movie
  3. overview: To be used as a description when we view the details of the movie
  4. poster_path: To be used on the image on the home of our app
  5. backdrop_path: To be used as the image that will be shown when we open the detail of the show

After the code has been run, it will create a new file named combined.csv with all of the detailed and combined information.

# Load necessary libraries
library(dplyr)
library(readr)

# Load the ratings data
ratings <- read_csv("ratings.csv")

# Find duplicate userId
duplicate_users <- ratings %>%
group_by(userId) %>%
filter(n() > 1) %>%
distinct(userId) # Ensure unique userId

# Create a new column called "password" with the value "pass"
# Create a new column called "username" by concatenating "user" with userId
duplicate_users <- duplicate_users %>%
mutate(password = "pass",
username = paste0("user", userId)) %>%
select(userId, username, password) # Select userId, username, and password columns

# View the result
print(duplicate_users)

# Optionally, save the result to a new CSV file
write_csv(duplicate_users, "users.csv")

For the user_id on the other hand, we will refer directly to the ratings CSV file. Here we will get the individual user_id without any duplicates, and then we will automatically create the username and password. The password will all be “pass” and the username will follow a specific format of user + id. For example

  1. User_id will become “User1” with a password of “pass”
  2. User_id will become “User2” with a password of “pass”

From that, we have successfully created the dataset that we can use for our backend. For the next steps, we will be using Python for creating the Machine Learning model.

Photo by Hitesh Choudhary on Unsplash

Part 5: Machine Learning Implementation

For this process, I will be using Google Collab and Python. Why do I do this? R is not good enough for creating Machine Learning models, sadly.

When I tried creating the Machine Learning Model in R, something always happened that frustrated me. This is also because we use Flask as our backend, the Machine Learning Model created by Python in Google Colab would be easier to be implemented into Flask itself.

from google.colab import drive
from google.colab import files

drive.mount('/content/drive')

# Step 1: Install the 'surprise' library (for SVD, UBCF, and IBCF)
!pip install scikit-surprise

# Step 2: Import necessary libraries
import pandas as pd
from surprise import Dataset, Reader, SVD, KNNBasic
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise.model_selection import cross_validate

# Load ratings.csv from Google Drive
ratings = pd.read_csv('/content/drive/My Drive/ratings.csv')

# Load combined_movies.csv from Google Drive
movies = pd.read_csv('/content/drive/My Drive/combined_movies.csv')

# Define a Reader object and specify the rating scale (assuming it’s 1 to 5)
reader = Reader(rating_scale=(0.5, 5.0))

# Load the data from the pandas dataframe
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Train-test split (80% training, 20% testing)
trainset, testset = train_test_split(data, test_size=0.2)

# Initialize and train the SVD model
print("\nEvaluating SVD model:")
model = SVD()
model.fit(trainset)
predictions = model.test(testset)
accuracy.rmse(predictions)

# Initialize and train the User-based Collaborative Filtering (UBCF) model
print("\nEvaluating User-based Collaborative Filtering (UBCF) model:")
ubcf_model = KNNBasic(sim_options={'name': 'cosine', 'user_based': True}) # UBCF
ubcf_model.fit(trainset)
ubcf_predictions = ubcf_model.test(testset)
accuracy.rmse(ubcf_predictions)

# Initialize and train the Item-based Collaborative Filtering (IBCF) model
print("\nEvaluating Item-based Collaborative Filtering (IBCF) model:")
ibcf_model = KNNBasic(sim_options={'name': 'cosine', 'user_based': False}) # IBCF
ibcf_model.fit(trainset)
ibcf_predictions = ibcf_model.test(testset)
accuracy.rmse(ibcf_predictions)

# Function to recommend top-N movies for a specific user
def get_recommendations(user_id, model, data, n=10):
# Get the list of all movie IDs
all_movie_ids = data.df['movieId'].unique()

# Get the movies the user has already rated
rated_movies = data.df[data.df['userId'] == user_id]['movieId'].values

# Get the list of movies the user hasn't rated yet
unrated_movies = [movie for movie in all_movie_ids if movie not in rated_movies]

# Predict ratings for all unrated movies
predictions = [model.predict(user_id, movie_id) for movie_id in unrated_movies]

# Sort the predictions by estimated rating in descending order
top_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)

# Get the top N movie IDs
top_movie_ids = [pred.iid for pred in top_predictions[:n]]

return top_movie_ids

# Get recommendations for a specific user
user_id = 1
top_movie_ids = get_recommendations(user_id, model, data, n=10)

# Merge to get the movie titles
recommended_movies = movies[movies['movieId'].isin(top_movie_ids)]
print(recommended_movies[['movieId', 'title']])

# Save the trained SVD model using pickle
import pickle

with open('svd_recommender_model.pkl', 'wb') as f:
pickle.dump(model, f)

files.download('svd_recommender_model.pkl')

Above is the code that I used to create the Machine Learning model. I tried up to three methods, which are Matrix Factorization (SVD), User-based Collaborative Filtering (UBCF), and Item-based Collaborative Filtering (IBCF).

You might be confused about the code. But it’s alright, let me guide you step by step:

from google.colab import drive
from google.colab import files

drive.mount('/content/drive')

# Step 1: Install the 'surprise' library (for SVD, UBCF, and IBCF)
!pip install scikit-surprise

# Step 2: Import necessary libraries
import pandas as pd
from surprise import Dataset, Reader, SVD, KNNBasic
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise.model_selection import cross_validate

# Load ratings.csv from Google Drive
ratings = pd.read_csv('/content/drive/My Drive/ratings.csv')

# Load combined_movies.csv from Google Drive
movies = pd.read_csv('/content/drive/My Drive/combined_movies.csv')

From the code above, I imported several tools that will help us in creating the Machine Learning model. This includes Surprise, which has the functions necessary for us to create the models using the methods that I mentioned above (SVD, UBCF, and IBCF).

# Define a Reader object and specify the rating scale (assuming it's 1 to 5)
reader = Reader(rating_scale=(0.5, 5.0))

# Load the data from the pandas dataframe
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Train-test split (80% training, 20% testing)
trainset, testset = train_test_split(data, test_size=0.2)

From the code above, then we define the reader. Since the data of the ratings starts from 0.5 and ends with 5, we will use that as our reader to read the data to be used for the Machine Learning model.

Then, we will pick three data, which are user_id, movie_id, and rating. These 3 datasets will be used as variables to construct our Machine Learning model.

We will then split some of the data, 80% will go into the training, while 20% of the data will be used as testing (for the validation). This is an amount that is fair since we need a lot of data for training as well.

# Initialize and train the SVD model
print("\nEvaluating SVD model:")
model = SVD()
model.fit(trainset)
predictions = model.test(testset)
accuracy.rmse(predictions)

# Initialize and train the User-based Collaborative Filtering (UBCF) model
print("\nEvaluating User-based Collaborative Filtering (UBCF) model:")
ubcf_model = KNNBasic(sim_options={'name': 'cosine', 'user_based': True}) # UBCF
ubcf_model.fit(trainset)
ubcf_predictions = ubcf_model.test(testset)
accuracy.rmse(ubcf_predictions)

# Initialize and train the Item-based Collaborative Filtering (IBCF) model
print("\nEvaluating Item-based Collaborative Filtering (IBCF) model:")
ibcf_model = KNNBasic(sim_options={'name': 'cosine', 'user_based': False}) # IBCF
ibcf_model.fit(trainset)
ibcf_predictions = ibcf_model.test(testset)
accuracy.rmse(ibcf_predictions)

Then, we will create 3 different models that use three different approaches: SVD, UBCF, and IBCF. We will fit the training data into the model and test them.

Printing out the rMSE

As you see, we also print out the rMSE. What is that? RMSE means root mean squared error. To simplify the explanation, lower rMSE in some way can mean that the dataset is closer to the actual value, which can equate to better accuracy.

In this case, the best rMSE goes to the SVD model, with a lower score (0.8699) compared to the rMSE of other models. This means that we will use the matrix factorization model (SVD) as part of our machine-learning model.

# Function to recommend top-N movies for a specific user
def get_recommendations(user_id, model, data, n=10):
# Get the list of all movie IDs
all_movie_ids = data.df['movieId'].unique()

# Get the movies the user has already rated
rated_movies = data.df[data.df['userId'] == user_id]['movieId'].values

# Get the list of movies the user hasn't rated yet
unrated_movies = [movie for movie in all_movie_ids if movie not in rated_movies]

# Predict ratings for all unrated movies
predictions = [model.predict(user_id, movie_id) for movie_id in unrated_movies]

# Sort the predictions by estimated rating in descending order
top_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)

# Get the top N movie IDs
top_movie_ids = [pred.iid for pred in top_predictions[:n]]

return top_movie_ids

# Get recommendations for a specific user
user_id = 1
top_movie_ids = get_recommendations(user_id, model, data, n=10)

# Merge to get the movie titles
recommended_movies = movies[movies['movieId'].isin(top_movie_ids)]
print(recommended_movies[['movieId', 'title']])

For the next step, we will get the predictions using the model and sort them. For example, in the case above, we will test it out on user_id 1. We will also merge it with the recommended movies just to get more information that is printed out.

The recommendations that are given to the user

As you can see, the Machine Learning Model works! It recommended 10 different movies from the printout. This model can be used for our backend.

# Save the trained SVD model using pickle
import pickle

with open('svd_recommender_model.pkl', 'wb') as f:
pickle.dump(model, f)

files.download('svd_recommender_model.pkl')

The final step would be exporting our Machine Learning model to PKL format so that it can be used in our Flask backend. The code is simple like above. We will first write the model, then we will download it. It should be downloaded with the name “svd_recommender_model”.

Photo by imgix on Unsplash

Part 6: Backend Implementation

For the backend, we will be using Flask. If you want to run the Flask app, you need to do two steps:

  1. Install Python: https://www.python.org/
  2. Install Flask: https://pypi.org/project/Flask. You can then find the QuickStart here: https://flask.palletsprojects.com/en/3.0.x/quickstart

For the code of Flask, it will be like this:

from flask import Flask, jsonify, request
import pandas as pd
import pickle

app = Flask(__name__)

# Step 1: Load the data (ratings.csv, combined_movies.csv, and users.csv)
ratings_file_path = 'ratings.csv' # Update this path with your Google Drive path
movies_file_path = 'combined_movies.csv' # Update this path with your Google Drive path
users_file_path = 'users.csv' # Update this path with your Google Drive path

ratings = pd.read_csv(ratings_file_path)
movies = pd.read_csv(movies_file_path)
users = pd.read_csv(users_file_path)

# Step 2: Load the pre-trained SVD model from the pickle file
with open('svd_recommender_model.pkl', 'rb') as f:
svd_model = pickle.load(f)

# Convert users data to a dictionary for quick lookup of password and userId
users_dict = users.set_index('username')[['password', 'userId']].to_dict(orient='index')

# Login endpoint
@app.route('/login', methods=['POST'])
def login():
"""
Endpoint to check if the username and password provided are correct.
:return: Success with userId or failure response in JSON format.
"""
data = request.json
username = data.get('username')
password = data.get('password')

if username in users_dict and users_dict[username]['password'] == password:
user_id = users_dict[username]['userId']
return jsonify({"status": 200, "userId": user_id, "message": "Success", "movies": []}), 200
else:
return jsonify({"status": 400, "message": "Username or Password Not Found", "movies": []}), 400

# Function to recommend top-N movies for a specific user
def get_recommendations(user_id, model, n=10):
all_movie_ids = ratings['movieId'].unique()
rated_movies = ratings[ratings['userId'] == user_id]['movieId'].values
unrated_movies = [movie for movie in all_movie_ids if movie not in rated_movies]

predictions = [model.predict(user_id, movie_id) for movie_id in unrated_movies]
top_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
top_movie_ids = [pred.iid for pred in top_predictions[:n]]

return top_movie_ids

# Endpoint to fetch movies a user has rated
@app.route('/user/<int:user_id>/movies', methods=['GET'])
def user_rated_movies(user_id):
"""
Endpoint to return the movies the user has rated.
"""
try:
# Get the movies the user has already rated
user_rated_movies = ratings[ratings['userId'] == user_id]

# Merge with movie details
user_movies = user_rated_movies.merge(movies, on='movieId')

# Replace NaN values with None
user_movies = user_movies.where(pd.notnull(user_movies), None)

# Convert the movies data to a list of dicts
user_movies_list = user_movies[['movieId', 'title', 'genres', 'original_title', 'overview', 'poster_path', 'backdrop_path']].to_dict(orient='records')

# Return the response in JSON format
return jsonify({
"status": 200,
"userId": user_id,
"message": "Success",
"movies": user_movies_list
}), 200

except Exception as e:
return jsonify({"status": 400, "message": str(e), "movies": []}), 400

# Endpoint to provide movie recommendations for a specific user
@app.route('/user/<int:user_id>/recommendations', methods=['GET'])
def user_recommendations(user_id):
"""
Endpoint to return movie recommendations for a user based on user_id only.
"""
try:
# Get top-N movie recommendations for the user
top_movie_ids = get_recommendations(user_id, svd_model, n=10)

# Merge to get the movie titles and other details
recommended_movies = movies[movies['movieId'].isin(top_movie_ids)]

# Replace NaN values with None (which corresponds to null in JSON)
recommended_movies = recommended_movies.where(pd.notnull(recommended_movies), None)

# Convert recommended movies to a list of dicts
recommended_movies_list = recommended_movies[['movieId', 'title', 'genres', 'original_title', 'overview', 'poster_path', 'backdrop_path']].to_dict(orient='records')

# Return the recommended movies in JSON format
return jsonify({
"status": 200,
"userId": user_id,
"message": "Success",
"movies": recommended_movies_list
}), 200

except Exception as e:
return jsonify({"status": 400, "message": str(e), "movies": []}), 400

# Run the Flask app
if __name__ == '__main__':
app.run(debug=True)

Above is all the code that will be used for the Flask backend. It has three endpoints:

  1. Login: To sign in to a specific user with username and password.
  2. Movies: To get the list of movies that a user has watched by their user_id.
  3. Recommendations: To get the recommended movies that are recommended to the user by their user_id.

I will go through the code step by step so that you will understand.

from flask import Flask, jsonify, request
import pandas as pd
import pickle

app = Flask(__name__)

# Step 1: Load the data (ratings.csv, combined_movies.csv, and users.csv)
ratings_file_path = 'ratings.csv' # Update this path with your Google Drive path
movies_file_path = 'combined_movies.csv' # Update this path with your Google Drive path
users_file_path = 'users.csv' # Update this path with your Google Drive path

ratings = pd.read_csv(ratings_file_path)
movies = pd.read_csv(movies_file_path)
users = pd.read_csv(users_file_path)

# Step 2: Load the pre-trained SVD model from the pickle file
with open('svd_recommender_model.pkl', 'rb') as f:
svd_model = pickle.load(f)

# Convert users data to a dictionary for quick lookup of password and userId
users_dict = users.set_index('username')[['password', 'userId']].to_dict(orient='index')

The code above will get the data from ratings, combined movies, and users. This code will also load the pre-trained SVD model from the pickle file that we can use.

For the login, we will use a dictionary that will have a username as the key to make it quicker for us to check if there is a user with that user_id or not on the CSV file.

# Login endpoint
@app.route('/login', methods=['POST'])
def login():
"""
Endpoint to check if the username and password provided are correct.
:return: Success with userId or failure response in JSON format.
"""
data = request.json
username = data.get('username')
password = data.get('password')

if username in users_dict and users_dict[username]['password'] == password:
user_id = users_dict[username]['userId']
return jsonify({"status": 200, "userId": user_id, "message": "Success", "movies": []}), 200
else:
return jsonify({"status": 400, "message": "Username or Password Not Found", "movies": []}), 400

The above code is the login endpoint. When the endpoint is called, we will refer to the dictionary. If the user is found, we will return 200 with the user_id. If the user is not found, we will instead return 400 with the message “username or password not found”.

# Endpoint to fetch movies a user has rated
@app.route('/user/<int:user_id>/movies', methods=['GET'])
def user_rated_movies(user_id):
"""
Endpoint to return the movies the user has rated.
"""
try:
# Get the movies the user has already rated
user_rated_movies = ratings[ratings['userId'] == user_id]

# Merge with movie details
user_movies = user_rated_movies.merge(movies, on='movieId')

# Replace NaN values with None
user_movies = user_movies.where(pd.notnull(user_movies), None)

# Convert the movies data to a list of dicts
user_movies_list = user_movies[['movieId', 'title', 'genres', 'original_title', 'overview', 'poster_path', 'backdrop_path']].to_dict(orient='records')

# Return the response in JSON format
return jsonify({
"status": 200,
"userId": user_id,
"message": "Success",
"movies": user_movies_list
}), 200

except Exception as e:
return jsonify({"status": 400, "message": str(e), "movies": []}), 400

For the code above, we will get the movies associated with the user_id that the user have watched and return it in JSON format. We will return it with movie_id, title, genres, original_title, overview, poster_path and backdrop_path as well.

# Function to recommend top-N movies for a specific user
def get_recommendations(user_id, model, n=10):
all_movie_ids = ratings['movieId'].unique()
rated_movies = ratings[ratings['userId'] == user_id]['movieId'].values
unrated_movies = [movie for movie in all_movie_ids if movie not in rated_movies]

predictions = [model.predict(user_id, movie_id) for movie_id in unrated_movies]
top_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
top_movie_ids = [pred.iid for pred in top_predictions[:n]]

return top_movie_ids

# Endpoint to provide movie recommendations for a specific user
@app.route('/user/<int:user_id>/recommendations', methods=['GET'])
def user_recommendations(user_id):
"""
Endpoint to return movie recommendations for a user based on user_id only.
"""
try:
# Get top-N movie recommendations for the user
top_movie_ids = get_recommendations(user_id, svd_model, n=10)

# Merge to get the movie titles and other details
recommended_movies = movies[movies['movieId'].isin(top_movie_ids)]

# Replace NaN values with None (which corresponds to null in JSON)
recommended_movies = recommended_movies.where(pd.notnull(recommended_movies), None)

# Convert recommended movies to a list of dicts
recommended_movies_list = recommended_movies[['movieId', 'title', 'genres', 'original_title', 'overview', 'poster_path', 'backdrop_path']].to_dict(orient='records')

# Return the recommended movies in JSON format
return jsonify({
"status": 200,
"userId": user_id,
"message": "Success",
"movies": recommended_movies_list
}), 200

except Exception as e:
return jsonify({"status": 400, "message": str(e), "movies": []}), 400

# Run the Flask app
if __name__ == '__main__':
app.run(debug=True)

Last but not least is the implementation for our endpoint to utilize the machine learning model that I have created, and then we will recommend it to the movies that are not rated yet.

We will also merge the movie_ids returned by the Machine Learning model with the details of the movies as the response in JSON so that our application will only consume the backend.

Now, to run the backend application, you will need to install Python and Flask first by following the steps above. After that, you need to do these steps:

  1. Go to the terminal and install surprise: pip install scikit-surprise on the directory of the python application
  2. Type flask — <project_name> run to run the app. In this case, it will be flask — flask_project run. After a while, it will create a run on a specific port, for example http://127.0.0.1:5000.
  3. You can then try out some endpoints of the backend when it is successfully run by typing things like this http://127.0.0.1:5000/user/1/movies.
Photo by Ather Energy on Unsplash

Part 7: IOS App Implementation

For the iOS app, you will need to download Xcode first: https://developer.apple.com/xcode/

After that, you can open the Xcodeproj and run the app when it has been successfully opened. for the code, let me explain a bit.

import SwiftUI

@MainActor
class UserStateViewModel: ObservableObject {

enum LoginState {
case loggedIn(userId: Int)
case loggedOut
}

@Published var loginState: LoginState = .loggedOut
}

@main
struct movie_recommendationApp: App {

@StateObject var userStateViewModel = UserStateViewModel()

init() {
if let image = UIImage(named: "back")?.withRenderingMode(.alwaysOriginal) {
UINavigationBar.appearance().backIndicatorImage = image
UINavigationBar.appearance().backIndicatorTransitionMaskImage = image
}
}

var body: some Scene {
WindowGroup {
NavigationStack {
ApplicationSwitcher()
}
.navigationViewStyle(.stack)
.environmentObject(userStateViewModel)
}
}
}

struct ApplicationSwitcher: View {

@EnvironmentObject var vm: UserStateViewModel

var body: some View {
switch vm.loginState {
case .loggedIn(let userId):
ContentView(userId: userId)
case .loggedOut:
LoginView()
}
}
}

The code above is the states that we will use. Since we will add login and logout functionality, we will play it with an environment object. We have two possible states. This is when the user is logged in, and when the user is logged out.

The first screen that we will discuss is the login screen, which will look like this:

Login screen

For the login screen, I will only be focusing more on the implementation of the API, which is:

func login() {
guard let url = URL(string: "http://127.0.0.1:5000/login") else { return }

var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")

let body: [String: Any] = ["username": username.lowercased(), "password": password.lowercased()]

guard let httpBody = try? JSONSerialization.data(withJSONObject: body, options: []) else { return }

request.httpBody = httpBody

URLSession.shared.dataTask(with: request) { data, response, error in
guard let data = data, error == nil else {
DispatchQueue.main.async {
self.errorMessage = "Network error. Please try again."
self.showError = true
}
return
}

// Decode the response using Codable
do {
let loginResponse = try JSONDecoder().decode(LoginResponse.self, from: data)

if loginResponse.message == "Success", let userId = loginResponse.userId {
// Handle successful login, use the userId
DispatchQueue.main.async {
vm.loginState = .loggedIn(userId: userId)
self.loginSuccessfull = true
}
} else {
DispatchQueue.main.async {
self.errorMessage = loginResponse.message
self.showError = true
}
}
} catch {
DispatchQueue.main.async {
self.errorMessage = "Failed to parse response."
self.showError = true
}
}
}.resume()
}

From the code above, we will call the login endpoint and send several body data, which includes the username and password.

After the URLSessionDataTask is called, we will try to decode the response:

  1. If the response message is “Success”, we will decode the data and send it to the home page
  2. On the other hand, we will show the error message on the application
  3. We will also handle other things like when the API is not deployed yet by using “Failed to parse response” and other things like network error.

The next is the home screen, which will look like this:

Home Screen

For the home screen, I will only be focusing more on the implementation of the API as well since the code is pretty long, which is:

/// Function to fetch watched movies for the user
func fetchWatchedMovies(userId: Int) {
guard let url = URL(string: "http://127.0.0.1:5000/user/\(userId)/movies") else {
return
}

URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
DispatchQueue.main.async {
self.errorMessage = "Failed to fetch movies. Please try again."
}
return
}

// Decode the movies response
do {
let response = try JSONDecoder().decode(UserMoviesResponse.self, from: data)

// Handle success or failure based on status
if response.status == 200 {
DispatchQueue.main.async {
self.watchedMovies = response.movies
}
} else {
DispatchQueue.main.async {
self.errorMessage = "Error: \(response.message ?? "Unknown error")"
}
}
} catch {
DispatchQueue.main.async {
print("Error decoding JSON: \(error)")
self.errorMessage = "Failed to parse movies. Please try again."
}
}
}.resume()
}

/// Function to fetch recommended movies for the user
func fetchRecommendedMovies(userId: Int) {
guard let url = URL(string: "http://127.0.0.1:5000/user/\(userId)/recommendations") else {
return
}

URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
DispatchQueue.main.async {
self.errorMessage = "Failed to fetch recommendations. Please try again."
}
return
}

// Log the raw response to the console (optional)
if let jsonString = String(data: data, encoding: .utf8) {
print("Raw JSON Response: \(jsonString)")
}

// Decode the recommendations response
do {
let response = try JSONDecoder().decode(UserMoviesResponse.self, from: data)

// Handle success or failure based on status
if response.status == 200 {
DispatchQueue.main.async {
self.recommendedMovies = response.movies
}
} else {
DispatchQueue.main.async {
self.errorMessage = "Error: \(response.message ?? "Unknown error")"
}
}
} catch {
DispatchQueue.main.async {
print("Error decoding JSON: \(error)")
self.errorMessage = "Failed to parse recommendations. Please try again."
}
}
}.resume()
}

For this code, we will be doing similar things like before, but we will call two API endpoints simultaneously, which are:

  1. http://127.0.0.1:5000/user/<int:user_id>/movies: To get the movies that are watched by the user by user id.
  2. http://127.0.0.1:5000/user/<int:user_id>/recommendations: To get the movies that are recommended to the user by user id.

The next is the detail screen, which will look like this:

Detail Screen

For the detail screen above, the code will be:

struct DetailView: View {

var movie: ContentView.Movie
let movieImageWidth = UIScreen.main.bounds.width
let movieImageHeight = UIScreen.main.bounds.width * 2 / 3

var body: some View {
VStack {
ScrollView(.vertical) {

if let backdropPath = movie.backdrop_path {
// Use the reusable MovieImageView for the backdrop image
MovieImageView(imagePath: backdropPath, width: movieImageWidth, height: movieImageHeight)
} else if let posterPath = movie.poster_path {
// Use the reusable MovieImageView for the poster image
MovieImageView(imagePath: posterPath, width: movieImageWidth, height: movieImageHeight)
} else {
// Fallback to placeholder
Rectangle()
.frame(width: movieImageWidth, height: movieImageHeight)
.foregroundColor(MovieAppColors.slate)
}

VStack {
HStack {
Text(movie.title)
.font(.system(size: 32, weight: .semibold))
.foregroundColor(.white)
Spacer()
}
Spacer()
HStack {
Text(movie.overview ?? "")
.font(.system(size: 20, weight: .regular))
.foregroundColor(MovieAppColors.lightGray)
Spacer()
}
}
.padding()
}
}
.ignoresSafeArea()
.background(MovieAppColors.black)
}
}

/// Reusable component to handle the image loading
struct MovieImageView: View {
let imagePath: String
let width: CGFloat
let height: CGFloat

var body: some View {
let imageUrl = URL(string: "https://image.tmdb.org/t/p/w500\(imagePath)")

WebImage(url: imageUrl) { image in
image
.resizable()
.scaledToFill()
.frame(width: width, height: height)
.clipped()
} placeholder: {
Rectangle()
.foregroundColor(MovieAppColors.lightGray)
.frame(width: width, height: height)
}
}
}

In this code, we will display the information of the movies based on the endpoint. These data include:

  1. Backdrop_path: To display the background image, if the image is not found, we will use the poster_path as the image
  2. Title: To display the title of the movie
  3. Overview: To display the description of the movie

The last screen will be the profile screen, which will be:

Profile Screen

For the profile screen, the code will be:

struct ProfileView: View {

@EnvironmentObject var vm: UserStateViewModel
@Environment(\.dismiss) var dismiss

var body: some View {
VStack {
Spacer()
HStack {
Spacer()
Button(action: {
vm.loginState = .loggedOut
dismiss()
}, label: {
Text("Sign Out")
.font(.system(size: 32, weight: .semibold))
.foregroundColor(MovieAppColors.red)
})
Spacer()
}
Spacer()
}
.ignoresSafeArea()
.background(MovieAppColors.black)
}
}

#Preview {
ProfileView()
}

For this profile screen, we will only have one button which is the Sign Out button. Tapping the “Sign Out” button will make the user sign out of the app.

Photo by Myles Tan on Unsplash

Part 8: Conclusion

To conclude, deploying a Machine Learning model on the backend is possible. However, from this project, I can see that Python and Google Collab itself provide better tools that are less prone to errors.

This could be a jump-start for the next step, and it could be improved even further. The next step would be integrating the datasets into a MySQL database. I hope that you like my article. You can follow me if you want to get more updates.

You can follow me on LinkedIn: https://www.linkedin.com/in/kevinabram

You can also check my GitHub here: https://github.com/kevinabram111

If you want the link to this project, you can find it here: https://github.com/kevinabram111/Movie-Recommendation

--

--

Kevin Abram
Kevin Abram

Written by Kevin Abram

iOS Engineer | Technology Consultant

No responses yet