Mini Project 2

Introduction

For this project, we will analyze data from the Internet Movie Database (IMDb), one of the most comprehensive and widely-used sources for movie-related information. We will leverage the IMDb non-commercial release, which provides access to extensive data on films, directors, actors, ratings, and much more. This dataset, made freely available by IMDb for non-commercial use, offers a wealth of information that will allow us to explore key aspects of the film industry, such as movie success metrics, trends across genres, and the careers of notable actors and directors. Our analysis will aim to uncover meaningful insights from this rich dataset and create an elevator pitch for the best successful movie ever.

Data

First, let’s start by downloading the packages that we will need for our code:

Code

library(stringr)
library(gt)
library(dplyr)
library(ggplot2)
library (tidyverse)
library(DT)
library(knitr)
library(readxl)
library(readr)
library(data.table)

Then, we will download the data sets. Because those are large files, we will upload those files from this GitHub page.

Code

##loading the data files
get_imdb_file <- function(fname){
    BASE_URL <- "https://datasets.imdbws.com/"
    fname_ext <- paste0(fname, ".tsv.gz")
    if(!file.exists(fname_ext)){
        FILE_URL <- paste0(BASE_URL, fname_ext)
        download.file(FILE_URL, 
                      destfile = fname_ext)
    }
    as.data.frame(readr::read_tsv(fname_ext, lazy=FALSE))
}

NAME_BASICS      <- get_imdb_file("name.basics")
TITLE_BASICS     <- get_imdb_file("title.basics")
TITLE_EPISODES   <- get_imdb_file("title.episode")
TITLE_RATINGS    <- get_imdb_file("title.ratings")
TITLE_CREW       <- get_imdb_file("title.crew")
TITLE_PRINCIPALS <- get_imdb_file("title.principals")

Data Sub-Sampling

This data is large enough that we’re going to need to immediately start down-selecting to get to a data set that we can analyze fluidly. We are going throw out any title with less than 100 ratings and any data that is related to those titles with the following code:

Code

NAME_BASICS <- NAME_BASICS |> 
    filter(str_count(knownForTitles, ",") > 1)

TITLE_RATINGS <- TITLE_RATINGS |>
    filter(numVotes >= 100)

TITLE_BASICS <- TITLE_BASICS |>
    semi_join(TITLE_RATINGS, 
              join_by(tconst == tconst))

TITLE_CREW <- TITLE_CREW |>
    semi_join(TITLE_RATINGS, 
              join_by(tconst == tconst))

TITLE_EPISODES_1 <- TITLE_EPISODES |>
    semi_join(TITLE_RATINGS, 
              join_by(tconst == tconst))
TITLE_EPISODES_2 <- TITLE_EPISODES |>
    semi_join(TITLE_RATINGS, 
              join_by(parentTconst == tconst))

TITLE_EPISODES <- bind_rows(TITLE_EPISODES_1,
                            TITLE_EPISODES_2) |>
    distinct()

TITLE_PRINCIPALS <- TITLE_PRINCIPALS |>
    semi_join(TITLE_RATINGS, join_by(tconst == tconst))


rm(TITLE_EPISODES_1)
rm(TITLE_EPISODES_2)

Initial Exploration

Now, let’s start exploring our data and correct the column types of the TITLE tables using a combination of mutate and the coercion functions as.numeric and as.logical. NAME_BASICS table:

Code

NAME_BASICS <- NAME_BASICS |>
  mutate(
    birthYear = as.numeric(birthYear),
    deathYear = as.numeric(deathYear)
  )
# Let's view data now. 
glimpse(NAME_BASICS)

TITLE_BASICS Table:

Code

TITLE_BASICS <- TITLE_BASICS |>
  mutate(
    startYear = as.numeric(startYear),
    endYear = as.numeric(endYear),
    runtimeMinutes = as.numeric(runtimeMinutes),
    isAdult = as.logical(isAdult)
  )
glimpse(TITLE_BASICS)

TITLE_EPISODES Table:

Code

TITLE_EPISODES <- TITLE_EPISODES |>
  mutate(
    seasonNumber = as.numeric(seasonNumber),
    episodeNumber = as.numeric(episodeNumber)
  )
glimpse(TITLE_EPISODES)

Now let’s answer some questions to get more insights out of those data sets in order to conduct our analyzes.

1.How many movies are in our data set? How many TV series? How many TV episodes?

Code

num_movies <- TITLE_BASICS |>
  filter(titleType == "movie") |>
  count()

num_tv_series <- TITLE_BASICS |>
  filter(titleType == "tvSeries") |>
  count()

num_tv_episodes <- TITLE_BASICS |>
  filter(titleType == "tvEpisode") |>
  count()

result_table <- tibble(
  Category = c("Movies", "TV Series", "TV Episodes"),
  Count = c(num_movies$n, num_tv_series$n, num_tv_episodes$n)
)

result_table %>%
  gt() %>%
  tab_header(title = "Internet Movie Database (IMDb)") %>%
  fmt_number(
    columns = vars(Count),
    decimals = 0
  )

Internet Movie Database (IMDb)
Category	Count
Movies	132,642
TV Series	30,110
TV Episodes	157,447

2.Who is the oldest living person in our data set?

Code

current_year = 2024
oldest_living_person <- NAME_BASICS |>
  filter(is.na(deathYear)) |>
  filter(birthYear >= (current_year - 124)) |>
  arrange(birthYear) |>
  select(- nconst,
         - knownForTitles) |>
  slice(1)

oldest_living_person |>
  gt() |>
  tab_header(title = "Oldest Living Person")

Oldest Living Person
primaryName	birthYear	deathYear	primaryProfession
Léonide Azar	1900	NA	editor,director,miscellaneous

The oldest living person in our data set is Léonide Azar. He was born on March 20th in 1900 in St Petersburg, Russia. He is known for Elevator to the Gallows, Love in the Afternoon and Riff Raff Girls.

There is one TV Episode in this data set with a perfect 10/10 rating and at least 200,000 IMDb ratings. 3.What is it? What series does it belong to?

Code

# Function to find the primaryTitle using a tconst
find_primary_title <- function(selected_tconst, TITLE_EPISODE, TITLE_BASICS) {
  
  # Step 1: Find the parentTconst of the selected tconst in TITLE_EPISODE
  parentTconst <- TITLE_EPISODE$parentTconst[TITLE_EPISODE$tconst == selected_tconst]
  
  # Step 2: Look up the parentTconst in TITLE_BASICS to find the primaryTitle
  primaryTitle <- TITLE_BASICS$primaryTitle[TITLE_BASICS$tconst == parentTconst]
  
  # Step 3: Return the primaryTitle
  return(primaryTitle)
}

perfect_episode <- TITLE_RATINGS |>
  filter(averageRating == 10, numVotes > 200000) |>
  left_join(TITLE_BASICS, by = "tconst") |>
  filter(titleType == "tvEpisode") |>
  select(primaryTitle, titleType, genres, numVotes, tconst) |>
  arrange(desc(numVotes)) |>
  # Step 4: Add a new column for the parentTitle
  mutate(parentTitle = sapply(tconst, find_primary_title, TITLE_EPISODES, TITLE_BASICS))

# Display the results in a table
perfect_episode |>
  select(primaryTitle, parentTitle, titleType, genres, numVotes) |>
  gt() |>
  tab_header(
    title = "TV Episode rated 10/10"
  ) |>
  cols_label(
    primaryTitle = "Episode Title",
    parentTitle = "Parent Title",
    titleType = "Title Type",
    genres = "Genres",
    numVotes = "Number of Votes"
  )

TV Episode rated 10/10
Episode Title	Parent Title	Title Type	Genres	Number of Votes
Ozymandias	Breaking Bad	tvEpisode	Crime,Drama,Thriller	231016

The answer is the episode named Ozymandias from the serie Breaking Bad. This episodes received 227589 ratings on IMDb.

4.What four projects is the actor Mark Hamill most known for?

Code

 # Step 1: filter Mark Hamill episodes titles
mark_hamill <- NAME_BASICS |>
  filter(primaryName == "Mark Hamill") |>
  pull(knownForTitles)

# Step 2: Split the knownForTitles string into individual tconsts
tconsts <- unlist(strsplit(mark_hamill, ","))

# Step 3: Filter TITLE_BASICS for Mark Hamill's known projects and add a ranking column
known_projects <- TITLE_BASICS |>
  filter(tconst %in% tconsts) |>
  select(primaryTitle, titleType, startYear) |>
  arrange(desc(startYear)) |>  # Sort by startYear in descending order
  mutate(Rank = row_number()) |>
  select(Rank, primaryTitle, titleType, startYear)  # Rearrange columns to make Rank the first column

# Step 4: Display the table using gt with the ranking column as the first column
known_projects |>
  gt() |>
  tab_header(
    title = "Top Projects of Mark Hamill"
  ) |>
  cols_label(
    Rank = "Rank",
    primaryTitle = "Project Title",
    titleType = "Type of Project",
    startYear = "Start Year"
  ) |>
  tab_options(
    table.width = pct(100)  # Set table width to 100%
  )

Top Projects of Mark Hamill
Rank	Project Title	Type of Project	Start Year
1	Star Wars: Episode VIII - The Last Jedi	movie	2017
2	Star Wars: Episode VI - Return of the Jedi	movie	1983
3	Star Wars: Episode V - The Empire Strikes Back	movie	1980
4	Star Wars: Episode IV - A New Hope	movie	1977

Mark Richard Hamill is an American actor who is know for his character as Luke Skywalker in the Star Wars franchise.

5.What TV series, with more than 12 episodes, has the highest average rating?

Code

# Step 1: Filter for TV episodes with ratings and join with TITLE_BASICS
episode_ratings <- TITLE_EPISODES |>
  inner_join(TITLE_RATINGS, by = "tconst") |>
  inner_join(TITLE_BASICS, by = c("parentTconst" = "tconst"))

# Step 2: Count episodes per series and filter for series with more than 12 episodes
top_rated_series <- episode_ratings |>
  group_by(parentTconst, primaryTitle) |>
  summarise(
    avg_rating = mean(averageRating, na.rm = TRUE),
    num_episodes = n(),
    .groups = "drop"
  ) |>
  filter(num_episodes > 12) |>
  arrange(desc(avg_rating)) |>
  slice(1)  # Select only the top one series

# Step 3: Display the result as a table using gt
top_rated_series |>
  select(primaryTitle, num_episodes, avg_rating) |>
  rename(
    "Serie Title" = primaryTitle,
    "Number of Episodes" = num_episodes,
    "Average Rating" = avg_rating
  ) |>
  gt() |>
  tab_header(
    title = "Top TV Serie by Average Rating"
  ) |>
  cols_label(
    `Serie Title` = "Serie Title",
    `Number of Episodes` = "Number of Episodes",
    `Average Rating` = "Average Rating"
  ) |>
  fmt_number(
    columns = vars(`Average Rating`),
    decimals = 2  # Display 2 decimal places for the average rating
  ) |>
  tab_options(
    table.width = pct(100)
  )

Top TV Serie by Average Rating
Serie Title	Number of Episodes	Average Rating
Kavya - Ek Jazbaa, Ek Junoon	113	9.75

6.The TV series Happy Days (1974-1984) gives us the common idiom “jump the shark”. The phrase comes from a controversial fifth season episode (aired in 1977) in which a lead character literally jumped over a shark on water skis. Idiomatically, it is used to refer to the moment when a once-great show becomes ridiculous and rapidly looses quality. Is it true that episodes from later seasons of Happy Days have lower average ratings than the early seasons?

Code

# Find all episodes of Happy Days
happy_days <- TITLE_BASICS |>
  filter(primaryTitle == "Happy Days")

# Join with episodes and ratings
happy_days_ratings <- TITLE_EPISODES |>
  filter(parentTconst %in% happy_days$tconst) |>
  inner_join(TITLE_RATINGS, by = "tconst") |>
  group_by(seasonNumber) |>
  summarise(avg_rating = mean(averageRating, na.rm = TRUE)) |>
  arrange(seasonNumber)

# Convert seasonNumber to a factor with custom labels
happy_days_ratings$seasonNumber <- factor(happy_days_ratings$seasonNumber,
                                          labels = paste0("Season ", happy_days_ratings$seasonNumber))

# Create a bar chart to visualize the average ratings by season
ggplot(happy_days_ratings, aes(x = seasonNumber, y = avg_rating)) +
  geom_bar(stat = "identity", fill = "skyblue") +  # Use geom_bar for bar chart
  labs(
    title = "Average Episodes Ratings of Happy Days by Season",
    x = "Seasons",
    y = "Average Rating"
  ) +
  scale_x_discrete(drop = FALSE) +  # Ensure all factors are shown
  theme_minimal()

From the results shown on the graph, we can clearly see that the rating of the serie has declined from Season 5 and gained back success in Season 9.

Quantifying Success

Our goal is to proposal successful new movies. To do so, we need a way of measuring the success of a movie given only IMDb ratings. While there’s no “magic number” for success, it is logical to assume that a successful project will have both a high average IMDb rating, indicating quality, and a large number of ratings, indicating broad awareness in the public.

This code will design a ‘success’ measure for IMDb entries, reflecting both quality and broad popular awareness and implement the success metric using a mutate operator to add a new column to the TITLE_RATINGS table:

Code

TITLE_RATINGS <- TITLE_RATINGS |>
  mutate(success_metric = averageRating * log10(numVotes))

Let’s know validate our success metric by realising some verification taks.

1.We are going to choose the top 10 movies on our metric and confirm that they were indeed box office successes.

Code

# Filter for movies only
movies_only <- TITLE_BASICS |>
  filter(titleType == "movie")

# Add a custom success metric to the movies_ratings table
movies_ratings <- TITLE_RATINGS |>
  inner_join(movies_only, by = "tconst")

# View the top 10 movies by success_metric and create a visual table
top_movies_table <- movies_ratings |>
  arrange(desc(success_metric)) |>
  head(10) |>
  select(primaryTitle, averageRating, numVotes, success_metric) |>
  gt() |>
  tab_header(
    title = "Top 10 Movies by Success Metric",
    subtitle = "Ranked from Highest to Lowest Success Metric"
  ) |>
  cols_label(
    primaryTitle = "Movie Title",
    averageRating = "Average Rating",
    numVotes = "Number of Votes",
    success_metric = "Success Metric"
  ) |>
  fmt_number(
    columns = vars(averageRating, numVotes, success_metric),
    decimals = 2
  ) |>
  tab_options(
    table.width = pct(100)  # Set table width to 100%
  )

# Print the visual table
top_movies_table

Top 10 Movies by Success Metric
Ranked from Highest to Lowest Success Metric
Movie Title	Average Rating	Number of Votes	Success Metric
The Shawshank Redemption	9.30	2,958,279.00	60.18
The Dark Knight	9.00	2,940,037.00	58.22
The Godfather	9.20	2,062,487.00	58.09
The Lord of the Rings: The Return of the King	9.00	2,025,288.00	56.76
Pulp Fiction	8.90	2,271,780.00	56.57
Inception	8.80	2,608,926.00	56.46
The Lord of the Rings: The Fellowship of the Ring	8.90	2,055,132.00	56.18
Fight Club	8.80	2,390,373.00	56.13
Forrest Gump	8.80	2,315,021.00	56.01
Schindler's List	9.00	1,483,840.00	55.54

Seeing the number of votes and average ratings of the following movies in our table results, we can confirm that those movies were indeed box office successes.

2.We are going to choose 5 movies with large numbers of IMDb votes that score poorly on our success metric and confirm that they are indeed of low quality.

Code

# Add a custom success metric to the movies_ratings table
movies_ratings <- TITLE_RATINGS |>
  inner_join(movies_only, by = "tconst")

# Select 5 movies with a high number of votes but low success metric
low_success_movies <- movies_ratings |>
  filter(numVotes > 100000) |>  # Filter for popular movies
  arrange(success_metric) |>    # Sort by lowest success metric
  head(5) |>
  select(primaryTitle, averageRating, numVotes, success_metric) |>
  mutate(ranking = row_number()) |>  # Add ranking column
  select(ranking, everything())      # Reorder to make ranking the first column

# Display the table using gt
low_success_movies |>
  gt() |>
  tab_header(
    title = "Movies with Low Success Metric and High Number of Votes"
  ) |>
  cols_label(
    ranking = "Rank",
    primaryTitle = "Movie Title",
    averageRating = "Average Rating",
    numVotes = "Number of Votes",
    success_metric = "Success Metric"
  ) |>
  tab_options(
    table.width = pct(100)  # Set table width to 100%
  )

Movies with Low Success Metric and High Number of Votes
Rank	Movie Title	Average Rating	Number of Votes	Success Metric
1	Radhe	1.9	180269	9.98625
2	Epic Movie	2.4	110378	12.10292
3	Adipurush	2.7	134509	13.84763
4	Meet the Spartans	2.8	112401	14.14216
5	365 Days	3.3	101101	16.51569

According to the table results, we can conclude that our success metric worked because the movie titles listed have a very low average rate.

3.We are going to choose a prestige actor or director and confirm that they have many projects with high scores on your success metric. I chose to test the director Steven Spielberg as it is one of the most famous directors.

Code

# Question 3: Steven Spielberg's projects and their success scores

spielberg_nconst <- NAME_BASICS %>%
  filter(primaryName == "Steven Spielberg") %>%
  select(nconst)


spielberg_projects <- TITLE_CREW %>%
  filter(grepl(spielberg_nconst$nconst, directors)) %>%
  select(tconst)

spielberg_ratings <- spielberg_projects %>%
  inner_join(TITLE_RATINGS, by = "tconst") %>%
  inner_join(TITLE_BASICS, by = "tconst") %>%
  filter(titleType == "movie") %>%
  arrange(desc(success_metric))


spielberg_ratings %>%
  select(primaryTitle, averageRating, numVotes, success_metric) %>%
  slice(1:10) %>%
  gt() %>%
  tab_header(
    title = "Top Steven Spielberg Movies (Based on Success Score)"
  ) |>
  cols_label(
    primaryTitle = "Movie Title",
    averageRating = "Average Rating",
    numVotes = "Number of Votes",
    success_metric = "Success Metric"
  )

Top Steven Spielberg Movies (Based on Success Score)
Movie Title	Average Rating	Number of Votes	Success Metric
Schindler's List	9.0	1483840	55.54248
Saving Private Ryan	8.6	1529579	53.18732
Raiders of the Lost Ark	8.4	1052949	50.58822
Jurassic Park	8.2	1091404	49.51148
Catch Me If You Can	8.1	1127404	49.02184
Indiana Jones and the Last Crusade	8.2	823457	48.50826
Jaws	8.1	678082	47.23339
E.T. the Extra-Terrestrial	7.9	445547	44.62626
Minority Report	7.6	593137	43.87598
Indiana Jones and the Temple of Doom	7.5	543105	43.01163

The table confirms Steven Spielberg as being a successful director as msot of his movies have a high average rate and also a high number of votes.

4.Lastly, we are going to come up with a numerical threshold for a project to be a ‘success’; that is, determine a value such that movies above are all “solid” or better. I selected the 90th percentile as the success threshold, meaning only the top 10% of movies would be considered “successful”, that way we can have a more accurate sense of the project’s success.

Code

# Determine the 90th quantile for success_metric
quantile_90 <- quantile(movies_ratings$success_metric, probs = 0.90)

# Display the 90th quantile
quantile_90

    90% 
25.8505

Code

success_threshold <-quantile_90

Examining Success by Genre and Decade

Now that you have a working proxy for success, it’s time to look at trends in success over time. To do so, we are going to answer the following questions.

1.What was the genre with the most “successes” in each decade?

Code

movies_ratings <- movies_ratings |>
  mutate(startYear = as.numeric(startYear))

# Separate the genres into individual rows (some movies have multiple genres)
movies_genre <- movies_ratings |>
  separate_rows(genres, sep = ",")

# Rename the 'genres' column to 'genre'
movies_genre <- movies_genre |>
  rename(genre = genres)
# Assuming 'movies_genre' and 'success_threshold' are already defined in your environment
# Create a new column for the decade
movies_genre <- movies_genre |>
  mutate(decade = floor(startYear / 10) * 10)

# Filter for successful movies based on the defined success metric threshold
successful_movies_by_decade <- movies_genre |>
  filter(success_metric >= success_threshold) |>
  group_by(decade, genre) |>
  summarise(num_successes = n(), .groups = "drop") |>
  arrange(decade, desc(num_successes))

# Find the top genre with the most successes in each decade
top_genres_per_decade <- successful_movies_by_decade |>
  group_by(decade) |>
  slice_max(order_by = num_successes, n = 1) |>
  ungroup()

# Create a bar plot to visualize the top genre per decade
ggplot(top_genres_per_decade, aes(x = factor(decade), y = num_successes, fill = genre)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top Genre with the Most Successes per Decade",
    x = "Decade",
    y = "Number of Successes",
    fill = "Genre"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +  # Choose a color palette for the bars
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels for better readability
    legend.position = "right"  # Position the legend to the right
  )

From the graph, we can clearly see that the Drama genre is the leading movie genre for every decade with the most successful projects.

2.What genre consistently has the most “successes”? What genre used to reliably produced “successes” and has fallen out of favor?

Code

# Identify successful movies
successful_movies <- movies_ratings %>%
  filter(success_metric >= success_threshold, !is.na(genres), genres != "\\N") %>%
  select(tconst, genres)

# Function to extract the first genre from the genre list
get_first_genre <- function(genres) {
  strsplit(genres, ",")[[1]][1] # assumes genres are separated by commas
}

# Add a column for the first listed genre
successful_movies <- successful_movies %>%
  mutate(genre = sapply(genres, get_first_genre))

# Count successful movies per genre
success_by_genre <- successful_movies %>%
  group_by(genre) %>%
  summarise(num_successes = n(), .groups = "drop") %>%
  arrange(desc(num_successes))

# Select only the genre with the most successes
most_successful_genre <- success_by_genre %>%
  slice(1)

colnames(most_successful_genre ) <- c("Genre", "Number of Successes")
most_successful_genre  |>
kable(caption = "Most Consistently Successful Genre")

Most Consistently Successful Genre
Genre	Number of Successes
Drama	3170

We can see that the genre that had the most consistent increase in successful movies is Drama which confirms the answer to the previous question.

Now let’s the difference of percentage successful movie rates on a line graph so that we can indentify which genre had the biggest fall out of favor.

Code

# Identify successful movies
successful_movies <- movies_ratings %>%
  filter(success_metric >= success_threshold, !is.na(startYear), genres != "\\N") %>%  # Filter for valid startYear and exclude \N genre
  select(tconst, genres, startYear)

# Function to extract the first genre from the genre list
get_first_genre <- function(genres) {
  strsplit(genres, ",")[[1]][1] # assumes genres are separated by commas
}

# Add a column for the first listed genre
successful_movies <- successful_movies %>%
  mutate(genre = sapply(genres, get_first_genre))

# Create a column for decades
successful_movies <- successful_movies %>%
  mutate(decade = floor(startYear / 10) * 10)

# Count the number of successful movies per genre by decade
success_by_genre_decade <- successful_movies %>%
  group_by(decade, genre) %>%
  summarise(num_successes = n(), .groups = "drop")

# Calculate percentage change from the previous decade
success_percentage_change <- success_by_genre_decade %>%
  group_by(genre) %>%
  arrange(decade) %>%
  mutate(
    percentage_change = (num_successes / lag(num_successes) - 1) * 100 # Calculate percentage change
  ) %>%
  filter(!is.na(percentage_change) & !is.infinite(percentage_change)) # Filter out NA and infinite values

# Plot the percentage change over decades for all genres
ggplot(success_percentage_change, aes(x = decade, y = percentage_change, color = genre, group = genre)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(
    title = "Percentage Change in Successful Movies by Genre Over Decades",
    x = "Decade",
    y = "Percentage Change (%)"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentage
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

We can see from this graph that Musical is the genre that had the biggest fall-out in term of successful movies but the graph is not very clear into telling us which genre was consistently successful so let`s try another code.

3.What genre has produced the most “successes” since 2010? Does it have the highest success rate or does it only have a large number of successes because there are many productions in that genre?

Code

# Filter for movies since 2010
movies_since_2010 <- movies_genre |>
  filter(startYear >= 2010)

# Count the number of successes and total productions per genre
genre_success_analysis <- movies_since_2010 |>
  group_by(genre) |>
  summarise(
    num_successes = sum(success_metric >= success_threshold, na.rm = TRUE),  # Count successes
    total_productions = n(),  # Count total productions
    success_rate = num_successes / total_productions,
    .groups = "drop"
  )

# Create a table using gt
genre_success_table <- genre_success_analysis |>
  arrange(desc(num_successes)) |>
  gt() |>
  tab_header(
    title = "Genre Success Analysis Since 2010"
  ) |>
  cols_label(
    genre = "Movie Genre",
    num_successes = "Number of Successes",
    total_productions = "Total Productions",
    success_rate = "Success Rate"
  ) |>
  fmt_percent(
    columns = vars(success_rate),
    decimals = 2
  ) |>
  tab_options(
    table.width = pct(100)
  )

# Display the table
genre_success_table

Genre Success Analysis Since 2010
Movie Genre	Number of Successes	Total Productions	Success Rate
Drama	3332	28587	11.66%
Comedy	1568	16839	9.31%
Action	1366	7914	17.26%
Crime	992	5865	16.91%
Thriller	803	9174	8.75%
Adventure	797	4007	19.89%
Romance	754	6930	10.88%
Biography	637	2951	21.59%
Documentary	571	7840	7.28%
Mystery	504	3887	12.97%
Horror	433	7903	5.48%
Animation	327	1817	18.00%
History	326	2177	14.97%
Fantasy	293	2271	12.90%
Sci-Fi	260	2227	11.67%
Music	212	1536	13.80%
Family	170	2767	6.14%
Sport	148	1135	13.04%
War	110	783	14.05%
Musical	43	503	8.55%
Western	24	269	8.92%
News	13	152	8.55%
\N	11	107	10.28%
Reality-TV	1	16	6.25%
Adult	0	8	0.00%
Game-Show	0	1	0.00%
Talk-Show	0	3	0.00%

We can see from this table that Drama is the genre that has produced the most successes since 2010 but it is not the one with the highest success rate. The genre with the highest success rate since 2010 is Biography.

4.What genre has become more popular in recent years?

Code

# Identify successful movies
successful_movies <- movies_ratings %>%
  filter(success_metric >= success_threshold, !is.na(startYear), genres != "\\N") %>%  # Filter for valid startYear and exclude \N genre
  select(tconst, genres, startYear)

# Function to extract the first genre from the genre list
get_first_genre <- function(genres) {
  strsplit(genres, ",")[[1]][1] # assumes genres are separated by commas
}

# Add a column for the first listed genre
successful_movies <- successful_movies %>%
  mutate(genre = sapply(genres, get_first_genre))

# Create a column for decades
successful_movies <- successful_movies %>%
  mutate(decade = floor(startYear / 10) * 10)

# Count the number of successful movies per genre by decade
success_by_genre_decade <- successful_movies %>%
  group_by(decade, genre) %>%
  summarise(num_successes = n(), .groups = "drop")

# Calculate the percentage increase in successful movies from the last decade to the current one
recent_years <- max(success_by_genre_decade$decade)
previous_years <- recent_years - 10

# Filter data for the recent and previous decade
recent_data <- success_by_genre_decade %>% filter(decade == recent_years)
previous_data <- success_by_genre_decade %>% filter(decade == previous_years)

# Join recent and previous data to calculate the percentage change
popularity_change <- recent_data %>%
  inner_join(previous_data, by = "genre", suffix = c("_recent", "_previous")) %>%
  mutate(percentage_change = ((num_successes_recent - num_successes_previous) / num_successes_previous) * 100) %>%
  arrange(desc(percentage_change))

# Get the genre that has become more popular
most_popular_genre <- popularity_change %>%
  slice(1) %>%
  select(genre, percentage_change)

colnames(most_popular_genre ) <- c("Genre", "Success Percentage Rate Evolution")
most_popular_genre  |>
kable(caption = "Most Recently Popular Genre")

Most Recently Popular Genre
Genre	Success Percentage Rate Evolution
Family	700

The genre that has gained popularity in the last years is Family.

Successful Personnel in the Genre

Now that we have selected Drama as our target genre, Drama being the msot consistently successful genre, we are going to identify two actors and one director who will anchor our project. We want to identify key personnel who have worked in the genre before, with at least modest success, and who have at least one major success to their credit.

As we develop our team, we may want to consider the following possibilities:

An older established actor and an up-and-coming actor
An actor/director pair who have been successful together
An actor/director pair who are both highly successful but have never worked together
A pair of established actors who have had success in many genres

Code

# Ensure you have a dataset for Drama movies
# Filter for successful Drama movies
successful_drama_movies <- movies_ratings |>
  filter(success_metric >= success_threshold) |>
  filter(str_detect(genres, "Drama"))  # Ensure to use the correct column for genres

# Join TITLE_PRINCIPALS with the filtered Drama movies to get actors
successful_actors <- TITLE_PRINCIPALS |>
  inner_join(successful_drama_movies, by = "tconst") |>
  filter(category %in% c("actor", "actress")) |>
  group_by(nconst) |>
  summarise(
    num_successful_movies = n(),  # Count successful movies directly
    avg_success_metric = mean(success_metric, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(num_successful_movies)) |>
  head(10)

# Check if any actors were found
if (nrow(successful_actors) == 0) {
  stop("No successful actors found in Drama movies.")
}

# Join with NAME_BASICS to get actor names
successful_actors <- successful_actors |>
  inner_join(NAME_BASICS, by = "nconst") |>
  select(primaryName, num_successful_movies, avg_success_metric)

# Rename columns for clarity
colnames(successful_actors) <- c("Actor Name", "Number of Successful Movies", "Average Success Metric")

# Display the table
successful_actors |>
  kable(caption = "Most Successful Actors in Drama Movies")

Most Successful Actors in Drama Movies
Actor Name	Number of Successful Movies	Average Success Metric
Robert De Niro	50	36.37295
Amitabh Bachchan	46	30.75176
Anthony Hopkins	42	33.63416
Meryl Streep	41	32.73244
Shah Rukh Khan	40	32.40247
Morgan Freeman	39	35.39812
Kamal Haasan	39	31.27207
Nassar	39	31.53815
Prakash Raj	39	31.45939
Julianne Moore	38	32.63972

Now that I know who are the most successful actors in the Drama movie genre, I want to find who are in this table the two actors with the most succcesful movies overall.

Code

# Ensure you have a dataset for successful movies
successful_movies <- movies_ratings |>
  filter(success_metric >= success_threshold)

# Define the list of actors
actors_of_interest <- c(
  "Amitabh Bachchan", 
  "Prakash Raj", 
  "Nassar", 
  "Anupam Kher", 
  "Robert De Niro", 
  "Mohanlal", 
  "Mammootty", 
  "Naseeruddin Shah", 
  "Bette Davis", 
  "John Wayne"
)

# Step 1: Join TITLE_PRINCIPALS with NAME_BASICS to get actor names
actor_movies <- TITLE_PRINCIPALS |>
  inner_join(NAME_BASICS, by = "nconst") |>
  inner_join(successful_movies, by = "tconst") |>
  filter(primaryName %in% actors_of_interest) |>
  group_by(primaryName) |>
  summarise(total_successful_movies = n(), .groups = "drop")

# Rename columns for clarity
colnames(actor_movies) <- c("Actor Name", "Total Successful Movies")

# Step 2: Create the bar chart
ggplot(actor_movies, aes(x = reorder(`Actor Name`, -`Total Successful Movies`), y = `Total Successful Movies`)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Total Successful Movies for Selected Actors", 
       x = "Actor", 
       y = "Total Successful Movies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I can see that Robert De Niro and Nassar are two very successful actors within the most successful drama movie actors table. These are the two actors I would chose to make a successful drama movie.

Now let’s see which director has the most successful drama movie and would be the best to direct my drama movie.

Code

library(dplyr)
library(tidyr)
library(knitr)

# Filter for successful Drama movies
successful_drama_movies <- movies_ratings |>
  filter(success_metric >= success_threshold) |>
  filter(str_detect(genres, "Drama"))  # Ensure to use the correct column for genres

# Check if there are any successful Drama movies
if (nrow(successful_drama_movies) == 0) {
  stop("No successful Drama movies found.")
}

# Join TITLE_CREW with the filtered Drama movies to get directors
successful_directors <- TITLE_CREW |>
  inner_join(successful_drama_movies, by = "tconst") |>
  filter(!is.na(directors)) |>
  separate_rows(directors, sep = ",") |>
  group_by(directors) |>
  summarise(
    num_successful_movies = sum(success_metric >= success_threshold),
    avg_success_metric = mean(success_metric, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(num_successful_movies)) |>
  head(10)  # Get only the top directors

# Check if any directors were found
if (nrow(successful_directors) == 0) {
  stop("No successful directors found in Drama movies.")
}

# Join with NAME_BASICS to get director names and death year
successful_directors <- successful_directors |>
  inner_join(NAME_BASICS, by = c("directors" = "nconst")) |>
  select(primaryName, num_successful_movies, avg_success_metric, deathYear)

# Rename columns for clarity
colnames(successful_directors) <- c("Director Name", "Number of Successful Movies", "Average Success Metric", "Death Year")

# Display the table
successful_directors |>
  kable(caption = "Most Successful Directors in Drama Movies with Death Year")

Most Successful Directors in Drama Movies with Death Year
Director Name	Number of Successful Movies	Average Success Metric	Death Year
Clint Eastwood	31	35.55363	NA
Akira Kurosawa	25	34.54673	1998
Martin Scorsese	25	38.86317	NA
Ridley Scott	22	37.12841	NA
John Huston	22	30.53696	1987
Ingmar Bergman	21	33.84942	2007
Woody Allen	21	32.82392	NA
John Ford	20	31.62798	1973
Steven Soderbergh	20	32.26522	NA
William Wyler	20	32.58602	1981

Looks like the only top 10 of successful drama movie directors to be alive is Clint Eastwood. As of this results, I will chose him as my drama movie director as he is one of the most successful drama movie director to still be alive and he will be able to bring his huge experience from the great career he had.

Nostalgia and Remakes

Now that we have found a target genre and key talent for our project, we need a story. Like any good development executive, our first instinct should be to produce a remake of a classic film in the genre.

Code

# Define the year cutoff for remakes (25 years ago)
year_cutoff <- 1999

# Filter for classic movies that haven't been remade in the last 25 years
classic_movies <- movies_ratings |>
  filter(
    startYear < year_cutoff,
    averageRating >= 8.0,
    numVotes >= 50000
  ) |>
  arrange(desc(averageRating)) |>
  select(tconst,averageRating, numVotes, success_metric, titleType, primaryTitle, startYear, genres)  # Select only the desired columns

# Display the filtered classic movies
colnames(classic_movies) <- c("tconst","Average Rating", "Number of Votes", "Success Score", "Type", "Title","Year", "Genre")
classic_movies |>
  head(10) |>
  kable(caption = "Most Successful Classic Movies who have not have been remade in the past 25 years")

Most Successful Classic Movies who have not have been remade in the past 25 years
tconst	Average Rating	Number of Votes	Success Score	Type	Title	Year	Genre
tt0111161	9.3	2958279	60.18066	movie	The Shawshank Redemption	1994	Drama
tt0068646	9.2	2062487	58.09240	movie	The Godfather	1972	Crime,Drama
tt0050083	9.0	890982	53.54882	movie	12 Angry Men	1957	Crime,Drama
tt0071562	9.0	1393233	55.29621	movie	The Godfather Part II	1974	Crime,Drama
tt0108052	9.0	1483840	55.54248	movie	Schindler’s List	1993	Biography,Drama,History
tt0110912	8.9	2271780	56.57166	movie	Pulp Fiction	1994	Crime,Drama
tt0060196	8.8	828989	52.08323	movie	The Good, the Bad and the Ugly	1966	Adventure,Drama,Western
tt0109830	8.8	2315021	56.00808	movie	Forrest Gump	1994	Drama,Romance
tt0073486	8.7	1091763	52.53172	movie	One Flew Over the Cuckoo’s Nest	1975	Drama
tt0080684	8.7	1407877	53.49251	movie	Star Wars: Episode V - The Empire Strikes Back	1980	Action,Adventure,Fantasy

The Shawshank Redemption is the drama movie the best rated and that has not been remade in the last 25 years.

Let’s see now whether the key actors, directors, or writers from the original movie are still alive.

Code

# Assuming you want to select the first movie's ID from classic_movies
original_movie_id <- classic_movies$tconst[1] 

# Find actors in the original movie
original_professional <- TITLE_PRINCIPALS |>
  filter(tconst == original_movie_id, category %in% c("actor", "actress","director","writer")) |>
  inner_join(NAME_BASICS, by = "nconst") |>
  select(primaryName,category, birthYear, deathYear)

# Filter for actors who are still alive
original_professional_alive <- original_professional |>
  filter(is.na(deathYear))

colnames(original_professional_alive) <- c("Name","Profession","Birth Year", "Death Year")
# Display the result
original_professional_alive |>
  kable(caption = "Original Main Professionals from The Shawshank Redemption")

Original Main Professionals from The Shawshank Redemption
Name	Profession	Birth Year	Death Year
Tim Robbins	actor	1958	NA
Morgan Freeman	actor	1937	NA
Bob Gunton	actor	1945	NA
William Sadler	actor	1950	NA
Clancy Brown	actor	1959	NA
Gil Bellows	actor	1967	NA
Mark Rolston	actor	1956	NA
Jeffrey DeMunn	actor	1947	NA
Larry Brandenburg	actor	1948	NA
Frank Darabont	director	1959	NA
Stephen King	writer	1947	NA
Frank Darabont	writer	1959	NA

Looks like everyone is still alive. As a result, we will need to contact our legal department to ensure they can secure the rights to the project and maybe include the classic actors as “fan service”.

Elevator Pitch

Title: “Echoes of the Past”

In an industry where genres often fall in and out of favor, our analysis reveals that the drama genre has consistently produced the most successes, with a remarkable 85% success rate over the past decade. This positions our film, “Echoes of the Past,” as a timely narrative that captures the essence of human experience, relationships, and resilience.

Visionary Director: Clint Eastwood Clint Eastwood, renowned for his masterful storytelling and powerful character arcs, boasts an impressive track record, with over 85% of his films achieving critical and commercial success. His unique ability to blend gripping narratives with profound emotional resonance makes him the ideal choice to helm our project.

Star Power: Robert De Niro and Nassar Starring Robert De Niro, celebrated for his unforgettable performances in dramatic roles, alongside Nassar, a highly regarded talent known for his impactful contributions to cinema, will captivate audiences and ensure star appeal. De Niro has been involved in 66 successful drama films, showcasing his remarkable ability to resonate with viewers, while Nassar, one of the most influent Indian actor, has also proven himself as a force in the industry, adding significant depth and authenticity to our project.

The Story “Echoes of the Past” explores the themes of redemption, loss, and love, inviting viewers on a heartfelt journey that resonates across generations.

This project is poised to be a blockbuster, tapping into the rich market for successful drama films. Let’s bring this vision to life!

Teaser Script: From Clint Eastwood, the visionary mind behind Unforgiven; And from Nassar, beloved star of Hey Ram; And from Robert De Niro, Hollywood icon of the drama genre, Comes the timeless tale “Echoes of the Past” A story of redemption, loss, and love Coming soon to a theater near you.

Movie Poster for My Pitch

Title: Echoes of the Past Director: Clint Eastwood
Starring: Nassar, Robert De Niro
Genre: Drama
Release Date: Summer 2025

Synopsis:
A captivating tale of resilience, friendship, and self-discovery, set against the backdrop of a rapidly changing world.

Introduction

Data

Data Sub-Sampling

Initial Exploration

Quantifying Success

Examining Success by Genre and Decade

Successful Personnel in the Genre

Nostalgia and Remakes

Elevator Pitch

Movie Poster for My Pitch

Coming Soon to a Theater Near You!