The TMDb data provides details of 10,000 movies. The data include details like casts, revenue, budget, popularity etc. This data can be analysed to find interesting pattern between movies.
We can use this data to try to answer following questions:
# import necessary libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
# Load TMDb data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
tmdb_movies = pd.read_csv('tmdb-movies.csv')
tmdb_movies.head()
tmdb_movies.describe()
As evident from the data, it seems we have cast of the movie as string separated by |
symbol. This needs to be converted into a suitable type in order to consume it properly later.
# Pandas read empty string value as nan, make it empty string
tmdb_movies.cast.fillna('', inplace=True)
tmdb_movies.genres.fillna('', inplace=True)
tmdb_movies.director.fillna('', inplace=True)
tmdb_movies.production_companies.fillna('', inplace=True)
def string_to_array(data):
"""
This function returns given splitss the data by separator `|` and returns the result as array
"""
return data.split('|')
tmdb_movies.cast = tmdb_movies.cast.apply(string_to_array)
tmdb_movies.genres = tmdb_movies.genres.apply(string_to_array)
tmdb_movies.director = tmdb_movies.director.apply(string_to_array)
tmdb_movies.production_companies = tmdb_movies.production_companies.apply(string_to_array)
It's evident from observations below that there is no clear trend in change in mean revenue over years.
def yearly_growth(mean_revenue):
return mean_revenue - mean_revenue.shift(1).fillna(0)
# Show change in mean revenue over years, considering only movies for which we have revenue data
movies_with_budget = tmdb_movies[tmdb_movies.budget_adj > 0]
movies_with_revenue = movies_with_budget[movies_with_budget.revenue_adj > 0]
revenues_over_years = movies_with_revenue.groupby('release_year').sum()
revenues_over_years.apply(yearly_growth)['revenue'].plot()
revenues_over_years[['budget_adj', 'revenue_adj']].plot()
def log(data):
return np.log(data)
movies_with_revenue[['budget_adj', 'revenue_adj']].apply(log) \
.sort_values(by='budget_adj').set_index('budget_adj')['revenue_adj'].plot(figsize=(20,6))
Since popularity column indicates all time popularity of the movie, it might not be the right metric to measure popularity over years. We can measure popularty of a movie based on average vote. I think a movie is popular if vote_average >= 7
.
On analyzing the popular movies since 1960(check illustrations below), following onservations can be made:
Drama
genreComedy
, Action
and Adventure
got popular.Documentry
, Action
and Animation
movies got more popularity.def popular_movies(movies):
return movies[movies['vote_average']>=7]
def group_by_genre(data):
"""
This function takes a Data Frame having and returns a dictionary having
release_year as key and value a dictionary having key as movie's genre
and value as frequency of the genre that year.
"""
genres_by_year = {}
for (year, position), genres in data.items():
for genre in genres:
if year in genres_by_year:
if genre in genres_by_year[year]:
genres_by_year[year][genre] += 1
else:
genres_by_year[year][genre] = 1
else:
genres_by_year[year] = {}
genres_by_year[year][genre] = 1
return genres_by_year
def plot(genres_by_year):
"""
This function iterates over each row of Data Frame and prints rows
having release_year divisible by 5 to avoid plotting too many graphs.
"""
for year, genres in genres_by_year.items():
if year%5 == 0:
pd.DataFrame(grouped_genres[year], index=[year]).plot(kind='bar', figsize=(20, 6))
# Group movies by genre for each year and try to find the correlations
# of genres over years.
grouped_genres = group_by_genre(tmdb_movies.groupby('release_year').apply(popular_movies).genres)
plot(grouped_genres)
We can consider those movies with at least 1 billion revenue and see what are common properties among them.
Considering this criteria and based on illustrations below, we can make following observations about highest grossing movies:
Adventure
and Action
are most common genres among these movies followed by Science Fiction
, Fantasy
and Family
.Steven Spielberg
and Peter Jackson
are directors who have highest muber of movies having at least 1 billion revenue.Warner Bros.
, Walt Disney
, Fox Film
and Universal picture
seems to have figured out the secret of highest grossing movies. They have highest number of at least a billion revenue movies. This does not mean all their movies have pretty high revenue.highest_grossing_movies = tmdb_movies[tmdb_movies['revenue_adj'] >= 1000000000]\
.sort_values(by='revenue_adj', ascending=False)
highest_grossing_movies.head()
def count_frequency(data):
frequency_count = {}
for items in data:
for item in items:
if item in frequency_count:
frequency_count[item] += 1
else:
frequency_count[item] = 1
return frequency_count
highest_grossing_genres = count_frequency(highest_grossing_movies.genres)
print(highest_grossing_genres)
pd.DataFrame(highest_grossing_genres, index=['Genres']).plot(kind='bar', figsize=(20, 8))
highest_grossing_movies.vote_average.hist()
def list_to_dict(data, label):
"""
This function creates returns statistics and indices for a data frame
from a list having label and value.
"""
statistics = {label: []}
index = []
for item in data:
statistics[label].append(item[1])
index.append(item[0])
return statistics, index
import operator
high_grossing_dirs = count_frequency(highest_grossing_movies.director)
revenues, indexes = list_to_dict(sorted(high_grossing_dirs.items(), key=operator.itemgetter(1), reverse=True)[:20], 'revenue')
pd.DataFrame(revenues, index=indexes).plot(kind='bar', figsize=(20, 5))
high_grossing_cast = count_frequency(highest_grossing_movies.cast)
revenues, index = list_to_dict(sorted(high_grossing_cast.items(), key=operator.itemgetter(1), reverse=True)[:30], 'number of movies')
pd.DataFrame(revenues, index=index).plot(kind='bar', figsize=(20, 5))
high_grossing_prod_comps = count_frequency(highest_grossing_movies.production_companies)
revenues, index = list_to_dict(sorted(high_grossing_prod_comps.items(), key=operator.itemgetter(1), reverse=True)[:30]\
, 'number of movies')
pd.DataFrame(revenues, index=index).plot(kind='bar', figsize=(20, 5))
We can see the top 30 highest grossing directors in bar chart below.
It seems Steven Spielberg surpasses other directors in gross revenue.
def grossing(movies, by):
"""
This function returns the movies' revenues over key passed as `by` value in argument.
"""
revenues = {}
for id, movie in movies.iterrows():
for key in movie[by]:
if key in revenues:
revenues[key].append(movie.revenue_adj)
else:
revenues[key] = [movie.revenue_adj]
return revenues
def gross_revenue(data):
"""
This functions computes the sum of values of the dictionary and
return a new dictionary with same key but cummulative value.
"""
gross = {}
for key, revenues in data.items():
gross[key] = np.sum(revenues)
return gross
gross_by_dirs = grossing(movies=movies_with_revenue, by='director')
director_gross_revenue = gross_revenue(gross_by_dirs)
top_15_directors = sorted(director_gross_revenue.items(), key=operator.itemgetter(1), reverse=True)[:15]
revenues, index = list_to_dict(top_15_directors, 'director')
pd.DataFrame(data=revenues, index=index).plot(kind='bar', figsize=(15, 9))
We can find the top 30 actors based on gross revenue as shown in subsequent sections below.
As we can see Harison Ford tops the chart with highest grossing.
gross_by_actors = grossing(movies=tmdb_movies, by='cast')
actors_gross_revenue = gross_revenue(gross_by_actors)
top_15_actors = sorted(actors_gross_revenue.items(), key=operator.itemgetter(1), reverse=True)[:15]
revenues, indexes = list_to_dict(top_15_actors, 'actors')
pd.DataFrame(data=revenues, index=indexes).plot(kind='bar', figsize=(15, 9))
We tried to answer quite a few questions using TMDb data whch are summarised below:
Drama
genre is common traits among popular movies with Comedy
, Action
, Adventure
and Animation
getting popular in recent years.Adventure
& Action
genres. Also Highest grossing movies are correlated with high popularity.