Project: Investigate TMDb Movie Data

Table of Contents

Introduction

The TMDb data provides details of 10,000 movies. The data include details like casts, revenue, budget, popularity etc. This data can be analysed to find interesting pattern between movies.

We can use this data to try to answer following questions:

  • What is the yearly revenue change?
  • Which genres are most popular from year to year?
  • What kinds of properties are associated with movies that have high revenues?
  • Who are top 15 highest grossing directors?
  • Who are top 15 highest grossing actors?
In [1]:
# import necessary libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

Data Wrangling

General Properties

In [2]:
# Load TMDb data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

tmdb_movies = pd.read_csv('tmdb-movies.csv')
tmdb_movies.head()
Out[2]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

In [3]:
tmdb_movies.describe()
Out[3]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

Data Cleaning

As evident from the data, it seems we have cast of the movie as string separated by | symbol. This needs to be converted into a suitable type in order to consume it properly later.

In [4]:
# Pandas read empty string value as nan, make it empty string
tmdb_movies.cast.fillna('', inplace=True)
tmdb_movies.genres.fillna('', inplace=True)
tmdb_movies.director.fillna('', inplace=True)
tmdb_movies.production_companies.fillna('', inplace=True)
In [5]:
def string_to_array(data):
    """
        This function returns given splitss the data by separator `|` and returns the result as array
    """
    return data.split('|')

Convert cast, genres, director and production_companies columns to array

In [6]:
tmdb_movies.cast = tmdb_movies.cast.apply(string_to_array)
In [7]:
tmdb_movies.genres = tmdb_movies.genres.apply(string_to_array)
In [8]:
tmdb_movies.director = tmdb_movies.director.apply(string_to_array)
In [9]:
tmdb_movies.production_companies = tmdb_movies.production_companies.apply(string_to_array)

Exploratory Data Analysis

Research Question 1: What is the yearly revenue change?

It's evident from observations below that there is no clear trend in change in mean revenue over years.

  • Mean revenue from year to year is quite unstable. This can be attributed to number of movies and number of movies having high or low revenue
  • The gap between budget and revenue have widened after 2000. This can be attributed to circulation of movies worldwide compared to earlier days.
  • There seems to be a correlation between gross budget and gross revenue over years.
  • When log of revenue_adj is plotted against log of budget_adj, we can see a clear correlation between revenue of a movie against the budget
In [10]:
def yearly_growth(mean_revenue):
    return mean_revenue - mean_revenue.shift(1).fillna(0)
In [11]:
# Show change in mean revenue over years, considering only movies for which we have revenue data
movies_with_budget = tmdb_movies[tmdb_movies.budget_adj > 0]
movies_with_revenue = movies_with_budget[movies_with_budget.revenue_adj > 0]

revenues_over_years = movies_with_revenue.groupby('release_year').sum()
revenues_over_years.apply(yearly_growth)['revenue'].plot()
revenues_over_years[['budget_adj', 'revenue_adj']].plot()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b6f62358>
In [12]:
def log(data):
    return np.log(data)
In [13]:
movies_with_revenue[['budget_adj', 'revenue_adj']].apply(log) \
.sort_values(by='budget_adj').set_index('budget_adj')['revenue_adj'].plot(figsize=(20,6))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b6e8aac8>

Since popularity column indicates all time popularity of the movie, it might not be the right metric to measure popularity over years. We can measure popularty of a movie based on average vote. I think a movie is popular if vote_average >= 7.

On analyzing the popular movies since 1960(check illustrations below), following onservations can be made:

  • Almost all popular movies have Drama genre
  • Over years Comedy, Action and Adventure got popular.
  • In recent years, Documentry, Action and Animation movies got more popularity.
In [14]:
def popular_movies(movies):
    return movies[movies['vote_average']>=7]
In [15]:
def group_by_genre(data):
    """
        This function takes a Data Frame having and returns a dictionary having
        release_year as key and value a dictionary having key as movie's genre
        and value as frequency of the genre that year.
    """
    genres_by_year = {}
    for (year, position), genres in data.items():
        for genre in genres:
            if year in genres_by_year:
                if genre in genres_by_year[year]:
                    genres_by_year[year][genre] += 1
                else:
                    genres_by_year[year][genre] = 1
            else:
                genres_by_year[year] = {}
                genres_by_year[year][genre] = 1
    return genres_by_year
            
In [16]:
def plot(genres_by_year):
    """
        This function iterates over each row of Data Frame and prints rows
        having release_year divisible by 5 to avoid plotting too many graphs.
    """
    for year, genres in genres_by_year.items():
        if year%5 == 0:
            pd.DataFrame(grouped_genres[year], index=[year]).plot(kind='bar', figsize=(20, 6))
In [17]:
# Group movies by genre for each year and try to find the correlations
# of genres over years.
grouped_genres = group_by_genre(tmdb_movies.groupby('release_year').apply(popular_movies).genres)
plot(grouped_genres)

Research Question 3: What kinds of properties are associated with movies that have high revenues?

We can consider those movies with at least 1 billion revenue and see what are common properties among them.

Considering this criteria and based on illustrations below, we can make following observations about highest grossing movies:

  • Adventure and Action are most common genres among these movies followed by Science Fiction, Fantasy and Family.
  • Most of the movies have more than 7 average vote, some movies have less than 7 but that is because of less number of total votes. This means highest grossing movies are popular as well.
  • Steven Spielberg and Peter Jackson are directors who have highest muber of movies having at least 1 billion revenue.
  • Most of the directors have only one movies having at least a billion revenue, hence there seems to be no corelation between highest grossing movies and directors.
  • Most of the cast have one movie having at least a billion revenue.
  • Warner Bros., Walt Disney, Fox Film and Universal picture seems to have figured out the secret of highest grossing movies. They have highest number of at least a billion revenue movies. This does not mean all their movies have pretty high revenue.
In [18]:
highest_grossing_movies = tmdb_movies[tmdb_movies['revenue_adj'] >= 1000000000]\
                            .sort_values(by='revenue_adj', ascending=False)
highest_grossing_movies.head()
Out[18]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
1386 19995 tt0499549 9.432768 237000000 2781505847 Avatar [Sam Worthington, Zoe Saldana, Sigourney Weave... http://www.avatarmovie.com/ [James Cameron] Enter the World of Pandora. ... In the 22nd century, a paraplegic Marine is di... 162 [Action, Adventure, Fantasy, Science Fiction] [Ingenious Film Partners, Twentieth Century Fo... 12/10/09 8458 7.1 2009 2.408869e+08 2.827124e+09
1329 11 tt0076759 12.037933 11000000 775398007 Star Wars [Mark Hamill, Harrison Ford, Carrie Fisher, Pe... http://www.starwars.com/films/star-wars-episod... [George Lucas] A long time ago in a galaxy far, far away... ... Princess Leia is captured and held hostage by ... 121 [Adventure, Action, Science Fiction] [Lucasfilm, Twentieth Century Fox Film Corpora... 3/20/77 4428 7.9 1977 3.957559e+07 2.789712e+09
5231 597 tt0120338 4.355219 200000000 1845034188 Titanic [Kate Winslet, Leonardo DiCaprio, Frances Fish... http://www.titanicmovie.com/menu.html [James Cameron] Nothing on Earth could come between them. ... 84 years later, a 101-year-old woman named Ros... 194 [Drama, Romance, Thriller] [Paramount Pictures, Twentieth Century Fox Fil... 11/18/97 4654 7.3 1997 2.716921e+08 2.506406e+09
10594 9552 tt0070047 2.010733 8000000 441306145 The Exorcist [Linda Blair, Max von Sydow, Ellen Burstyn, Ja... http://theexorcist.warnerbros.com/ [William Friedkin] Something almost beyond comprehension is happe... ... 12-year-old Regan MacNeil begins to adapt an e... 122 [Drama, Horror, Thriller] [Warner Bros., Hoya Productions] 12/26/73 1113 7.2 1973 3.928928e+07 2.167325e+09
9806 578 tt0073195 2.563191 7000000 470654000 Jaws [Roy Scheider, Robert Shaw, Richard Dreyfuss, ... http://www.jaws25.com/ [Steven Spielberg] Don't go in the water. ... An insatiable great white shark terrorizes the... 124 [Horror, Thriller, Adventure] [Universal Pictures, Zanuck/Brown Productions] 6/18/75 1415 7.3 1975 2.836275e+07 1.907006e+09

5 rows × 21 columns

Find common genres in highest grossing movies

In [19]:
def count_frequency(data):
    frequency_count = {}
    for items in data:
        for item in items:
            if item in frequency_count:
                frequency_count[item] += 1
            else:
                frequency_count[item] = 1
    return frequency_count
In [20]:
highest_grossing_genres = count_frequency(highest_grossing_movies.genres)
print(highest_grossing_genres)
pd.DataFrame(highest_grossing_genres, index=['Genres']).plot(kind='bar', figsize=(20, 8))
{'Action': 23, 'Adventure': 32, 'Fantasy': 15, 'Science Fiction': 16, 'Drama': 9, 'Romance': 2, 'Thriller': 9, 'Horror': 2, 'Family': 15, 'Crime': 5, 'Mystery': 1, 'Animation': 8, 'Comedy': 4, 'Music': 1}
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b8ec5208>

Popularity of highest grossing movies

In [21]:
highest_grossing_movies.vote_average.hist()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b6fcd780>

Directors of highest grossing movies

In [22]:
def list_to_dict(data, label):
    """
        This function creates returns statistics and indices for a data frame
        from a list having label and value.
    """
    statistics = {label: []}
    index = []
    for item in data:
        statistics[label].append(item[1])
        index.append(item[0])
    return statistics, index
In [23]:
import operator

high_grossing_dirs = count_frequency(highest_grossing_movies.director)

revenues, indexes = list_to_dict(sorted(high_grossing_dirs.items(), key=operator.itemgetter(1), reverse=True)[:20], 'revenue')

pd.DataFrame(revenues, index=indexes).plot(kind='bar', figsize=(20, 5))
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b78d6e80>

Cast of highest grossing movies

In [24]:
high_grossing_cast = count_frequency(highest_grossing_movies.cast)

revenues, index = list_to_dict(sorted(high_grossing_cast.items(), key=operator.itemgetter(1), reverse=True)[:30], 'number of movies')
pd.DataFrame(revenues, index=index).plot(kind='bar', figsize=(20, 5))
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b73b97b8>

Production companies of highest grossing movies

In [25]:
high_grossing_prod_comps = count_frequency(highest_grossing_movies.production_companies)

revenues, index = list_to_dict(sorted(high_grossing_prod_comps.items(), key=operator.itemgetter(1), reverse=True)[:30]\
                                 , 'number of movies')

pd.DataFrame(revenues, index=index).plot(kind='bar', figsize=(20, 5))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b70b3080>

Highest grossing budget

Research Question 4: Who are top 15 highest grossing directors?

We can see the top 30 highest grossing directors in bar chart below.

It seems Steven Spielberg surpasses other directors in gross revenue.

In [26]:
def grossing(movies, by):
    """
        This function returns the movies' revenues over key passed as `by` value in argument. 
    """
    revenues = {}
    for id, movie in movies.iterrows():
        for key in movie[by]:
            if key in revenues:
                revenues[key].append(movie.revenue_adj)
            else:
                revenues[key] = [movie.revenue_adj]
    return revenues
In [27]:
def gross_revenue(data):
    """
        This functions computes the sum of values of the dictionary and
        return a new dictionary with same key but cummulative value.
    """
    gross = {}
    for key, revenues in data.items():
        gross[key] = np.sum(revenues)
    return gross
In [28]:
gross_by_dirs = grossing(movies=movies_with_revenue, by='director')

director_gross_revenue = gross_revenue(gross_by_dirs)

top_15_directors = sorted(director_gross_revenue.items(), key=operator.itemgetter(1), reverse=True)[:15]
revenues, index = list_to_dict(top_15_directors, 'director')
pd.DataFrame(data=revenues, index=index).plot(kind='bar', figsize=(15, 9))
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b7998a20>

Research Question 5: Who are top 15 highest grossing actors?

We can find the top 30 actors based on gross revenue as shown in subsequent sections below.

As we can see Harison Ford tops the chart with highest grossing.

In [29]:
gross_by_actors = grossing(movies=tmdb_movies, by='cast')
actors_gross_revenue = gross_revenue(gross_by_actors)

top_15_actors = sorted(actors_gross_revenue.items(), key=operator.itemgetter(1), reverse=True)[:15]
revenues, indexes = list_to_dict(top_15_actors, 'actors')
pd.DataFrame(data=revenues, index=indexes).plot(kind='bar', figsize=(15, 9))
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8b7872e10>

Conclusions

We tried to answer quite a few questions using TMDb data whch are summarised below:

  • The gap between gross budget and revenue have increased over years.
  • Drama genre is common traits among popular movies with Comedy, Action, Adventure and Animation getting popular in recent years.
  • There seems to be correlation between highest grossing movies and Adventure & Action genres. Also Highest grossing movies are correlated with high popularity.
  • There also seems to be correlation between a movie's budget and it's revenue. A movie having high budget seems to have a high revenue.