Saturday, June 22, 2013

Warrior Zombies from Outer Space II: Mayhem Unleashed


Given the speed at which I consume them, it's only justified that the first post on this blog is about movies. (Although, by that logic, it could have equally well been about sandwiches, Nutella, or tissue paper. Note to self: Look for a Nutella consumption dataset)
Anyway, this post is about movie taglines - specifically, the words that constitute them.
The data is pretty much there for the picking – IMDb hosts a number of freely available1 datasets, and one of them is about taglines.

The data is in an odd format, but at least it’s all available in one place. After the coding equivalent of jamming the fork into the toaster and jerking it around until something pops, I have the data in a usable structure

Once here, R’s tm package makes quick work of the word frequency analysis, and I have derived a dataset with common words and their frequencies in movie titles. After removing some highly frequent words in English (articles, pronouns, some prepositions, etc.). Here’s a list of the most used words in movie taglines, ordered by frequency:

love, life, story, world, time, film, comedy, death, woman, don’t

Not many surprises there – until we look at the fraction of taglines these terms occur in:

love
7.5%
life
6.0%
story
5.0%
world
3.8%
time
3.1%
film
2.3%
comedy
2.2%
death
2.1%
woman
2.1%
dont
1.9%


These numbers are way higher than I expected. ‘Love’ alone occurs in a whopping 7.5% of all movie taglines!
Here’s a visual representation of the words you'd have seem most often in movie taglines (the size of each word is proportional to the frequency of its occurrence)

Yes Hollywood, we see right through you.
The R code to parse the data and make the word cloud is available at github if you're interested 

I'm kinda keen to know if this trend has been constant through the years. Let's do the same thing, except looking at the taglines decade by decade. Here’s the list of top 10 words in movie taglines from each decade2 – from the fifties to the teens (Teens? Onesies? I like ‘onesies’)



1940s
1950s
1960s
1970s
1980s
1990s
2000s
2010s
action
story
love
love
love
love
love
love
love
love
story
story
story
life
life
life
story
world
world
film
life
story
story
story
thrills
terror
picture
time
time
time
world
world
adventure
adventure
film
world
hes
world
time
sometimes
romance
woman
woman
life
world
comedy
sometimes
dont
gun
screen
adventure
movie
comedy
hes
film
time
west
picture
life
death
adventure
murder
dont
film
thrill
gun
time
terror
terror
film
comedy
family
screen
girl
motion
hes
movie
dont
family
cant

Again, I think a visual representation might come in handy




‘story’ and ‘love’ are part of the Top 10 list in each decade, but the other words are distinctly symbolic of the movies of each era:

  • The 40s are the years of ‘action’, ‘adventure’, ‘thrills’ and ‘west’
  • The 50s go slightly more romantic, and scale stuff up, adding ‘woman’, ‘girl’, ‘world’ and ‘terror’
  • In the 60s, ‘girl’ is out, but ‘woman’ is still in; No more ‘gun’ and ‘terror’. instead, it’s about ‘life’ and ‘time’, both of which are here to stay
  • ‘terror’ makes a comeback in the 70s; ‘adventure’ goes out. And ‘death’ is explored.
  • In the 80s, ‘comedy’ makes the list for the first time
  • The 90s  are the only time ‘murder’ was cool.
  • The 00s (I like to call these the noughties) and (the early part of) the 10s show a distinct change in values that sell.’family’, ‘sometimes’  and ‘cant’ are popular

I've put up word clouds for each decade, and the code to generate them on imgur and github.

So that’s that for frequent words. But I'm also after words that are frequent exclusively in high (or low) rated movies. Or to look at it another way, words that, in retrospect, are indicative of the movie’s success.

One way of doing this is to segment the data into different parts by performance, and do the same analysis as above. But the prior frequencies will likely dominate these lists. What I really want is words whose presence (or absence) is highly indicative of the movie’s rating.

NOTE: Some math to follow. If you're uncomfortable with arithmetic and/or statistics, skip a couple of paragraphs.

For a given term, if D1 is the distribution of movie ratings with the term present in the tagline, and D2 is the distribution of movie ratings with the term absent in the tagline, I'm going to define my divergence/separation metric as3,4:

<Obligatory CORRELATION DOES NOT IMPLY CAUSATION warning> 

Adding such words will not automatically make your movie successful – this is offered a post-event descriptive analysis, not a predictive one. I'm not implying any causality here.


</warning>

This divergence is just a magnitude - so I had to separate the most related ‘good movie’ keywords list from the ‘bad movie’ keywords list.
So, without further math or ado, the 10 terms that correspond to highest ratings:

animation, masterpiece, vision, magnificent, production, startling, french, glorious, smashing, grand         

And the 10 terms that correspond to the lowest ratings:

outer, zombies, ancient, woods, experiment, pray, tonight, mayhem, warrior, unleashed

Again, I've put up code to generate these lists on github

If these lists make you have second thoughts about making Warrior Zombies from Outer Space II: Mayhem Unleashed, don't be disheartened  - because like I said earlier, there is certainly a correlation, but it’s not necessarily a causal relationship5. And hey, I know a bunch of people who would watch the hell out of that movie.



Footnotes
1 Going through and adhering to the legal clauses for use for the datasets is left as an exercise for the reader
2 The punctuation has been removed from the data to make the analysis easier. So if you see “cant”, that’s probably “can't”, and so on.
3 It is possible that a better metric might have been used, or even a simpler one, but for some reason, I went with this. Other suggestions are welcome.
4 IMDb ratings are arguably, not the best indicators of movie success, but that's certainly one way of estimating, and there is probably going to a future post analyzing how reliable a measure this.
5 EDIT: Revisiting this, the final two lists of words don't seem particularly robust. 




0 comments:

Post a Comment