Wednesday, September 11, 2013

The Happiest Emoticons

Clearly, a :) is happier than a :( but what about a :-* and a :-D ? Or a :-| and a :-o ? In this post I attempt to rank emoticons in order of how happy someone has to be to use each one. (And punctuate horribly to avoid mixing punctuation with the emoticon)

To start off, I need a collection of emoticons associated with some text. And where else would I find this, but that gigantic compendium of everyday emotions, the definitive corpus of our age - Twitter.
If you'd rather read the code yourself, it's available here, but the methodology is this: I collect lots of tweets containing emoticons, assign each one a 'sentiment' score[1], and then order the emoticons based on the average sentiment score of tweets containing each emoticon

The tweet gathering process is fairly direct. I parse tweets obtained from the streaming API[2] which contain any of a set of predefined emoticons and write them out to a file. If you want to, have a look at the Python code here. For the purpose of the R analysis, the tweet texts are already in a file. Each line is then (a) parsed for the emoticons it contains, and (b) assigned a sentiment score[3].

Finally, we plot each tweet on an emoticon-score plot. Like so:

The tiny vertical black lines mark the mean score for each emoticon.
There is no ordering to the colour scale. The colours just help differentiate each row.

The complete data collection, analysis and plotting code can be found on this github repo

Okay, so here's a list of observations and (partial) explanations for some surprises
  1.  o.O and :* score higher than :-)
    I think the ubiquity of :-) is its burden. People feel :-) for all sorts of reasons. Also, the score for o.O is computed over a much smaller number of tweets, and is possibly unstable.
  2. I can understand people using :-) at sad stuff, but what kind of a person uses :-( for happy tweets? (There aren't many of these, but a couple of them are too far right.) Let's look at one of  those tweets:

    Wow I was sleeping sooooo good which doesn't happen very often & They called from work & woke me up .. Now I can't go back to sleep :-(  

    That makes sense. It's a tweet that turned sour half way through, but overall, had a pretty high density of positive words, so it's no surprise that our scorer tagged it with a positive score
  3. Here's a tweet with a 8D in it:

    Got to take a pic with heage ! Who has by far been the most fun, funny and candid lecturer(in my... http://t.co/WaOTW8D2YO

    Notice anything funny? It's a happy tweet, but the emoticon we were looking for, is conspicuously absent! Actually, the 8D does occur in the tweet - albeit in a url http://t.co/WaOTW8D2YO
    Thanks to Twitter's automatic url compression using t.co, it's entirely possible to see an arbitrary collection of alphanumeric characters in a tweet without any semantic information. So be wary of the scores for stuff like 8D and xD.
So the next time you can't tell what someone is trying to convey with an emoticon, this chart might come in handy as a reference. In the meantime, if you're happy and you know it, contort your pupils o.O


Notes
[1] A linear scale where positive is happy, negative is unhappy
[2] Twitter's Search API handles punctuation poorly, so that's not an option
[3] Assignment of this score is done via a relatively simple lookup mechanism. This file provides a good evaluation
All code used has been made available on github

8 comments:

  1. Hmm...such a high score for o.O compared to o_O and O_o is quite interesting especially when they are used interchangeably. I'm guessing a sentence/word ending in 'o' and the next starting with 'O'(leading to a 'o.O') will also contribute to this. If not English, maybe Italian or something?

    ReplyDelete
    Replies
    1. That's entirely possible, but that would only give us a higher *number* of tweets in the o.O category. I see no reason for that sample to be biased towards positive sentiment either.

      Delete
  2. You might lose some genuine emoticons (it's not like there's an absence of data in twitter), but could you refine the search so that you're only looking for emoticons that have a space in front? E.g. " 8D" Unless the emoticon is at the beginning of the tweet, this should remove the emoticons within urls or other strange non-emotional occurrences.

    ReplyDelete
    Replies
    1. Agreed. That should certainly be a cleaner sample. We'd miss some tweets which mean it as an emoticon, but just don't bother to put in a space before it, but like you say, there isn't a dearth of data.
      I'll try this out.

      Delete
  3. And you could conclude that there are way more positive tweets in the world than negative. o.O (Unless most of the negative tweets are hidden in negative emoticons not included.)

    ReplyDelete
    Replies
    1. I think it's both. Twitter as a medium encourages positive tweets more than negative ones, but emoticons too, are more likely to be used when you want to convey a positive emotion.

      Although, a quick sample taken (just now,) without any preference for emoticon-containing tweets, does show a higher fraction of positive tweets. The internet is a happy place. Who would've though? o.O

      Delete
  4. Any references on your sentiment visualization? It is really great.

    ReplyDelete
    Replies
    1. Thanks. I created it using the ggplot2 package in R. If you want to have a look at the code, it's here: https://github.com/CabbagesAndKings/EmoticonSentiment/blob/master/scripts/scoreEmoticons.R
      (The plotting function is near the very end)

      Delete