Tuesday, July 9, 2013

This, That or the Other: Of Pasta, Pokémon and The Sopranos

Identifying proper noun categories using machine learning

OPHIRA EISENBERG: Jonathan, what's our next game?
JONATHAN COULTON: Well, if you have listened to  Ask Me Another before, then this game will probably sound familiar to you. It is one of my favorites. It's called This, That, or the Other. We will name an item and all you have to do is tell us which of three categories that item belongs to. Today's categories are grains, world currencies or Pokémon characters.


That was an excerpt from the transcript of an old episode of NPR’s whimsical trivia/puzzle radio show Ask Me Another

Other editions of the show have asked participants to tell between:
  • A tech company, a car model, or a Star Wars location;
  • Harry Potter spell, a prescription drug, and a piece of IKEA furniture;
  • A type of cheese, a dance move, or a Moby Dick character;
    and my personal
    favourite,
  • A type of pasta, a title of an opera, or a character on The Sopranos.

You see how this makes for a funny radio show. Here’s how this makes for an interesting post on a data blog:

If you were to guess, based on what you know about how each category sounds like, would you be likely to be right more than a third of the time?  Rather, could we train a model to pick up on features that are particular to each category, and then get that model to perform better than random on unseen examples?
Okay, so the data collection part isn't very difficult for a number of those categories, thanks to Wikipedia maintaining parseable articles in list form, like List of Pokémon1. I managed to easily procure lists of 

  • Currencies
  • Pokémon
  • Pastas
  • Cheese
  • Locations in the Star Wars Universe
For starters, the features I decided to create were simply the occurrence of letters and ‘bi-letters’ in each name. For example, Naboo (Die hard followers may recognise that as the place where Jar Jar Binks was from, but for the uninitiated, that’s a planet in the Star Wars universe. Just in case you were wondering), would have a ‘True’ value for features:

n, a, b, o, ^n, na, ab, bo, oo, o$

The ^ and the $ are special symbols I used to indicate beginning and the end of a word respectively.

If you've worked in the field of Natural Language Processing, you'll recognize these features as analogues of unigrams and bigrams in language models.

Next, I trained a Naïve Bayes model in in Python, using the excellent NLTK libraries. I picked categories two at a time, but Naïve Bayes allows for an extension to any number of targets very naturally. Feel free to play around with the code. You can find it on github here.

With just those simple kinds of features, I was able to get upwards of 80% accuracy2 on most pairs. In fact, on a pair like Cheese vs Pasta (I would totally watch a movie with that title.(Oi! In some circles I could pass that for humour!)), which seem like a difficult pair to classify, I could get as much as 92% accuracy.

Here’s a twist on the problem statement: What if you were designing the game, and wanted to pick the hardest items to guess? Actually, we can directly extend the results of the earlier part to get these. We simply need the items that the algorithm misclassified. So here’s a test to see how you do on the toughies. In the following set, can you guess if it’s a pasta, or a location in the Star Wars universe?

  1. Bestine
  2. Quelli
  3. Falleen
  4. Alfabeto
  5. Felucia
  6. Sorprese
  7. Sulorine
  8. Egg barley

Answers:
Star Wars Locations: 1, 2, 3, 5, 7,
Pastas: 4, 6, 8 (Yeah, the last one was a giveaway3 )

Do well? Pat yourself on the back, for today, you have outwitted a machine.

So why does the model classify these incorrectly? This might provide some insight. Here are the top 10 features the model picked:

Feature/Value
Dominant cat: Lesser cat
Ratio of Occurence
li = True
pastas : starWars
23.2 : 1.0
ti = True
pastas : starWars
12.8 : 1.0
et = True
pastas : starWars
10.8 : 1.0
i$ = True
pastas : starWars
9.0 : 1.0
^p = True
pastas : starWars
8.8 : 1.0
tt = True
pastas : starWars
8.1 : 1.0
length = 5
starWa : pastas
7.9 : 1.0
ci = True
pastas : starWars
6.9 : 1.0
f = True
pastas : starWars
6.3 : 1.0
nn = True
pastas : starWars
6.2 : 1.0


That’s that for these models. Here are some suggestions for other cool things you could do with the code if you have a teeny weeny bit of coding experience (No math/stats/machine learning experience required):


  • Find out if your name looks like a grain or a kind pasta (And then go around claiming "I just don't get grain-people. Pasta-ites FTW!")
  • Gather more lists (Simpsons characters, varieties of chili, brands of cosmetics, ….) and  see which ones look like which, er, other ones. (The data is represented as simply a text file, with one item on each line)
  • See which the cheesiest pastas are! (I'm so terribly sorry. I'll never ever try to be funny again. Ever.) 
I’m working on a R + Shiny app that would make all of the above much easier for the casual browser, but it’s much more fun to play around with the code, don’t you think?

Don't you?
Don't leave me hanging here, guys.
Guys?
Fine. I'll just make the app.


Footnotes
1 While on that topic, check out these weird lists on Wikipedia: 
2 I’m using accuracy as simply #correct predictions/(#correct predictions + #incorrect predictions)
3 Only goes to show that there are more/better features to be derived here.
4 I was working with a slightly different version here, which included a variable for length. Don't be thrown off by that.



2 comments:

  1. (If you haven't already tried) Mentalfloss has hordes of these in the form of quizzes : http://mentalfloss.com/quizzes

    ReplyDelete
  2. These are way too addictive! I loved 'Celebrity Baby Name or Computer Virus?'. Unfortunately, I only scored 45%.

    ReplyDelete