Today I closed down the Great Language Game, a site I've run for nearly five years. The game played you a short audio clip and challenged you to identify what language was being spoken. Despite the simple premise, it was enjoyed by some 3.2 million players from around the world, and even inspired original linguistics research. Here I reflect on the process of building and running the game.

What inspired its creation?

I'm half-Swedish, but born and raised in Melbourne, Australia. I've always loved the mixed cultural heritage of modern Australia, where 26% of people were born elsewhere. This means a large number of Australians speak languages other than English in the home, or perhaps with their parents and relatives.

As I finished school, I went through a naive phase where I hoped to become a polyglot, or even a hyper-polyglot (speaker of many languages). When I first started forgetting a language I had studied, it was a shock. I learned that you only get to keep languages you use. This means your lifestyle and profession really limits the number of languages you can speak. But perhaps if we can't learn them all, at least we can learn to tell the difference between them.

Fast forward a few years and I suddenly realized that I had all the tools I needed to make this small dream come true. I wrote the Great Language Game for myself, and the other few hundred hobby linguists I thought might enjoy it. The day after release, I had to take time off work to improve my site: first tens, then hundreds of thousands of people began listing to languages and trying to guess their identity. Who knew so many people would love foreign tongues?

Sourcing audio

Coming from the research world, where all data should be carefully licensed and attributed, I thought a lot about getting samples of different languages. I began at first with samples from a single Australian broadcaster, SBS, which has podcasts in nearly 70 languages. This was an amazing kick-start.

An SBS employee made contact to correct a language, and so I asked about the idea of official permission to use the audio. He thought asking would raise more problems than it might solve, but the worry of being taken down stuck in my mind. To hedge against it, I quickly included samples from many other broadcasters from across the world, including BBC, Voice of America and Sveriges Radio.

I learned some interesting things about the boundaries between languages. In particular, that often they're political rather than linguistic. For example, should I include Bosnian, Serbian and Croatian separately or as the merged Serbo-Croatian; and should Hindi and Urdu be separate, or merged into Hindustani? Even language names sometimes got complaints; I developed the policy of exclusively using language names from Wikipedia or Ethnologue.

Linguistic research

A linguist, Hedvig SkirgÄrd, contacted me about the game and its player dataset. Having experienced how hard it can be to get data in the research world, I was keen to make data open for her and others to use. So, I did a careful export, made it public, and she and her colleague Sean Roberts began diving in. The results of this were a fascinating research paper on how and why people confuse languages.

This process opened up a bunch of questions: why had I included one language rather than another? What audio did I have, and where did each sample come from? I also had many people asking me to add more languages, but I was unhappy with the way I was handling all this audio data. I didn't want to add any more to the existing project, so I decided to rebuild.

Rebuilding

I made two decisions when I rebuilt the game. Firstly, I separated out all my audio sourcing efforts into a new open project, making the audio dataset totally independent from the game. This in theory would allow others to contribute audio more directly to the game, and for that audio to be used for other purposes. That dataset is now the Wide Language Index.

Secondly, I decided to use much more traditional software tools the second time around. (For developers: I moved from Flask and MongoDB to Django and Postgres). This made my life easier and made the project better to work with in a large number of ways.

Some time around completing this rewrite, I began to run out of energy. I shipped the new version, although it still had less languages and less features. Usage dropped, but I thought it would be worth it in the end. In time, people discovered the Wide Language Index and began contributing actively, and this meant we could add languages that I couldn't identify on my own, such as Egyptian Arabic.

Over time though, I came to realize how much energy the game took, even when I wasn't actively working on it. Having decided it had run its course, I shut it down to free up energy for new projects.

Some thanks

Over the years I've run the game, I had a huge amount of correspondence. People told me about using the game in classrooms, of their high scores, of their fascinating backgrounds and upbringings that gave them a particular ear for languages, and of their requests and hopes for more. I'd like to thank you all for your energy, and for your time enjoying the world's languages.

Although the game is now closed down, I've left datasets behind for anyone else who's interested in developing similar projects themselves, as well as for researchers who are interested in looking at confusion again:

  • Wide Language Index: a catalog of audio samples from different languages, categorized by language, mirrored, and annotated
  • Language Confusion Dataset: confusion data from the language game, for use by hobbyists and researchers

I look forward to trying and playing new games made by others, games that help us think about and explore the rich world we live in.