The Aarrgh and the Aardwolf
Curating our Word Lists, Part 1
It's surprisingly difficult to make a good word list for a game. I mean, you think it's going to be hard. But it's even harder than that.
Finding an appropriately sized starting list is the first challenge. We started by downloading SOWPODS (Collins Scrabble Words), the official word list used in many English-language Scrabble tournaments around the world. But SOWPODS has over 279,000 words in it. That is a heck of a lot of words.
I consider myself something of a word nerd, and I'm unfamiliar with probably the majority of the words on it. Here's a sample from the top of the list:
AA
AAH
AAHED
AAHING
AAHS
AAL
AALII
AALIIS
AALS
AARDVARK
AARDVARKS
AARDWOLF
AARDWOLVES
AARGH
AARRGH
AARRGHH
AARTI
AARTIS
AAS
You won't regret googling aardwolf
, and I love that there are three accepted variations of aargh
. But I think most people would be pretty annoyed if a word in their puzzle turned out to be aals
. I didn't want to sort through 279,000 words by hand, so we needed a different option.
After trying and rejecting a few other lists – too big, too small, too hot, too cold – we landed on the "Spell Checker Oriented Word Lists" aka SCOWL (and Friends). Thanks, Jed Hartman for blogging about it! SCOWL is an awesome resource with lists that can be filtered and combined based on a slew of things like frequency, region, abbreviations, proper names, and more.
Next, we decided that we wanted to remove most verb forms and plurals to reduce repetition in the game – e.g. keep jump
, but remove jumps
, jumped
, and jumping
. (Fun Fact – these different forms of a word are known in linguistics as "inflections".) SCOWL has a file containing data about parts of speech and inflections. It was perfect for this. We wrote a PowerShell script that compared our main list against the inflections list and removed the words we didn't want.
For this game, we also removed words shorter than three letters or longer than eight letters.
Then we used SCOWL as a starting point for our normal and hard difficulty word lists, based on their "size" categories. Common words are found in the smaller size categories. Uncommon words often correlate with difficulty, but not always, so we ended up moving some words around. For example, our "Move from normal to hard" list includes semantic
, nominal
, and parity
. Our "Move from hard to normal" list includes starry
, clay
, and balloon
. 🎈
We wanted to accept solutions other than the one the game made for the puzzle at hand, so we created a broader set of words for our solution checker and a stricter subset for our puzzle generator. That way, the player can submit an inflection or a more obscure word that wasn't intended by the generator.
This gave us a pretty good starting point for our lists, but the hardest part was yet to come – manually picking words to remove based on subject matter. There's enough to unpack there that I think it deserves its own post, so I'll see you in Part 2!
Have you made word lists for a game? Feel free to reach out with your thoughts or questions!
Top image from: Marie van Dieren