7500 Word Dictionary

0
0
Published 2023-06-12


Here's a 7500 word dictionary cart for making word games and whatnot. It contains the most common 3-6 letter words according to wiktionary.com, including proper names. The loader is 264 tokens, and the data is 11317, stored over the full map (including shared gfx), plus the last 44 SFX. So there are just 128 sprites and 20 SFX spare. The 5 most common and 10 least common words on the list are:

Technical details..

The data is generated using a convoluted toolchain process, that I'll post later on if I find time to organize it into something useful.

The compression works by enumerating every possible word in order of word size first, and then alphabetical order. So, A is 0, B is 1, AA is 26, and so on. This means that to store the whole dictionary, only the distances between each word's index is needed. There are many clusters of close words (e.g the distance between CAN and CAP is only 2), so the distances are sorted into range categories depending on how many bits are needed to store each range. Repeated categories are common and so can be encoded with a single bit -- otherwise a 3-bit value is used to store the category for each distance. The encoding utility greedy-searches sets of 5 category bit-lengths and a roman cypher to try to optimize the encoded size, which saved around 1k compared with hand-optimizing the parameter set.