Getting a Large List of Nouns or Adjectives in Python with NLTK; or Python Mad Libs
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
When you need a large collection of English nouns or adjectives for a project -- whether for a word game, text generation, or a Mad Libs program -- the Natural Language Toolkit (NLTK) is the most accessible starting point in Python. NLTK ships with curated datasets like WordNet and the Brown corpus that give you tens of thousands of categorized words out of the box. Understanding how to extract words by part of speech from these resources opens the door to creative programming projects and serious NLP pipelines alike.
Getting Started with NLTK
Before you can access any corpus, you need to install NLTK and download the relevant data packages. NLTK separates its library code from its data, so you must explicitly fetch the datasets you plan to use.
This downloads WordNet (a structured lexical database), the Brown corpus (a tagged collection of real English text), and the tagger models needed for part-of-speech tagging.
Extracting Nouns and Adjectives from WordNet
WordNet organizes English words into synsets -- groups of synonymous words -- and labels each synset with a part of speech. This makes it the cleanest source for extracting word lists by category. The reason WordNet works so well here is that every entry already carries its grammatical role, so you never need to guess.
The pos parameter accepts wn.NOUN, wn.ADJ, wn.VERB, and wn.ADV. The replace('_', ' ') call converts multi-word entries like "ice_cream" into readable strings.
Extracting Words from the Brown Corpus
The Brown corpus takes a different approach. Instead of a structured dictionary, it is a collection of real sentences where each word has been tagged with a part-of-speech label. This gives you words as they actually appear in context, which can be more natural for text generation.
The universal tagset simplifies the many Penn Treebank tags into broader categories. If you need finer control, skip the tagset parameter and filter on tags like NN, NNS, NNP (proper noun), or JJ, JJR (comparative), JJS (superlative).
Building a Mad Libs Generator
With your word lists in hand, building a Mad Libs game is straightforward. The idea is to define a story template with placeholders and fill them randomly from your extracted lists.
Alternative Approach with spaCy
If you need more accurate part-of-speech tagging on custom text (rather than pulling from a pre-built corpus), spaCy is a strong alternative. It uses trained neural models rather than rule-based taggers, which gives better accuracy on modern English.
spaCy does not ship with a word list like WordNet, so it works best when you want to extract words from your own documents rather than generate a standalone vocabulary.
Common Pitfalls
- Forgetting to download data: NLTK separates code from data. Calling
wordnet.all_synsets()without first runningnltk.download('wordnet')raises aLookupError. - Multi-word lemmas: WordNet entries like
"ice_cream"or"New_York"slip into your word lists. Filter withstr.isalpha()if you need single words only. - Mixing up tagsets: The Brown corpus default tags (NN, JJ) differ from the universal tagset (NOUN, ADJ). Using the wrong tag name returns an empty list with no warning.
- Duplicate words across synsets: A single word can appear in many synsets, inflating your list. Always collect into a
setrather than alistduring extraction. - Assuming WordNet covers slang or modern words: WordNet was built from formal English. It lacks internet slang, brand names, and many recently coined terms.
Summary
- Use
wn.all_synsets(pos=wn.NOUN)to pull all nouns from WordNet, andwn.ADJfor adjectives. - The Brown corpus provides real-world tagged text; filter on universal tags like
NOUNandADJfor cleaner categories. - Always collect words into a
setto avoid duplicates from multiple synsets. - Building a Mad Libs generator is as simple as defining a template string and calling
random.choice()on your word lists. - For tagging your own custom text rather than pulling from a corpus, spaCy with a trained model gives more accurate results than NLTK's rule-based tagger.
- Remember to download NLTK data packages before accessing any corpus -- this is the most common source of errors for new users.

