Python
NLTK
Natural Language Processing
Mad Libs
Programming

Getting a Large List of Nouns or Adjectives in Python with NLTK; or Python Mad Libs

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When you need a large collection of English nouns or adjectives for a project -- whether for a word game, text generation, or a Mad Libs program -- the Natural Language Toolkit (NLTK) is the most accessible starting point in Python. NLTK ships with curated datasets like WordNet and the Brown corpus that give you tens of thousands of categorized words out of the box. Understanding how to extract words by part of speech from these resources opens the door to creative programming projects and serious NLP pipelines alike.

Getting Started with NLTK

Before you can access any corpus, you need to install NLTK and download the relevant data packages. NLTK separates its library code from its data, so you must explicitly fetch the datasets you plan to use.

python
1import nltk
2
3nltk.download('wordnet')
4nltk.download('brown')
5nltk.download('universal_tagset')
6nltk.download('averaged_perceptron_tagger_eng')

This downloads WordNet (a structured lexical database), the Brown corpus (a tagged collection of real English text), and the tagger models needed for part-of-speech tagging.

Extracting Nouns and Adjectives from WordNet

WordNet organizes English words into synsets -- groups of synonymous words -- and labels each synset with a part of speech. This makes it the cleanest source for extracting word lists by category. The reason WordNet works so well here is that every entry already carries its grammatical role, so you never need to guess.

python
1from nltk.corpus import wordnet as wn
2
3# Get all nouns
4nouns = set()
5for synset in wn.all_synsets(pos=wn.NOUN):
6    for lemma in synset.lemmas():
7        nouns.add(lemma.name().replace('_', ' '))
8
9# Get all adjectives
10adjectives = set()
11for synset in wn.all_synsets(pos=wn.ADJ):
12    for lemma in synset.lemmas():
13        adjectives.add(lemma.name().replace('_', ' '))
14
15print(f"Nouns: {len(nouns)}")       # ~117,000
16print(f"Adjectives: {len(adjectives)}")  # ~21,000

The pos parameter accepts wn.NOUN, wn.ADJ, wn.VERB, and wn.ADV. The replace('_', ' ') call converts multi-word entries like "ice_cream" into readable strings.

Extracting Words from the Brown Corpus

The Brown corpus takes a different approach. Instead of a structured dictionary, it is a collection of real sentences where each word has been tagged with a part-of-speech label. This gives you words as they actually appear in context, which can be more natural for text generation.

python
1from nltk.corpus import brown
2
3# Tagged words use Penn Treebank tags
4tagged_words = brown.tagged_words(tagset='universal')
5
6# NN = singular noun, NNS = plural noun, JJ = adjective
7nouns = set(word.lower() for word, tag in tagged_words if tag == 'NOUN')
8adjectives = set(word.lower() for word, tag in tagged_words if tag == 'ADJ')
9
10print(f"Brown nouns: {len(nouns)}")       # ~20,000
11print(f"Brown adjectives: {len(adjectives)}")  # ~9,000

The universal tagset simplifies the many Penn Treebank tags into broader categories. If you need finer control, skip the tagset parameter and filter on tags like NN, NNS, NNP (proper noun), or JJ, JJR (comparative), JJS (superlative).

Building a Mad Libs Generator

With your word lists in hand, building a Mad Libs game is straightforward. The idea is to define a story template with placeholders and fill them randomly from your extracted lists.

python
1import random
2from nltk.corpus import wordnet as wn
3
4def get_words(pos):
5    words = set()
6    for synset in wn.all_synsets(pos=pos):
7        for lemma in synset.lemmas():
8            name = lemma.name().replace('_', ' ')
9            if name.isalpha():  # skip hyphenated/multi-word
10                words.add(name)
11    return list(words)
12
13nouns = get_words(wn.NOUN)
14adjectives = get_words(wn.ADJ)
15verbs = get_words(wn.VERB)
16
17template = (
18    "The {adj} {noun1} decided to {verb} across the {adj2} {noun2}. "
19    "Everyone agreed it was the most {adj3} thing they had ever seen."
20)
21
22story = template.format(
23    adj=random.choice(adjectives),
24    noun1=random.choice(nouns),
25    verb=random.choice(verbs),
26    adj2=random.choice(adjectives),
27    noun2=random.choice(nouns),
28    adj3=random.choice(adjectives),
29)
30print(story)

Alternative Approach with spaCy

If you need more accurate part-of-speech tagging on custom text (rather than pulling from a pre-built corpus), spaCy is a strong alternative. It uses trained neural models rather than rule-based taggers, which gives better accuracy on modern English.

python
1import spacy
2
3nlp = spacy.load("en_core_web_sm")
4
5text = "The brilliant engineer designed an elegant solution for the complex problem."
6doc = nlp(text)
7
8nouns = [token.text for token in doc if token.pos_ == "NOUN"]
9adjectives = [token.text for token in doc if token.pos_ == "ADJ"]
10
11print(f"Nouns: {nouns}")        # ['engineer', 'solution', 'problem']
12print(f"Adjectives: {adjectives}")  # ['brilliant', 'elegant', 'complex']

spaCy does not ship with a word list like WordNet, so it works best when you want to extract words from your own documents rather than generate a standalone vocabulary.

Common Pitfalls

  • Forgetting to download data: NLTK separates code from data. Calling wordnet.all_synsets() without first running nltk.download('wordnet') raises a LookupError.
  • Multi-word lemmas: WordNet entries like "ice_cream" or "New_York" slip into your word lists. Filter with str.isalpha() if you need single words only.
  • Mixing up tagsets: The Brown corpus default tags (NN, JJ) differ from the universal tagset (NOUN, ADJ). Using the wrong tag name returns an empty list with no warning.
  • Duplicate words across synsets: A single word can appear in many synsets, inflating your list. Always collect into a set rather than a list during extraction.
  • Assuming WordNet covers slang or modern words: WordNet was built from formal English. It lacks internet slang, brand names, and many recently coined terms.

Summary

  • Use wn.all_synsets(pos=wn.NOUN) to pull all nouns from WordNet, and wn.ADJ for adjectives.
  • The Brown corpus provides real-world tagged text; filter on universal tags like NOUN and ADJ for cleaner categories.
  • Always collect words into a set to avoid duplicates from multiple synsets.
  • Building a Mad Libs generator is as simple as defining a template string and calling random.choice() on your word lists.
  • For tagging your own custom text rather than pulling from a corpus, spaCy with a trained model gives more accurate results than NLTK's rule-based tagger.
  • Remember to download NLTK data packages before accessing any corpus -- this is the most common source of errors for new users.

Course illustration
Course illustration

All Rights Reserved.