List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
CountVectorizer can build the vocabulary and the term-document matrix, but it does not automatically return the words sorted by total corpus frequency. To get that ranking, you usually fit the vectorizer, sum the counts across all documents, and then sort the resulting vocabulary terms by their totals.
What CountVectorizer Gives You
After fitting, CountVectorizer provides two useful things:
- a sparse matrix of token counts per document
- the vocabulary order used for the feature columns
The matrix tells you how many times each token appears in each document. To rank vocabulary terms by total occurrence in the whole corpus, you need to collapse that matrix across rows.
Sum Counts Across the Corpus
The simplest approach is to sum the matrix by column, because each column corresponds to one vocabulary item.
This gives you each vocabulary term and its total count in the entire corpus.
Sort by Frequency Descending
Once you have the words and totals, sorting is straightforward.
That is the usual answer when people want “the vocabulary according to occurrence.”
A pandas View Can Be Convenient
If you want a table or plan to export the result, wrapping it in a DataFrame can be clearer.
This becomes especially useful when you also want to filter, plot, or export the results later.
Be Careful About Tokenization Rules
The ranking is only as meaningful as the tokenization rules used by CountVectorizer.
Important parameters include:
- '
stop_words' - '
ngram_range' - '
lowercase' - '
min_df' - '
max_df'
For example, leaving stop words enabled means common words such as "the" may dominate the ranking. That may be correct for raw frequency analysis, but it may be unhelpful for topic discovery.
Corpus Frequency Is Not the Same as Document Frequency
Another common source of confusion is the difference between:
- total occurrence count across the corpus
- number of documents containing the term
X.sum(axis=0) gives corpus frequency. If instead you want document frequency, you would count how many rows contain a nonzero value for each column.
Those two measures answer different questions, so be explicit about which one you need.
Common Pitfalls
The most common mistake is assuming vectorizer.vocabulary_ is already ordered by frequency. It is not.
Another common issue is forgetting that CountVectorizer returns a sparse matrix and therefore needs an aggregation step to compute total counts. Developers also often interpret raw rankings without thinking about stop words, tokenization settings, or whether corpus frequency is actually the right measure for the task.
Summary
- '
CountVectorizerbuilds the vocabulary and count matrix, but not a pre-sorted frequency list.' - Sum the count matrix by column to get total corpus frequency per term.
- Pair the totals with
get_feature_names_out()and sort descending. - Adjust tokenization parameters if stop words or n-grams matter.
- Be explicit about whether you want corpus frequency or document frequency.

