CountVectorizer
Scikit-Learn
text analysis
corpus
word frequency

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

CountVectorizer can build the vocabulary and the term-document matrix, but it does not automatically return the words sorted by total corpus frequency. To get that ranking, you usually fit the vectorizer, sum the counts across all documents, and then sort the resulting vocabulary terms by their totals.

What CountVectorizer Gives You

After fitting, CountVectorizer provides two useful things:

  • a sparse matrix of token counts per document
  • the vocabulary order used for the feature columns
python
1from sklearn.feature_extraction.text import CountVectorizer
2
3corpus = [
4    "the quick brown fox jumps",
5    "the quick blue fox",
6    "blue fox jumps quickly",
7]
8
9vectorizer = CountVectorizer()
10X = vectorizer.fit_transform(corpus)
11
12print(X.toarray())
13print(vectorizer.get_feature_names_out())

The matrix tells you how many times each token appears in each document. To rank vocabulary terms by total occurrence in the whole corpus, you need to collapse that matrix across rows.

Sum Counts Across the Corpus

The simplest approach is to sum the matrix by column, because each column corresponds to one vocabulary item.

python
1import numpy as np
2from sklearn.feature_extraction.text import CountVectorizer
3
4corpus = [
5    "the quick brown fox jumps",
6    "the quick blue fox",
7    "blue fox jumps quickly",
8]
9
10vectorizer = CountVectorizer()
11X = vectorizer.fit_transform(corpus)
12words = vectorizer.get_feature_names_out()
13counts = np.asarray(X.sum(axis=0)).ravel()
14
15for word, count in zip(words, counts):
16    print(word, count)

This gives you each vocabulary term and its total count in the entire corpus.

Sort by Frequency Descending

Once you have the words and totals, sorting is straightforward.

python
1ranking = sorted(
2    zip(words, counts),
3    key=lambda item: item[1],
4    reverse=True,
5)
6
7for word, count in ranking:
8    print(f"{word}: {count}")

That is the usual answer when people want “the vocabulary according to occurrence.”

A pandas View Can Be Convenient

If you want a table or plan to export the result, wrapping it in a DataFrame can be clearer.

python
1import pandas as pd
2
3freq_df = pd.DataFrame({
4    "word": words,
5    "count": counts,
6}).sort_values("count", ascending=False)
7
8print(freq_df)

This becomes especially useful when you also want to filter, plot, or export the results later.

Be Careful About Tokenization Rules

The ranking is only as meaningful as the tokenization rules used by CountVectorizer.

Important parameters include:

  • 'stop_words'
  • 'ngram_range'
  • 'lowercase'
  • 'min_df'
  • 'max_df'
python
1vectorizer = CountVectorizer(
2    stop_words="english",
3    ngram_range=(1, 1),
4    min_df=1,
5)

For example, leaving stop words enabled means common words such as "the" may dominate the ranking. That may be correct for raw frequency analysis, but it may be unhelpful for topic discovery.

Corpus Frequency Is Not the Same as Document Frequency

Another common source of confusion is the difference between:

  • total occurrence count across the corpus
  • number of documents containing the term

X.sum(axis=0) gives corpus frequency. If instead you want document frequency, you would count how many rows contain a nonzero value for each column.

Those two measures answer different questions, so be explicit about which one you need.

Common Pitfalls

The most common mistake is assuming vectorizer.vocabulary_ is already ordered by frequency. It is not.

Another common issue is forgetting that CountVectorizer returns a sparse matrix and therefore needs an aggregation step to compute total counts. Developers also often interpret raw rankings without thinking about stop words, tokenization settings, or whether corpus frequency is actually the right measure for the task.

Summary

  • 'CountVectorizer builds the vocabulary and count matrix, but not a pre-sorted frequency list.'
  • Sum the count matrix by column to get total corpus frequency per term.
  • Pair the totals with get_feature_names_out() and sort descending.
  • Adjust tokenization parameters if stop words or n-grams matter.
  • Be explicit about whether you want corpus frequency or document frequency.

Course illustration
Course illustration

All Rights Reserved.