CountVectorizer
custom features
text preprocessing
feature engineering
machine learning

Concatenate custom features with CountVectorizer

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

To combine text features from CountVectorizer with custom numerical features in scikit-learn, use scipy.sparse.hstack() to horizontally stack the sparse matrix from CountVectorizer with your custom feature array. For a cleaner approach, use ColumnTransformer with Pipeline to apply CountVectorizer to text columns and passthrough or transform numerical columns in a single step. This pattern is standard for NLP tasks where text bag-of-words features need to be combined with metadata like document length, sentiment scores, or categorical attributes.

The Problem

CountVectorizer produces a sparse matrix of word counts, but your model also needs non-text features:

python
1from sklearn.feature_extraction.text import CountVectorizer
2
3texts = ["the cat sat on the mat", "the dog sat on the log", "the cat and the dog"]
4vectorizer = CountVectorizer()
5text_features = vectorizer.fit_transform(texts)
6
7print(text_features.shape)  # (3, 8) — sparse matrix
8print(type(text_features))  # <class 'scipy.sparse._csr.csr_matrix'>
9
10# Custom features (e.g., text length, sentiment)
11import numpy as np
12custom_features = np.array([
13    [6, 0.5],   # word_count, sentiment
14    [6, 0.3],
15    [5, 0.7]
16])
17
18# Cannot simply use np.hstack — different types (sparse vs dense)

Method 1: scipy.sparse.hstack (Direct)

python
1from sklearn.feature_extraction.text import CountVectorizer
2from scipy.sparse import hstack, csr_matrix
3import numpy as np
4
5texts = ["the cat sat on the mat", "the dog sat on the log", "the cat and the dog"]
6custom = np.array([[6, 0.5], [6, 0.3], [5, 0.7]])
7
8vectorizer = CountVectorizer()
9text_features = vectorizer.fit_transform(texts)
10
11# Convert custom features to sparse and stack
12custom_sparse = csr_matrix(custom)
13combined = hstack([text_features, custom_sparse])
14
15print(combined.shape)  # (3, 10) — 8 text features + 2 custom features
16print(type(combined))  # <class 'scipy.sparse._coo.coo_matrix'>

This preserves sparsity, which is critical for large vocabularies where a dense matrix would use too much memory.

Use ColumnTransformer for a clean pipeline that handles text and numeric columns together:

python
1import pandas as pd
2from sklearn.compose import ColumnTransformer
3from sklearn.feature_extraction.text import CountVectorizer
4from sklearn.preprocessing import StandardScaler
5from sklearn.pipeline import Pipeline
6from sklearn.linear_model import LogisticRegression
7
8df = pd.DataFrame({
9    'text': ['great product love it', 'terrible waste of money',
10             'decent quality okay', 'amazing best purchase ever'],
11    'word_count': [4, 4, 3, 4],
12    'has_exclamation': [0, 0, 0, 1],
13    'label': [1, 0, 1, 1]
14})
15
16preprocessor = ColumnTransformer([
17    ('text', CountVectorizer(), 'text'),
18    ('numeric', StandardScaler(), ['word_count', 'has_exclamation'])
19])
20
21pipeline = Pipeline([
22    ('features', preprocessor),
23    ('classifier', LogisticRegression())
24])
25
26pipeline.fit(df[['text', 'word_count', 'has_exclamation']], df['label'])
27predictions = pipeline.predict(df[['text', 'word_count', 'has_exclamation']])

Method 3: FeatureUnion

FeatureUnion concatenates outputs from multiple transformers:

python
1from sklearn.pipeline import Pipeline, FeatureUnion
2from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
3from sklearn.preprocessing import FunctionTransformer
4from sklearn.linear_model import LogisticRegression
5import numpy as np
6
7# Custom feature extractor
8def extract_custom_features(texts):
9    return np.array([[len(t), t.count(' ') + 1] for t in texts])
10
11feature_union = FeatureUnion([
12    ('bow', CountVectorizer()),
13    ('custom', FunctionTransformer(extract_custom_features))
14])
15
16pipeline = Pipeline([
17    ('features', feature_union),
18    ('classifier', LogisticRegression())
19])
20
21texts = ['great product', 'terrible quality', 'love this item', 'do not buy']
22labels = [1, 0, 1, 0]
23
24pipeline.fit(texts, labels)
25print(pipeline.predict(['amazing product']))

Method 4: TfidfVectorizer + Custom Features

TfidfVectorizer is often preferred over CountVectorizer for text classification:

python
1from sklearn.feature_extraction.text import TfidfVectorizer
2from scipy.sparse import hstack, csr_matrix
3from sklearn.svm import LinearSVC
4import numpy as np
5
6texts = ['good movie', 'bad movie', 'great film', 'terrible film', 'ok movie']
7labels = [1, 0, 1, 0, 1]
8
9# Text features
10tfidf = TfidfVectorizer(max_features=1000)
11text_feat = tfidf.fit_transform(texts)
12
13# Custom features
14custom = np.array([
15    [2, 0.8],   # word_count, sentiment_score
16    [2, 0.2],
17    [2, 0.9],
18    [2, 0.1],
19    [2, 0.5]
20])
21
22# Scale custom features to match TF-IDF range [0, 1]
23from sklearn.preprocessing import MinMaxScaler
24scaler = MinMaxScaler()
25custom_scaled = scaler.fit_transform(custom)
26
27# Combine
28combined = hstack([text_feat, csr_matrix(custom_scaled)])
29
30clf = LinearSVC()
31clf.fit(combined, labels)

Getting Feature Names

python
1from sklearn.compose import ColumnTransformer
2from sklearn.feature_extraction.text import CountVectorizer
3
4preprocessor = ColumnTransformer([
5    ('text', CountVectorizer(), 'text'),
6    ('numeric', 'passthrough', ['word_count', 'sentiment'])
7])
8
9preprocessor.fit(df)
10
11# Get all feature names
12feature_names = preprocessor.get_feature_names_out()
13print(feature_names)
14# ['text__amazing', 'text__best', ..., 'numeric__word_count', 'numeric__sentiment']

Common Pitfalls

  • Using np.hstack instead of scipy.sparse.hstack: np.hstack converts the sparse matrix to dense, consuming massive memory for large vocabularies (e.g., 50,000 words x 100,000 documents). Always use scipy.sparse.hstack to preserve sparsity.
  • Not scaling custom features before concatenation: CountVectorizer produces counts (0, 1, 2, ...) or TF-IDF scores (0.0-1.0), while custom features may be in a completely different range (e.g., word count 0-500). Without scaling, the model weights are dominated by larger-magnitude features. Use StandardScaler or MinMaxScaler on custom features.
  • Forgetting to apply the same transformations at prediction time: If you use hstack manually, you must apply the same vectorizer.transform() and scaler.transform() at prediction time. A Pipeline with ColumnTransformer handles this automatically.
  • Passing a DataFrame column to CountVectorizer in ColumnTransformer as a list: CountVectorizer expects a single column (string), not a list of columns. In ColumnTransformer, pass the column name as a string ('text'), not a list (['text']).
  • Mismatching row counts between text features and custom features: If the text array and custom feature array have different numbers of rows, hstack produces a cryptic dimension error. Always verify that both arrays have the same number of samples before concatenating.

Summary

  • Use scipy.sparse.hstack([text_features, csr_matrix(custom_features)]) for manual concatenation
  • Use ColumnTransformer with Pipeline for production ML pipelines (cleanest approach)
  • Use FeatureUnion when working with raw text input (no DataFrame)
  • Scale custom features to match the range of text features before combining
  • Preserve sparsity — never convert large sparse matrices to dense arrays

Course illustration
Course illustration

All Rights Reserved.