Feature Engineering
TensorFlow
Machine Learning
Data Processing
Deep Learning

Creating many feature columns in Tensorflow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

TensorFlow feature columns (tf.feature_column) bridge raw tabular data and model input layers by defining how each feature should be transformed — numeric columns pass through directly, categorical columns are encoded as one-hot or embeddings, and crossed columns capture feature interactions. When a dataset has many features, you can programmatically generate feature column lists from schema metadata instead of defining each one manually. Note that tf.feature_column is part of TF 1.x/2.x Estimator API; for modern TF 2.x Keras models, use tf.keras.layers preprocessing layers instead.

Numeric Columns

python
1import tensorflow as tf
2
3# Single numeric column
4age = tf.feature_column.numeric_column("age")
5
6# With normalization
7def zscore_normalizer(mean, std):
8    return lambda x: (x - mean) / std
9
10income = tf.feature_column.numeric_column(
11    "income",
12    normalizer_fn=zscore_normalizer(50000, 15000)
13)
14
15# Create many numeric columns from a list
16numeric_features = ["age", "income", "credit_score", "account_balance", "tenure_months"]
17numeric_columns = [tf.feature_column.numeric_column(f) for f in numeric_features]

numeric_column handles continuous values. Use normalizer_fn to standardize features during input processing.

Categorical Columns

python
1# Known vocabulary
2gender = tf.feature_column.categorical_column_with_vocabulary_list(
3    "gender", ["male", "female", "other"]
4)
5
6# Large vocabulary from file
7city = tf.feature_column.categorical_column_with_vocabulary_file(
8    "city", vocabulary_file="cities.txt", vocabulary_size=1000
9)
10
11# Hash bucket for high-cardinality features
12user_id = tf.feature_column.categorical_column_with_hash_bucket(
13    "user_id", hash_bucket_size=10000
14)
15
16# Integer identity (already encoded as integers 0-N)
17day_of_week = tf.feature_column.categorical_column_with_identity(
18    "day_of_week", num_buckets=7
19)
20
21# Wrap categorical columns for use in dense layers
22gender_onehot = tf.feature_column.indicator_column(gender)
23city_embedding = tf.feature_column.embedding_column(city, dimension=16)
24user_embedding = tf.feature_column.embedding_column(user_id, dimension=32)

Categorical columns must be wrapped in indicator_column (one-hot) or embedding_column (dense vector) before feeding into dense layers.

Generating Columns Programmatically

python
1import pandas as pd
2
3# Infer column types from a DataFrame
4df = pd.read_csv("data.csv")
5
6feature_columns = []
7
8for col in df.columns:
9    if col == "target":
10        continue
11
12    if df[col].dtype in ["int64", "float64"]:
13        feature_columns.append(
14            tf.feature_column.numeric_column(col)
15        )
16    elif df[col].dtype == "object":
17        vocab = df[col].unique().tolist()
18        cat_col = tf.feature_column.categorical_column_with_vocabulary_list(col, vocab)
19        if len(vocab) <= 10:
20            feature_columns.append(tf.feature_column.indicator_column(cat_col))
21        else:
22            dim = min(len(vocab) // 2, 50)
23            feature_columns.append(tf.feature_column.embedding_column(cat_col, dimension=dim))
24
25print(f"Created {len(feature_columns)} feature columns")

For datasets with dozens or hundreds of features, programmatic generation from schema metadata is essential.

Bucketized and Crossed Columns

python
1# Bucketize continuous features into ranges
2age_bucket = tf.feature_column.bucketized_column(
3    tf.feature_column.numeric_column("age"),
4    boundaries=[18, 25, 35, 50, 65]
5)
6
7# Cross features to capture interactions
8age_gender = tf.feature_column.crossed_column(
9    [age_bucket, "gender"], hash_bucket_size=20
10)
11
12# Wrap crossed column for dense layers
13age_gender_indicator = tf.feature_column.indicator_column(age_gender)

Bucketized columns convert continuous values to categorical ranges. Crossed columns combine multiple categorical features to learn interactions.

Using with Estimators

python
1# Build feature columns
2feature_columns = numeric_columns + [gender_onehot, city_embedding, age_bucket]
3
4# Create an Estimator
5estimator = tf.estimator.DNNClassifier(
6    feature_columns=feature_columns,
7    hidden_units=[128, 64, 32],
8    n_classes=2
9)
10
11# Input function
12def input_fn(df, labels, batch_size=32):
13    dataset = tf.data.Dataset.from_tensor_slices((dict(df), labels))
14    return dataset.shuffle(1000).batch(batch_size)
15
16# Train
17estimator.train(input_fn=lambda: input_fn(train_df, train_labels), steps=1000)

Modern Alternative: Keras Preprocessing Layers

python
1# TF 2.x Keras approach (recommended over feature_columns)
2import tensorflow as tf
3
4# Define input layers
5inputs = {
6    "age": tf.keras.Input(shape=(1,), name="age", dtype=tf.float32),
7    "income": tf.keras.Input(shape=(1,), name="income", dtype=tf.float32),
8    "gender": tf.keras.Input(shape=(1,), name="gender", dtype=tf.string),
9    "city": tf.keras.Input(shape=(1,), name="city", dtype=tf.string),
10}
11
12# Preprocessing
13age_norm = tf.keras.layers.Normalization()(inputs["age"])
14income_norm = tf.keras.layers.Normalization()(inputs["income"])
15
16gender_encoded = tf.keras.layers.StringLookup(vocabulary=["male", "female", "other"],
17                                                output_mode="one_hot")(inputs["gender"])
18city_encoded = tf.keras.layers.StringLookup(max_tokens=1000)(inputs["city"])
19city_embedded = tf.keras.layers.Embedding(1000, 16)(city_encoded)
20city_flat = tf.keras.layers.Flatten()(city_embedded)
21
22# Combine all features
23combined = tf.keras.layers.Concatenate()([age_norm, income_norm, gender_encoded, city_flat])
24x = tf.keras.layers.Dense(64, activation="relu")(combined)
25output = tf.keras.layers.Dense(1, activation="sigmoid")(x)
26
27model = tf.keras.Model(inputs=inputs, outputs=output)
28model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Keras preprocessing layers are the modern replacement for tf.feature_column. They integrate directly into the model graph and support TF Serving.

Common Pitfalls

  • Forgetting to wrap categorical columns: Raw categorical_column_with_* columns cannot be used directly in dense layers. Wrap them in indicator_column (one-hot) or embedding_column (dense vector).
  • Hash bucket collisions: categorical_column_with_hash_bucket maps different values to the same bucket when hash_bucket_size is too small. Use a size at least 2-5x the number of unique values.
  • Embedding dimension too large: A common rule is dimension = min(cardinality // 2, 50). An oversized embedding wastes parameters and memory without improving accuracy.
  • Mixing feature_column and Keras layers: tf.feature_column is designed for Estimators. While tf.keras.layers.DenseFeatures bridges them into Keras, the cleaner approach is to use Keras preprocessing layers directly.
  • Not normalizing numeric features: Numeric features with different scales (age 0-100, income 0-1M) cause training instability. Use normalizer_fn in numeric_column or Keras Normalization layer.

Summary

  • tf.feature_column provides numeric_column, categorical_column_with_*, bucketized_column, and crossed_column
  • Categorical columns must be wrapped in indicator_column or embedding_column for dense layers
  • Generate feature columns programmatically from DataFrame schema for large datasets
  • Use crossed_column to capture feature interactions
  • For modern TF 2.x code, prefer Keras preprocessing layers (Normalization, StringLookup, Embedding) over tf.feature_column
  • Always normalize numeric features to avoid training instability

Course illustration
Course illustration

All Rights Reserved.