Introduction
TensorFlow feature columns (tf.feature_column) bridge raw tabular data and model input layers by defining how each feature should be transformed — numeric columns pass through directly, categorical columns are encoded as one-hot or embeddings, and crossed columns capture feature interactions. When a dataset has many features, you can programmatically generate feature column lists from schema metadata instead of defining each one manually. Note that tf.feature_column is part of TF 1.x/2.x Estimator API; for modern TF 2.x Keras models, use tf.keras.layers preprocessing layers instead.
Numeric Columns
1import tensorflow as tf
2
3# Single numeric column
4age = tf.feature_column.numeric_column("age")
5
6# With normalization
7def zscore_normalizer(mean, std):
8 return lambda x: (x - mean) / std
9
10income = tf.feature_column.numeric_column(
11 "income",
12 normalizer_fn=zscore_normalizer(50000, 15000)
13)
14
15# Create many numeric columns from a list
16numeric_features = ["age", "income", "credit_score", "account_balance", "tenure_months"]
17numeric_columns = [tf.feature_column.numeric_column(f) for f in numeric_features]
numeric_column handles continuous values. Use normalizer_fn to standardize features during input processing.
Categorical Columns
1# Known vocabulary
2gender = tf.feature_column.categorical_column_with_vocabulary_list(
3 "gender", ["male", "female", "other"]
4)
5
6# Large vocabulary from file
7city = tf.feature_column.categorical_column_with_vocabulary_file(
8 "city", vocabulary_file="cities.txt", vocabulary_size=1000
9)
10
11# Hash bucket for high-cardinality features
12user_id = tf.feature_column.categorical_column_with_hash_bucket(
13 "user_id", hash_bucket_size=10000
14)
15
16# Integer identity (already encoded as integers 0-N)
17day_of_week = tf.feature_column.categorical_column_with_identity(
18 "day_of_week", num_buckets=7
19)
20
21# Wrap categorical columns for use in dense layers
22gender_onehot = tf.feature_column.indicator_column(gender)
23city_embedding = tf.feature_column.embedding_column(city, dimension=16)
24user_embedding = tf.feature_column.embedding_column(user_id, dimension=32)
Categorical columns must be wrapped in indicator_column (one-hot) or embedding_column (dense vector) before feeding into dense layers.
Generating Columns Programmatically
1import pandas as pd
2
3# Infer column types from a DataFrame
4df = pd.read_csv("data.csv")
5
6feature_columns = []
7
8for col in df.columns:
9 if col == "target":
10 continue
11
12 if df[col].dtype in ["int64", "float64"]:
13 feature_columns.append(
14 tf.feature_column.numeric_column(col)
15 )
16 elif df[col].dtype == "object":
17 vocab = df[col].unique().tolist()
18 cat_col = tf.feature_column.categorical_column_with_vocabulary_list(col, vocab)
19 if len(vocab) <= 10:
20 feature_columns.append(tf.feature_column.indicator_column(cat_col))
21 else:
22 dim = min(len(vocab) // 2, 50)
23 feature_columns.append(tf.feature_column.embedding_column(cat_col, dimension=dim))
24
25print(f"Created {len(feature_columns)} feature columns")
For datasets with dozens or hundreds of features, programmatic generation from schema metadata is essential.
Bucketized and Crossed Columns
1# Bucketize continuous features into ranges
2age_bucket = tf.feature_column.bucketized_column(
3 tf.feature_column.numeric_column("age"),
4 boundaries=[18, 25, 35, 50, 65]
5)
6
7# Cross features to capture interactions
8age_gender = tf.feature_column.crossed_column(
9 [age_bucket, "gender"], hash_bucket_size=20
10)
11
12# Wrap crossed column for dense layers
13age_gender_indicator = tf.feature_column.indicator_column(age_gender)
Bucketized columns convert continuous values to categorical ranges. Crossed columns combine multiple categorical features to learn interactions.
Using with Estimators
1# Build feature columns
2feature_columns = numeric_columns + [gender_onehot, city_embedding, age_bucket]
3
4# Create an Estimator
5estimator = tf.estimator.DNNClassifier(
6 feature_columns=feature_columns,
7 hidden_units=[128, 64, 32],
8 n_classes=2
9)
10
11# Input function
12def input_fn(df, labels, batch_size=32):
13 dataset = tf.data.Dataset.from_tensor_slices((dict(df), labels))
14 return dataset.shuffle(1000).batch(batch_size)
15
16# Train
17estimator.train(input_fn=lambda: input_fn(train_df, train_labels), steps=1000)
Modern Alternative: Keras Preprocessing Layers
1# TF 2.x Keras approach (recommended over feature_columns)
2import tensorflow as tf
3
4# Define input layers
5inputs = {
6 "age": tf.keras.Input(shape=(1,), name="age", dtype=tf.float32),
7 "income": tf.keras.Input(shape=(1,), name="income", dtype=tf.float32),
8 "gender": tf.keras.Input(shape=(1,), name="gender", dtype=tf.string),
9 "city": tf.keras.Input(shape=(1,), name="city", dtype=tf.string),
10}
11
12# Preprocessing
13age_norm = tf.keras.layers.Normalization()(inputs["age"])
14income_norm = tf.keras.layers.Normalization()(inputs["income"])
15
16gender_encoded = tf.keras.layers.StringLookup(vocabulary=["male", "female", "other"],
17 output_mode="one_hot")(inputs["gender"])
18city_encoded = tf.keras.layers.StringLookup(max_tokens=1000)(inputs["city"])
19city_embedded = tf.keras.layers.Embedding(1000, 16)(city_encoded)
20city_flat = tf.keras.layers.Flatten()(city_embedded)
21
22# Combine all features
23combined = tf.keras.layers.Concatenate()([age_norm, income_norm, gender_encoded, city_flat])
24x = tf.keras.layers.Dense(64, activation="relu")(combined)
25output = tf.keras.layers.Dense(1, activation="sigmoid")(x)
26
27model = tf.keras.Model(inputs=inputs, outputs=output)
28model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
Keras preprocessing layers are the modern replacement for tf.feature_column. They integrate directly into the model graph and support TF Serving.
Common Pitfalls
Forgetting to wrap categorical columns: Raw categorical_column_with_* columns cannot be used directly in dense layers. Wrap them in indicator_column (one-hot) or embedding_column (dense vector).
Hash bucket collisions: categorical_column_with_hash_bucket maps different values to the same bucket when hash_bucket_size is too small. Use a size at least 2-5x the number of unique values.
Embedding dimension too large: A common rule is dimension = min(cardinality // 2, 50). An oversized embedding wastes parameters and memory without improving accuracy.
Mixing feature_column and Keras layers: tf.feature_column is designed for Estimators. While tf.keras.layers.DenseFeatures bridges them into Keras, the cleaner approach is to use Keras preprocessing layers directly.
Not normalizing numeric features: Numeric features with different scales (age 0-100, income 0-1M) cause training instability. Use normalizer_fn in numeric_column or Keras Normalization layer.
Summary
tf.feature_column provides numeric_column, categorical_column_with_*, bucketized_column, and crossed_column
Categorical columns must be wrapped in indicator_column or embedding_column for dense layers
Generate feature columns programmatically from DataFrame schema for large datasets
Use crossed_column to capture feature interactions
For modern TF 2.x code, prefer Keras preprocessing layers (Normalization, StringLookup, Embedding) over tf.feature_column
Always normalize numeric features to avoid training instability