Infer multivalent features with tfdv from pandas dataframe

tfdv

multivalent features

pandas dataframe

feature inference

data validation

Infer multivalent features with tfdv from pandas dataframe

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

TensorFlow Data Validation (TFDV) is a library for analyzing and validating machine learning data. It can infer a schema from your dataset, detecting feature types, distributions, and anomalies. Multivalent features (features with multiple values per example, like tags or categories) require special handling. TFDV can detect and represent these when working with pandas DataFrames.

What Are Multivalent Features?

A multivalent (or multi-valued) feature contains a variable number of values per example:

python

1import pandas as pd
2
3# Single-valued features (univalent)
4df_simple = pd.DataFrame({
5    'age': [25, 30, 35],
6    'name': ['Alice', 'Bob', 'Charlie']
7})
8
9# Multivalent features — multiple values per row
10df_multi = pd.DataFrame({
11    'user_id': [1, 2, 3],
12    'tags': [['python', 'ml'], ['java'], ['python', 'web', 'api']],
13    'scores': [[85, 90], [75, 80, 95], [88]]
14})

Examples include: product categories, user interests, search keywords, and multi-label classifications.

Setting Up TFDV

bash

pip install tensorflow-data-validation

python

import tensorflow_data_validation as tfdv
import pandas as pd

Generating Statistics from a DataFrame

python

1# Create a DataFrame
2df = pd.DataFrame({
3    'user_id': [1, 2, 3, 4, 5],
4    'age': [25, 30, 35, 28, 42],
5    'city': ['NYC', 'LA', 'NYC', 'SF', 'LA'],
6    'purchase_amount': [99.5, 150.0, 75.25, 200.0, 50.0]
7})
8
9# Generate statistics
10stats = tfdv.generate_statistics_from_dataframe(df)
11
12# Visualize statistics
13tfdv.visualize_statistics(stats)

Inferring a Schema

python

1# Infer schema from statistics
2schema = tfdv.infer_schema(stats)
3
4# Display the schema
5tfdv.display_schema(schema)

The schema describes each feature's type, domain, presence, and valency:

python

1# Access feature details programmatically
2for feature in schema.feature:
3    print(f"{feature.name}: type={feature.type}, "
4          f"presence={feature.presence.min_fraction}")

Handling Multivalent Features

TFDV represents multivalent features using value_count constraints in the schema:

python

1# Create data with multivalent features
2# TFDV expects Apache Arrow or TFRecord format for true multivalent support
3# For pandas, represent as lists
4
5import pyarrow as pa
6
7# Create Arrow table with list-type columns
8table = pa.table({
9    'user_id': [1, 2, 3],
10    'tags': [['python', 'ml'], ['java', 'spring'], ['python', 'web', 'api']],
11    'scores': [[85, 90], [75], [88, 92, 95]]
12})
13
14# Generate stats from Arrow table
15stats = tfdv.generate_statistics_from_dataframe(table.to_pandas())
16schema = tfdv.infer_schema(stats)

Setting Valency Constraints

python

1from tensorflow_metadata.proto.v0 import schema_pb2
2
3# Manually set multivalent constraints
4feature = tfdv.get_feature(schema, 'tags')
5feature.value_count.min = 1
6feature.value_count.max = 10
7
8# For fixed-length features
9feature = tfdv.get_feature(schema, 'embedding')
10feature.shape.dim.add().size = 128  # fixed 128-dim vector

Validating Data Against a Schema

python

1# Generate stats for new data
2new_df = pd.DataFrame({
3    'user_id': [6, 7],
4    'age': [22, None],  # missing value
5    'city': ['Chicago', 'NYC'],
6    'purchase_amount': [-10.0, 200.0]  # negative value
7})
8
9new_stats = tfdv.generate_statistics_from_dataframe(new_df)
10
11# Validate against schema
12anomalies = tfdv.validate_statistics(new_stats, schema)
13tfdv.display_anomalies(anomalies)

Feature Crosses and Combined Analysis

python

1# Analyze feature crosses
2stats = tfdv.generate_statistics_from_dataframe(
3    df,
4    stats_options=tfdv.StatsOptions(
5        feature_allowlist=['age', 'city', 'purchase_amount']
6    )
7)

Feature Crosses: Combining multiple multivalent features can be insightful, leading to better model characterization.
Handling Missing Values: TFDV also detects missing values, allowing you to decide how to handle such cases appropriately.
Custom Schema: You can customize the inferred schema to better suit domain-specific needs and apply constraints that enhance data validation.

Practical Pipeline Example

python

1import tensorflow_data_validation as tfdv
2
3# Step 1: Generate training stats and schema
4train_stats = tfdv.generate_statistics_from_dataframe(train_df)
5schema = tfdv.infer_schema(train_stats)
6
7# Step 2: Set constraints for multivalent features
8tags_feature = tfdv.get_feature(schema, 'tags')
9tags_feature.value_count.min = 1
10tags_feature.value_count.max = 20
11
12# Step 3: Validate serving data
13serving_stats = tfdv.generate_statistics_from_dataframe(serving_df)
14anomalies = tfdv.validate_statistics(serving_stats, schema)
15
16if anomalies.anomaly_info:
17    print("Data anomalies detected!")
18    tfdv.display_anomalies(anomalies)
19else:
20    print("Data passes validation")
21
22# Step 4: Compare training and serving distributions
23tfdv.visualize_statistics(
24    lhs_statistics=train_stats,
25    rhs_statistics=serving_stats,
26    lhs_name='Training',
27    rhs_name='Serving'
28)

Common Pitfalls

Pandas list columns: Pandas does not natively support list-type columns well. TFDV may treat list columns as object types rather than multivalent features. Convert to Apache Arrow format for better support.
Schema drift: Always regenerate the schema when the training data distribution changes significantly. Stale schemas produce false anomalies.
Large datasets: generate_statistics_from_dataframe loads the entire DataFrame into memory. For large datasets, use generate_statistics_from_tfrecord or generate_statistics_from_csv which process data in batches.
Version compatibility: TFDV versions must match your TensorFlow version. Mismatches cause import errors or schema incompatibilities.
Missing values: TFDV distinguishes between missing features and features with empty lists. Configure presence.min_fraction appropriately for optional multivalent features.

Summary

TFDV infers schemas from data, detecting feature types, distributions, and anomalies
Multivalent features have multiple values per example (lists, arrays)
Use value_count constraints in the schema to validate multivalent feature lengths
Generate statistics with tfdv.generate_statistics_from_dataframe() and validate with tfdv.validate_statistics()
Compare training and serving data distributions to detect data drift