apache spark MLLib how to build labeled points for string features?

Apache Spark

MLLib

Labeled Points

String Features

Machine Learning

apache spark MLLib how to build labeled points for string features?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

LabeledPoint requires numeric features, so raw string columns must be encoded before you can train an MLlib model with them. The main trap is mixing modern pyspark.ml transformers with legacy pyspark.mllib algorithms without converting vectors correctly.

A reliable approach is to do all categorical encoding in the DataFrame-based ml pipeline, then convert the final assembled feature vector into an mllib LabeledPoint only at the edge where the legacy algorithm needs it.

Why Raw Strings Do Not Work

MLlib algorithms operate on numeric vectors. A value like "red" or "small" has no mathematical meaning until you map it to numbers.

For categorical features, the usual sequence is:

index the string values to stable numeric IDs
one-hot encode them when ordinal meaning would be misleading
assemble them with numeric columns into one feature vector

If you skip this step and invent inconsistent manual mappings in different jobs, your training and inference pipelines drift apart.

Build the Encoding Pipeline with `pyspark.ml`

The DataFrame API is the safest place to encode string columns.

python

1from pyspark.sql import SparkSession
2from pyspark.ml import Pipeline
3from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
4
5spark = SparkSession.builder.master("local[*]").appName("lp-demo").getOrCreate()
6
7rows = [
8    (1.0, "red", "small", 10.5),
9    (0.0, "blue", "medium", 7.2),
10    (1.0, "red", "large", 11.8),
11]
12
13df = spark.createDataFrame(rows, ["label", "color", "size", "weight"])
14
15color_indexer = StringIndexer(inputCol="color", outputCol="color_idx", handleInvalid="keep")
16size_indexer = StringIndexer(inputCol="size", outputCol="size_idx", handleInvalid="keep")
17encoder = OneHotEncoder(
18    inputCols=["color_idx", "size_idx"],
19    outputCols=["color_vec", "size_vec"]
20)
21assembler = VectorAssembler(
22    inputCols=["color_vec", "size_vec", "weight"],
23    outputCol="features"
24)
25
26pipeline = Pipeline(stages=[color_indexer, size_indexer, encoder, assembler])
27model = pipeline.fit(df)
28transformed = model.transform(df)
29transformed.select("label", "features").show(truncate=False)

This keeps category handling reproducible. The fitted StringIndexer stores the mapping learned from training data, which is much safer than a handwritten dictionary buried in application code.

Convert the `ml` Vector into an `mllib` `LabeledPoint`

This is the part that often gets overlooked. The vector produced by VectorAssembler belongs to the ml API, while LabeledPoint expects an mllib vector.

Use Vectors.fromML during conversion.

python

1from pyspark.mllib.linalg import Vectors
2from pyspark.mllib.regression import LabeledPoint
3
4labeled_points = transformed.select("label", "features").rdd.map(
5    lambda row: LabeledPoint(float(row["label"]), Vectors.fromML(row["features"]))
6)
7
8for item in labeled_points.take(3):
9    print(item)

That conversion makes the boundary explicit and prevents subtle type mismatches.

Handling Unseen Categories

Production data rarely matches the training set perfectly. If an unseen category appears and your indexer is configured with default failure behavior, transformation can break.

That is why handleInvalid="keep" is often the right choice for categorical inputs that may evolve. It gives unknown categories a reserved bucket instead of crashing the job.

You should still monitor how often this happens. A sudden rise in unknown values usually means upstream data changed and the model may need retraining.

When to Avoid `LabeledPoint` Entirely

If you are using modern Spark estimators such as those in pyspark.ml.classification or pyspark.ml.regression, stay in the DataFrame API and keep the features column as an ml vector. Converting to LabeledPoint is only necessary for older mllib algorithms or legacy code you cannot remove yet.

In new codebases, prefer the modern API from end to end. It is easier to pipeline, persist, and deploy.

Common Pitfalls

Feeding raw string columns directly into an MLlib algorithm that expects numeric vectors.
Creating ad hoc category mappings in Python code and then using different mappings in another job.
Forgetting that VectorAssembler creates an ml vector, not an mllib one.
Leaving handleInvalid at the default and discovering unseen categories only after a production failure.
Converting to LabeledPoint too early instead of keeping preprocessing in the modern pipeline API.

Summary

Raw string features must be encoded before they can become part of a LabeledPoint.
Use StringIndexer, OneHotEncoder, and VectorAssembler in pyspark.ml to build a stable preprocessing pipeline.
Convert the final ml vector with Vectors.fromML when creating legacy mllib LabeledPoint objects.
Configure unknown-category handling deliberately instead of relying on default failure behavior.
Prefer the modern DataFrame API unless a legacy MLlib algorithm forces the conversion step.

apache spark MLLib how to build labeled points for string features?

Master System Design with Codemia

Introduction

Why Raw Strings Do Not Work

Build the Encoding Pipeline with pyspark.ml

Convert the ml Vector into an mllib LabeledPoint

Handling Unseen Categories

When to Avoid LabeledPoint Entirely

Common Pitfalls

Summary

Build the Encoding Pipeline with `pyspark.ml`

Convert the `ml` Vector into an `mllib` `LabeledPoint`

When to Avoid `LabeledPoint` Entirely