Scikit-learn
MLP Classifier
Variable Length Input
Machine Learning
Data Preprocessing

How to input Scikit learn MLP classifier with variable length of input data.

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

MLPClassifier in scikit-learn requires a fixed-width feature matrix. If your samples have variable length, the model cannot consume them directly. You must transform each sample into a consistent numeric representation before calling fit.

Why Variable-Length Inputs Do Not Work Directly

Scikit-learn estimators expect X to have shape n_samples by n_features. That means every sample must produce the same number of features.

A ragged input like this is not valid for MLPClassifier:

python
1raw = [
2    [0.2, 0.7],
3    [0.5, 0.1, 0.4, 1.1],
4    [0.9],
5]

The model does not understand “one sample has two features, another has four.” The preprocessing step is therefore the real solution.

Strategy 1: Pad or Truncate Numeric Sequences

If the order of the values matters and the sequences are not too long, padding or truncation is the simplest baseline.

python
1import numpy as np
2from sklearn.neural_network import MLPClassifier
3
4raw = [
5    [1.2, 0.7],
6    [0.5, 0.1, 0.4, 1.1],
7    [0.9],
8    [0.2, 0.3, 0.1],
9]
10y = np.array([0, 1, 0, 1])
11
12max_len = 4
13
14def pad_or_truncate(seq, width, pad_value=0.0):
15    seq = seq[:width]
16    return seq + [pad_value] * (width - len(seq))
17
18X = np.array([pad_or_truncate(seq, max_len) for seq in raw], dtype=np.float32)
19
20clf = MLPClassifier(hidden_layer_sizes=(16,), max_iter=1000, random_state=42)
21clf.fit(X, y)
22print(clf.predict(X))

This makes the data rectangular, which is what scikit-learn needs.

The drawback is that truncation may discard useful information, while padding may inject many meaningless zeros if the sequences vary a lot.

Strategy 2: Extract Fixed Summary Features

Sometimes the sequence itself is less important than its statistics. In that case, derive a fixed set of numeric features from each sample.

python
1import numpy as np
2
3
4def sequence_features(seq):
5    arr = np.array(seq, dtype=np.float32)
6    if arr.size == 0:
7        return [0.0, 0.0, 0.0, 0.0]
8    return [
9        float(arr.mean()),
10        float(arr.std()),
11        float(arr.min()),
12        float(arr.max()),
13    ]
14
15X_stats = np.array([sequence_features(seq) for seq in raw], dtype=np.float32)
16print(X_stats)

This often generalizes better than naïve padding when the sequence lengths vary widely and the task depends more on aggregate properties than on exact position.

Strategy 3: Vectorize Text-Like Inputs

If the variable-length input is text or tokens, use a vectorizer rather than padding raw lists yourself.

python
1from sklearn.feature_extraction.text import TfidfVectorizer
2from sklearn.pipeline import Pipeline
3from sklearn.neural_network import MLPClassifier
4
5texts = ["good fast", "bad slow", "good reliable", "slow buggy"]
6y = [1, 0, 1, 0]
7
8model = Pipeline([
9    ("tfidf", TfidfVectorizer()),
10    ("mlp", MLPClassifier(hidden_layer_sizes=(32,), max_iter=500, random_state=42)),
11])
12
13model.fit(texts, y)
14print(model.predict(["good and reliable"]))

This is the right approach for variable-length text because the vectorizer converts each sample into a fixed-width feature space automatically.

Use a Pipeline When Preprocessing Matters

Once you choose a feature representation, put preprocessing and model training into one pipeline when possible. That keeps train and inference behavior consistent.

For dense numeric inputs, scaling is often helpful too.

python
1from sklearn.pipeline import Pipeline
2from sklearn.preprocessing import StandardScaler
3from sklearn.neural_network import MLPClassifier
4
5pipeline = Pipeline([
6    ("scale", StandardScaler()),
7    ("mlp", MLPClassifier(hidden_layer_sizes=(32, 16), max_iter=800, random_state=42)),
8])

This does not solve variable length by itself, but it improves the full workflow once the features are fixed-width.

Sometimes a Different Model Is Better

If sequence order and long-range context are central to the problem, padding data into MLPClassifier may not be the best design. Recurrent models, transformers, or other sequence-aware architectures may fit the problem better.

That said, a fixed-feature scikit-learn pipeline is often much faster to build and easier to debug, so it can still be the right baseline.

Common Pitfalls

A common mistake is trying to pass ragged Python lists directly into MLPClassifier.fit and expecting scikit-learn to infer the structure.

Another mistake is padding without thinking about how much information truncation destroys or how much noise the padding values add.

Developers also often mix training and preprocessing manually instead of using a pipeline, which makes later inference inconsistent.

Finally, if sequence order is essential, do not force the problem into a fixed-width MLP representation just because the API is convenient.

Summary

  • 'MLPClassifier requires every sample to have the same number of features.'
  • Convert variable-length inputs into fixed-width vectors before training.
  • Padding, truncation, summary features, and vectorization are the main strategies.
  • Use pipelines so preprocessing and training stay consistent.
  • Reconsider the model choice if the task truly depends on sequence structure.

Course illustration
Course illustration

All Rights Reserved.