Could not convert string to float error from the Titanic competition

Python

Data Analysis

Machine Learning

Error Handling

Titanic Dataset

Could not convert string to float error from the Titanic competition

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

The Titanic Kaggle dataset mixes numeric columns such as Age and Fare with string columns such as Sex, Embarked, and Cabin. The error “could not convert string to float” appears when code sends those raw string values into a model or a conversion step that expects numbers only. The fix is not to coerce everything blindly, but to inspect the column types and encode categorical features correctly.

Why the Error Happens

Most machine learning estimators in scikit-learn expect a numeric matrix. If you pass a DataFrame containing text columns, pandas or scikit-learn eventually tries to turn those strings into floats and fails.

A minimal example:

python

1import pandas as pd
2from sklearn.linear_model import LogisticRegression
3
4train = pd.DataFrame(
5    {
6        "Pclass": [3, 1, 3],
7        "Sex": ["male", "female", "female"],
8        "Fare": [7.25, 71.28, 7.92],
9        "Survived": [0, 1, 1],
10    }
11)
12
13X = train[["Pclass", "Sex", "Fare"]]
14y = train["Survived"]
15
16model = LogisticRegression(max_iter=200)
17model.fit(X, y)

That fails because Sex still contains text.

Inspect the Data Before Training

The first step is always to confirm which columns are numeric and which are categorical.

python

1import pandas as pd
2
3train = pd.read_csv("train.csv")
4print(train.dtypes)
5print(train.head())

On Titanic data, columns such as Name, Sex, Ticket, Cabin, and Embarked need special handling. Columns such as Age may also need imputation because missing values can cause different training errors even after strings are fixed.

A Robust Fix with a Preprocessing Pipeline

The cleanest solution is to separate numeric and categorical columns and preprocess them differently. ColumnTransformer lets you do that in one reusable pipeline.

python

1import pandas as pd
2from sklearn.compose import ColumnTransformer
3from sklearn.impute import SimpleImputer
4from sklearn.linear_model import LogisticRegression
5from sklearn.pipeline import Pipeline
6from sklearn.preprocessing import OneHotEncoder
7
8train = pd.read_csv("train.csv")
9
10features = ["Pclass", "Sex", "Age", "Fare", "Embarked"]
11X = train[features]
12y = train["Survived"]
13
14numeric_features = ["Pclass", "Age", "Fare"]
15categorical_features = ["Sex", "Embarked"]
16
17numeric_pipeline = Pipeline(
18    steps=[("imputer", SimpleImputer(strategy="median"))]
19)
20
21categorical_pipeline = Pipeline(
22    steps=[
23        ("imputer", SimpleImputer(strategy="most_frequent")),
24        ("encoder", OneHotEncoder(handle_unknown="ignore")),
25    ]
26)
27
28preprocessor = ColumnTransformer(
29    transformers=[
30        ("num", numeric_pipeline, numeric_features),
31        ("cat", categorical_pipeline, categorical_features),
32    ]
33)
34
35model = Pipeline(
36    steps=[
37        ("preprocessor", preprocessor),
38        ("classifier", LogisticRegression(max_iter=500)),
39    ]
40)
41
42model.fit(X, y)
43print("Training complete")

This handles both missing numeric values and string categories in a single training flow.

When Manual Conversion Is Enough

If you only need a quick experiment, you can convert one categorical column by hand.

python

1import pandas as pd
2
3train = pd.read_csv("train.csv")
4train["Sex"] = train["Sex"].map({"male": 0, "female": 1})
5train["Embarked"] = train["Embarked"].fillna("S")
6train = pd.get_dummies(train, columns=["Embarked"], drop_first=True)

This is acceptable for a small notebook, but it becomes harder to maintain as the feature set grows.

Missing Values Are a Separate Problem

It is common to fix the string columns and still get another exception because Age or Cabin contains missing data. That does not mean the first fix failed. It means the dataset has more than one preprocessing issue.

Treat the steps separately:

encode text columns
impute or drop missing values
train the estimator on the cleaned result

That sequence is easier to debug than trying random conversions until the error changes.

Common Pitfalls

Calling .astype(float) on a mixed-type DataFrame and hoping every column will convert cleanly.
Passing raw columns such as Sex or Embarked directly into scikit-learn estimators.
Forgetting that missing values in numeric columns cause separate training failures.
Encoding the training set manually but forgetting to apply the same transformation to test data.
Dropping useful categorical columns entirely instead of encoding them.

Summary

The error happens because a model or conversion step received string data where numeric input was required.
Inspect Titanic column types before training.
Encode categorical features such as Sex and Embarked instead of forcing them into floats.
Handle missing numeric values separately with imputation.
Prefer a scikit-learn preprocessing pipeline so training and prediction use the same transformations.