Could not convert string to float error from the Titanic competition
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
The Titanic Kaggle dataset mixes numeric columns such as Age and Fare with string columns such as Sex, Embarked, and Cabin. The error “could not convert string to float” appears when code sends those raw string values into a model or a conversion step that expects numbers only. The fix is not to coerce everything blindly, but to inspect the column types and encode categorical features correctly.
Why the Error Happens
Most machine learning estimators in scikit-learn expect a numeric matrix. If you pass a DataFrame containing text columns, pandas or scikit-learn eventually tries to turn those strings into floats and fails.
A minimal example:
That fails because Sex still contains text.
Inspect the Data Before Training
The first step is always to confirm which columns are numeric and which are categorical.
On Titanic data, columns such as Name, Sex, Ticket, Cabin, and Embarked need special handling. Columns such as Age may also need imputation because missing values can cause different training errors even after strings are fixed.
A Robust Fix with a Preprocessing Pipeline
The cleanest solution is to separate numeric and categorical columns and preprocess them differently. ColumnTransformer lets you do that in one reusable pipeline.
This handles both missing numeric values and string categories in a single training flow.
When Manual Conversion Is Enough
If you only need a quick experiment, you can convert one categorical column by hand.
This is acceptable for a small notebook, but it becomes harder to maintain as the feature set grows.
Missing Values Are a Separate Problem
It is common to fix the string columns and still get another exception because Age or Cabin contains missing data. That does not mean the first fix failed. It means the dataset has more than one preprocessing issue.
Treat the steps separately:
- encode text columns
- impute or drop missing values
- train the estimator on the cleaned result
That sequence is easier to debug than trying random conversions until the error changes.
Common Pitfalls
- Calling
.astype(float)on a mixed-type DataFrame and hoping every column will convert cleanly. - Passing raw columns such as
SexorEmbarkeddirectly into scikit-learn estimators. - Forgetting that missing values in numeric columns cause separate training failures.
- Encoding the training set manually but forgetting to apply the same transformation to test data.
- Dropping useful categorical columns entirely instead of encoding them.
Summary
- The error happens because a model or conversion step received string data where numeric input was required.
- Inspect Titanic column types before training.
- Encode categorical features such as
SexandEmbarkedinstead of forcing them into floats. - Handle missing numeric values separately with imputation.
- Prefer a scikit-learn preprocessing pipeline so training and prediction use the same transformations.

