Introduction
Weka (Waikato Environment for Knowledge Analysis) is a Java-based machine learning toolkit with a GUI and a powerful command-line interface. While the GUI is useful for exploration, the command line is essential for automation, scripting, reproducible experiments, and processing large datasets on servers without a display. Every Weka algorithm can be invoked from the command line with full parameter control.
Prerequisites
1# Check Java is installed (Weka requires Java 8+)
2java -version
3
4# Download Weka from https://www.cs.waikato.ac.nz/ml/weka/
5# Extract to a directory, e.g., ~/weka-3-8-6/
6
7# Set the classpath
8export WEKAJAR=~/weka-3-8-6/weka.jar
9
10# Test the installation
11java -cp $WEKAJAR weka.core.Version
12# 3.8.6
Loading and Inspecting Data
Weka uses ARFF (Attribute-Relation File Format) as its native format, but can read CSV files too:
1# View dataset summary
2java -cp $WEKAJAR weka.core.Instances data/iris.arff
3
4# Convert CSV to ARFF
5java -cp $WEKAJAR weka.core.converters.CSVLoader data.csv > data.arff
6
7# Convert ARFF to CSV
8java -cp $WEKAJAR weka.core.converters.ArffLoader data.arff | \
9 java -cp $WEKAJAR weka.core.converters.CSVSaver -i /dev/stdin -o data.csv
1@relation iris
2
3@attribute sepallength numeric
4@attribute sepalwidth numeric
5@attribute petallength numeric
6@attribute petalwidth numeric
7@attribute class {Iris-setosa, Iris-versicolor, Iris-virginica}
8
9@data
105.1,3.5,1.4,0.2,Iris-setosa
114.9,3.0,1.4,0.2,Iris-setosa
12...
Running Classifiers
Decision Tree (J48)
1# Train and evaluate with 10-fold cross-validation
2java -cp $WEKAJAR weka.classifiers.trees.J48 \
3 -t data/iris.arff \
4 -x 10
5
6# Key options:
7# -t Training file
8# -T Separate test file
9# -x Number of cross-validation folds
10# -o Output only the model (no evaluation)
11# -p 0 Output predictions (0 = no attributes, 1-n = specific attributes)
Random Forest
1java -cp $WEKAJAR weka.classifiers.trees.RandomForest \
2 -t data/iris.arff \
3 -I 100 \
4 -x 10
5
6# -I Number of trees (iterations)
7# -K Number of features to consider at each split (0 = auto)
8# -depth Maximum depth (0 = unlimited)
Naive Bayes
java -cp $WEKAJAR weka.classifiers.bayes.NaiveBayes \
-t data/iris.arff \
-x 10
SVM (SMO)
1java -cp $WEKAJAR weka.classifiers.functions.SMO \
2 -t data/iris.arff \
3 -C 1.0 \
4 -x 10
5
6# -C Complexity parameter (regularization)
7# -K Kernel: "weka.classifiers.functions.supportVector.PolyKernel" for polynomial
Train/Test Split
1# Use a separate test file
2java -cp $WEKAJAR weka.classifiers.trees.J48 \
3 -t train.arff \
4 -T test.arff
5
6# Percentage split (66% train, 34% test)
7java -cp $WEKAJAR weka.classifiers.trees.J48 \
8 -t data/iris.arff \
9 -split-percentage 66
Saving and Loading Models
1# Save trained model
2java -cp $WEKAJAR weka.classifiers.trees.J48 \
3 -t data/iris.arff \
4 -d model.j48
5
6# Load model and classify new data
7java -cp $WEKAJAR weka.classifiers.trees.J48 \
8 -l model.j48 \
9 -T new_data.arff \
10 -p 0
Data Preprocessing (Filters)
1# Normalize numeric attributes to [0, 1]
2java -cp $WEKAJAR weka.filters.unsupervised.attribute.Normalize \
3 -i data.arff -o normalized.arff
4
5# Remove an attribute (e.g., column 1)
6java -cp $WEKAJAR weka.filters.unsupervised.attribute.Remove \
7 -R 1 \
8 -i data.arff -o filtered.arff
9
10# Discretize numeric attributes
11java -cp $WEKAJAR weka.filters.unsupervised.attribute.Discretize \
12 -B 5 \
13 -i data.arff -o discretized.arff
14
15# Resample (under/oversample for class imbalance)
16java -cp $WEKAJAR weka.filters.supervised.instance.Resample \
17 -B 1.0 \
18 -i data.arff -o resampled.arff
19
20# Chain filters using FilteredClassifier
21java -cp $WEKAJAR weka.classifiers.meta.FilteredClassifier \
22 -F "weka.filters.unsupervised.attribute.Normalize" \
23 -W weka.classifiers.trees.J48 \
24 -t data.arff -x 10
Clustering
1# K-Means clustering
2java -cp $WEKAJAR weka.clusterers.SimpleKMeans \
3 -t data/iris.arff \
4 -N 3 \
5 -x 10
6
7# -N Number of clusters
8
9# Expectation-Maximization
10java -cp $WEKAJAR weka.clusterers.EM \
11 -t data/iris.arff \
12 -N -1
13
14# -N -1 Auto-select number of clusters using cross-validation
Feature Selection
1# Evaluate attributes using InfoGain
2java -cp $WEKAJAR weka.attributeSelection.InfoGainAttributeEval \
3 -i data/iris.arff \
4 -s "weka.attributeSelection.Ranker -T 0.01"
5
6# Use CfsSubsetEval with BestFirst search
7java -cp $WEKAJAR weka.attributeSelection.CfsSubsetEval \
8 -i data/iris.arff \
9 -s "weka.attributeSelection.BestFirst -D 1"
Scripting and Automation
1#!/bin/bash
2# Run multiple classifiers and compare results
3
4WEKAJAR=~/weka-3-8-6/weka.jar
5DATA=data/iris.arff
6
7classifiers=(
8 "weka.classifiers.trees.J48"
9 "weka.classifiers.trees.RandomForest -I 100"
10 "weka.classifiers.bayes.NaiveBayes"
11 "weka.classifiers.functions.SMO"
12 "weka.classifiers.lazy.IBk -K 5"
13)
14
15for clf in "${classifiers[@]}"; do
16 echo "=== $clf ==="
17 java -cp $WEKAJAR $clf -t $DATA -x 10 2>&1 | grep "Correctly Classified"
18 echo
19done
Common Pitfalls
Memory for large datasets: Weka loads the entire dataset into memory. For large files, increase the JVM heap: java -Xmx4g -cp $WEKAJAR .... Without this, you get OutOfMemoryError.
Class attribute position: Weka assumes the last attribute is the class by default. If your class column is not last, specify it with -c (e.g., -c 1 for the first column).
ARFF format errors: Missing quotes around nominal values with spaces, incorrect @attribute declarations, or mismatched data types cause cryptic parsing errors. Validate your ARFF file before running experiments.
Classpath issues: Weka packages (installed via the Package Manager) are not automatically on the classpath. Add them: java -cp $WEKAJAR:~/wekafiles/packages/*/weka.jar ... or load them programmatically.
Reproducibility: Set the random seed with -s for randomized algorithms (e.g., -s 42) to ensure reproducible results across runs.
Summary
Run any Weka classifier from the command line: java -cp weka.jar weka.classifiers.trees.J48 -t data.arff -x 10
Use -t for training file, -T for test file, -x for cross-validation folds, -d/-l to save/load models
Preprocess data with filters: weka.filters.unsupervised.attribute.Normalize, Remove, Discretize
Use FilteredClassifier to chain preprocessing and classification in one command
Increase memory with -Xmx4g for large datasets
Script multiple experiments in bash for systematic comparisons