Sentiment Analysis
Text Classification
Natural Language Processing
Machine Learning
Algorithm Development

Algorithm to determine how positive or negative a statement/text is

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Determining how positive or negative a piece of text is is the core task of sentiment analysis. At a high level, you take text, convert it into a representation a program can score, and then map that score to labels such as positive, neutral, or negative.

Two common approaches

There are two practical starting points: rule-based sentiment scoring and supervised machine learning.

A rule-based system uses a sentiment lexicon, which is just a dictionary of words mapped to positive or negative weights. It is easy to build and explain, which makes it a good first algorithm.

A supervised model learns from labeled examples such as movie reviews or customer comments. That usually performs better once you have enough training data, but it requires a dataset and a training pipeline.

A simple lexicon-based algorithm

Here is a minimal sentiment scorer in Python.

python
1import re
2
3LEXICON = {
4    "good": 2,
5    "great": 3,
6    "excellent": 4,
7    "love": 3,
8    "bad": -2,
9    "terrible": -4,
10    "hate": -3,
11    "slow": -1,
12}
13
14
15def sentiment_score(text: str) -> int:
16    tokens = re.findall(r"[a-zA-Z']+", text.lower())
17    return sum(LEXICON.get(token, 0) for token in tokens)
18
19
20samples = [
21    "I love this product, it is excellent",
22    "The service was slow and terrible",
23    "The package arrived",
24]
25
26for sample in samples:
27    score = sentiment_score(sample)
28    print(sample, score)

This algorithm tokenizes the text, looks up each token in the lexicon, and adds the scores together. Positive totals indicate positive sentiment, negative totals indicate negative sentiment, and zero suggests neutral or unknown sentiment.

Why this simple algorithm works and where it fails

The benefit of a lexicon method is transparency. If the final score is -5, you can inspect exactly which words contributed to that result.

The weakness is context. It struggles with negation, sarcasm, domain-specific meaning, and multi-word phrases. For example, not good should be negative, but a naive lexicon might still add a positive score for good.

A small improvement is to handle obvious negation.

python
1import re
2
3LEXICON = {"good": 2, "bad": -2, "great": 3, "awful": -3}
4NEGATIONS = {"not", "never", "no"}
5
6
7def sentiment_with_negation(text: str) -> int:
8    tokens = re.findall(r"[a-zA-Z']+", text.lower())
9    total = 0
10    flip = False
11
12    for token in tokens:
13        if token in NEGATIONS:
14            flip = True
15            continue
16
17        value = LEXICON.get(token, 0)
18        total += -value if flip else value
19        flip = False
20
21    return total
22
23
24print(sentiment_with_negation("not good"))
25print(sentiment_with_negation("not bad"))

This is still far from perfect, but it shows how sentiment algorithms grow from simple rules into richer language models.

A machine-learning version

If you have labeled examples, a common baseline is TF-IDF plus logistic regression. The model learns which words or phrases tend to correlate with positive or negative labels.

python
1from sklearn.feature_extraction.text import TfidfVectorizer
2from sklearn.linear_model import LogisticRegression
3
4texts = [
5    "I love this phone",
6    "This movie was great",
7    "I hate this product",
8    "The app is terrible",
9]
10labels = [1, 1, 0, 0]
11
12vectorizer = TfidfVectorizer()
13X = vectorizer.fit_transform(texts)
14model = LogisticRegression()
15model.fit(X, labels)
16
17prediction = model.predict(vectorizer.transform(["great product"]))
18print(prediction[0])

This approach captures patterns better than a hand-written lexicon, but only if the training data is representative of the text you care about.

Choosing the right output

Not every sentiment system needs only positive or negative labels. Many systems return:

  • a discrete label such as positive, neutral, or negative
  • a confidence score
  • a continuous sentiment value such as -1.0 to 1.0

The best choice depends on the downstream use case. Dashboards often want a scalar score. Moderation pipelines may prefer thresholds and confidence.

Common Pitfalls

A common mistake is assuming sentiment is just word counting. Real language includes negation, sarcasm, context, and domain-specific terms that simple scoring misses.

Another issue is using a model trained on one domain in another domain. A model trained on movie reviews may behave badly on financial news or support tickets.

It is also easy to ignore class balance. If most training data is positive, a model may look accurate while performing poorly on negative examples.

Summary

  • Sentiment analysis estimates how positive or negative a text is.
  • Lexicon-based systems are simple and interpretable, making them a strong first baseline.
  • Supervised models such as TF-IDF plus logistic regression usually perform better when labeled data is available.
  • Context, negation, and domain mismatch are major sources of error.
  • Start simple, measure performance, and add complexity only when the use case demands it.

Course illustration
Course illustration

All Rights Reserved.