Algorithm to determine how positive or negative a statement/text is
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Determining how positive or negative a piece of text is is the core task of sentiment analysis. At a high level, you take text, convert it into a representation a program can score, and then map that score to labels such as positive, neutral, or negative.
Two common approaches
There are two practical starting points: rule-based sentiment scoring and supervised machine learning.
A rule-based system uses a sentiment lexicon, which is just a dictionary of words mapped to positive or negative weights. It is easy to build and explain, which makes it a good first algorithm.
A supervised model learns from labeled examples such as movie reviews or customer comments. That usually performs better once you have enough training data, but it requires a dataset and a training pipeline.
A simple lexicon-based algorithm
Here is a minimal sentiment scorer in Python.
This algorithm tokenizes the text, looks up each token in the lexicon, and adds the scores together. Positive totals indicate positive sentiment, negative totals indicate negative sentiment, and zero suggests neutral or unknown sentiment.
Why this simple algorithm works and where it fails
The benefit of a lexicon method is transparency. If the final score is -5, you can inspect exactly which words contributed to that result.
The weakness is context. It struggles with negation, sarcasm, domain-specific meaning, and multi-word phrases. For example, not good should be negative, but a naive lexicon might still add a positive score for good.
A small improvement is to handle obvious negation.
This is still far from perfect, but it shows how sentiment algorithms grow from simple rules into richer language models.
A machine-learning version
If you have labeled examples, a common baseline is TF-IDF plus logistic regression. The model learns which words or phrases tend to correlate with positive or negative labels.
This approach captures patterns better than a hand-written lexicon, but only if the training data is representative of the text you care about.
Choosing the right output
Not every sentiment system needs only positive or negative labels. Many systems return:
- a discrete label such as positive, neutral, or negative
- a confidence score
- a continuous sentiment value such as
-1.0to1.0
The best choice depends on the downstream use case. Dashboards often want a scalar score. Moderation pipelines may prefer thresholds and confidence.
Common Pitfalls
A common mistake is assuming sentiment is just word counting. Real language includes negation, sarcasm, context, and domain-specific terms that simple scoring misses.
Another issue is using a model trained on one domain in another domain. A model trained on movie reviews may behave badly on financial news or support tickets.
It is also easy to ignore class balance. If most training data is positive, a model may look accurate while performing poorly on negative examples.
Summary
- Sentiment analysis estimates how positive or negative a text is.
- Lexicon-based systems are simple and interpretable, making them a strong first baseline.
- Supervised models such as TF-IDF plus logistic regression usually perform better when labeled data is available.
- Context, negation, and domain mismatch are major sources of error.
- Start simple, measure performance, and add complexity only when the use case demands it.

