SPORTS OR POLITICS?

Binary Text Classification using scikit-learn
Zenith Roll: M25CSA032 NLP Assignment 1
View on GitHub

Live Classifier Demo

This interface uses a client-side keyword heuristic derived from the training data (TF-IDF analysis) to simulate the full model's predictions.

Waiting...

Project Overview

This system classifies text documents into two categories: Sports and Politics. It was built as part of the CSL 7640 coursework to compare different feature representations and machine learning algorithms.


The Data

The dataset is a subset of the 20 Newsgroups corpus.

Performance

95.2% Best Accuracy
0.95 Best F1-Score
3 ML Techniques
TF-IDF Best Feature

Key Finding: Simple models like Multinomial Naive Bayes with TF-IDF performed exceptionally well, often matching more complex SVMs.

Experimental Results

Configuration ↕ Accuracy ↕ F1-Score ↕
TF-IDF + MultinomialNB 0.9520 0.9515
BoW + MultinomialNB 0.9480 0.9475
TF-IDF (Bigram) + MultinomialNB 0.9460 0.9455
TF-IDF + LinearSVC 0.9450 0.9442
BoW + LinearSVC 0.9390 0.9385
TF-IDF (Bigram) + LinearSVC 0.9410 0.9405
TF-IDF + RandomForest 0.9150 0.9120
BoW + RandomForest 0.9020 0.8980
TF-IDF (Bigram) + RandomForest 0.9100 0.9080