Live Classifier Demo
This interface uses a client-side keyword heuristic derived from the training data (TF-IDF analysis) to simulate the full model's predictions.
Waiting...
Project Overview
This system classifies text documents into two categories: Sports and Politics. It was built as part of the CSL 7640 coursework to compare different feature representations and machine learning algorithms.
The Data
The dataset is a subset of the 20 Newsgroups corpus.
- Sports Class:
rec.sport.baseball,rec.sport.hockey - Politics Class:
talk.politics.guns,talk.politics.mideast,talk.politics.misc
Performance
95.2%
Best Accuracy
0.95
Best F1-Score
3
ML Techniques
TF-IDF
Best Feature
Key Finding: Simple models like Multinomial Naive Bayes with TF-IDF performed exceptionally well, often matching more complex SVMs.
Experimental Results
| Configuration ↕ | Accuracy ↕ | F1-Score ↕ |
|---|---|---|
| TF-IDF + MultinomialNB | 0.9520 | 0.9515 |
| BoW + MultinomialNB | 0.9480 | 0.9475 |
| TF-IDF (Bigram) + MultinomialNB | 0.9460 | 0.9455 |
| TF-IDF + LinearSVC | 0.9450 | 0.9442 |
| BoW + LinearSVC | 0.9390 | 0.9385 |
| TF-IDF (Bigram) + LinearSVC | 0.9410 | 0.9405 |
| TF-IDF + RandomForest | 0.9150 | 0.9120 |
| BoW + RandomForest | 0.9020 | 0.8980 |
| TF-IDF (Bigram) + RandomForest | 0.9100 | 0.9080 |
Visualisation Gallery
Accuracy Ranking
Best Model Profile
Confusion Matrix
Class Balance