This repository contains an end-to-end NLP classification project developed as part of a MasterSchool Data Science program.
The objective is to build a feature-driven pipeline to classify tweets as disaster-related or not, combining interpretable linguistic feature engineering with TF-IDF vectorization and linear classification models.
Twitter is a major channel for real-time reporting of emergency events. The ability to automatically identify genuine disaster signals from informal, figurative, or ambiguous language has direct applications in early warning systems and crisis monitoring.
The classification task is non-trivial: disaster-related vocabulary frequently appears in non-disaster contexts, creating semantic ambiguity that surface-level features cannot fully resolve.
This project:
- performs exploratory data analysis to identify discriminative linguistic patterns,
- engineers interpretable features from both raw and cleaned text,
- builds a preprocessing pipeline combining word-level TF-IDF, character-level TF-IDF, and scaled numeric features,
- trains and tunes two linear classifiers with cross-validation,
- and evaluates model behavior with attention to the asymmetric cost of false negatives.
The core challenge is semantic ambiguity.
The same words, fire, flood, crash, bomb, appear in both real emergencies and everyday conversation. A tweet saying "she's a suicide bomb" uses the same vocabulary as a tweet reporting a real attack. No word-frequency-based model can fully resolve this overlap.
This motivates a feature engineering strategy that goes beyond raw word counts, incorporating grammatical signals, composite linguistic scores, and semantic distinctiveness measures.
The project follows a structured analytical pipeline:
Exploratory Data Analysis Surface-level linguistic signals, lexical features, POS distributions, and n-gram analysis — identifying discriminative patterns before any modeling decision.
Text Preprocessing Noise removal (URLs, mentions, hashtags), character normalization, and lemmatization — designed to preserve signal while eliminating platform-specific artifacts.
Feature Engineering
- Surface signals extracted from raw text:
num_numbers,has_link,num_exclam,num_question,has_ellipsis - POS count features from cleaned text: past/present verbs, common nouns, pronouns
- Weighted composite scores:
event_score(factual, event-oriented language) andsocial_score(conversational, informal language) - Semantic indices:
ambiguity_ratioanddistinctive_signal, built from lemmatized nouns, verbs, and pronouns on the training set only
Vectorization
- Word-level TF-IDF with
ngram_range=(1,2)captures compositional lexical patterns - Character-level TF-IDF with
ngram_range=(3,4)captures subword patterns and spelling variation - StandardScaler applied to numeric
All preprocessing steps are combined in a single ColumnTransformer and wrapped in a sklearn Pipeline together with the classifier. This ensures that vectorization and scaling are fitted exclusively on the training set, with no leakage into the test set.
Modeling
- Logistic Regression and Linear SVM
- Both tuned with GridSearchCV (5-fold CV, optimizing F1 on the disaster class)
- Threshold analysis on Logistic Regression probability estimates
Feature engineering is grounded in the exploratory analysis, not applied generically.
Event Score aggregates signals associated with factual, event-oriented language: number presence, link presence, past-tense verbs (VBD, VBN), and common nouns (NN). Each signal is weighted by its observed discriminative ratio (disaster/non-disaster mean) on the training set.
Social Score aggregates signals associated with conversational language: exclamation marks, question marks, present-tense verbs (VBP, VB), and pronouns (PRP, PRP$). Weights follow the same empirical approach.
Ambiguity Ratio measures the proportion of lemmas in a tweet that appear in both classes. A high value indicates vocabulary shared across disaster and non-disaster tweets, making classification harder.
Distinctive Signal counts lemmas that appear in only one class. A high value indicates vocabulary strongly associated with a specific class.
Both indices are constructed exclusively from the training set to prevent data leakage.
| Model | Class | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|
| Logistic Regression | Non-Disaster (0) | 0.82 | 0.87 | 0.84 | 0.82 |
| Logistic Regression | Disaster (1) | 0.81 | 0.74 | 0.78 | 0.82 |
| Linear SVM | Non-Disaster (0) | 0.81 | 0.87 | 0.84 | 0.81 |
| Linear SVM | Disaster (1) | 0.81 | 0.73 | 0.77 | 0.81 |
Logistic Regression is selected as the final model. Beyond marginal performance advantages, it produces probability estimates enabling threshold adjustment, useful when minimizing false negatives is a priority.
The pipeline works with word frequencies and grammatical patterns, it does not understand meaning or context. Three categories of errors were identified:
- Figurative language — disaster vocabulary used metaphorically
- Noisy labels — tweets ambiguous even for human annotators
- Vocabulary gaps — real incidents reported with place names unseen during training
These limitations motivate contextual models like BERT as a natural next step.
The dataset (train.csv) is publicly available on Kaggle: NLP with Disaster Tweets
7,613 labeled tweets: text and target (1 = disaster, 0 = non-disaster).
DisasterTweets/
│
├── notebooks/
│ └── disaster_tweets_NLP_classification_project.ipynb # Full analytical pipeline
│
├── reports/
│ └── presentation.pdf # Project presentation slides
│
├── .gitignore
├── requirements.txt
└── README.md
pip install -r requirements.txt
Key dependencies: pandas, numpy, scikit-learn, nltk, matplotlib, seaborn, wordcloud
Maria Petralia Data Science Program — MasterSchool March 2026