🐦 Disaster Tweets — NLP Binary Classification

This repository contains an end-to-end NLP classification project developed as part of a MasterSchool Data Science program.

The objective is to build a feature-driven pipeline to classify tweets as disaster-related or not, combining interpretable linguistic feature engineering with TF-IDF vectorization and linear classification models.

📌 Project Overview

Twitter is a major channel for real-time reporting of emergency events. The ability to automatically identify genuine disaster signals from informal, figurative, or ambiguous language has direct applications in early warning systems and crisis monitoring.

The classification task is non-trivial: disaster-related vocabulary frequently appears in non-disaster contexts, creating semantic ambiguity that surface-level features cannot fully resolve.

This project:

performs exploratory data analysis to identify discriminative linguistic patterns,
engineers interpretable features from both raw and cleaned text,
builds a preprocessing pipeline combining word-level TF-IDF, character-level TF-IDF, and scaled numeric features,
trains and tunes two linear classifiers with cross-validation,
and evaluates model behavior with attention to the asymmetric cost of false negatives.

🎯 Problem Statement

The core challenge is semantic ambiguity.

The same words, fire, flood, crash, bomb, appear in both real emergencies and everyday conversation. A tweet saying "she's a suicide bomb" uses the same vocabulary as a tweet reporting a real attack. No word-frequency-based model can fully resolve this overlap.

This motivates a feature engineering strategy that goes beyond raw word counts, incorporating grammatical signals, composite linguistic scores, and semantic distinctiveness measures.

🧭 Methodology

The project follows a structured analytical pipeline:

Exploratory Data Analysis Surface-level linguistic signals, lexical features, POS distributions, and n-gram analysis — identifying discriminative patterns before any modeling decision.

Text Preprocessing Noise removal (URLs, mentions, hashtags), character normalization, and lemmatization — designed to preserve signal while eliminating platform-specific artifacts.

Feature Engineering

Surface signals extracted from raw text: num_numbers, has_link, num_exclam, num_question, has_ellipsis
POS count features from cleaned text: past/present verbs, common nouns, pronouns
Weighted composite scores: event_score (factual, event-oriented language) and social_score (conversational, informal language)
Semantic indices: ambiguity_ratio and distinctive_signal, built from lemmatized nouns, verbs, and pronouns on the training set only

Vectorization

Word-level TF-IDF with ngram_range=(1,2) captures compositional lexical patterns
Character-level TF-IDF with ngram_range=(3,4) captures subword patterns and spelling variation
StandardScaler applied to numeric

All preprocessing steps are combined in a single ColumnTransformer and wrapped in a sklearn Pipeline together with the classifier. This ensures that vectorization and scaling are fitted exclusively on the training set, with no leakage into the test set.

Modeling

Logistic Regression and Linear SVM
Both tuned with GridSearchCV (5-fold CV, optimizing F1 on the disaster class)
Threshold analysis on Logistic Regression probability estimates

🧠 Feature Engineering Design

Feature engineering is grounded in the exploratory analysis, not applied generically.

Event Score aggregates signals associated with factual, event-oriented language: number presence, link presence, past-tense verbs (VBD, VBN), and common nouns (NN). Each signal is weighted by its observed discriminative ratio (disaster/non-disaster mean) on the training set.

Social Score aggregates signals associated with conversational language: exclamation marks, question marks, present-tense verbs (VBP, VB), and pronouns (PRP, PRP$). Weights follow the same empirical approach.

Ambiguity Ratio measures the proportion of lemmas in a tweet that appear in both classes. A high value indicates vocabulary shared across disaster and non-disaster tweets, making classification harder.

Distinctive Signal counts lemmas that appear in only one class. A high value indicates vocabulary strongly associated with a specific class.

Both indices are constructed exclusively from the training set to prevent data leakage.

✅ Results

Model	Class	Precision	Recall	F1	Accuracy
Logistic Regression	Non-Disaster (0)	0.82	0.87	0.84	0.82
Logistic Regression	Disaster (1)	0.81	0.74	0.78	0.82
Linear SVM	Non-Disaster (0)	0.81	0.87	0.84	0.81
Linear SVM	Disaster (1)	0.81	0.73	0.77	0.81

Logistic Regression is selected as the final model. Beyond marginal performance advantages, it produces probability estimates enabling threshold adjustment, useful when minimizing false negatives is a priority.

⚠️ Limitations

The pipeline works with word frequencies and grammatical patterns, it does not understand meaning or context. Three categories of errors were identified:

Figurative language — disaster vocabulary used metaphorically
Noisy labels — tweets ambiguous even for human annotators
Vocabulary gaps — real incidents reported with place names unseen during training

These limitations motivate contextual models like BERT as a natural next step.

📊 Dataset

The dataset (train.csv) is publicly available on Kaggle: NLP with Disaster Tweets

7,613 labeled tweets: text and target (1 = disaster, 0 = non-disaster).

📁 Repository Structure

DisasterTweets/
│
├── notebooks/
│   └── disaster_tweets_NLP_classification_project.ipynb   # Full analytical pipeline
│
├── reports/
│   └── presentation.pdf   # Project presentation slides
│
├── .gitignore
├── requirements.txt
└── README.md

📦 Requirements

pip install -r requirements.txt

Key dependencies: pandas, numpy, scikit-learn, nltk, matplotlib, seaborn, wordcloud

👤 Author

Maria Petralia Data Science Program — MasterSchool March 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐦 Disaster Tweets — NLP Binary Classification

📌 Project Overview

🎯 Problem Statement

🧭 Methodology

🧠 Feature Engineering Design

✅ Results

⚠️ Limitations

📊 Dataset

📁 Repository Structure

📦 Requirements

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
reports		reports
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🐦 Disaster Tweets — NLP Binary Classification

📌 Project Overview

🎯 Problem Statement

🧭 Methodology

🧠 Feature Engineering Design

✅ Results

⚠️ Limitations

📊 Dataset

📁 Repository Structure

📦 Requirements

👤 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages