Skip to content

MapiAI/DisasterTweets-NLP-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🐦 Disaster Tweets — NLP Binary Classification

Python NLP scikit-learn NLTK License Status

This repository contains an end-to-end NLP classification project developed as part of a MasterSchool Data Science program.

The objective is to build a feature-driven pipeline to classify tweets as disaster-related or not, combining interpretable linguistic feature engineering with TF-IDF vectorization and linear classification models.


📌 Project Overview

Twitter is a major channel for real-time reporting of emergency events. The ability to automatically identify genuine disaster signals from informal, figurative, or ambiguous language has direct applications in early warning systems and crisis monitoring.

The classification task is non-trivial: disaster-related vocabulary frequently appears in non-disaster contexts, creating semantic ambiguity that surface-level features cannot fully resolve.

This project:

  • performs exploratory data analysis to identify discriminative linguistic patterns,
  • engineers interpretable features from both raw and cleaned text,
  • builds a preprocessing pipeline combining word-level TF-IDF, character-level TF-IDF, and scaled numeric features,
  • trains and tunes two linear classifiers with cross-validation,
  • and evaluates model behavior with attention to the asymmetric cost of false negatives.

🎯 Problem Statement

The core challenge is semantic ambiguity.

The same words, fire, flood, crash, bomb, appear in both real emergencies and everyday conversation. A tweet saying "she's a suicide bomb" uses the same vocabulary as a tweet reporting a real attack. No word-frequency-based model can fully resolve this overlap.

This motivates a feature engineering strategy that goes beyond raw word counts, incorporating grammatical signals, composite linguistic scores, and semantic distinctiveness measures.


🧭 Methodology

The project follows a structured analytical pipeline:

Exploratory Data Analysis Surface-level linguistic signals, lexical features, POS distributions, and n-gram analysis — identifying discriminative patterns before any modeling decision.

Text Preprocessing Noise removal (URLs, mentions, hashtags), character normalization, and lemmatization — designed to preserve signal while eliminating platform-specific artifacts.

Feature Engineering

  • Surface signals extracted from raw text: num_numbers, has_link, num_exclam, num_question, has_ellipsis
  • POS count features from cleaned text: past/present verbs, common nouns, pronouns
  • Weighted composite scores: event_score (factual, event-oriented language) and social_score (conversational, informal language)
  • Semantic indices: ambiguity_ratio and distinctive_signal, built from lemmatized nouns, verbs, and pronouns on the training set only

Vectorization

  • Word-level TF-IDF with ngram_range=(1,2) captures compositional lexical patterns
  • Character-level TF-IDF with ngram_range=(3,4) captures subword patterns and spelling variation
  • StandardScaler applied to numeric

All preprocessing steps are combined in a single ColumnTransformer and wrapped in a sklearn Pipeline together with the classifier. This ensures that vectorization and scaling are fitted exclusively on the training set, with no leakage into the test set.

Modeling

  • Logistic Regression and Linear SVM
  • Both tuned with GridSearchCV (5-fold CV, optimizing F1 on the disaster class)
  • Threshold analysis on Logistic Regression probability estimates

🧠 Feature Engineering Design

Feature engineering is grounded in the exploratory analysis, not applied generically.

Event Score aggregates signals associated with factual, event-oriented language: number presence, link presence, past-tense verbs (VBD, VBN), and common nouns (NN). Each signal is weighted by its observed discriminative ratio (disaster/non-disaster mean) on the training set.

Social Score aggregates signals associated with conversational language: exclamation marks, question marks, present-tense verbs (VBP, VB), and pronouns (PRP, PRP$). Weights follow the same empirical approach.

Ambiguity Ratio measures the proportion of lemmas in a tweet that appear in both classes. A high value indicates vocabulary shared across disaster and non-disaster tweets, making classification harder.

Distinctive Signal counts lemmas that appear in only one class. A high value indicates vocabulary strongly associated with a specific class.

Both indices are constructed exclusively from the training set to prevent data leakage.


✅ Results

Model Class Precision Recall F1 Accuracy
Logistic Regression Non-Disaster (0) 0.82 0.87 0.84 0.82
Logistic Regression Disaster (1) 0.81 0.74 0.78 0.82
Linear SVM Non-Disaster (0) 0.81 0.87 0.84 0.81
Linear SVM Disaster (1) 0.81 0.73 0.77 0.81

Logistic Regression is selected as the final model. Beyond marginal performance advantages, it produces probability estimates enabling threshold adjustment, useful when minimizing false negatives is a priority.


⚠️ Limitations

The pipeline works with word frequencies and grammatical patterns, it does not understand meaning or context. Three categories of errors were identified:

  • Figurative language — disaster vocabulary used metaphorically
  • Noisy labels — tweets ambiguous even for human annotators
  • Vocabulary gaps — real incidents reported with place names unseen during training

These limitations motivate contextual models like BERT as a natural next step.


📊 Dataset

The dataset (train.csv) is publicly available on Kaggle: NLP with Disaster Tweets

7,613 labeled tweets: text and target (1 = disaster, 0 = non-disaster).


📁 Repository Structure

DisasterTweets/
│
├── notebooks/
│   └── disaster_tweets_NLP_classification_project.ipynb   # Full analytical pipeline
│
├── reports/
│   └── presentation.pdf   # Project presentation slides
│
├── .gitignore
├── requirements.txt
└── README.md

📦 Requirements

pip install -r requirements.txt

Key dependencies: pandas, numpy, scikit-learn, nltk, matplotlib, seaborn, wordcloud


👤 Author

Maria Petralia Data Science Program — MasterSchool March 2026

About

NLP binary classification pipeline to detect disaster-related tweets - feature engineering, TF-IDF vectorization, and linear models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors