GapClean (v1.0.5)

Written by Aarya Venkat, PhD

Description

GapClean is a memory-efficient tool for cleaning gappy multiple sequence alignments in FASTA format. It offers three powerful modes for gap removal:

Threshold Mode: Remove columns exceeding a gap percentage
Seed Mode: Remove gaps relative to a reference sequence
Entropy Mode: Remove low-diversity columns using Shannon entropy

NEW in v1.0.5:

Multi-format support: Stockholm, Clustal, PHYLIP, FASTA
Auto-detect input format (no flags needed!)
Convert between formats seamlessly
Works with Pfam alignments (.sto files)

Also in v1.0.4:

Simple Python API for notebooks and scripts
Auto-managed temp files (no manual cleanup!)
Returns statistics dict with metrics and timing

And v1.0.3:

Entropy-based gap removal mode
Pure Python implementation (Windows compatible!)
Comprehensive testing and documentation

Features

Memory Efficient: Process alignments larger than RAM using 2D chunking
Fast: Optimized NumPy operations for gap detection
Scalable: Handles million-sequence datasets (1M+ sequences in 35 seconds)
Cross-Platform: Works on Windows, macOS, and Linux
Flexible: Three gap removal modes for different use cases
Well-Tested: 48+ tests ensuring reliability

Performance

Benchmarks on Apple M1 (16 GB RAM) using Pfam protein families (70% gap threshold):

Scale	Pfam	Sequences	Length	Time	Size
Tiny	PF15608	931	148	<0.1s	215 KB
Small	PF00637	38,583	648	2s	27 MB
Medium	PF00535	157,052	1,285	11s	206 MB
Large	PF00069	1,051,876	3,667	35s	7 GB

Processing time scales linearly with alignment size. From small families (931 sequences) to million-sequence datasets with negligible overhead.

Installation

pip install gapclean

Requirements: Python 3.8+

Quick Start

Threshold Mode

Remove columns with >75% gaps:

gapclean -i input.fa -o output.fa -t 75

Seed Mode

Remove gaps relative to first sequence:

gapclean -i input.fa -o output.fa -s 0

Entropy Mode (NEW!)

Remove columns based on diversity:

# Keep variable regions (SNP detection)
gapclean -i input.fa -o output.fa --entropy-min 1.0

# Keep conserved regions (alignment cleaning)
gapclean -i input.fa -o output.fa --entropy-max 1.5

Usage

gapclean [options]

Required Arguments:
  -i, --input    Input aligned FASTA file
  -o, --output   Output cleaned FASTA file

Gap Removal Mode (choose one):
  -t, --threshold      Percentage threshold (0-100)
  -s, --seed           Seed sequence index (0-based)
  --entropy-min        Remove columns with entropy < threshold (keep variable)
  --entropy-max        Remove columns with entropy > threshold (keep conserved)

Optional Arguments:
  --row-chunk-size   Sequences per chunk (default: 5000)
  --col-chunk-size   Columns per chunk (default: 5000)
  -h, --help         Show help message

Examples

Phylogenetic Analysis

# Remove very gappy columns before tree building
gapclean -i gene_alignment.fa -o cleaned.fa -t 80

Variant Analysis

# Remove gaps relative to reference genome (first sequence)
gapclean -i variants.fa -o positions.fa -s 0

SNP Detection

# Keep only variable positions (DNA)
gapclean -i population.fa -o snps.fa --entropy-min 1.0

Stockholm Format (Pfam)

# Auto-detects Stockholm input, outputs FASTA (fast - recommended!)
gapclean -i PF00535.sto -o cleaned.fa -t 70

# Stockholm output (slower - only if you need the metadata)
gapclean -i pfam_seed.sto -o cleaned.sto --output-format stockholm -t 75

# Explicit format specification
gapclean -i pfam_seed.sto -o cleaned.txt -t 75 --output-format fasta

Format Conversion

# Convert Clustal to FASTA while cleaning (recommended - fast)
gapclean -i alignment.aln -o output.fa -t 50

# Convert FASTA to Stockholm (slower for large alignments)
gapclean -i input.fa -o output.sto --output-format stockholm -t 75

Performance Note: Stockholm/Clustal/PHYLIP output is much slower than FASTA for large alignments (100K+ sequences) due to BioPython's format writers. GapClean defaults to FASTA output for optimal performance. Use --output-format only if you specifically need non-FASTA formats.

Memory-Constrained Systems

# Process large alignment with limited RAM
gapclean -i huge_alignment.fa -o cleaned.fa -t 75 \
  --row-chunk-size 1000 --col-chunk-size 1000

Python API

Use GapClean programmatically in Python scripts and Jupyter notebooks:

from gapclean import clean_alignment

# Threshold mode - remove columns with >50% gaps
stats = clean_alignment(
    input_file='input.fa',
    output_file='output.fa',
    threshold=50
)

print(f"Removed {stats['columns_removed']} columns")
print(f"Took {stats['elapsed_seconds']:.1f} seconds")

# Stockholm format (auto-detected)
stats = clean_alignment('pfam.sto', 'cleaned.fa', threshold=70)

# Explicit format conversion
stats = clean_alignment(
    'input.aln', 'output.sto',
    threshold=50,
    input_format='clustal',
    output_format='stockholm'
)

# Seed mode - remove gaps relative to first sequence
stats = clean_alignment('input.fa', 'output.fa', seed_index=0)

# Entropy mode - keep only conserved regions
stats = clean_alignment('input.fa', 'output.fa', entropy_max=1.5)

# Quiet mode for pipelines
stats = clean_alignment('input.fa', 'output.fa', threshold=75, verbose=False)

Why use the Python API?

One function call: No temp file management, no subprocess overhead
Multi-format support: Stockholm, Clustal, PHYLIP, FASTA - auto-detected!
Returns statistics: Get metrics about the cleaning operation
Perfect for pipelines: Integrate into larger bioinformatics workflows
Jupyter-friendly: See the included visualization tutorial notebook

Documentation

Full documentation available at: https://arikat.github.io/GapClean/

What's New in v1.0.3

Entropy-based gap removal: Identify variable and conserved regions
Windows compatibility: Pure Python, no external dependencies
Better error messages: Clear, actionable feedback
Type hints: Full type annotations for better IDE support
Comprehensive tests: 48+ tests for reliability
Professional docs: Beautiful Material for MkDocs site
CI/CD: Automated testing on Windows, macOS, Linux

See CHANGELOG for full details.

Citation

If you use GapClean in your research, please cite:

Venkat, A. (2026). GapClean: Memory-efficient gap removal for multiple sequence alignments.
https://github.com/arikat/GapClean

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE file for details.

Support

Documentation: https://arikat.github.io/GapClean/
Issues: https://github.com/arikat/GapClean/issues
PyPI: https://pypi.org/project/gapclean/

Thank Gappy for his service. He is a retired detective.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
gapclean		gapclean
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
gapclean.png		gapclean.png
gapclean_icon.png		gapclean_icon.png
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GapClean (v1.0.5)

Written by Aarya Venkat, PhD

Description

Features

Performance

Installation

Quick Start

Threshold Mode

Seed Mode

Entropy Mode (NEW!)

Usage

Examples

Phylogenetic Analysis

Variant Analysis

SNP Detection

Stockholm Format (Pfam)

Format Conversion

Memory-Constrained Systems

Python API

Why use the Python API?

Documentation

What's New in v1.0.3

Citation

Contributing

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GapClean (v1.0.5)

Written by Aarya Venkat, PhD

Description

Features

Performance

Installation

Quick Start

Threshold Mode

Seed Mode

Entropy Mode (NEW!)

Usage

Examples

Phylogenetic Analysis

Variant Analysis

SNP Detection

Stockholm Format (Pfam)

Format Conversion

Memory-Constrained Systems

Python API

Why use the Python API?

Documentation

What's New in v1.0.3

Citation

Contributing

License

Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages