LEmma (Language Experimental Minimal Model Architecture)

A from-scratch implementation of a Transformer-based language model, built to understand the internals of modern LLMs: tokenization, attention, training, and text generation.

Architecture

LEmma implements the core components of the original "Attention Is All You Need" Transformer:

Multi-Head Self-Attention — allows each token to attend to every other token in the sequence
Position-wise Feed-Forward Network — processes each token representation independently
Sinusoidal Positional Encoding — injects token position information into embeddings
Residual Connections + Layer Norm — pre-norm style for stable training

Features

Custom Character Tokenizer: Simple character-level tokenizer (CharTokenizer) for converting text into token sequences and back.
Transformer Model: Implements a minimal Transformer with multi-head self-attention, feed-forward layers, and positional encoding.
Training Pipeline: Training script for feeding text data into the Transformer and optimizing with cross-entropy loss.
Text Generation: Sampling script with temperature and top-k support for generating sequences from a trained model.

Project Structure

LEmma/
├── src/
│   └── lemma/
│       ├── models/
│       │   ├── attention.py
│       │   ├── feedforward.py
│       │   ├── positionalencoding.py
│       │   ├── transformerblock.py
│       │   └── transformer.py
│       ├── tokenizer/
│       │   ├── chartokenizer.py
│       │   └── huggingface_tokenizer.py
│       └── utils/
│           ├── train.py
│           ├── sample.py
│           ├── prepare_data.py
│           └── hf_tokenize.py
├── data/
├── checkpoints/
├── configs/
├── tests/
├── pyproject.toml
└── README.md

Setup

python -m venv .venv
source .venv/bin/activate
pip install -e .

Usage

Train a model:

python src/lemma/utils/train.py

Generate text:

python src/lemma/utils/sample.py

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEmma (Language Experimental Minimal Model Architecture)

Architecture

Features

Project Structure

Setup

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LEmma (Language Experimental Minimal Model Architecture)

Architecture

Features

Project Structure

Setup

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages