Skip to content

BurnOutTrader/databento_scripts

Repository files navigation

DataBento to Parquet Conversion Scripts

Convert raw DataBento .dbn.zst files into consolidated Parquet files for use with NautilusTrader backtesting.

Installation

1. Clone as Submodule

git submodule add https://github.com/BurnOutTrader/databento_scripts.git scripts/databento

2. Install Dependencies

pip install pyarrow psutil databento-dbn zstandard

You also need zstd installed on your system (for integrity verification):

# Debian/Ubuntu
sudo apt install zstd

# macOS
brew install zstd

# Or install via cargo
cargo install zstd

3. Install DBN CLI

cargo install dbn

Verify: dbn --version

Configuration

Edit paths.py to set your data directories:

from pathlib import Path

DATA_DIR = Path("/your/data/storage")
DATABENTO_DIR = DATA_DIR / "databento"
DBN_RAW_DIR = DATA_DIR / "databento_raw"
DEFINITIONS_DIR = DATABENTO_DIR / "definitions"

Create the directories:

mkdir -p /your/data/storage/databento_raw
mkdir -p /your/data/storage/databento

Data Structure

Place your DataBento .dbn.zst files in databento_raw/:

databento_raw/
├── definitions/     # Instrument definitions
├── candles_1d/     # Daily OHLCV
├── candles_1h/     # Hourly OHLCV
├── candles_1m/     # Minute OHLCV
├── candles_1s/     # Second OHLCV
├── mbo/           # Market-by-order
├── mbp10/         # Market-by-price
├── tbbo/          # Top-of-book
├── statistics/
└── status/

Usage

Step 1: Process Definitions (must run first)

cd scripts/databento
python databento_definitions_to_parquet.py

This creates instrument definition files needed to decode market data.

Step 2: Process Market Data

python databento_to_parquet.py

Select which datatypes to process when prompted, or choose all.

Step 3 (optional): Merge Symbol-Specific Data

For datatypes where you download per-symbol rather than all symbols (e.g. mbo, mbp10):

python databento_symbol_merge.py
  1. Select the raw data folder(s) to process
  2. The script scans the files and lists all available symbols
  3. Select which symbols to merge (e.g. 1,3,8)

New records are merged into existing parquets with exact-row deduplication, so it's safe to run multiple times or add new symbols incrementally.

Output

Parquet files are saved to databento/:

databento/
├── definitions/
│   └── FuturesContract/
│       └── 20250606.parquet
├── candles_1d/
│   └── Bar/
│       └── 20250606.parquet  # Daily
├── mbo/
│   └── OrderBookDelta/
│       └── 20250606.parquet  # Daily
└── ...

Features

  • Integrity Check: SHA256 verification before processing
  • Resume: Tracks progress, safe to interrupt
  • RAM Management: Monitors memory, auto-throttles
  • Parallel: Uses all CPU cores
  • Smart Splitting: Large files split automatically

Progress Files

Stored in databento/ directory:

  • .data_done — processed data periods
  • .definitions_done — processed definitions
  • .dbn_integrity — cached checksums
  • .splitting_files — current splits

Delete these to force a re-run.

Troubleshooting

dbn not found: Add cargo bin to PATH or edit dbn_split.py

Out of memory: Lower WORKER count in the script

Script fails: Check the error output - integrity check will report corrupted files

Resume after crash: Just re-run, it picks up where it left off

CME/GLBX statistics missing raw-symbol mappings: Some CME/GLBX historical days can contain statistics records whose numeric instrument_id is present in the payload but missing from the DBN metadata/raw-symbol mapping table and nearby definitions files. To reiterate, while this instrument ID is in the statistics file it doesn't appear that CME ever published an instrument definition for it. For statistics only, databento_to_parquet.py falls back to writing the record with instrument_id="<numeric id>" so the data is preserved for later repair instead of stopping the whole day. Each fallback occurrence writes a JSON trace file to databento/statistics_fallback_logs/.


How It Works

Overview

The scripts convert compressed .dbn.zst files from DataBento into Apache Parquet format. This is necessary because:

  1. NautilusTrader reads Parquet natively
  2. Parquet is columnar and efficient for backtesting
  3. Raw DBN files are too slow for iterative research

File Processing Flow

.dbn.zst → Integrity Check → Split (if large) → Process to Parquet → Consolidate by period

1. Integrity Verification

Before processing any file, the scripts verify integrity:

  1. SHA256 Checksum: Computed for each file and cached in .dbn_integrity. On subsequent runs, cached checksums are reused - only new files need computation.

  2. Readability Check: Runs zstd -t on each file to verify it's a valid compressed archive.

If any file fails:

  • Script prints a report of failures
  • Stops immediately - no processing occurs
  • Fix the corrupted files and re-run

This prevents wasting hours processing corrupted data only to get errors at the end.

2. File Splitting

Large .dbn.zst files are split to manage memory:

  • Threshold: Files under 100MB are processed whole
  • Over 100MB: Split into chunks using dbn --limit

Split limits vary by datatype:

  • MBO/MBP10/TBBO: 10,000 records/chunk
  • All others: 500,000 records/chunk
  • Definitions: 10,000 records/chunk

Split files are stored in {datatype}/splits/{filename}/ and deleted after successful processing.

3. Resume Tracking

The scripts track progress to support interruption:

  • .data_done: Lists processed datatype/period combinations (e.g., mbo/20250606)
  • .definitions_done: Lists processed definition filenames
  • .definitions_day_done: Lists processed definition days
  • .splitting_files: Lists files currently being split

On resume:

  1. Script loads the done files
  2. Skips any input files that match entries
  3. Checks for incomplete splits (file in .splitting_files but no .split_done marker)
  4. Clears incomplete splits and resplits them
  5. Continues processing remaining files

4. RAM Management

Processing uses parallel workers, but RAM is monitored:

  • Buffer: Keeps 6GB free at all times (configurable)
  • Estimates: Each datatype has a default RAM estimate until real measurements come in
  • Pressure Relief: If RAM drops below buffer, kills the worker processing the latest date, deletes its partial .tmp files, and re-queues it. The re-queued worker restarts from the first chunk of that day, writing a fresh parquet — no risk of duplicates
  • Auto-throttling: Number of concurrent workers adjusts based on available RAM
  • Streaming writes: Workers stream chunks through ParquetWriter one at a time, so only one chunk's worth of data is in memory at once (not the entire day)
  • Stale cleanup: On startup, any orphan .tmp files from previous interrupted runs are deleted

This prevents OOM crashes while maximizing throughput.

After the first file for a datatype is processed, the script learns the actual RAM usage and can adjust future processing accordingly. This is especially important for large files that may exceed initial estimates.

memvcpu.png

4a. Statistics Mapping Fallback

statistics is the only datatype with a missing-symbol fallback. If DatabentoDataLoader raises ValueError: No raw symbol found for <instrument_id>, the worker:

  1. Re-reads the raw DBN statistics records directly
  2. Writes the affected parquet rows with instrument_id set to that numeric ID as a string
  3. Emits a per-occurrence JSON log in databento/statistics_fallback_logs/

This is a preservation path for degraded source days, not a symbology fix. The fallback keeps the statistics rows available so they can be repaired later with corrected Databento mappings or a post-processing script.

5. Output Grouping

Files are consolidated into one parquet per trading day so they match NautilusTrader's ParquetDataCatalog expectations:

Datatype Grouping Example
candles_1d Daily 20250606.parquet
candles_1h Daily 20250606.parquet
candles_1m Daily 20250606.parquet
candles_1s Daily 20250606.parquet
mbo Daily 20250606.parquet
mbp10 Daily 20250606.parquet
tbbo Daily 20250606.parquet
statistics Daily 20250606.parquet
status Daily 20250606.parquet
definitions By instrument type + day FuturesContract/20250606.parquet

6. Cleanup

On successful completion with 0 errors, you can choose to delete raw source files:

  • Original .dbn.zst files
  • Split chunks in splits/ directories
  • Progress tracking files

This saves disk space but removes the ability to re-run. Always backup raw files first.

About

DataBento to Parquet conversion scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages