Convert raw DataBento .dbn.zst files into consolidated Parquet files for use with NautilusTrader backtesting.
git submodule add https://github.com/BurnOutTrader/databento_scripts.git scripts/databentopip install pyarrow psutil databento-dbn zstandardYou also need zstd installed on your system (for integrity verification):
# Debian/Ubuntu
sudo apt install zstd
# macOS
brew install zstd
# Or install via cargo
cargo install zstdcargo install dbnVerify: dbn --version
Edit paths.py to set your data directories:
from pathlib import Path
DATA_DIR = Path("/your/data/storage")
DATABENTO_DIR = DATA_DIR / "databento"
DBN_RAW_DIR = DATA_DIR / "databento_raw"
DEFINITIONS_DIR = DATABENTO_DIR / "definitions"Create the directories:
mkdir -p /your/data/storage/databento_raw
mkdir -p /your/data/storage/databentoPlace your DataBento .dbn.zst files in databento_raw/:
databento_raw/
├── definitions/ # Instrument definitions
├── candles_1d/ # Daily OHLCV
├── candles_1h/ # Hourly OHLCV
├── candles_1m/ # Minute OHLCV
├── candles_1s/ # Second OHLCV
├── mbo/ # Market-by-order
├── mbp10/ # Market-by-price
├── tbbo/ # Top-of-book
├── statistics/
└── status/
cd scripts/databento
python databento_definitions_to_parquet.pyThis creates instrument definition files needed to decode market data.
python databento_to_parquet.pySelect which datatypes to process when prompted, or choose all.
For datatypes where you download per-symbol rather than all symbols (e.g. mbo, mbp10):
python databento_symbol_merge.py- Select the raw data folder(s) to process
- The script scans the files and lists all available symbols
- Select which symbols to merge (e.g.
1,3,8)
New records are merged into existing parquets with exact-row deduplication, so it's safe to run multiple times or add new symbols incrementally.
Parquet files are saved to databento/:
databento/
├── definitions/
│ └── FuturesContract/
│ └── 20250606.parquet
├── candles_1d/
│ └── Bar/
│ └── 20250606.parquet # Daily
├── mbo/
│ └── OrderBookDelta/
│ └── 20250606.parquet # Daily
└── ...
- Integrity Check: SHA256 verification before processing
- Resume: Tracks progress, safe to interrupt
- RAM Management: Monitors memory, auto-throttles
- Parallel: Uses all CPU cores
- Smart Splitting: Large files split automatically
Stored in databento/ directory:
.data_done— processed data periods.definitions_done— processed definitions.dbn_integrity— cached checksums.splitting_files— current splits
Delete these to force a re-run.
dbn not found: Add cargo bin to PATH or edit dbn_split.py
Out of memory: Lower WORKER count in the script
Script fails: Check the error output - integrity check will report corrupted files
Resume after crash: Just re-run, it picks up where it left off
CME/GLBX statistics missing raw-symbol mappings: Some CME/GLBX historical days can contain statistics records whose numeric instrument_id is present in the payload but missing from the DBN metadata/raw-symbol mapping table and nearby definitions files. To reiterate, while this instrument ID is in the statistics file it doesn't appear that CME ever published an instrument definition for it. For statistics only, databento_to_parquet.py falls back to writing the record with instrument_id="<numeric id>" so the data is preserved for later repair instead of stopping the whole day. Each fallback occurrence writes a JSON trace file to databento/statistics_fallback_logs/.
The scripts convert compressed .dbn.zst files from DataBento into Apache Parquet format. This is necessary because:
- NautilusTrader reads Parquet natively
- Parquet is columnar and efficient for backtesting
- Raw DBN files are too slow for iterative research
.dbn.zst → Integrity Check → Split (if large) → Process to Parquet → Consolidate by period
Before processing any file, the scripts verify integrity:
-
SHA256 Checksum: Computed for each file and cached in
.dbn_integrity. On subsequent runs, cached checksums are reused - only new files need computation. -
Readability Check: Runs
zstd -ton each file to verify it's a valid compressed archive.
If any file fails:
- Script prints a report of failures
- Stops immediately - no processing occurs
- Fix the corrupted files and re-run
This prevents wasting hours processing corrupted data only to get errors at the end.
Large .dbn.zst files are split to manage memory:
- Threshold: Files under 100MB are processed whole
- Over 100MB: Split into chunks using
dbn --limit
Split limits vary by datatype:
- MBO/MBP10/TBBO: 10,000 records/chunk
- All others: 500,000 records/chunk
- Definitions: 10,000 records/chunk
Split files are stored in {datatype}/splits/{filename}/ and deleted after successful processing.
The scripts track progress to support interruption:
.data_done: Lists processed datatype/period combinations (e.g.,mbo/20250606).definitions_done: Lists processed definition filenames.definitions_day_done: Lists processed definition days.splitting_files: Lists files currently being split
On resume:
- Script loads the done files
- Skips any input files that match entries
- Checks for incomplete splits (file in
.splitting_filesbut no.split_donemarker) - Clears incomplete splits and resplits them
- Continues processing remaining files
Processing uses parallel workers, but RAM is monitored:
- Buffer: Keeps 6GB free at all times (configurable)
- Estimates: Each datatype has a default RAM estimate until real measurements come in
- Pressure Relief: If RAM drops below buffer, kills the worker processing the latest date, deletes its partial
.tmpfiles, and re-queues it. The re-queued worker restarts from the first chunk of that day, writing a fresh parquet — no risk of duplicates - Auto-throttling: Number of concurrent workers adjusts based on available RAM
- Streaming writes: Workers stream chunks through
ParquetWriterone at a time, so only one chunk's worth of data is in memory at once (not the entire day) - Stale cleanup: On startup, any orphan
.tmpfiles from previous interrupted runs are deleted
This prevents OOM crashes while maximizing throughput.
After the first file for a datatype is processed, the script learns the actual RAM usage and can adjust future processing accordingly. This is especially important for large files that may exceed initial estimates.
statistics is the only datatype with a missing-symbol fallback. If DatabentoDataLoader raises ValueError: No raw symbol found for <instrument_id>, the worker:
- Re-reads the raw DBN statistics records directly
- Writes the affected parquet rows with
instrument_idset to that numeric ID as a string - Emits a per-occurrence JSON log in
databento/statistics_fallback_logs/
This is a preservation path for degraded source days, not a symbology fix. The fallback keeps the statistics rows available so they can be repaired later with corrected Databento mappings or a post-processing script.
Files are consolidated into one parquet per trading day so they match
NautilusTrader's ParquetDataCatalog expectations:
| Datatype | Grouping | Example |
|---|---|---|
| candles_1d | Daily | 20250606.parquet |
| candles_1h | Daily | 20250606.parquet |
| candles_1m | Daily | 20250606.parquet |
| candles_1s | Daily | 20250606.parquet |
| mbo | Daily | 20250606.parquet |
| mbp10 | Daily | 20250606.parquet |
| tbbo | Daily | 20250606.parquet |
| statistics | Daily | 20250606.parquet |
| status | Daily | 20250606.parquet |
| definitions | By instrument type + day | FuturesContract/20250606.parquet |
On successful completion with 0 errors, you can choose to delete raw source files:
- Original
.dbn.zstfiles - Split chunks in
splits/directories - Progress tracking files
This saves disk space but removes the ability to re-run. Always backup raw files first.
