Refactor little endian binary decoding into a shared internal utility by Sao-Ali · Pull Request #182 · DataHaskell/dataframe

Sao-Ali · 2026-03-12T17:34:43Z

This PR addresses endian decoding consistency and performance by replacing loop-based byte-to-word decoding with unrolled little-endian helpers, then moving those helpers into a
shared internal binary utility used by both Parquet and Lazy binary paths. This keeps behavior aligned across code paths, removes duplicate implementations, and makes future binary
changes safer because there is one canonical implementation.

Replaced loop-based LE decoding logic with unrolled helper implementation.
Added a shared internal binary utility module as the canonical source.
Updated Parquet binary helpers to call the shared utility.
Updated Lazy binary read paths to call the shared utility.
Kept external behavior/API compatible while removing duplicate logic.
Verified with formatter, lint, full build, and full test suite pass

… module and update Parquet and Lazy binary readers to use the same implementation so byte decoding behavior is consistent across code paths and easier to maintain; this removes duplicated decode logic, keeps existing APIs intact through Parquet.Binary wrappers, adds and preserves endian regression coverage, and confirms compatibility by running formatting, lint, full build, and the complete test suite successfully.

mchav · 2026-03-12T17:58:07Z

src/DataFrame/IO/Parquet/Binary.hs

-            )
-    | otherwise =
-        littleEndianWord32 (BS.take 4 $ bytes `BS.append` BS.pack [0, 0, 0, 0])
+littleEndianWord32 = Binary.littleEndianWord32


We don't need to keep this re export.

mchav · 2026-03-12T17:58:50Z

src/DataFrame/Internal/Binary.hs

+littleEndianWord32 bytes
+    | len >= 4 =
+        assembleWord32
+            (BS.index bytes 0)


Unsafe index is fine here.

mchav · 2026-03-12T17:59:45Z

src/DataFrame/Internal/Binary.hs

+
+byteAtOrZero :: Int -> BS.ByteString -> Int -> Word8
+byteAtOrZero len bytes i
+    | i < len = BS.index bytes i


Unsafe index here.

…y and remove redundant re exports; update parquet modules and tests to import DataFrame.Internal.Binary directly, and fix byteAtOrZero bounds check to avoid unsafe indexing on negative offsets while keeping guarded fast path indexing unchanged

mchav requested changes Mar 12, 2026

View reviewed changes

mchav approved these changes Mar 15, 2026

View reviewed changes

mchav merged commit 86b02b2 into DataHaskell:main Mar 15, 2026
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor little endian binary decoding into a shared internal utility#182

Refactor little endian binary decoding into a shared internal utility#182
mchav merged 2 commits intoDataHaskell:mainfrom
Sao-Ali:issue-164-unroll-loops-decode-unsigned-integers

Sao-Ali commented Mar 12, 2026

Uh oh!

mchav Mar 12, 2026

Uh oh!

mchav Mar 12, 2026

Uh oh!

mchav Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sao-Ali commented Mar 12, 2026

Uh oh!

mchav Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants