Skip to content

Refactor little endian binary decoding into a shared internal utility#182

Merged
mchav merged 2 commits intoDataHaskell:mainfrom
Sao-Ali:issue-164-unroll-loops-decode-unsigned-integers
Mar 15, 2026
Merged

Refactor little endian binary decoding into a shared internal utility#182
mchav merged 2 commits intoDataHaskell:mainfrom
Sao-Ali:issue-164-unroll-loops-decode-unsigned-integers

Conversation

@Sao-Ali
Copy link
Contributor

@Sao-Ali Sao-Ali commented Mar 12, 2026

This PR addresses endian decoding consistency and performance by replacing loop-based byte-to-word decoding with unrolled little-endian helpers, then moving those helpers into a
shared internal binary utility used by both Parquet and Lazy binary paths. This keeps behavior aligned across code paths, removes duplicate implementations, and makes future binary
changes safer because there is one canonical implementation.

  • Replaced loop-based LE decoding logic with unrolled helper implementation.
  • Added a shared internal binary utility module as the canonical source.
  • Updated Parquet binary helpers to call the shared utility.
  • Updated Lazy binary read paths to call the shared utility.
  • Kept external behavior/API compatible while removing duplicate logic.
  • Verified with formatter, lint, full build, and full test suite pass

… module and update Parquet and Lazy binary readers to use the same implementation so byte decoding behavior is

  consistent across code paths and easier to maintain; this removes duplicated decode logic, keeps existing APIs intact through Parquet.Binary wrappers, adds and preserves endian
  regression coverage, and confirms compatibility by running formatting, lint, full build, and the complete test suite successfully.
)
| otherwise =
littleEndianWord32 (BS.take 4 $ bytes `BS.append` BS.pack [0, 0, 0, 0])
littleEndianWord32 = Binary.littleEndianWord32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to keep this re export.

littleEndianWord32 bytes
| len >= 4 =
assembleWord32
(BS.index bytes 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe index is fine here.


byteAtOrZero :: Int -> BS.ByteString -> Int -> Word8
byteAtOrZero len bytes i
| i < len = BS.index bytes i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe index here.

…y and remove redundant re exports; update parquet modules and tests to import DataFrame.Internal.Binary

  directly, and fix byteAtOrZero bounds check to avoid unsafe indexing on negative offsets while keeping guarded fast path indexing unchanged
@mchav mchav merged commit 86b02b2 into DataHaskell:main Mar 15, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants