-
Notifications
You must be signed in to change notification settings - Fork 41
Description
In DataFrame.IO.Parquet.Thrift,
currently we use ColumnMetaData and FileMetadata (which is clear by looking at their name, they are metadata), and ColumnChunk, RowGroup
data ColumnChunk = ColumnChunk
{ columnChunkFilePath :: String
, columnChunkMetadataFileOffset :: Int64
, columnMetaData :: ColumnMetaData
, columnChunkOffsetIndexOffset :: Int64
, columnChunkOffsetIndexLength :: Int32
, columnChunkColumnIndexOffset :: Int64
, columnChunkColumnIndexLength :: Int32
, cryptoMetadata :: ColumnCryptoMetadata
, encryptedColumnMetadata :: BS.ByteString
}
deriving (Show, Eq)
data RowGroup = RowGroup
{ rowGroupColumns :: [ColumnChunk]
, totalByteSize :: Int64
, rowGroupNumRows :: Int64
, rowGroupSortingColumns :: [SortingColumn]
, fileOffset :: Int64
, totalCompressedSize :: Int64
, ordinal :: Int16
}
deriving (Show, Eq)The issue:
these records does not contain actual bytes, they are also metadata, but their name suggests they do, readRowGroup, readColumnChunk also suggests it loads entire row group / column chunk into memory as a RowGroup / ColumnChunk. (confused me a bit)
It would be conceptually cleaner if we rename them to suggest they are just metadata (contain no actual bytes).
The risk is that, currently there is no export control and the entire module is exposed-module, rename would cause a public API change.
This is a simple change, I can submit a PR if needed. Alternatively we can add haddock documentation to the readColumnChunk / readRowGroup functions and their data types to reduce confusion.