Skip to content

Read fixed size strings without padding characters#692

Open
FlyingSamson wants to merge 5 commits intoess-dmsc:masterfrom
FlyingSamson:fixed-size-strings
Open

Read fixed size strings without padding characters#692
FlyingSamson wants to merge 5 commits intoess-dmsc:masterfrom
FlyingSamson:fixed-size-strings

Conversation

@FlyingSamson
Copy link
Contributor

This would be one way to overcome the current issue that when reading fixed-size strings into std::strings all padding characters are also returned.

This will certainly introduce a certain overhead compared to not trimming the read string (I did not perform any benchmarks though) which need to be contemplated.
On the other hand, in cases where the caller actually wants the trimmed string, forcing him to trim the string himself after it has been created might incur even larger costs.

Resolves #661

Also relevant for #215 and #224


AND_GIVEN("an instance of a smaller string") {
std::string write = "hell";
THEN("we have to resize the string to the appropriate size") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure, why this was required in the first place.

From what I understand the code will write shorter strings just fine. This line merely ensured that in the REQUIRE_THAT below write also contained trailing \0 characters.

@ggoneiESS
Copy link
Member

I don't quite have the mental capacity to deal with this on a Friday evening, but I would refer you to my comments on a different PR:

ess-dmsc/streaming-data-types#102 (comment)

variable-length datasets cannot be compressed
the data no longer exists contiguously (it necessarily becomes an array of pointers to strings, rather than just raw data)
And (academic but technical arguments)

heap storage requires more space than regular 'raw data' storage (i.e. how the HDF5 object exists in memory)
general reduction in I/O efficiency because it requires individual write operations for each data element rather than one write per dataset chunk (actually, chunking isn't allowed at all)
Performance is definitely at a premium V storage.

I found this via the HDF5 clinic - https://steven-varga.ca/blog/hdf5-fixed-vs-variable-benchmark/ and it provides a CPP file. It might be possible to incorporate into a filewriter test.

Surely if it's like to be a different length wouldn't it be better to use VarLengthString instead anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fixed-size string attributes include \0 in resulting std::string

2 participants