Skip to content

Streamline parcels-benchmarks#42

Open
VeckoTheGecko wants to merge 34 commits intomainfrom
improvements
Open

Streamline parcels-benchmarks#42
VeckoTheGecko wants to merge 34 commits intomainfrom
improvements

Conversation

@VeckoTheGecko
Copy link
Copy Markdown
Contributor

This PR reworks parcels-benchmarks in a way (I hope) is much easier to work with. Follow the README and let me know what you think.

Changes:

  • Replaces the parcels_benchmarks internal package (which provided the CLI tool for adding dataset hashes etc.). Now instead:

    • An intake-xarray catalog is defined in catalogs/parcels-benchmarks/catalog.yml. The top of the file has a comment which contains the link to the ZIP to be downloaded.
      • This streamlines our approach, making it easier for the benchmarking scripts to go straight from data on disk to xarray dataset.
      • We can use other options available via intake
      • This approach allows us to get familiar with intake which will likely be used for our HPC systems after v4 is released.
    • A script (scripts/download-catalog.py) downloads the data for a catalog and takes a output_dir (both via CLI args). This uses curl to download the dataset, and then unzips all nested zip files (deleting the original zips). This script also copies the catalog file into the output_dir (which is good since the datasets in the catalog are defined relative to this catalog file).
      • If a catalog is already downloaded (i.e., if the folder already exists) its skipped
      • Pro: The use of curl here means this approach is quite transparent - one can easily see download speeds and decide to cancel
      • Con: There is no longer the concept of "known hashes" - this is something we can get back if we want in future1
    • Pixi is used, via the setup-data task, to download all the datasets.
      • This makes our data approach much more flexible should we want to change it in future
  • Requires a PARCELS_BENCHMARKS_DATA_FOLDER environment variable to be explicitly set which is then acts as the working space for the data. This environment variable is used in the download and benchmarking code.

We needed the following things to ease development:

  • Download all datasets before running benchmarks

  • Make it transparent the download progress of datasets

Footnotes

  1. given we are the sole owners of our data sources I don't think this is a concern

@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

Not all the benchmarks are running. Once this is merged I'll fix the rest in #40 .

Let me know what you think of this @fluidnumerics-joe

@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

Oh, and since Parcels is now a submodule I think you'll need to do git submodule update --init --recursive (if you aren't doing a fresh clone from the README)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant