Validate downloaded file integrity and raise ValueError on hash mismatch#8833
Validate downloaded file integrity and raise ValueError on hash mismatch#8833e-mny wants to merge 15 commits intoProject-MONAI:devfrom
Conversation
for more information, see https://pre-commit.ci
DCO Remediation Commit for Enoch Mok <enochmokny@gmail.com> I, Enoch Mok <enochmokny@gmail.com>, hereby add my Signed-off-by to this commit: d02da49 Signed-off-by: Enoch Mok <enochmokny@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
…tch; used ./runtests.sh autofix Signed-off-by: Enoch Mok <enochmokny@gmail.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe pull request modifies Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
monai/apps/utils.py (2)
263-276:⚠️ Potential issue | 🟠 MajorValidate the temporary file before caching it.
The file is moved to
filepathbefore the hash check, so a mismatch leaves the corrupted/tampered file in the cache and later retries hit the bad existing file.Proposed fix
- file_dir = filepath.parent - if file_dir: - os.makedirs(file_dir, exist_ok=True) - shutil.move(f"{tmp_name}", f"{filepath}") # copy the downloaded to a user-specified cache. + if not check_hash(tmp_name, hash_val, hash_type): + raise ValueError( + f"{hash_type} hash check of downloaded file failed: URL={url}, " + f"filepath={filepath}, expected {hash_type}={hash_val}, " + f"The file may be corrupted or tampered with. " + "Please retry the download or verify the source." + ) + file_dir = filepath.parent + if file_dir: + os.makedirs(file_dir, exist_ok=True) + shutil.move(f"{tmp_name}", f"{filepath}") # copy the downloaded to a user-specified cache. except (PermissionError, NotADirectoryError): # project-monai/monai issue `#3613` `#3757` for windows pass logger.info(f"Downloaded: {filepath}") - if not check_hash(filepath, hash_val, hash_type): - raise ValueError( - f"{hash_type} hash check of downloaded file failed: URL={url}, " - f"filepath={filepath}, expected {hash_type}={hash_val}, " - f"The file may be corrupted or tampered with. " - "Please retry the download or verify the source." - )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 263 - 276, The code moves the downloaded temporary file (tmp_name) to the final cache path (filepath) before calling check_hash, which can leave a corrupted file cached; instead, run check_hash against the temporary file (use check_hash(tmp_name, hash_val, hash_type)), only create the target directory and shutil.move(tmp_name, filepath) after the hash passes, and if the hash fails delete the temporary file and raise the ValueError (preserve the same error message). Update the logic around tmp_name, filepath and the check_hash call accordingly in the same function where tmp_name and filepath are used so that corrupted downloads are never moved into the cache.
215-224:⚠️ Potential issue | 🟡 MinorDocument the new
ValueErrorcontract.Line 223 still documents
RuntimeError, but downloaded hash mismatches now raiseValueError.Proposed fix
- RuntimeError: When the hash validation of the ``url`` downloaded file fails. + ValueError: When the hash validation of the ``url`` downloaded file fails.As per coding guidelines, "Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 215 - 224, Update the docstring Raises section for the download routine in monai/apps/utils.py so it documents the new ValueError contract: replace the RuntimeError entry that states "When the hash validation of the ``url`` downloaded file fails." (and any other RuntimeError entry referring to downloaded hash mismatch) with ValueError, and ensure the entry that mentions the existing ``filepath`` hash validation also reflects the correct exception type if it changed; keep other urllib-related exceptions (URLError, HTTPError, ContentTooShortError, IOError) as-is and keep messaging consistent with the parameters ``filepath`` and ``url``.tests/test_utils.py (1)
78-87:⚠️ Potential issue | 🔴 CriticalDo not skip or swallow
ValueErrorhash mismatches.This defeats the new
ValueErrorbehavior, and non-matchingValueErrors fall through silently because the branch does not re-raise.Proposed fix
DOWNLOAD_FAIL_MSGS = ( "unexpected EOF", # incomplete download "network issue", "gdown dependency", # gdown not installed - "hash check", # check hash value of downloaded file "limit", # HTTP Error 503: Egress is over the account limit "authenticate", "timed out", # urlopen error [Errno 110] Connection timed out "HTTPError", # HTTPError: 429 Client Error: Too Many Requests for huggingface hub ) @@ - except ValueError as v_e: - if "hash check" in str(v_e): - raise unittest.SkipTest(f"Hash value error while downloading: {v_e}") from v_eAlso applies to: 179-187
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_utils.py` around lines 78 - 87, The code currently swallows ValueError exceptions when matching messages in DOWNLOAD_FAIL_MSGS; change the exception handling where DOWNLOAD_FAIL_MSGS is consulted so that if the caught exception is a ValueError it is re-raised immediately (do not treat it as a benign download failure), while preserving the existing fallback logic for other exception types; update both places that check DOWNLOAD_FAIL_MSGS (the exception handlers that inspect exception messages around the DOWNLOAD_FAIL_MSGS use) to first if isinstance(e, ValueError): raise before matching messages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_utils.py`:
- Around line 198-218: The test asserts the wrong exception type: update
test_download_url to expect ValueError (not RuntimeError) when download_url
detects a hash mismatch, and also update the test's docstring "Raises:" section
to list ValueError instead of RuntimeError; locate the assertion block that
calls download_url with a bad hash and change the exception class accordingly
and adjust the docstring text to reflect ValueError as the raised exception from
download_url.
---
Outside diff comments:
In `@monai/apps/utils.py`:
- Around line 263-276: The code moves the downloaded temporary file (tmp_name)
to the final cache path (filepath) before calling check_hash, which can leave a
corrupted file cached; instead, run check_hash against the temporary file (use
check_hash(tmp_name, hash_val, hash_type)), only create the target directory and
shutil.move(tmp_name, filepath) after the hash passes, and if the hash fails
delete the temporary file and raise the ValueError (preserve the same error
message). Update the logic around tmp_name, filepath and the check_hash call
accordingly in the same function where tmp_name and filepath are used so that
corrupted downloads are never moved into the cache.
- Around line 215-224: Update the docstring Raises section for the download
routine in monai/apps/utils.py so it documents the new ValueError contract:
replace the RuntimeError entry that states "When the hash validation of the
``url`` downloaded file fails." (and any other RuntimeError entry referring to
downloaded hash mismatch) with ValueError, and ensure the entry that mentions
the existing ``filepath`` hash validation also reflects the correct exception
type if it changed; keep other urllib-related exceptions (URLError, HTTPError,
ContentTooShortError, IOError) as-is and keep messaging consistent with the
parameters ``filepath`` and ``url``.
In `@tests/test_utils.py`:
- Around line 78-87: The code currently swallows ValueError exceptions when
matching messages in DOWNLOAD_FAIL_MSGS; change the exception handling where
DOWNLOAD_FAIL_MSGS is consulted so that if the caught exception is a ValueError
it is re-raised immediately (do not treat it as a benign download failure),
while preserving the existing fallback logic for other exception types; update
both places that check DOWNLOAD_FAIL_MSGS (the exception handlers that inspect
exception messages around the DOWNLOAD_FAIL_MSGS use) to first if isinstance(e,
ValueError): raise before matching messages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9b3e7460-f1d0-4896-8762-66d40df2812f
📒 Files selected for processing (3)
monai/apps/utils.pytests/test_utils.pytests/testing_data/data_config.json
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monai/apps/utils.py (1)
258-272:⚠️ Potential issue | 🔴 Critical🐛 Critical: hash check runs against the wrong path — always fails for new downloads.
At line 262 you call
check_hash(filepath, ...), but the downloaded bytes live attmp_name.filepathisn't created until theshutil.moveon line 272.check_hashwill hit itsopen(...)exceptbranch, log an error, and returnFalse, so every successfully-downloaded file will raiseValueError("... hash check of downloaded file failed ...")regardless of whether the hash actually matches.The reason the new
TestDownloadUrl.test_download_url"passes" today is thatskip_if_downloading_fails()converts this ValueError into aSkipTest— the success path isn't actually being exercised.🔧 Proposed fix
if not tmp_name.exists(): raise RuntimeError( f"Download of file from {url} to {filepath} failed due to network issue or denied permission." ) - if not check_hash(filepath, hash_val, hash_type): + if not check_hash(tmp_name, hash_val, hash_type): raise ValueError( f"{hash_type} hash check of downloaded file failed: URL={url}, " f"filepath={filepath}, expected {hash_type}={hash_val}, " f"The file may be corrupted or tampered with. " "Please retry the download or verify the source." ) file_dir = filepath.parent if file_dir: os.makedirs(file_dir, exist_ok=True) shutil.move(f"{tmp_name}", f"{filepath}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 258 - 272, The hash check is being done against filepath which doesn't exist yet; change the code to run check_hash on the temporary download path tmp_name (i.e., call check_hash(tmp_name, hash_val, hash_type)) before creating/moving to filepath, then ensure the destination directory (file_dir) is created and finally shutil.move(tmp_name, filepath); update any related error messages if needed to reference the tmp_name when reporting the downloaded file being validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_utils.py`:
- Around line 212-218: The failing test is network-fragile because the second
download_url call expecting a ValueError can instead raise network errors; wrap
the bad-hash scenario in the same network-skip guard or separate network from
hash checking: first call download_url under skip_if_downloading_fails() with
the correct hash to ensure the file is present, then call the pure hash checker
(e.g., check_hash or whatever validation routine used by download_url) against
that file to assert it raises ValueError for a bad hash, or re-enter
skip_if_downloading_fails() around the bad-hash download and convert network
exceptions to skips while asserting that a returned/created file triggers the
ValueError path. Ensure you reference download_url and skip_if_downloading_fails
(and the module's hash validation function) when making the change.
- Around line 185-187: The except ValueError handler in tests/test_utils.py
currently only raises SkipTest when the exception message contains "hash check"
and otherwise swallows the ValueError; modify the except ValueError as v_e block
to re-raise v_e when the message does not indicate a hash-check failure (i.e.,
keep the unittest.SkipTest(f"Hash value error while downloading: {v_e}") from
v_e branch for "hash check" and add a plain raise v_e for all other cases) so
genuine ValueErrors from the test body are propagated.
---
Outside diff comments:
In `@monai/apps/utils.py`:
- Around line 258-272: The hash check is being done against filepath which
doesn't exist yet; change the code to run check_hash on the temporary download
path tmp_name (i.e., call check_hash(tmp_name, hash_val, hash_type)) before
creating/moving to filepath, then ensure the destination directory (file_dir) is
created and finally shutil.move(tmp_name, filepath); update any related error
messages if needed to reference the tmp_name when reporting the downloaded file
being validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b3be07b5-74d1-45d2-abef-997ad1617025
📒 Files selected for processing (2)
monai/apps/utils.pytests/test_utils.py
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monai/apps/utils.py (1)
215-223:⚠️ Potential issue | 🟡 MinorUpdate the stale raised-exception docs.
Line 216 still says existing-file hash validation raises
RuntimeError, but Line 231 now raisesValueError.Proposed fix
Raises: - RuntimeError: When the hash validation of the ``filepath`` existing file fails. + ValueError: When the hash validation of the ``filepath`` existing file fails. RuntimeError: When a network issue or denied permission prevents the file download from ``url`` to ``filepath``.As per coding guidelines, "Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 215 - 223, The docstring's Raises section is inconsistent: it still claims a RuntimeError is raised for an existing-file hash mismatch (refers to ``filepath``) while the implementation now raises ValueError for the downloaded-URL hash check (refers to ``url``). Edit the Raises block in the function's Google-style docstring (the one mentioning filepath and url) so the exception types and descriptions match the current behavior: replace the stale RuntimeError mention with the correct exception type(s) and ensure the entries for the filepath/hash check and the url/download/hash check accurately state ValueError (or RuntimeError only where actually raised), and keep other URL/HTTP/IO exceptions unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@monai/apps/utils.py`:
- Around line 231-233: Tests expecting a RuntimeError for hash-mismatch should
be updated to expect ValueError and the revised message text; change the
assertions in tests/apps/test_download_and_extract.py (the block that calls
download_url(... wrong_md5) and the other occurrence around the same check) to
catch ValueError instead of RuntimeError and assert the message contains "md5
hash check" (or more generically "{hash_type} hash check") to match the new
raise in monai/apps/utils.py (the ValueError raised in the hash check path).
In `@tests/test_utils.py`:
- Around line 213-215: The test currently calls
download_url(filepath=model_path, hash_val="0"*64,
hash_type=SAMPLE_TIFF_HASH_TYPE) without the required url, causing a TypeError
before the intended hash check; update the test to pass a valid url (e.g., the
same model_path or a known test URL) as the url argument when calling
download_url so the function runs to the hash validation and raises ValueError
as expected (adjust the call in tests/test_utils.py where download_url is
invoked inside the with self.assertRaises(ValueError) block).
---
Outside diff comments:
In `@monai/apps/utils.py`:
- Around line 215-223: The docstring's Raises section is inconsistent: it still
claims a RuntimeError is raised for an existing-file hash mismatch (refers to
``filepath``) while the implementation now raises ValueError for the
downloaded-URL hash check (refers to ``url``). Edit the Raises block in the
function's Google-style docstring (the one mentioning filepath and url) so the
exception types and descriptions match the current behavior: replace the stale
RuntimeError mention with the correct exception type(s) and ensure the entries
for the filepath/hash check and the url/download/hash check accurately state
ValueError (or RuntimeError only where actually raised), and keep other
URL/HTTP/IO exceptions unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a321ffa5-076e-4483-8541-5bec4e0da138
📒 Files selected for processing (2)
monai/apps/utils.pytests/test_utils.py
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_utils.py (1)
78-83:⚠️ Potential issue | 🟠 MajorDo not suppress hash-integrity failures here.
This reintroduces the behavior the PR is meant to avoid: hash mismatches become
SkipTestinstead of visible failures underskip_if_downloading_fails().Proposed fix
DOWNLOAD_FAIL_MSGS = ( "unexpected EOF", # incomplete download "network issue", "gdown dependency", # gdown not installed - "hash check", # check hash value of downloaded file "limit", # HTTP Error 503: Egress is over the account limit "authenticate", "timed out", # urlopen error [Errno 110] Connection timed out "HTTPError", # HTTPError: 429 Client Error: Too Many Requests for huggingface hub ) @@ - except ValueError as v_e: - if "hash check" in str(v_e): - raise unittest.SkipTest(f"Hash value error while downloading: {v_e}") from v_e - - raise v_e - -Also applies to: 179-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_utils.py` around lines 78 - 83, The DOWNLOAD_FAIL_MSGS list used by skip_if_downloading_fails() is incorrectly treating hash-integrity failures as transient download failures; remove the "hash check" entry from DOWNLOAD_FAIL_MSGS so hash mismatches raise test failures rather than being converted to SkipTest, and make the same change to the other identical failure-message tuple used later in the file (the second DOWNLOAD_FAIL_MSGS-style list referenced by skip_if_downloading_fails()). Ensure skip_if_downloading_fails() continues to match only genuine download/network/gdown/limit messages and not integrity/hash errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_utils.py`:
- Around line 197-217: The test reuses model_path so download_url sees the
existing file and skips re-downloading, preventing the hash-mismatch path from
being exercised; update TestDownloadUrl.test_download_url to either delete
model_path (os.remove(model_path)) before calling download_url for the mismatch
case or use a new filepath (e.g., model_path_mismatch) and wrap that mismatch
download call with skip_if_downloading_fails() so the test forces a fresh
download and still skips in quick/network-failing environments; reference
download_url, model_path, TestDownloadUrl.test_download_url, and
skip_if_downloading_fails when making the change.
---
Outside diff comments:
In `@tests/test_utils.py`:
- Around line 78-83: The DOWNLOAD_FAIL_MSGS list used by
skip_if_downloading_fails() is incorrectly treating hash-integrity failures as
transient download failures; remove the "hash check" entry from
DOWNLOAD_FAIL_MSGS so hash mismatches raise test failures rather than being
converted to SkipTest, and make the same change to the other identical
failure-message tuple used later in the file (the second
DOWNLOAD_FAIL_MSGS-style list referenced by skip_if_downloading_fails()). Ensure
skip_if_downloading_fails() continues to match only genuine
download/network/gdown/limit messages and not integrity/hash errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7b5d0b2c-e70c-4e3d-bd08-86cee7116a1d
📒 Files selected for processing (2)
tests/apps/test_download_and_extract.pytests/test_utils.py
| except (ContentTooShortError, HTTPError, ValueError, RuntimeError) as e: | ||
| if isinstance(e, ValueError): | ||
| # FIXME: skip MD5 check as current downloading method may fail | ||
| self.assertTrue(str(e).startswith("md5 check")) | ||
| self.assertIn("hash check", str(e)) | ||
| return # skipping this test due the network connection errors |
There was a problem hiding this comment.
| except (ContentTooShortError, HTTPError, ValueError, RuntimeError) as e: | |
| if isinstance(e, ValueError): | |
| # FIXME: skip MD5 check as current downloading method may fail | |
| self.assertTrue(str(e).startswith("md5 check")) | |
| self.assertIn("hash check", str(e)) | |
| return # skipping this test due the network connection errors | |
| except ValueError as e: | |
| # FIXME: skip MD5 check as current downloading method may fail | |
| self.assertIn("hash check", str(e)) | |
| return | |
| except (ContentTooShortError, HTTPError, RuntimeError) as e: | |
| return # skipping this test due the network connection errors |
I think a separate clause makes sense despite what the original was.
| SAMPLE_TIFF_HASH_TYPE = "sha256" | ||
|
|
||
|
|
||
| class TestDownloadUrl(unittest.TestCase): |
There was a problem hiding this comment.
We have a conflict now from your previous PR being merged. These two classes possibly need to be merged as the two tests are different.
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
Fixes #8832 .
Description
Adds a check on downloaded files to verify their hash against the expected value and raises a
ValueErrorif there is a mismatch, indicating possible corruption or tampering.Types of changes
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.