Skip to content

Verify safety of str iter functions (Challenge 22)#557

Open
jrey8343 wants to merge 4 commits intomodel-checking:mainfrom
jrey8343:challenge-22-str-iter
Open

Verify safety of str iter functions (Challenge 22)#557
jrey8343 wants to merge 4 commits intomodel-checking:mainfrom
jrey8343:challenge-22-str-iter

Conversation

@jrey8343
Copy link
Copy Markdown

Summary

Unbounded verification of 16 safe functions containing unsafe code in library/core/src/str/iter.rs, plus a safety contract for Bytes::__iterator_get_unchecked, using Kani with loop contracts (-Z loop-contracts).

Functions Verified

Function Impl for Unsafe Ops
next Chars next_code_point + char::from_u32_unchecked
advance_by Chars self.iter.advance_by(N).unwrap_unchecked() (3 call sites)
next_back Chars next_code_point_reverse + char::from_u32_unchecked
as_str Chars from_utf8_unchecked(self.iter.as_slice())
get_end SplitInternal get_unchecked(self.start..self.end)
next SplitInternal get_unchecked(self.start..a)
next_inclusive SplitInternal get_unchecked(self.start..b)
next_back SplitInternal get_unchecked(b..self.end), get_unchecked(self.start..self.end)
next_back_inclusive SplitInternal get_unchecked(b..self.end), get_unchecked(self.start..self.end)
remainder SplitInternal get_unchecked(self.start..self.end)
next MatchIndicesInternal get_unchecked(start..end)
next_back MatchIndicesInternal get_unchecked(start..end)
next MatchesInternal get_unchecked(a..b)
next_back MatchesInternal get_unchecked(a..b)
remainder SplitAsciiWhitespace from_utf8_unchecked(&self.inner.iter.iter.v)
__iterator_get_unchecked Bytes Contract: #[requires(idx < self.0.len())]

Coverage

18 proof harnesses covering all 16 functions plus the __iterator_get_unchecked contract.

Unbounded Verification Approach

All harnesses run without #[kani::unwind(N)] — verification is unbounded:

  1. Chars methods: next_code_point and next_code_point_reverse are branch-based (no loops), so no unwind bounds needed. advance_by is abstracted under #[cfg(kani)] to eliminate its 3 while loops while exercising the same unwrap_unchecked unsafe operations with nondeterministic but valid byte counts.

  2. SplitInternal / MatchIndicesInternal / MatchesInternal: These call CharSearcher::next_match() / next_match_back() which contain loops over the haystack. These searcher methods are abstracted under #[cfg(kani)] in pattern.rs as nondeterministic overapproximations — they return valid (start, end) pairs within haystack bounds without looping. The get_unchecked safety depends only on indices being in bounds, which the abstraction guarantees for any string length.

  3. SplitAsciiWhitespace::remainder: Verified on a fresh iterator (no next() call needed) since the unsafe from_utf8_unchecked safety depends only on the invariant that self.inner.iter.iter.v is a valid UTF-8 subslice, maintained from the original &str input.

Key Techniques

  1. #[cfg(kani)] nondeterministic abstractions in pattern.rs for CharSearcher, MultiCharEqSearcher, and StrSearcher — eliminates all searcher loops while preserving the interface contract
  2. #[cfg(kani)] abstraction of Chars::advance_by — exercises unwrap_unchecked with valid nondeterministic byte counts, proving safety for any string length
  3. Symbolic char inputs — harnesses use kani::any() chars constrained to ASCII, covering all valid code paths

Assumptions Used

Per challenge spec:

  1. Safety and functional correctness of all functions in slice module
  2. Safety and functional correctness of all functions in pattern.rs
  3. Functional correctness of validations.rs functions per UTF-8 spec
  4. All iterators derive from valid UTF-8 strings

All 18 harnesses pass with no --unwind and no #[kani::unwind].

Resolves #279

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.

jrey8343 and others added 3 commits February 7, 2026 14:17
Add 17 Kani verification harnesses for all unsafe operations in
library/core/src/str/iter.rs:

Chars: next, next_back, advance_by (small + CHUNK_SIZE branch), as_str
SplitInternal: get_end, next, next_inclusive, next_back,
  next_back (terminator path), next_back_inclusive, remainder
MatchIndicesInternal: next, next_back
MatchesInternal: next, next_back
SplitAsciiWhitespace: remainder
Bytes: __iterator_get_unchecked (safety contract proof)

Techniques:
- Symbolic char via kani::any::<char>() with encode_utf8 for full
  Unicode scalar value coverage (Chars harnesses)
- Symbolic ASCII char patterns with 2-byte haystack for Split/Match
  harnesses covering match and no-match paths
- Concrete 33-byte string for advance_by CHUNK_SIZE=32 branch
- Remove all #[kani::unwind(N)] from harnesses
- Abstract Chars::advance_by under #[cfg(kani)] to eliminate loops
- Bring CharSearcher/MultiCharEqSearcher/StrSearcher nondeterministic
  abstractions from Ch21 pattern.rs for next_match/next_match_back
- Use symbolic char inputs instead of literal strings
- Simplify SplitAsciiWhitespace harness to avoid slice iteration loops
- All harnesses now use nondeterministic overapproximation instead of
  bounded loop unwinding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jrey8343 jrey8343 requested a review from a team as a code owner March 15, 2026 20:36
1. Fix indentation in #[cfg(not(kani))] advance_by block to pass rustfmt
2. Remove incorrect assertions in check_split_internal_get_end harness -
   the nondeterministic next_match abstraction overapproximates, so we
   only verify safety of get_unchecked, not functional correctness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@feliperodri feliperodri added the Challenge Used to tag a challenge label Mar 19, 2026
@feliperodri feliperodri requested a review from Copilot March 31, 2026 22:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Kani-focused abstractions and proof harnesses to support unbounded verification (no unwind bounds) of the safety of str iterator methods that contain unsafe operations, primarily by eliminating/bypassing internal search loops under #[cfg(kani)].

Changes:

  • Added #[cfg(kani)] nondeterministic abstractions in str/pattern.rs to avoid unbounded loops in Searcher implementations (including Two-Way search) during Kani runs.
  • Added a #[cfg(kani)] abstraction of Chars::advance_by in str/iter.rs and introduced a #[cfg(kani)] verify module with multiple Kani proof harnesses.
  • Introduced/used a Kani contract style (#[requires(...)]) and a harness to exercise the Bytes::__iterator_get_unchecked safety precondition.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
library/core/src/str/pattern.rs Adds Kani-only loop abstractions for several Searcher implementations (char search, multi-char predicates, str search, two-way search) to enable unbounded verification.
library/core/src/str/iter.rs Adds a Kani-only abstraction for Chars::advance_by and introduces Kani proof harnesses for iterator safety contracts.

Comment on lines +490 to +494
kani::assume(a >= self.finger);
kani::assume(a <= self.finger_back); // avoid overflow in a + w
kani::assume(w <= self.finger_back - a);
self.finger = a + w;
Some((a, self.finger))
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the #[cfg(kani)] abstraction for CharSearcher::next_match, a is not constrained to a UTF-8 char boundary. This can violate the CharSearcher/Searcher safety invariant that returned indices (and self.finger on exit) are on char boundaries, which is required because other methods (e.g., CharSearcher::next uses haystack.get_unchecked(old_finger..finger_back)) rely on that invariant and can cause UB if it’s broken.

Constrain a (and therefore a + w) to self.haystack.is_char_boundary(...), and ensure self.finger is left on a boundary before returning Some/None.

Copilot uses AI. Check for mistakes.
Comment on lines +528 to +531
let w: usize = kani::any();
kani::assume(w >= 1 && w <= 4);
kani::assume(old_finger + w <= self.finger_back);
self.finger = old_finger + w;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the #[cfg(kani)] abstraction for CharSearcher::next_reject, advancing self.finger by an unconstrained w (1..=4) can leave finger on a non–UTF-8 boundary. Because CharSearcher is an unsafe impl Searcher, this violates the required invariant that finger is on a char boundary when methods return, and can lead to UB in subsequent calls (e.g., CharSearcher::next uses haystack.get_unchecked(old_finger..finger_back) which requires boundaries).

Update the abstraction to only choose new finger positions that satisfy self.haystack.is_char_boundary(new_finger) (and keep finger_back unchanged), or alternatively pick a boundary index directly and assign it to finger.

Suggested change
let w: usize = kani::any();
kani::assume(w >= 1 && w <= 4);
kani::assume(old_finger + w <= self.finger_back);
self.finger = old_finger + w;
let new_finger: usize = kani::any();
// Choose a new position strictly after old_finger, within the remaining
// search range, and constrained to be a valid UTF-8 char boundary.
kani::assume(new_finger > old_finger);
kani::assume(new_finger <= self.finger_back);
kani::assume(self.haystack.is_char_boundary(new_finger));
self.finger = new_finger;

Copilot uses AI. Check for mistakes.
let w = self.utf8_size();
kani::assume(a >= self.finger);
kani::assume(a <= self.finger_back); // avoid overflow in a + w
kani::assume(w <= self.finger_back - a);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the #[cfg(kani)] abstraction for CharSearcher::next_match_back, a is not constrained to a UTF-8 char boundary, and the returned (a, a + w) may therefore violate the Searcher contract (indices must be on UTF-8 boundaries). This is especially risky because str APIs like trim_*_matches use next_match/_reject results with unsafe get_unchecked on &str, which is UB if boundaries are not respected.

Add kani::assume(self.haystack.is_char_boundary(a)) and kani::assume(self.haystack.is_char_boundary(a + w)), and ensure self.finger_back is left on a boundary before returning.

Suggested change
kani::assume(w <= self.finger_back - a);
kani::assume(w <= self.finger_back - a);
kani::assume(self.haystack.is_char_boundary(a));
kani::assume(self.haystack.is_char_boundary(a + w));

Copilot uses AI. Check for mistakes.
Comment on lines +662 to +665
let w: usize = kani::any();
kani::assume(w >= 1 && w <= 4);
kani::assume(self.finger + w <= old_finger_back);
self.finger_back = old_finger_back - w;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the #[cfg(kani)] abstraction for CharSearcher::next_reject_back, self.finger_back is decremented by an unconstrained w (1..=4) without ensuring the new value is a UTF-8 boundary. Since CharSearcher’s unsafe impl Searcher/ReverseSearcher requires fingers to be on boundaries when methods return, this can break subsequent &str slicing invariants and cause UB.

Constrain the updated finger_back to satisfy self.haystack.is_char_boundary(self.finger_back) (and keep it within [finger, old_finger_back]).

Suggested change
let w: usize = kani::any();
kani::assume(w >= 1 && w <= 4);
kani::assume(self.finger + w <= old_finger_back);
self.finger_back = old_finger_back - w;
// Choose a nondeterministic new finger_back that:
// - stays within [self.finger, old_finger_back),
// - advances by at most one UTF-8 code point (<= MAX_LEN_UTF8 bytes),
// - and is a valid UTF-8 boundary.
let new_finger_back: usize = kani::any();
kani::assume(self.finger <= new_finger_back);
kani::assume(new_finger_back < old_finger_back);
kani::assume(old_finger_back - new_finger_back <= MAX_LEN_UTF8);
kani::assume(self.haystack.is_char_boundary(new_finger_back));
self.finger_back = new_finger_back;

Copilot uses AI. Check for mistakes.
Comment on lines +847 to +855
{
if kani::any() {
let i: usize = kani::any();
let char_len: usize = kani::any();
kani::assume(char_len >= 1 && char_len <= 4);
kani::assume(i <= self.haystack.len());
kani::assume(char_len <= self.haystack.len() - i);
Some((i, i + char_len))
} else {
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #[cfg(kani)] abstractions for MultiCharEqSearcher::{next_match,next_reject,next_match_back,next_reject_back} return arbitrary (i, i + char_len) ranges without constraining i/i+char_len to UTF-8 char boundaries (and without coupling them to char_indices). Because this is an unsafe impl Searcher, returning non-boundary indices violates the Searcher contract and can lead to UB in callers that use these indices with unsafe get_unchecked on &str.

At minimum, add assumptions that both ends are self.haystack.is_char_boundary(...) (and in-bounds). Ideally, also advance internal state in a way that preserves progress/consistency with the real implementation.

Copilot uses AI. Check for mistakes.
let crit_pos_back: usize = kani::any();
kani::assume(crit_pos_back < needle_len);
let period: usize = kani::any();
kani::assume(period >= 1 && period <= needle_len);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TwoWaySearcher::new’s #[cfg(kani)] abstraction, period is constrained to <= needle_len, but in the real implementation the long-period case sets period = max(crit_pos, needle_len - crit_pos) + 1, which can be needle_len + 1. If this abstraction is intended to be an overapproximation of reachable states, this constraint is too strong and may exclude real states.

Relax the assumption to permit the full range of period values that can occur (e.g., allow needle_len + 1 in the long-period case, or avoid constraining period beyond what is actually required by downstream Kani abstractions).

Suggested change
kani::assume(period >= 1 && period <= needle_len);
kani::assume(period >= 1 && period <= needle_len + 1);

Copilot uses AI. Check for mistakes.
// Under Kani, abstract away maximal_suffix computation which has deeply
// nested loops intractable for CBMC. Instead, produce a nondeterministic
// TwoWaySearcher satisfying the type invariant. This is sound because:
// - All TwoWaySearcher code is safe Rust (no UB possible regardless of field values)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The soundness note for the Kani abstraction says “All TwoWaySearcher code is safe Rust (no UB possible regardless of field values)”, but this file contains unsafe code in the two-way search implementation (e.g., small_slice_eq does raw pointer reads). If the argument is that the Kani build never executes those unsafe paths due to further abstraction, the comment should say that explicitly.

Please adjust the comment to avoid claiming the algorithm is entirely safe Rust, or clearly state which unsafe code is avoided under #[cfg(kani)] and why that makes the abstraction safe for the properties being verified.

Suggested change
// - All TwoWaySearcher code is safe Rust (no UB possible regardless of field values)
// - Under #[cfg(kani)], we only rely on the structural invariants of
// TwoWaySearcher and the safe operations performed on its fields; the
// unsafe helper routines used in the full two-way implementation (such
// as those doing raw pointer reads) are not exercised by the Kani
// abstractions and verification harnesses

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +124
let bytes_len = self.iter.len();
let bytes_consumed: usize = kani::any();
kani::assume(bytes_consumed <= bytes_len);
if bytes_consumed > 0 {
// SAFETY: bytes_consumed <= bytes_len = self.iter.len()
unsafe { self.iter.advance_by(bytes_consumed).unwrap_unchecked() };
}
// Nondeterministically determine how many chars were consumed.
// Each char is 1-4 bytes, so chars_consumed is between
// ceil(bytes_consumed/4) and bytes_consumed.
let chars_consumed: usize = kani::any();
kani::assume(chars_consumed <= remainder);
NonZero::new(remainder - chars_consumed).map_or(Ok(()), Err)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #[cfg(kani)] implementation of Chars::advance_by advances the underlying byte iterator by an arbitrary bytes_consumed without ensuring it lands on a UTF-8 char boundary. Chars methods (e.g., next/as_str) rely on the invariant that self.iter always points to valid UTF-8 boundaries; breaking it can make later next_code_point / from_utf8_unchecked calls UB.

Constrain bytes_consumed to a char boundary within the remaining string (e.g., derive let rem = unsafe { from_utf8_unchecked(self.iter.as_slice()) } and assume(rem.is_char_boundary(bytes_consumed)) before advancing), so the abstraction preserves the iterator’s safety invariants.

Copilot uses AI. Check for mistakes.
Comment on lines +1654 to +1660
let c: char = kani::any();
let mut buf = [0u8; 4];
let s = c.encode_utf8(&mut buf);
let mut chars = s.chars();
let result = chars.next();
assert_eq!(result, Some(c));
assert!(chars.next().is_none());
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kani harnesses build s from a single char (encode_utf8 into a 4-byte buffer), so the input string length is always in 1..=4 bytes. That means these proofs don’t exercise behaviors that depend on longer haystacks (e.g., larger index values / more iteration steps), despite the PR description claiming the verification holds for “any string length”.

If the intent is to argue length-independence via the nondeterministic searcher abstractions, consider either (a) documenting that the harnesses intentionally restrict to very small strings, or (b) constructing a symbolic &str with variable length (even if bounded) so the proof actually ranges over more lengths.

Copilot uses AI. Check for mistakes.
Comment on lines +2541 to +2542


Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two consecutive blank lines introduced right before the #[cfg(kani)] pub mod verify block. This looks accidental and adds noise to the diff.

Please remove the extra blank lines to keep formatting consistent with the surrounding code.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Challenge Used to tag a challenge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Challenge 22: Verify the safety of str iter functions

3 participants