Skip to content

Multiple calls to create_chat_completion() fail with "llama_decode: failed to decode, ret = -1" #2140

@thisisayushg

Description

@thisisayushg

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am running LiquidAI's LFM2.5-1.2B-Instruct model. Calling create_chat_completion() multiple times should not throw error.

Current Behavior

When trying to call create_chat_completion twice in a row, the model throws error "llama_decode: failed to decode, ret = -1"

Environment and Context

I am using llama-cpp-python v0.3.16. Python version is 3.13.9.

  • Windows 11

Failure Information (for bugs)

On tracing back the issue, it looks like there needs to be a context and cache reset after each chat_completion call, which isn't happening yet.

Steps to Reproduce

from pathlib import Path
from llama_cpp import Llama
llm = Llama(
    model_path=str(Path.home() / "AppData/Local/llama.cpp/LiquidAI_LFM2.5-1.2B-Instruct-GGUF_LFM2.5-1.2B-Instruct-Q4_K_M.gguf"),
    n_ctx=1000
)

system_prompt = """
\nYou are a helpful assistant
"""
prompt = """
suggest me places to visit during winter season
"""
response = llm.create_chat_completion(
      messages =  [{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}],
)
print(response)
# llm.reset()                                               # Using this works
# llm._ctx.kv_cache_clear()                        # Using this works
response = llm.create_chat_completion(
      messages =  [{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}],
)
print(response)

Failure Logs

init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 519
 - the tokens for sequence 0 in the input batch have a starting position of Y = 29
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions