Skip to content

[Feature] Introduce Knowledge Compiling Module#160

Open
wangxingjun778 wants to merge 30 commits intomainfrom
feat/search_wiki
Open

[Feature] Introduce Knowledge Compiling Module#160
wangxingjun778 wants to merge 30 commits intomainfrom
feat/search_wiki

Conversation

@wangxingjun778
Copy link
Copy Markdown
Member

@wangxingjun778 wangxingjun778 commented Apr 14, 2026

🚀 New Features & Capabilities

  • Knowledge Compile Module: Introduced a new module for offline document processing.
  • Hierarchical Indexing: Converts documents into hierarchical tree indices and knowledge clusters to structure data effectively.
  • New CLI Command: Added a command-line interface entry point to trigger the knowledge compilation process.
  • Search Pipeline Integration: Integrated the generated artifacts (tree indices/clusters) into the existing search pipeline to enhance retrieval precision.

🛠️ Improvements & Optimizations

  • Health Check Utility: Added a linting utility to perform system health checks.
  • I/O Optimization: Implemented file hash reuse throughout the pipeline to reduce redundant I/O operations.
  • Configuration Flexibility: Removed hardcoded model names and processing limits, allowing for dynamic configuration.

🐛 Bug Fixes & Performance Tuning

  • Cross-Reference Performance: Addressed performance bottlenecks identified in the cross-reference building process.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'Knowledge Compile' module, enabling offline document processing into hierarchical tree indices and knowledge clusters. It adds a new CLI command, a linting utility for health checks, and integrates these artifacts into the search pipeline for improved retrieval precision. Feedback focuses on addressing performance bottlenecks in cross-reference building, eliminating hardcoded model names and processing limits, and optimizing I/O by reusing file hashes throughout the pipeline.

Comment on lines +906 to +926
for i in range(len(cluster_ids)):
for j in range(i + 1, len(cluster_ids)):
cid_a, cid_b = cluster_ids[i], cluster_ids[j]
shared = cluster_to_files[cid_a] & cluster_to_files[cid_b]
if not shared:
continue

pair_key = (min(cid_a, cid_b), max(cid_a, cid_b))
if pair_key in pairs_seen:
continue
pairs_seen.add(pair_key)

weight = min(len(shared) * 0.25, 1.0)
c_a = await self._storage.get(cid_a)
c_b = await self._storage.get(cid_b)
if c_a and c_b:
self._add_edge(c_a, cid_b, "co_occur", weight)
self._add_edge(c_b, cid_a, "co_occur", weight)
await self._storage.update(c_a)
await self._storage.update(c_b)
edges_created += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _build_cross_references method implements an O(N^2) loop over cluster pairs, performing multiple asynchronous database operations (get, update) within the inner loop. This will lead to severe performance issues as the knowledge base grows. Consider batching these updates or using a more efficient graph construction strategy.

Comment thread src/sirchmunk/cli/cli.py
llm = OpenAIChat(
base_url=os.getenv("LLM_BASE_URL", "https://api.openai.com/v1"),
api_key=llm_api_key,
model=os.getenv("LLM_MODEL_NAME", "gpt-5.2"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model name "gpt-5.2" is hardcoded here and in other command functions (_compile_status, _compile_lint). It should be centralized or made configurable via environment variables to avoid duplication and facilitate future updates.

"""Result of compiling a single file."""

path: str
tree: Optional[DocumentTree] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a file_hash field to FileCompileResult to allow passing the already-computed hash through the pipeline, avoiding redundant I/O.

Suggested change
tree: Optional[DocumentTree] = None
path: str
file_hash: str = ""

When *shallow* is True (or file is ineligible for tree indexing),
the pipeline skips tree building and summarises via a direct LLM call.
"""
result = FileCompileResult(path=entry.path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Populate the file_hash in the result object using the hash already available in FileEntry.

Suggested change
result = FileCompileResult(path=entry.path)
result = FileCompileResult(path=entry.path, file_hash=entry.file_hash)

report.trees_built += 1
# Update manifest
manifest.files[result.path] = FileManifestEntry(
file_hash=get_fast_hash(result.path) or "",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the file_hash from the result object instead of re-calculating it by reading the file again.

Suggested change
file_hash=get_fast_hash(result.path) or "",
file_hash=result.file_hash,

Comment thread src/sirchmunk/search.py Outdated
from sirchmunk.learnings.tree_indexer import DocumentTree

trees: List[DocumentTree] = []
for tree_file in sorted(tree_cache.glob("*.json"))[:50]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The number of tree indices processed during probing is hardcoded to 50. Similar hardcoded limits exist in _probe_compile_hints (50 clusters, 100 trees). In large environments, these limits significantly restrict the effectiveness of the knowledge network. Consider making these thresholds configurable.


async def _check_clusters(self, report: LintReport, auto_fix: bool) -> None:
"""Validate each knowledge cluster."""
all_clusters = await self._storage.find("", limit=10000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The limit of 10,000 clusters for linting might be insufficient for very large knowledge bases. Consider using pagination or making the limit configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant