[Feature] Introduce Knowledge Compiling Module by wangxingjun778 · Pull Request #160 · modelscope/sirchmunk

wangxingjun778 · 2026-04-14T17:44:02Z

🚀 New Features & Capabilities

Knowledge Compile Module: Introduced a new module for offline document processing.
Hierarchical Indexing: Converts documents into hierarchical tree indices and knowledge clusters to structure data effectively.
New CLI Command: Added a command-line interface entry point to trigger the knowledge compilation process.
Search Pipeline Integration: Integrated the generated artifacts (tree indices/clusters) into the existing search pipeline to enhance retrieval precision.

🛠️ Improvements & Optimizations

Health Check Utility: Added a linting utility to perform system health checks.
I/O Optimization: Implemented file hash reuse throughout the pipeline to reduce redundant I/O operations.
Configuration Flexibility: Removed hardcoded model names and processing limits, allowing for dynamic configuration.

🐛 Bug Fixes & Performance Tuning

Cross-Reference Performance: Addressed performance bottlenecks identified in the cross-reference building process.

gemini-code-assist

Code Review

This pull request introduces a 'Knowledge Compile' module, enabling offline document processing into hierarchical tree indices and knowledge clusters. It adds a new CLI command, a linting utility for health checks, and integrates these artifacts into the search pipeline for improved retrieval precision. Feedback focuses on addressing performance bottlenecks in cross-reference building, eliminating hardcoded model names and processing limits, and optimizing I/O by reusing file hashes throughout the pipeline.

gemini-code-assist · 2026-04-14T17:46:28Z

+        for i in range(len(cluster_ids)):
+            for j in range(i + 1, len(cluster_ids)):
+                cid_a, cid_b = cluster_ids[i], cluster_ids[j]
+                shared = cluster_to_files[cid_a] & cluster_to_files[cid_b]
+                if not shared:
+                    continue
+
+                pair_key = (min(cid_a, cid_b), max(cid_a, cid_b))
+                if pair_key in pairs_seen:
+                    continue
+                pairs_seen.add(pair_key)
+
+                weight = min(len(shared) * 0.25, 1.0)
+                c_a = await self._storage.get(cid_a)
+                c_b = await self._storage.get(cid_b)
+                if c_a and c_b:
+                    self._add_edge(c_a, cid_b, "co_occur", weight)
+                    self._add_edge(c_b, cid_a, "co_occur", weight)
+                    await self._storage.update(c_a)
+                    await self._storage.update(c_b)
+                    edges_created += 1


The _build_cross_references method implements an O(N^2) loop over cluster pairs, performing multiple asynchronous database operations (get, update) within the inner loop. This will lead to severe performance issues as the knowledge base grows. Consider batching these updates or using a more efficient graph construction strategy.

gemini-code-assist · 2026-04-14T17:46:29Z

+    llm = OpenAIChat(
+        base_url=os.getenv("LLM_BASE_URL", "https://api.openai.com/v1"),
+        api_key=llm_api_key,
+        model=os.getenv("LLM_MODEL_NAME", "gpt-5.2"),


The model name "gpt-5.2" is hardcoded here and in other command functions (_compile_status, _compile_lint). It should be centralized or made configurable via environment variables to avoid duplication and facilitate future updates.

gemini-code-assist · 2026-04-14T17:46:29Z

+    """Result of compiling a single file."""
+
+    path: str
+    tree: Optional[DocumentTree] = None


Add a file_hash field to FileCompileResult to allow passing the already-computed hash through the pipeline, avoiding redundant I/O.

Suggested change

tree: Optional[DocumentTree] = None

path: str

file_hash: str = ""

gemini-code-assist · 2026-04-14T17:46:29Z

+        When *shallow* is True (or file is ineligible for tree indexing),
+        the pipeline skips tree building and summarises via a direct LLM call.
+        """
+        result = FileCompileResult(path=entry.path)


Populate the file_hash in the result object using the hash already available in FileEntry.

Suggested change

result = FileCompileResult(path=entry.path)

result = FileCompileResult(path=entry.path, file_hash=entry.file_hash)

gemini-code-assist · 2026-04-14T17:46:29Z

+                    report.trees_built += 1
+                # Update manifest
+                manifest.files[result.path] = FileManifestEntry(
+                    file_hash=get_fast_hash(result.path) or "",


Use the file_hash from the result object instead of re-calculating it by reading the file again.

Suggested change

file_hash=get_fast_hash(result.path) or "",

file_hash=result.file_hash,

gemini-code-assist · 2026-04-14T17:46:29Z

+            from sirchmunk.learnings.tree_indexer import DocumentTree
+
+            trees: List[DocumentTree] = []
+            for tree_file in sorted(tree_cache.glob("*.json"))[:50]:


The number of tree indices processed during probing is hardcoded to 50. Similar hardcoded limits exist in _probe_compile_hints (50 clusters, 100 trees). In large environments, these limits significantly restrict the effectiveness of the knowledge network. Consider making these thresholds configurable.

gemini-code-assist · 2026-04-14T17:46:29Z

+
+    async def _check_clusters(self, report: LintReport, auto_fix: bool) -> None:
+        """Validate each knowledge cluster."""
+        all_clusters = await self._storage.find("", limit=10000)


The limit of 10,000 clusters for linting might be insufficient for very large knowledge bases. Consider using pagination or making the limit configurable.

…earch_wiki

… retrieval

wangxingjun778 added 9 commits April 13, 2026 14:40

bump version

3eb5354

Introduce Sirchmunk Learnings (insights from pageindex and LLM wiki)

b72a878

improve compile infer

c4f4b16

improve search pipeline for compile mode

6458477

fix and enhance llm wiki and tree index for FAST search

1f6f799

fix _extract_catalog_keywords for llm wiki

077be35

add tree guided sampling

a602197

fix compile quality and large-file processing

8233c35

adopt the latest compile processing

1de1c98

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

wangxingjun778 added 20 commits April 15, 2026 16:02

refactor tree indexing with toc

938ced1

ok Merge branch 'main' of github.com:modelscope/sirchmunk into feat/s…

7b65b4b

…earch_wiki

enhance compile for excel and add embedding fallback for rga keywords…

29c0909

… retrieval

fix storage

d1f1fd4

add financebench

caf8e05

add llm judge for financebench

4a0a017

Adapt older knowledge cluster data structure

613c099

update finance bench readme

6858418

refactor config for finbench

9441ef2

refactor financebench readme

f1f86fa

update readme for finbench

0e46ef5

enhance tree indexes usage for search pipeline

2cf5c37

fix issues

c0b0db5

update tree index

e8184d0

update finbench readme

dc27ed9

update finbench readme

8723b85

update should answer thres

ca9a609

fix eval for finbench in runner

34c181e

refactor metrics as LLM judge for finbench

2b4714e

update config

a184e86

refactor doc extractor

eb43fdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Introduce Knowledge Compiling Module#160

[Feature] Introduce Knowledge Compiling Module#160
wangxingjun778 wants to merge 30 commits intomainfrom
feat/search_wiki

wangxingjun778 commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	tree: Optional[DocumentTree] = None
	path: str
	file_hash: str = ""

	result = FileCompileResult(path=entry.path)
	result = FileCompileResult(path=entry.path, file_hash=entry.file_hash)

	file_hash=get_fast_hash(result.path) or "",
	file_hash=result.file_hash,

Conversation

wangxingjun778 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 New Features & Capabilities

🛠️ Improvements & Optimizations

🐛 Bug Fixes & Performance Tuning

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangxingjun778 commented Apr 14, 2026 •

edited

Loading