[Feature] Introduce Knowledge Compiling Module#160
[Feature] Introduce Knowledge Compiling Module#160wangxingjun778 wants to merge 30 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a 'Knowledge Compile' module, enabling offline document processing into hierarchical tree indices and knowledge clusters. It adds a new CLI command, a linting utility for health checks, and integrates these artifacts into the search pipeline for improved retrieval precision. Feedback focuses on addressing performance bottlenecks in cross-reference building, eliminating hardcoded model names and processing limits, and optimizing I/O by reusing file hashes throughout the pipeline.
| for i in range(len(cluster_ids)): | ||
| for j in range(i + 1, len(cluster_ids)): | ||
| cid_a, cid_b = cluster_ids[i], cluster_ids[j] | ||
| shared = cluster_to_files[cid_a] & cluster_to_files[cid_b] | ||
| if not shared: | ||
| continue | ||
|
|
||
| pair_key = (min(cid_a, cid_b), max(cid_a, cid_b)) | ||
| if pair_key in pairs_seen: | ||
| continue | ||
| pairs_seen.add(pair_key) | ||
|
|
||
| weight = min(len(shared) * 0.25, 1.0) | ||
| c_a = await self._storage.get(cid_a) | ||
| c_b = await self._storage.get(cid_b) | ||
| if c_a and c_b: | ||
| self._add_edge(c_a, cid_b, "co_occur", weight) | ||
| self._add_edge(c_b, cid_a, "co_occur", weight) | ||
| await self._storage.update(c_a) | ||
| await self._storage.update(c_b) | ||
| edges_created += 1 |
There was a problem hiding this comment.
The _build_cross_references method implements an O(N^2) loop over cluster pairs, performing multiple asynchronous database operations (get, update) within the inner loop. This will lead to severe performance issues as the knowledge base grows. Consider batching these updates or using a more efficient graph construction strategy.
| llm = OpenAIChat( | ||
| base_url=os.getenv("LLM_BASE_URL", "https://api.openai.com/v1"), | ||
| api_key=llm_api_key, | ||
| model=os.getenv("LLM_MODEL_NAME", "gpt-5.2"), |
| """Result of compiling a single file.""" | ||
|
|
||
| path: str | ||
| tree: Optional[DocumentTree] = None |
| When *shallow* is True (or file is ineligible for tree indexing), | ||
| the pipeline skips tree building and summarises via a direct LLM call. | ||
| """ | ||
| result = FileCompileResult(path=entry.path) |
| report.trees_built += 1 | ||
| # Update manifest | ||
| manifest.files[result.path] = FileManifestEntry( | ||
| file_hash=get_fast_hash(result.path) or "", |
| from sirchmunk.learnings.tree_indexer import DocumentTree | ||
|
|
||
| trees: List[DocumentTree] = [] | ||
| for tree_file in sorted(tree_cache.glob("*.json"))[:50]: |
There was a problem hiding this comment.
The number of tree indices processed during probing is hardcoded to 50. Similar hardcoded limits exist in _probe_compile_hints (50 clusters, 100 trees). In large environments, these limits significantly restrict the effectiveness of the knowledge network. Consider making these thresholds configurable.
|
|
||
| async def _check_clusters(self, report: LintReport, auto_fix: bool) -> None: | ||
| """Validate each knowledge cluster.""" | ||
| all_clusters = await self._storage.find("", limit=10000) |
🚀 New Features & Capabilities
🛠️ Improvements & Optimizations
🐛 Bug Fixes & Performance Tuning