Skip to content

Enhance language detection: more manifest signals, recursive scan, Makefile heuristics, and tests#57

Merged
JordanCoin merged 2 commits intomainfrom
codex/fix-language-detection-in-detectlanguagesfromfiles
Mar 27, 2026
Merged

Enhance language detection: more manifest signals, recursive scan, Makefile heuristics, and tests#57
JordanCoin merged 2 commits intomainfrom
codex/fix-language-detection-in-detectlanguagesfromfiles

Conversation

@JordanCoin
Copy link
Copy Markdown
Owner

Motivation

  • Improve the fallback language detection used when the daemon/dep-graph isn't available so language inference is more accurate.
  • Add signals for additional ecosystems (C#, Kotlin, TypeScript, Swift, C/C++) and monorepos so common repo layouts are recognized.
  • Detect languages from sources nested in subdirectories and glean C/C++ from Makefile contents to avoid missing languages in non-top-level layouts.

Description

  • Reworked detectLanguagesFromFiles to use a map[string][]string of manifest → languages and a small addLang helper to accumulate results.
  • Added many manifest signals (e.g. build.gradle.ktskotlin, tsconfig.jsontypescript, Podfileswift, tsconfig.jsontypescript, Makefilemake) and C# detection via repo-root glob for *.csproj/*.sln.
  • Detect JS/TS monorepos by checking packages/*/package.json, switch from top-level-only scanning to recursive scanning with scanner.ScanFiles and scanner.NewGitIgnoreCache, and translate detected file extensions into languages via scanner.DetectLanguage.
  • Added applyMakefileHeuristics which reads a Makefile (up to 128KB) and heuristically adds c, cpp when it finds C/C++-related tokens.
  • Cleaned up some heuristics (remove make placeholder after applying heuristics) and added missing imports (io, strings).
  • Added unit tests in cmd/context_test.go covering manifest signals and subdirectory source detection.

Testing

  • Ran the new tests with go test ./cmd -run TestDetectLanguagesFromFiles and both tests passed.
  • Ran the package test suite with go test ./... and the test run completed successfully (including the new tests).

Codex Task

Copilot AI review requested due to automatic review settings March 26, 2026 15:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e47b587829

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cmd/context.go Outdated
if strings.Contains(content, "g++") || strings.Contains(content, "clang++") || strings.Contains(content, ".cpp") || strings.Contains(content, ".cc") {
addLang("cpp")
}
if strings.Contains(content, "gcc") || strings.Contains(content, "clang") || strings.Contains(content, ".c") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid inferring C from C++-only Makefiles

The C heuristic is overly broad: strings.Contains(content, "clang") and especially strings.Contains(content, ".c") both match common C++-only tokens like clang++ and .cpp, so repositories with only C++ build rules will be mislabeled as both cpp and c. This affects language-driven behavior (e.g., context output and skill matching) whenever fallback detection relies on Makefile parsing.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the “cheap fallback” language detection used when the daemon/dep-graph isn’t available by adding more manifest-based signals, supporting monorepo layouts, scanning for source files recursively, and adding Makefile-based heuristics for C/C++.

Changes:

  • Expanded manifest→language signals (Gradle Kotlin DSL, tsconfig, Podfile, Makefile) and added repo-root glob detection for C# projects.
  • Switched fallback detection from top-level-only extension checks to a recursive scan using scanner.ScanFiles + NewGitIgnoreCache.
  • Added applyMakefileHeuristics and new unit tests covering manifest signals and nested source detection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
cmd/context.go Reworks fallback language detection to include more manifest signals, Makefile heuristics, and recursive scanning.
cmd/context_test.go Adds unit tests validating new manifest signals and subdirectory source detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cmd/context.go Outdated
Comment on lines +304 to +309
// Makefile heuristics for C/C++ projects.
if _, hasMakefile := langs["make"]; hasMakefile {
applyMakefileHeuristics(filepath.Join(root, "Makefile"), addLang)
}
for _, entry := range entries {
if entry.IsDir() {
continue
}
if lang := scanner.DetectLanguage(entry.Name()); lang != "" {
langs[lang] = true
delete(langs, "make")

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makefile is added as a manifest signal for "make", but the function then unconditionally deletes langs["make"]. As a result, "make" can never be reported as a detected language, and langs["make"] is effectively just a sentinel. Either keep "make" in the final output (don’t delete it), or avoid adding it to manifests and instead check for Makefile existence directly before running heuristics (and update comments/PR description accordingly).

Copilot uses AI. Check for mistakes.
cmd/context.go Outdated
if strings.Contains(content, "g++") || strings.Contains(content, "clang++") || strings.Contains(content, ".cpp") || strings.Contains(content, ".cc") {
addLang("cpp")
}
if strings.Contains(content, "gcc") || strings.Contains(content, "clang") || strings.Contains(content, ".c") {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyMakefileHeuristics will often add c for C++-only Makefiles: strings.Contains(content, "clang") matches clang++, and strings.Contains(content, ".c") also matches .cpp. This can cause systematic false positives (reporting both c and cpp). Tighten these checks (e.g., token/word-boundary matching, or explicitly excluding clang++/.cpp when detecting C).

Suggested change
if strings.Contains(content, "gcc") || strings.Contains(content, "clang") || strings.Contains(content, ".c") {
if strings.Contains(content, "gcc") ||
(strings.Contains(content, "clang") && !strings.Contains(content, "clang++")) ||
(strings.Contains(content, ".c") && !strings.Contains(content, ".cpp")) {

Copilot uses AI. Check for mistakes.
cmd/context.go Outdated
Comment on lines 310 to 316
// Include subdirectory-only source files.
gitCache := scanner.NewGitIgnoreCache(root)
if files, err := scanner.ScanFiles(root, gitCache, nil, nil); err == nil {
for _, f := range files {
addLang(scanner.DetectLanguage(f.Path))
}
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a full recursive scanner.ScanFiles walk in the fallback path, but buildProjectContext may still call countSourceFiles(root) afterwards, which performs another full ScanFiles walk. That makes the daemon-missing path potentially do two complete directory traversals. Consider reusing the scanned file list/count from this walk (or returning both languages + fileCount from the fallback) to avoid duplicate I/O on large repos.

Copilot uses AI. Check for mistakes.
- Remove 'make' sentinel from manifests, check Makefile directly
- Tighten C heuristic: exclude clang++ false positive
- Cache ScanFiles result to avoid double directory walk
- Fix build error in countSourceFiles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JordanCoin JordanCoin merged commit e9ed7c1 into main Mar 27, 2026
12 checks passed
@JordanCoin JordanCoin deleted the codex/fix-language-detection-in-detectlanguagesfromfiles branch March 27, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants