[Feat] : 노래별 태그 추출 크롤링 코드 작성 (#173) by GulSam00 · Pull Request #176 · GulSam00/singcode

GulSam00 · 2026-04-02T16:28:30Z

User description

📌 PR 제목

[Feat] : 노래별 태그 추출 크롤링 코드 작성

📌 변경 사항

AI 기반 곡 태그 자동 추출 기능 추가 (packages/crawling/src/utils/getSongTag.ts)
태그 추출 cron 스크립트 추가 (packages/crawling/src/cron/taggingSongs.ts)
Supabase DB 조회/저장 유틸 추가 (packages/crawling/src/supabase/)
크롤링 스크립트 crawling/ → cron/ 디렉토리 이동
GitHub Actions 워크플로우 step name 개선 및 태그 추출 워크플로우 추가
CLAUDE.md 업데이트

💬 추가 참고 사항

close 노래별 태그 추출 크롤링 코드 작성 #173

PR Type

Enhancement

Description

Add AI-based automatic song tagging using OpenAI GPT-4o-mini
Implement song tagging cron script with Supabase integration
Reorganize crawling scripts from crawling/ to cron/ directory
Add new GitHub Actions workflow for automated daily tagging
Update documentation and improve workflow step names

Diagram Walkthrough

flowchart LR
  A["Song Database"] -->|getSongsAllDB| B["Tagging Script"]
  C["Tagged Songs Cache"] -->|getSongTagSongIdsDB| B
  B -->|autoTagSong| D["OpenAI GPT-4o-mini"]
  D -->|tag IDs| B
  B -->|postSongTagsDB| E["song_tags Table"]
  F["GitHub Actions"] -->|Daily Schedule| B

File Walkthrough

Relevant files

Enhancement

4 files

getSongTag.ts `AI-powered song tag extraction utility`	+92/-0
taggingSongs.ts `Cron script for batch song tagging`	+59/-0
getDB.ts `Add song and tag query functions`	+24/-0
postDB.ts `Add song tags insertion function`	+12/-0

Miscellaneous

3 files

crawlYoutube.ts `Update import path after directory move`	+1/-1
crawlYoutubeVerify.ts `Update import path and increase limit`	+2/-3
sitemap-0.xml `Update sitemap timestamp`	+1/-1

Configuration changes

5 files

tagging_song.yml `New workflow for automated song tagging`	+43/-0
crawl_recent_tj.yml `Improve workflow step name clarity`	+1/-1
update_ky_youtube.yml `Improve workflow step name clarity`	+1/-1
verify_ky_youtube.yml `Improve workflow step name clarity`	+1/-1
package.json `Update script paths and add tag-songs command`	+4/-3

Documentation

2 files

CLAUDE.md `Document tagging pipeline and workflows`	+27/-0
CLAUDE.md `Update crawling package description`	+1/-1

Formatting

2 files

route.ts `Reformat TypeScript type definition`	+5/-3
route.ts `Consolidate array operations on single line`	+1/-3

Additional files

1 files

crawlRecentTJ.ts	[link]

vercel · 2026-04-02T16:28:35Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
singcode	Ready	Preview, Comment	Apr 2, 2026 4:28pm

GulSam00 · 2026-04-02T16:28:42Z

/describe

GulSam00 · 2026-04-02T16:28:44Z

/review

qodo-code-review · 2026-04-02T16:28:45Z

Review Summary by Qodo

Add AI-based automatic song tagging with OpenAI integration

✨ Enhancement

Walkthroughs

Description

• Add AI-based automatic song tagging using OpenAI GPT-4o-mini
• Create tagging cron script with Supabase integration
• Reorganize crawling scripts from crawling/ to cron/ directory
• Add new GitHub Actions workflow for daily song tagging
• Update documentation and improve workflow step names

Diagram

flowchart LR
  A["Song Database"] -->|getSongsAllDB| B["Tagging Script"]
  C["Tagged Songs Cache"] -->|getSongTagSongIdsDB| B
  B -->|autoTagSong| D["OpenAI GPT-4o-mini"]
  D -->|tag IDs| B
  B -->|postSongTagsDB| E["song_tags Table"]
  F["GitHub Actions"] -->|Daily Schedule| B

File Changes

1. packages/crawling/src/utils/getSongTag.ts ✨ Enhancement +92/-0

AI-powered song tag extraction utility

packages/crawling/src/utils/getSongTag.ts

2. packages/crawling/src/cron/taggingSongs.ts ✨ Enhancement +59/-0

Cron script for batch song tagging

packages/crawling/src/cron/taggingSongs.ts

3. packages/crawling/src/supabase/getDB.ts ✨ Enhancement +24/-0

Add database queries for songs and tags

packages/crawling/src/supabase/getDB.ts

View more (14)

4. packages/crawling/src/supabase/postDB.ts ✨ Enhancement +12/-0

Add song tags insertion function

packages/crawling/src/supabase/postDB.ts

5. packages/crawling/src/cron/crawlYoutube.ts Miscellaneous +1/-1

Update import path after directory reorganization

packages/crawling/src/cron/crawlYoutube.ts

6. packages/crawling/src/cron/crawlYoutubeVerify.ts Miscellaneous +2/-3

Update import path and increase processing limit

packages/crawling/src/cron/crawlYoutubeVerify.ts

7. .github/workflows/tagging_song.yml ⚙️ Configuration changes +43/-0

New workflow for daily song tagging

.github/workflows/tagging_song.yml

8. .github/workflows/crawl_recent_tj.yml 📝 Documentation +1/-1

Improve workflow step name clarity

.github/workflows/crawl_recent_tj.yml

9. .github/workflows/update_ky_youtube.yml 📝 Documentation +1/-1

Improve workflow step name clarity

.github/workflows/update_ky_youtube.yml

10. .github/workflows/verify_ky_youtube.yml 📝 Documentation +1/-1

Improve workflow step name clarity

.github/workflows/verify_ky_youtube.yml

11. packages/crawling/package.json ⚙️ Configuration changes +4/-3

Update script paths and add tag-songs command

packages/crawling/package.json

12. packages/crawling/CLAUDE.md 📝 Documentation +27/-0

Document tagging pipeline and workflows

packages/crawling/CLAUDE.md

13. CLAUDE.md 📝 Documentation +1/-1

Update crawling package description

CLAUDE.md

14. apps/web/src/app/api/search/route.ts Formatting +5/-3

Reformat TypeScript type definition

apps/web/src/app/api/search/route.ts

15. apps/web/src/app/api/songs/thumb-up/route.ts Formatting +1/-3

Simplify array sorting chain formatting

apps/web/src/app/api/songs/thumb-up/route.ts

16. apps/web/public/sitemap-0.xml Miscellaneous +1/-1

Update sitemap timestamp

apps/web/public/sitemap-0.xml

17. packages/crawling/src/cron/crawlRecentTJ.ts Additional files +0/-0

...

packages/crawling/src/cron/crawlRecentTJ.ts

GulSam00 · 2026-04-02T16:28:45Z

/improve

qodo-code-review · 2026-04-02T16:28:47Z

Code Review by Qodo

New Review Started

This review has been superseded by a new analysis

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-04-02T16:28:50Z

Code Review by Qodo

New Review Started

This review has been superseded by a new analysis

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-04-02T16:28:52Z

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (1)

1. taggingSongs.ts stops at 5000 📎 Requirement gap ≡ Correctness

Description

The new tagging cron stops iterating after processing 5000 songs, which can leave the remaining
songs unprocessed when the song list exceeds that size. This violates the requirement to reliably
traverse the entire song list for per-song tag extraction.

Code

packages/crawling/src/cron/taggingSongs.ts[R17-21]

+// 2. 순차 순회 (테스트: 5회만 실행)
+let processedCount = 0;
+for (const song of allSongs) {
+  if (processedCount >= 5000) break;
+  if (taggedSongIds.has(song.id)) {

Evidence
PR Compliance ID 2 requires explicit logic to traverse the entire song list; the added loop contains
a hard break at 5000 processed songs, so only a subset is processed when allSongs.length > 5000.
Implement song list iteration logic
packages/crawling/src/cron/taggingSongs.ts[17-21]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`packages/crawling/src/cron/taggingSongs.ts` breaks out of the main loop when `processedCount >= 5000`, which prevents processing the full song list.

## Issue Context
Compliance requires iterating through the entire song list predictably; a hard-coded cap causes unintentional partial processing as the dataset grows.

## Fix Focus Areas
- packages/crawling/src/cron/taggingSongs.ts[17-21]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Incomplete tagged-song preload 🐞 Bug ≡ Correctness

Description

getSongTagSongIdsDB() loads only 50,000 rows from song_tags (a many-to-many mapping), so
taggingSongs.ts can miss many already-tagged songs and re-run OpenAI + attempt duplicate inserts for
them.

Code

packages/crawling/src/supabase/getDB.ts[R102-110]

+export async function getSongTagSongIdsDB(): Promise<Set<string>> {
+  const supabase = getClient();
+
+  const { data, error } = await supabase.from('song_tags').select('song_id').limit(50000);
+
+  if (error) throw error;
+
+  return new Set(data.map(row => row.song_id));
+}

Evidence

taggingSongs.ts relies on getSongTagSongIdsDB() to skip already-tagged songs, but
getSongTagSongIdsDB() applies limit(50000) to song_tags rows. Since song_tags is a (song_id, tag_id)
mapping and the pipeline assigns 1~4 tags per song, the mapping table’s row count grows much faster
than unique song count; limiting rows will exclude many tagged songs from the Set, breaking the skip
logic and causing reprocessing/duplicate inserts.

packages/crawling/src/supabase/getDB.ts[88-110]
packages/crawling/src/cron/taggingSongs.ts[11-24]
packages/crawling/CLAUDE.md[93-117]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`getSongTagSongIdsDB()` currently fetches only `limit(50000)` rows from `song_tags`. Because `song_tags` has multiple rows per song (1~4 tags per song), this limit does **not** represent “50k songs”; it represents “50k mappings”. As the table grows, many tagged songs will not be loaded into the Set and the tagging cron will re-tag them and attempt duplicate inserts.

## Issue Context
`taggingSongs.ts` uses this Set as the sole mechanism to skip already-tagged songs, so the correctness and completeness of this Set is critical.

## Fix Focus Areas
- packages/crawling/src/supabase/getDB.ts[102-110]
- packages/crawling/src/cron/taggingSongs.ts[11-24]

## Suggested fix
- Implement pagination over `song_tags` using `.range(from, to)` (or another paging strategy) until no rows are returned, adding `song_id` to a `Set`.
- Optionally order by `song_id` to make paging deterministic.
- Consider updating `taggedSongIds` in-memory after a successful insert (`taggedSongIds.add(song.id)`) to avoid any accidental reprocessing within the same run.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Non-idempotent tag inserts 🐞 Bug ☼ Reliability

Description

postSongTagsDB() blindly inserts rows for all tagIds, and taggingSongs.ts does not enforce
uniqueness/size on AI output, so duplicate tag IDs (or reruns) can create duplicate mappings or fail
inserts depending on DB constraints.

Code

packages/crawling/src/supabase/postDB.ts[R55-65]

+export async function postSongTagsDB(songId: string, tagIds: number[]) {
+  const supabase = getClient();
+  const rows = tagIds.map(tagId => ({ song_id: songId, tag_id: tagId }));
+
+  const { error } = await supabase.from('song_tags').insert(rows);
+  if (error) {
+    console.error('postSongTagsDB error:', error);
+    return false;
+  }
+  return true;
+}

Evidence
The OpenAI response is parsed and returned directly as tag_ids without enforcing the documented
1~4 constraint or uniqueness, and the DB write path uses a plain .insert(rows) with no
upsert/on-conflict behavior. This makes the pipeline fragile to (a) duplicate IDs in model output
and (b) reprocessing of already-tagged songs (which is made more likely by the incomplete preload
issue).
packages/crawling/src/supabase/postDB.ts[55-65]
packages/crawling/src/utils/getSongTag.ts[46-88]
packages/crawling/src/cron/taggingSongs.ts[26-41]
packages/crawling/CLAUDE.md[111-117]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`postSongTagsDB()` performs a raw insert of `(song_id, tag_id)` rows and the caller passes model output through without sanitization. This can lead to duplicate rows (data integrity issue) or insert errors (pipeline instability), especially when reruns or partial preloads occur.

## Issue Context
The tagging pipeline expects 1~4 tags per song, but this is currently only a prompt instruction, not an enforced invariant.

## Fix Focus Areas
- packages/crawling/src/supabase/postDB.ts[55-65]
- packages/crawling/src/utils/getSongTag.ts[46-88]
- packages/crawling/src/cron/taggingSongs.ts[26-41]

## Suggested fix
- Before inserting, sanitize tag IDs:
 - Ensure `Array.isArray(tagIds)`
 - Filter to integers only
 - Deduplicate (`new Set`)
 - Enforce max length (e.g., `.slice(0, 4)`)
- Make DB write idempotent:
 - Prefer `upsert(rows, { onConflict: 'song_id,tag_id', ignoreDuplicates: true })` (or equivalent supported by your Supabase/PostgREST setup)
 - If upsert options aren’t available, consider inserting per row and ignoring duplicate-key errors explicitly.
- Optionally log when output is truncated/deduped for auditability.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Browser close not awaited 🐞 Bug ☼ Reliability

Description

crawlYoutubeVerify.ts does not use try/finally for Puppeteer cleanup and calls browser.close()
without await, so errors during the now-longer 5000-iteration run can leak a Chrome process and
destabilize the workflow.

Code

packages/crawling/src/cron/crawlYoutubeVerify.ts[R48-51]
+  if (index >= 5000) break;
}

browser.close();

Evidence

The verify cron launches a browser at top-level, runs a long loop, and then calls browser.close()
without awaiting and without any finally block. With the PR increasing the loop cap to 5000, the
window for exceptions/interruptions grows, making missing cleanup more impactful.

packages/crawling/src/cron/crawlYoutubeVerify.ts[12-51]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Puppeteer browser lifecycle in `crawlYoutubeVerify.ts` is not protected by `try/finally`, and `browser.close()` is called without `await`. If an exception escapes the loop or the script terminates unexpectedly, Chrome can remain running and the job may hang or waste resources.

## Issue Context
This PR increases the maximum processed items from 2000 to 5000, increasing total runtime and the chance of mid-run errors.

## Fix Focus Areas
- packages/crawling/src/cron/crawlYoutubeVerify.ts[12-51]

## Suggested fix
- Wrap the main logic in `try { ... } finally { await browser.close(); }`.
- Ensure all early exits (break/return) still pass through the `finally`.
- Consider adding a top-level catch to log fatal errors and rethrow/exit with non-zero status.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. Unnecessary contents write 🐞 Bug ⛨ Security

Description

tagging_song.yml grants contents: write, but the job only checks out, installs deps, writes .env,
and runs the tagging script (no repository write step), increasing blast radius if the workflow
execution is compromised.

Code

.github/workflows/tagging_song.yml[R8-10]
+permissions:
+  contents: write # push 권한을 위해 필요
+

Evidence

The workflow explicitly requests contents: write, but the defined steps do not include any git
commit/push action, and the invoked tagging script only interacts with Supabase/OpenAI and logs
results. Reducing token permissions follows least-privilege and limits damage from supply-chain
attacks during pnpm install.

.github/workflows/tagging_song.yml[8-43]
packages/crawling/src/cron/taggingSongs.ts[1-59]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The tagging workflow requests `contents: write` but doesn’t perform repository write operations. This unnecessarily expands the permissions of `GITHUB_TOKEN`.

## Issue Context
The workflow runs on a schedule and executes `pnpm install` + a custom script, which is a common supply-chain risk surface. Least privilege reduces impact.

## Fix Focus Areas
- .github/workflows/tagging_song.yml[8-10]

## Suggested fix
- Remove the `permissions:` block entirely (defaults to minimal) or set:
 - `permissions:
     contents: read`
- If future steps will push commits, add the write permission only in that workflow/job/step where it’s required.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-04-02T16:28:57Z

PR Description updated to latest commit (3023c13)

Copy walkthrough table to "Files Changed" Tab

qodo-code-review · 2026-04-02T16:33:03Z

packages/crawling/src/cron/taggingSongs.ts

+// 2. 순차 순회 (테스트: 5회만 실행)
+let processedCount = 0;
+for (const song of allSongs) {
+  if (processedCount >= 5000) break;
+  if (taggedSongIds.has(song.id)) {


1. taggingsongs.ts stops at 5000 📎 Requirement gap ≡ Correctness

The new tagging cron stops iterating after processing 5000 songs, which can leave the remaining songs unprocessed when the song list exceeds that size. This violates the requirement to reliably traverse the entire song list for per-song tag extraction.

Agent Prompt

## Issue description `packages/crawling/src/cron/taggingSongs.ts` breaks out of the main loop when `processedCount >= 5000`, which prevents processing the full song list. ## Issue Context Compliance requires iterating through the entire song list predictably; a hard-coded cap causes unintentional partial processing as the dataset grows. ## Fix Focus Areas - packages/crawling/src/cron/taggingSongs.ts[17-21]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-04-02T16:33:03Z

packages/crawling/src/supabase/getDB.ts

+export async function getSongTagSongIdsDB(): Promise<Set<string>> {
+  const supabase = getClient();
+
+  const { data, error } = await supabase.from('song_tags').select('song_id').limit(50000);
+
+  if (error) throw error;
+
+  return new Set(data.map(row => row.song_id));
+}


2. Incomplete tagged-song preload 🐞 Bug ≡ Correctness

getSongTagSongIdsDB() loads only 50,000 rows from song_tags (a many-to-many mapping), so taggingSongs.ts can miss many already-tagged songs and re-run OpenAI + attempt duplicate inserts for them.

Agent Prompt

## Issue description `getSongTagSongIdsDB()` currently fetches only `limit(50000)` rows from `song_tags`. Because `song_tags` has multiple rows per song (1~4 tags per song), this limit does **not** represent “50k songs”; it represents “50k mappings”. As the table grows, many tagged songs will not be loaded into the Set and the tagging cron will re-tag them and attempt duplicate inserts. ## Issue Context `taggingSongs.ts` uses this Set as the sole mechanism to skip already-tagged songs, so the correctness and completeness of this Set is critical. ## Fix Focus Areas - packages/crawling/src/supabase/getDB.ts[102-110] - packages/crawling/src/cron/taggingSongs.ts[11-24] ## Suggested fix - Implement pagination over `song_tags` using `.range(from, to)` (or another paging strategy) until no rows are returned, adding `song_id` to a `Set`. - Optionally order by `song_id` to make paging deterministic. - Consider updating `taggedSongIds` in-memory after a successful insert (`taggedSongIds.add(song.id)`) to avoid any accidental reprocessing within the same run.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

GulSam00 added 3 commits March 27, 2026 23:31

refactor : 크롤링 스크립트 crawling → cron 디렉토리 이동 (#173)

6fc0949

feat : AI 기반 곡 태그 자동 추출 기능 추가 (#173)

466cc2f

chore : 워크플로우 step name 개선 및 CLAUDE.md 업데이트 (#173)

3023c13

qodo-code-review bot reviewed Apr 2, 2026

View reviewed changes

GulSam00 merged commit c377757 into develop Apr 4, 2026
2 checks passed

GulSam00 deleted the feat/173-extractSongTags branch April 4, 2026 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] : 노래별 태그 추출 크롤링 코드 작성 (#173)#176

[Feat] : 노래별 태그 추출 크롤링 코드 작성 (#173)#176
GulSam00 merged 3 commits intodevelopfrom
feat/173-extractSongTags

GulSam00 commented Apr 2, 2026 •

edited by qodo-code-review bot

Loading

Uh oh!

vercel bot commented Apr 2, 2026

Uh oh!

GulSam00 commented Apr 2, 2026

Uh oh!

GulSam00 commented Apr 2, 2026

Uh oh!

qodo-code-review bot commented Apr 2, 2026

Uh oh!

GulSam00 commented Apr 2, 2026

Uh oh!

qodo-code-review bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Apr 2, 2026

Uh oh!

qodo-code-review bot Apr 2, 2026

Uh oh!

qodo-code-review bot Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GulSam00 commented Apr 2, 2026 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

📌 PR 제목