Skip to content

[Feat] : 노래별 태그 추출 크롤링 코드 작성 (#173)#176

Merged
GulSam00 merged 3 commits intodevelopfrom
feat/173-extractSongTags
Apr 4, 2026
Merged

[Feat] : 노래별 태그 추출 크롤링 코드 작성 (#173)#176
GulSam00 merged 3 commits intodevelopfrom
feat/173-extractSongTags

Conversation

@GulSam00
Copy link
Copy Markdown
Owner

@GulSam00 GulSam00 commented Apr 2, 2026

User description

📌 PR 제목

[Feat] : 노래별 태그 추출 크롤링 코드 작성

📌 변경 사항

  • AI 기반 곡 태그 자동 추출 기능 추가 (packages/crawling/src/utils/getSongTag.ts)
  • 태그 추출 cron 스크립트 추가 (packages/crawling/src/cron/taggingSongs.ts)
  • Supabase DB 조회/저장 유틸 추가 (packages/crawling/src/supabase/)
  • 크롤링 스크립트 crawling/cron/ 디렉토리 이동
  • GitHub Actions 워크플로우 step name 개선 및 태그 추출 워크플로우 추가
  • CLAUDE.md 업데이트

💬 추가 참고 사항


PR Type

Enhancement


Description

  • Add AI-based automatic song tagging using OpenAI GPT-4o-mini

  • Implement song tagging cron script with Supabase integration

  • Reorganize crawling scripts from crawling/ to cron/ directory

  • Add new GitHub Actions workflow for automated daily tagging

  • Update documentation and improve workflow step names


Diagram Walkthrough

flowchart LR
  A["Song Database"] -->|getSongsAllDB| B["Tagging Script"]
  C["Tagged Songs Cache"] -->|getSongTagSongIdsDB| B
  B -->|autoTagSong| D["OpenAI GPT-4o-mini"]
  D -->|tag IDs| B
  B -->|postSongTagsDB| E["song_tags Table"]
  F["GitHub Actions"] -->|Daily Schedule| B
Loading

File Walkthrough

Relevant files
Enhancement
4 files
getSongTag.ts
AI-powered song tag extraction utility                                     
+92/-0   
taggingSongs.ts
Cron script for batch song tagging                                             
+59/-0   
getDB.ts
Add song and tag query functions                                                 
+24/-0   
postDB.ts
Add song tags insertion function                                                 
+12/-0   
Miscellaneous
3 files
crawlYoutube.ts
Update import path after directory move                                   
+1/-1     
crawlYoutubeVerify.ts
Update import path and increase limit                                       
+2/-3     
sitemap-0.xml
Update sitemap timestamp                                                                 
+1/-1     
Configuration changes
5 files
tagging_song.yml
New workflow for automated song tagging                                   
+43/-0   
crawl_recent_tj.yml
Improve workflow step name clarity                                             
+1/-1     
update_ky_youtube.yml
Improve workflow step name clarity                                             
+1/-1     
verify_ky_youtube.yml
Improve workflow step name clarity                                             
+1/-1     
package.json
Update script paths and add tag-songs command                       
+4/-3     
Documentation
2 files
CLAUDE.md
Document tagging pipeline and workflows                                   
+27/-0   
CLAUDE.md
Update crawling package description                                           
+1/-1     
Formatting
2 files
route.ts
Reformat TypeScript type definition                                           
+5/-3     
route.ts
Consolidate array operations on single line                           
+1/-3     
Additional files
1 files
crawlRecentTJ.ts [link]   

@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
singcode Ready Ready Preview, Comment Apr 2, 2026 4:28pm

@GulSam00
Copy link
Copy Markdown
Owner Author

GulSam00 commented Apr 2, 2026

/describe

@GulSam00
Copy link
Copy Markdown
Owner Author

GulSam00 commented Apr 2, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add AI-based automatic song tagging with OpenAI integration

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add AI-based automatic song tagging using OpenAI GPT-4o-mini
• Create tagging cron script with Supabase integration
• Reorganize crawling scripts from crawling/ to cron/ directory
• Add new GitHub Actions workflow for daily song tagging
• Update documentation and improve workflow step names
Diagram
flowchart LR
  A["Song Database"] -->|getSongsAllDB| B["Tagging Script"]
  C["Tagged Songs Cache"] -->|getSongTagSongIdsDB| B
  B -->|autoTagSong| D["OpenAI GPT-4o-mini"]
  D -->|tag IDs| B
  B -->|postSongTagsDB| E["song_tags Table"]
  F["GitHub Actions"] -->|Daily Schedule| B
Loading

Grey Divider

File Changes

1. packages/crawling/src/utils/getSongTag.ts ✨ Enhancement +92/-0

AI-powered song tag extraction utility

packages/crawling/src/utils/getSongTag.ts


2. packages/crawling/src/cron/taggingSongs.ts ✨ Enhancement +59/-0

Cron script for batch song tagging

packages/crawling/src/cron/taggingSongs.ts


3. packages/crawling/src/supabase/getDB.ts ✨ Enhancement +24/-0

Add database queries for songs and tags

packages/crawling/src/supabase/getDB.ts


View more (14)
4. packages/crawling/src/supabase/postDB.ts ✨ Enhancement +12/-0

Add song tags insertion function

packages/crawling/src/supabase/postDB.ts


5. packages/crawling/src/cron/crawlYoutube.ts Miscellaneous +1/-1

Update import path after directory reorganization

packages/crawling/src/cron/crawlYoutube.ts


6. packages/crawling/src/cron/crawlYoutubeVerify.ts Miscellaneous +2/-3

Update import path and increase processing limit

packages/crawling/src/cron/crawlYoutubeVerify.ts


7. .github/workflows/tagging_song.yml ⚙️ Configuration changes +43/-0

New workflow for daily song tagging

.github/workflows/tagging_song.yml


8. .github/workflows/crawl_recent_tj.yml 📝 Documentation +1/-1

Improve workflow step name clarity

.github/workflows/crawl_recent_tj.yml


9. .github/workflows/update_ky_youtube.yml 📝 Documentation +1/-1

Improve workflow step name clarity

.github/workflows/update_ky_youtube.yml


10. .github/workflows/verify_ky_youtube.yml 📝 Documentation +1/-1

Improve workflow step name clarity

.github/workflows/verify_ky_youtube.yml


11. packages/crawling/package.json ⚙️ Configuration changes +4/-3

Update script paths and add tag-songs command

packages/crawling/package.json


12. packages/crawling/CLAUDE.md 📝 Documentation +27/-0

Document tagging pipeline and workflows

packages/crawling/CLAUDE.md


13. CLAUDE.md 📝 Documentation +1/-1

Update crawling package description

CLAUDE.md


14. apps/web/src/app/api/search/route.ts Formatting +5/-3

Reformat TypeScript type definition

apps/web/src/app/api/search/route.ts


15. apps/web/src/app/api/songs/thumb-up/route.ts Formatting +1/-3

Simplify array sorting chain formatting

apps/web/src/app/api/songs/thumb-up/route.ts


16. apps/web/public/sitemap-0.xml Miscellaneous +1/-1

Update sitemap timestamp

apps/web/public/sitemap-0.xml


17. packages/crawling/src/cron/crawlRecentTJ.ts Additional files +0/-0

...

packages/crawling/src/cron/crawlRecentTJ.ts


Grey Divider

Qodo Logo

@GulSam00
Copy link
Copy Markdown
Owner Author

GulSam00 commented Apr 2, 2026

/improve

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 2, 2026

Code Review by Qodo

Grey Divider

New Review Started

This review has been superseded by a new analysis

Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

1 similar comment
@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 2, 2026

Code Review by Qodo

Grey Divider

New Review Started

This review has been superseded by a new analysis

Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 2, 2026

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (1)

Grey Divider


Action required

1. taggingSongs.ts stops at 5000 📎 Requirement gap ≡ Correctness
Description
The new tagging cron stops iterating after processing 5000 songs, which can leave the remaining
songs unprocessed when the song list exceeds that size. This violates the requirement to reliably
traverse the entire song list for per-song tag extraction.
Code

packages/crawling/src/cron/taggingSongs.ts[R17-21]

+// 2. 순차 순회 (테스트: 5회만 실행)
+let processedCount = 0;
+for (const song of allSongs) {
+  if (processedCount >= 5000) break;
+  if (taggedSongIds.has(song.id)) {
Evidence
PR Compliance ID 2 requires explicit logic to traverse the entire song list; the added loop contains
a hard break at 5000 processed songs, so only a subset is processed when allSongs.length > 5000.

Implement song list iteration logic
packages/crawling/src/cron/taggingSongs.ts[17-21]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`packages/crawling/src/cron/taggingSongs.ts` breaks out of the main loop when `processedCount >= 5000`, which prevents processing the full song list.

## Issue Context
Compliance requires iterating through the entire song list predictably; a hard-coded cap causes unintentional partial processing as the dataset grows.

## Fix Focus Areas
- packages/crawling/src/cron/taggingSongs.ts[17-21]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Incomplete tagged-song preload 🐞 Bug ≡ Correctness
Description
getSongTagSongIdsDB() loads only 50,000 rows from song_tags (a many-to-many mapping), so
taggingSongs.ts can miss many already-tagged songs and re-run OpenAI + attempt duplicate inserts for
them.
Code

packages/crawling/src/supabase/getDB.ts[R102-110]

+export async function getSongTagSongIdsDB(): Promise<Set<string>> {
+  const supabase = getClient();
+
+  const { data, error } = await supabase.from('song_tags').select('song_id').limit(50000);
+
+  if (error) throw error;
+
+  return new Set(data.map(row => row.song_id));
+}
Evidence
taggingSongs.ts relies on getSongTagSongIdsDB() to skip already-tagged songs, but
getSongTagSongIdsDB() applies limit(50000) to song_tags rows. Since song_tags is a (song_id, tag_id)
mapping and the pipeline assigns 1~4 tags per song, the mapping table’s row count grows much faster
than unique song count; limiting rows will exclude many tagged songs from the Set, breaking the skip
logic and causing reprocessing/duplicate inserts.

packages/crawling/src/supabase/getDB.ts[88-110]
packages/crawling/src/cron/taggingSongs.ts[11-24]
packages/crawling/CLAUDE.md[93-117]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`getSongTagSongIdsDB()` currently fetches only `limit(50000)` rows from `song_tags`. Because `song_tags` has multiple rows per song (1~4 tags per song), this limit does **not** represent “50k songs”; it represents “50k mappings”. As the table grows, many tagged songs will not be loaded into the Set and the tagging cron will re-tag them and attempt duplicate inserts.

## Issue Context
`taggingSongs.ts` uses this Set as the sole mechanism to skip already-tagged songs, so the correctness and completeness of this Set is critical.

## Fix Focus Areas
- packages/crawling/src/supabase/getDB.ts[102-110]
- packages/crawling/src/cron/taggingSongs.ts[11-24]

## Suggested fix
- Implement pagination over `song_tags` using `.range(from, to)` (or another paging strategy) until no rows are returned, adding `song_id` to a `Set`.
- Optionally order by `song_id` to make paging deterministic.
- Consider updating `taggedSongIds` in-memory after a successful insert (`taggedSongIds.add(song.id)`) to avoid any accidental reprocessing within the same run.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Non-idempotent tag inserts 🐞 Bug ☼ Reliability
Description
postSongTagsDB() blindly inserts rows for all tagIds, and taggingSongs.ts does not enforce
uniqueness/size on AI output, so duplicate tag IDs (or reruns) can create duplicate mappings or fail
inserts depending on DB constraints.
Code

packages/crawling/src/supabase/postDB.ts[R55-65]

+export async function postSongTagsDB(songId: string, tagIds: number[]) {
+  const supabase = getClient();
+  const rows = tagIds.map(tagId => ({ song_id: songId, tag_id: tagId }));
+
+  const { error } = await supabase.from('song_tags').insert(rows);
+  if (error) {
+    console.error('postSongTagsDB error:', error);
+    return false;
+  }
+  return true;
+}
Evidence
The OpenAI response is parsed and returned directly as tag_ids without enforcing the documented
1~4 constraint or uniqueness, and the DB write path uses a plain .insert(rows) with no
upsert/on-conflict behavior. This makes the pipeline fragile to (a) duplicate IDs in model output
and (b) reprocessing of already-tagged songs (which is made more likely by the incomplete preload
issue).

packages/crawling/src/supabase/postDB.ts[55-65]
packages/crawling/src/utils/getSongTag.ts[46-88]
packages/crawling/src/cron/taggingSongs.ts[26-41]
packages/crawling/CLAUDE.md[111-117]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`postSongTagsDB()` performs a raw insert of `(song_id, tag_id)` rows and the caller passes model output through without sanitization. This can lead to duplicate rows (data integrity issue) or insert errors (pipeline instability), especially when reruns or partial preloads occur.

## Issue Context
The tagging pipeline expects 1~4 tags per song, but this is currently only a prompt instruction, not an enforced invariant.

## Fix Focus Areas
- packages/crawling/src/supabase/postDB.ts[55-65]
- packages/crawling/src/utils/getSongTag.ts[46-88]
- packages/crawling/src/cron/taggingSongs.ts[26-41]

## Suggested fix
- Before inserting, sanitize tag IDs:
 - Ensure `Array.isArray(tagIds)`
 - Filter to integers only
 - Deduplicate (`new Set`)
 - Enforce max length (e.g., `.slice(0, 4)`)
- Make DB write idempotent:
 - Prefer `upsert(rows, { onConflict: 'song_id,tag_id', ignoreDuplicates: true })` (or equivalent supported by your Supabase/PostgREST setup)
 - If upsert options aren’t available, consider inserting per row and ignoring duplicate-key errors explicitly.
- Optionally log when output is truncated/deduped for auditability.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Browser close not awaited 🐞 Bug ☼ Reliability
Description
crawlYoutubeVerify.ts does not use try/finally for Puppeteer cleanup and calls browser.close()
without await, so errors during the now-longer 5000-iteration run can leak a Chrome process and
destabilize the workflow.
Code

packages/crawling/src/cron/crawlYoutubeVerify.ts[R48-51]

+  if (index >= 5000) break;
}

browser.close();
Evidence
The verify cron launches a browser at top-level, runs a long loop, and then calls browser.close()
without awaiting and without any finally block. With the PR increasing the loop cap to 5000, the
window for exceptions/interruptions grows, making missing cleanup more impactful.

packages/crawling/src/cron/crawlYoutubeVerify.ts[12-51]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Puppeteer browser lifecycle in `crawlYoutubeVerify.ts` is not protected by `try/finally`, and `browser.close()` is called without `await`. If an exception escapes the loop or the script terminates unexpectedly, Chrome can remain running and the job may hang or waste resources.

## Issue Context
This PR increases the maximum processed items from 2000 to 5000, increasing total runtime and the chance of mid-run errors.

## Fix Focus Areas
- packages/crawling/src/cron/crawlYoutubeVerify.ts[12-51]

## Suggested fix
- Wrap the main logic in `try { ... } finally { await browser.close(); }`.
- Ensure all early exits (break/return) still pass through the `finally`.
- Consider adding a top-level catch to log fatal errors and rethrow/exit with non-zero status.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

5. Unnecessary contents write 🐞 Bug ⛨ Security
Description
tagging_song.yml grants contents: write, but the job only checks out, installs deps, writes .env,
and runs the tagging script (no repository write step), increasing blast radius if the workflow
execution is compromised.
Code

.github/workflows/tagging_song.yml[R8-10]

+permissions:
+  contents: write # push 권한을 위해 필요
+
Evidence
The workflow explicitly requests contents: write, but the defined steps do not include any git
commit/push action, and the invoked tagging script only interacts with Supabase/OpenAI and logs
results. Reducing token permissions follows least-privilege and limits damage from supply-chain
attacks during pnpm install.

.github/workflows/tagging_song.yml[8-43]
packages/crawling/src/cron/taggingSongs.ts[1-59]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The tagging workflow requests `contents: write` but doesn’t perform repository write operations. This unnecessarily expands the permissions of `GITHUB_TOKEN`.

## Issue Context
The workflow runs on a schedule and executes `pnpm install` + a custom script, which is a common supply-chain risk surface. Least privilege reduces impact.

## Fix Focus Areas
- .github/workflows/tagging_song.yml[8-10]

## Suggested fix
- Remove the `permissions:` block entirely (defaults to minimal) or set:
 - `permissions:
     contents: read`
- If future steps will push commits, add the write permission only in that workflow/job/step where it’s required.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

PR Description updated to latest commit (3023c13)

Comment on lines +17 to +21
// 2. 순차 순회 (테스트: 5회만 실행)
let processedCount = 0;
for (const song of allSongs) {
if (processedCount >= 5000) break;
if (taggedSongIds.has(song.id)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. taggingsongs.ts stops at 5000 📎 Requirement gap ≡ Correctness

The new tagging cron stops iterating after processing 5000 songs, which can leave the remaining
songs unprocessed when the song list exceeds that size. This violates the requirement to reliably
traverse the entire song list for per-song tag extraction.
Agent Prompt
## Issue description
`packages/crawling/src/cron/taggingSongs.ts` breaks out of the main loop when `processedCount >= 5000`, which prevents processing the full song list.

## Issue Context
Compliance requires iterating through the entire song list predictably; a hard-coded cap causes unintentional partial processing as the dataset grows.

## Fix Focus Areas
- packages/crawling/src/cron/taggingSongs.ts[17-21]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +102 to +110
export async function getSongTagSongIdsDB(): Promise<Set<string>> {
const supabase = getClient();

const { data, error } = await supabase.from('song_tags').select('song_id').limit(50000);

if (error) throw error;

return new Set(data.map(row => row.song_id));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Incomplete tagged-song preload 🐞 Bug ≡ Correctness

getSongTagSongIdsDB() loads only 50,000 rows from song_tags (a many-to-many mapping), so
taggingSongs.ts can miss many already-tagged songs and re-run OpenAI + attempt duplicate inserts for
them.
Agent Prompt
## Issue description
`getSongTagSongIdsDB()` currently fetches only `limit(50000)` rows from `song_tags`. Because `song_tags` has multiple rows per song (1~4 tags per song), this limit does **not** represent “50k songs”; it represents “50k mappings”. As the table grows, many tagged songs will not be loaded into the Set and the tagging cron will re-tag them and attempt duplicate inserts.

## Issue Context
`taggingSongs.ts` uses this Set as the sole mechanism to skip already-tagged songs, so the correctness and completeness of this Set is critical.

## Fix Focus Areas
- packages/crawling/src/supabase/getDB.ts[102-110]
- packages/crawling/src/cron/taggingSongs.ts[11-24]

## Suggested fix
- Implement pagination over `song_tags` using `.range(from, to)` (or another paging strategy) until no rows are returned, adding `song_id` to a `Set`.
- Optionally order by `song_id` to make paging deterministic.
- Consider updating `taggedSongIds` in-memory after a successful insert (`taggedSongIds.add(song.id)`) to avoid any accidental reprocessing within the same run.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@GulSam00 GulSam00 merged commit c377757 into develop Apr 4, 2026
2 checks passed
@GulSam00 GulSam00 deleted the feat/173-extractSongTags branch April 4, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

노래별 태그 추출 크롤링 코드 작성

1 participant