Skip to content

#724 Switch the behavior of Hive repair table on reruns. Do it only if explicitly asked.#725

Merged
yruslan merged 2 commits intomainfrom
feature/724-hive-re-create-only-on-schema-change
Mar 23, 2026
Merged

#724 Switch the behavior of Hive repair table on reruns. Do it only if explicitly asked.#725
yruslan merged 2 commits intomainfrom
feature/724-hive-re-create-only-on-schema-change

Conversation

@yruslan
Copy link
Copy Markdown
Collaborator

@yruslan yruslan commented Mar 20, 2026

Closes #724

Summary by CodeRabbit

  • New Features

    • Added a CLI flag to force Hive table recreation: --force-recreate-hive-tables
    • New runtime configuration key pramen.runtime.hive.force.recreate (defaults to false)
  • Tests

    • Added/updated tests to cover the new CLI option and runtime flag behavior, including cases for initial table creation and forced re-creation

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b93c2df-7eac-41ec-9f5f-82dd924c3480

📥 Commits

Reviewing files that changed from the base of the PR and between c6a2ed6 and 59a478e.

📒 Files selected for processing (3)
  • pramen/core/src/main/scala/za/co/absa/pramen/core/cmd/CmdLineConfig.scala
  • pramen/core/src/test/scala/za/co/absa/pramen/core/mocks/job/JobSpy.scala
  • pramen/core/src/test/scala/za/co/absa/pramen/core/tests/runner/task/TaskRunnerBaseSuite.scala
🚧 Files skipped from review as they are similar to previous changes (1)
  • pramen/core/src/main/scala/za/co/absa/pramen/core/cmd/CmdLineConfig.scala

Walkthrough

Introduces a new runtime flag pramen.runtime.hive.force.recreate (default false) wired through CLI, runtime config, and task runner; task runner now uses this flag to decide Hive table force-recreation instead of basing that decision on rerun reason.

Changes

Cohort / File(s) Summary
Configuration & CLI
pramen/core/src/main/resources/reference.conf, pramen/core/src/main/scala/za/co/absa/pramen/core/app/config/RuntimeConfig.scala, pramen/core/src/main/scala/za/co/absa/pramen/core/cmd/CmdLineConfig.scala
Added pramen.runtime.hive.force.recreate = false; added forceReCreateHiveTables: Boolean to RuntimeConfig; added CLI flag --force-recreate-hive-tables and plumbing to write it into runtime config.
Task Execution Logic
pramen/core/src/main/scala/za/co/absa/pramen/core/runner/task/TaskRunnerBase.scala
Replaced rerun-based recreation decision with check against runtimeConfig.forceReCreateHiveTables; recreation still occurs on schema diffs when appropriate.
Tests & Mocks
pramen/core/src/test/scala/za/co/absa/pramen/core/RuntimeConfigFactory.scala, pramen/core/src/test/scala/za/co/absa/pramen/core/cmd/CmdLineConfigSuite.scala, pramen/core/src/test/scala/za/co/absa/pramen/core/mocks/job/JobSpy.scala, pramen/core/src/test/scala/za/co/absa/pramen/core/tests/runner/task/TaskRunnerBaseSuite.scala
Test factory extended to accept forceReCreateHiveTables; added CLI parsing test for the new flag; JobSpy records recreateHiveTable calls; task-runner tests updated/extended to assert behavior when forceReCreateHiveTables is true/false (new parameterization and assertions).

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Arguments
    participant CmdLine as CmdLineConfig
    participant Config as RuntimeConfig
    participant TaskRunner as TaskRunnerBase
    participant Hive as Hive Table Operations

    CLI->>CmdLine: --force-recreate-hive-tables
    CmdLine->>CmdLine: parse -> forceReCreateHiveTables = true
    CmdLine->>Config: applyCmdLineToConfig(forceReCreateHiveTables)
    Config->>TaskRunner: expose runtimeConfig.forceReCreateHiveTables
    TaskRunner->>TaskRunner: determine recreate flag (config or schemaChanged)
    alt runtimeConfig.forceReCreateHiveTables == true
        TaskRunner->>Hive: createOrRefreshHiveTable(schema, date, recreate=true)
    else
        TaskRunner->>Hive: createOrRefreshHiveTable(schema, date, recreate=schemaChanged)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 A tiny flag hops in, quiet and neat,
It tells the runner when old tables to meet,
Now reruns won't clatter the hive in the night,
Only schemas that change get the recreate light,
Hoppity-hop—tests confirm all is right!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: switching Hive repair behavior from automatic on reruns to explicit request, matching the PR's core objective.
Linked Issues check ✅ Passed The PR successfully addresses issue #724 by adding a runtime flag to control Hive table recreation, changing the logic from automatic reruns to only when explicitly requested or schema changes.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the Hive table recreation control feature: configuration, runtime config, CLI flag, logic updates, test factory, and test coverage.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/724-hive-re-create-only-on-schema-change

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pramen/core/src/main/scala/za/co/absa/pramen/core/cmd/CmdLineConfig.scala (1)

141-142: Minor: Inconsistent variable naming.

The local variable forcereCreateHiveTables uses lowercase 'r' in "recreate", while the field and constant use forceReCreateHiveTables with uppercase 'R'. Consider aligning the casing for consistency.

✏️ Suggested fix
-    for (forcereCreateHiveTables <- cmd.forceReCreateHiveTables)
-      accumulatedConfig = accumulatedConfig.withValue(FORCE_RECREATE_HIVE_TABLES, ConfigValueFactory.fromAnyRef(forcereCreateHiveTables))
+    for (forceReCreateHiveTables <- cmd.forceReCreateHiveTables)
+      accumulatedConfig = accumulatedConfig.withValue(FORCE_RECREATE_HIVE_TABLES, ConfigValueFactory.fromAnyRef(forceReCreateHiveTables))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/cmd/CmdLineConfig.scala`
around lines 141 - 142, Rename the local pattern variable
forcereCreateHiveTables to match the casing used elsewhere
(forceReCreateHiveTables) so naming is consistent; update the for-comprehension
binding in CmdLineConfig (the for (...) <- cmd.forceReCreateHiveTables) and any
references such as accumulatedConfig.withValue(FORCE_RECREATE_HIVE_TABLES,
ConfigValueFactory.fromAnyRef(...)) to use forceReCreateHiveTables.
pramen/core/src/main/scala/za/co/absa/pramen/core/app/config/RuntimeConfig.scala (1)

168-169: Minor: Extra space in assignment.

There's a double space before ConfigUtils which appears to be a typo.

✏️ Suggested fix
       maxAttempts,
-      forceReCreateHiveTables =  ConfigUtils.getOptionBoolean(conf, FORCE_RECREATE_HIVE_TABLES).getOrElse(false)
+      forceReCreateHiveTables = ConfigUtils.getOptionBoolean(conf, FORCE_RECREATE_HIVE_TABLES).getOrElse(false)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pramen/core/src/main/scala/za/co/absa/pramen/core/app/config/RuntimeConfig.scala`
around lines 168 - 169, Fix the minor whitespace typo in RuntimeConfig.scala by
removing the extra space before the ConfigUtils call in the
forceReCreateHiveTables assignment so it reads with a single space between the
equals sign and the call (symbol: forceReCreateHiveTables and
ConfigUtils.getOptionBoolean). Ensure no other spacing changes are introduced
around that assignment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@pramen/core/src/main/scala/za/co/absa/pramen/core/runner/task/TaskRunnerBase.scala`:
- Around line 413-415: Test coverage is missing for the new recreate logic in
TaskRunnerBase: add unit tests that call the code path that reaches
task.job.createOrRefreshHiveTable and assert the boolean passed for the recreate
parameter; specifically, add one test where isRerun = true,
runtimeConfig.forceReCreateHiveTables = false and both
schemaChangesBeforeTransform and schemaChangesAfterTransform are empty and
assert createOrRefreshHiveTable was called with recreate = false, and another
where runtimeConfig.forceReCreateHiveTables = true and assert recreate = true;
target the TaskRunnerBase behavior (mock task.job and verify the
createOrRefreshHiveTable(...) call and its recreate argument) so changes to
recreate logic are validated.

---

Nitpick comments:
In
`@pramen/core/src/main/scala/za/co/absa/pramen/core/app/config/RuntimeConfig.scala`:
- Around line 168-169: Fix the minor whitespace typo in RuntimeConfig.scala by
removing the extra space before the ConfigUtils call in the
forceReCreateHiveTables assignment so it reads with a single space between the
equals sign and the call (symbol: forceReCreateHiveTables and
ConfigUtils.getOptionBoolean). Ensure no other spacing changes are introduced
around that assignment.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/cmd/CmdLineConfig.scala`:
- Around line 141-142: Rename the local pattern variable forcereCreateHiveTables
to match the casing used elsewhere (forceReCreateHiveTables) so naming is
consistent; update the for-comprehension binding in CmdLineConfig (the for (...)
<- cmd.forceReCreateHiveTables) and any references such as
accumulatedConfig.withValue(FORCE_RECREATE_HIVE_TABLES,
ConfigValueFactory.fromAnyRef(...)) to use forceReCreateHiveTables.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 22f7ec76-144f-4f47-bca9-b088bc7b0f5e

📥 Commits

Reviewing files that changed from the base of the PR and between 253f431 and c6a2ed6.

📒 Files selected for processing (6)
  • pramen/core/src/main/resources/reference.conf
  • pramen/core/src/main/scala/za/co/absa/pramen/core/app/config/RuntimeConfig.scala
  • pramen/core/src/main/scala/za/co/absa/pramen/core/cmd/CmdLineConfig.scala
  • pramen/core/src/main/scala/za/co/absa/pramen/core/runner/task/TaskRunnerBase.scala
  • pramen/core/src/test/scala/za/co/absa/pramen/core/RuntimeConfigFactory.scala
  • pramen/core/src/test/scala/za/co/absa/pramen/core/cmd/CmdLineConfigSuite.scala

Comment on lines 413 to 415
val hiveWarnings = if (task.job.outputTable.hiveTable.nonEmpty) {
val recreate = schemaChangesBeforeTransform.nonEmpty || schemaChangesAfterTransform.nonEmpty || task.reason == TaskRunReason.Rerun
val recreate = schemaChangesBeforeTransform.nonEmpty || schemaChangesAfterTransform.nonEmpty || runtimeConfig.forceReCreateHiveTables
task.job.createOrRefreshHiveTable(dfTransformed.schema, task.infoDate, recreate)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for tests that verify the recreate parameter value in createOrRefreshHiveTable calls

echo "=== Searching for tests that check recreate parameter ==="
rg -n "recreate" --type scala pramen/core/src/test/

echo ""
echo "=== Check JobSpy implementation for recreate handling ==="
rg -n -A5 "createOrRefreshHiveTable" pramen/core/src/test/scala/za/co/absa/pramen/core/mocks/job/JobSpy.scala

Repository: AbsaOSS/pramen

Length of output: 3190


🏁 Script executed:

#!/bin/bash
echo "=== Search for TaskRunnerBaseSuite ==="
fd -t f "TaskRunnerBaseSuite" pramen/core/src/test/

echo ""
echo "=== Search for tests that verify recreate behavior based on schema changes or rerun reason ==="
rg -n "schemaChanges|forceReCreateHiveTables|recreate.*true|recreate.*false" --type scala pramen/core/src/test/ | grep -i "task.*runner\|runner.*test" | head -30

echo ""
echo "=== Check if TaskRunnerBaseSuite exists and its content ==="
find pramen/core/src/test -name "*TaskRunner*" -type f

Repository: AbsaOSS/pramen

Length of output: 812


🏁 Script executed:

#!/bin/bash
echo "=== Read TaskRunnerBaseSuite to examine test coverage ==="
wc -l pramen/core/src/test/scala/za/co/absa/pramen/core/tests/runner/task/TaskRunnerBaseSuite.scala

echo ""
echo "=== Show content with line numbers ==="
cat -n pramen/core/src/test/scala/za/co/absa/pramen/core/tests/runner/task/TaskRunnerBaseSuite.scala

Repository: AbsaOSS/pramen

Length of output: 33640


Core behavioral change looks correct, but test coverage gap should be addressed.

The change correctly implements the PR objective: Hive table recreation now occurs only when schema changes are detected OR when explicitly requested via runtimeConfig.forceReCreateHiveTables, rather than automatically on every rerun.

However, existing tests only verify that createOrRefreshHiveTable is called, not the value of the recreate parameter. Add tests that explicitly verify:

  1. recreate = false when isRerun = true, forceReCreateHiveTables = false, and no schema changes
  2. recreate = true when forceReCreateHiveTables = true

This ensures the behavioral change is properly validated and prevents regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pramen/core/src/main/scala/za/co/absa/pramen/core/runner/task/TaskRunnerBase.scala`
around lines 413 - 415, Test coverage is missing for the new recreate logic in
TaskRunnerBase: add unit tests that call the code path that reaches
task.job.createOrRefreshHiveTable and assert the boolean passed for the recreate
parameter; specifically, add one test where isRerun = true,
runtimeConfig.forceReCreateHiveTables = false and both
schemaChangesBeforeTransform and schemaChangesAfterTransform are empty and
assert createOrRefreshHiveTable was called with recreate = false, and another
where runtimeConfig.forceReCreateHiveTables = true and assert recreate = true;
target the TaskRunnerBase behavior (mock task.job and verify the
createOrRefreshHiveTable(...) call and its recreate argument) so changes to
recreate logic are validated.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

Unit Test Coverage

Overall Project 84.4% 🍏
Files changed 95.51% 🍏

Module Coverage
pramen:core Jacoco Report 86.36% 🍏
Files
Module File Coverage
pramen:core Jacoco Report CmdLineConfig.scala 95.17% -0.67% 🍏
RuntimeConfig.scala 92.22% 🍏
TaskRunnerBase.scala 82.74% 🍏

@yruslan yruslan merged commit 44ea912 into main Mar 23, 2026
7 checks passed
@yruslan yruslan deleted the feature/724-hive-re-create-only-on-schema-change branch March 23, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not re-create Hive tables on rerun, only when schema has changed

1 participant