Skip to content

feat(datetime): prototype JVM UDF path for Hour/Minute/Second (engine=java)#4321

Draft
andygrove wants to merge 15 commits into
apache:mainfrom
andygrove:datetime-jvm-udf-prototype
Draft

feat(datetime): prototype JVM UDF path for Hour/Minute/Second (engine=java)#4321
andygrove wants to merge 15 commits into
apache:mainfrom
andygrove:datetime-jvm-udf-prototype

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Part of #4311.

Rationale for this change

Comet's native Rust implementations of hour, minute, and second incorrectly apply timezone conversion to TimestampNTZType inputs (#3180). Today this means Comet falls back to Spark for the NTZ case, losing native execution for the rest of the stage.

This PR prototypes the approach outlined in #4311: add an opt-in JVM UDF path that produces bit-exact Spark results for the affected expressions, in exchange for a JNI round-trip per native batch. It mirrors the engine-selection idiom proposed in the open regex PR (#4239), keeping the framing consistent across families. The default stays rust; users opt into java for full Spark compatibility.

The scope is intentionally narrow: three closely related expressions sharing a serde shape and a UDF base class. Once this engine-config model is reviewed, it extends naturally to TruncTimestamp, DateFormat, and the field-extraction expressions in a follow-up.

What changes are included in this PR?

  • New config spark.comet.exec.datetime.engine with values rust (default) and java. Lives in CometConf.scala.
  • New DateTimeFieldUDF abstract base in common/src/main/scala/org/apache/comet/udf/, implementing the Arrow vector iteration, null preservation, and TZ/NTZ branching using java.time. Concrete subclasses HourUDF, MinuteUDF, SecondUDF each supply a one-line extract(LocalDateTime): Int.
  • Fork in CometHour/CometMinute/CometSecond (in spark/src/main/scala/org/apache/comet/serde/datetime.scala): when engine=java, the serde builds a JvmScalarUdf proto pointing at the matching UDF class; when engine=rust, the existing native path is unchanged. A shared private DateTimeFieldUdfHelper.buildProto factors the proto construction.
  • The engine=java branch in getSupportLevel matches explicitly on TimestampType | TimestampNTZType and returns Unsupported for anything else, so a future input-type change in Spark falls back cleanly instead of crashing the UDF at runtime.
  • Compat doc: new "Engine Selection (experimental)" section in docs/source/user-guide/latest/compatibility/expressions/datetime.md.

No native Rust changes; the JvmScalarUdf proto and JvmScalarUdfExpr planner integration were already merged in #4232.

This PR was scaffolded with the project's brainstorming and writing-plans skills.

How are these changes tested?

Three test suites, totaling 17 new tests:

  • DateTimeFieldUDFSuite (Scala unit, in the spark module): 8 tests exercising each UDF directly on Arrow vectors. Covers TZ + NTZ, multiple session timezones, and null preservation.
  • CometDatetimeJvmSuite (Spark integration): 8 tests with engine=java set in sparkConf. Each of hour/minute/second is checked across UTC, America/Los_Angeles, and Asia/Tokyo on both TimestampType and TimestampNTZ, plus a DST fall-back boundary test in LA and a plan-inspection test asserting native execution. All use checkSparkAnswerAndOperator to verify both correctness against Spark and that the operator ran natively.
  • CometDatetimeEngineDefaultSuite (Spark integration): 1 test asserting the config defaults to rust and that the rust path still runs natively under defaults. Regression guard for the existing path.

Also verified locally: spotless:apply is a no-op, cargo clippy --all-targets --workspace -- -D warnings is clean, and CometTemporalExpressionSuite (27 tests) still passes unchanged, confirming the rust default path is untouched.

andygrove added 15 commits May 14, 2026 08:14
Tests for comet-common code live in the spark module by convention; the
spark module already has ScalaTest wiring, while the common module
previously had only JUnit. This reverts the scalatest plumbing added to
common/pom.xml and relocates DateTimeFieldUDFSuite next to the other
Scala test suites.
DateTimeFieldUDFSuite has been relocated to the spark module (which
already has ScalaTest), so common no longer needs scalatest dependencies
or the scalatest-maven-plugin.
When engine=java, getSupportLevel previously returned Compatible
unconditionally, relying on the Spark analyzer to reject non-timestamp
input types for Hour/Minute/Second. If a future Spark version accepts a
new timestamp-like input type, DateTimeFieldUDF would throw
IllegalArgumentException at runtime instead of falling back cleanly. Now
the java branch matches on TimestampType | TimestampNTZType and returns
Unsupported otherwise, mirroring how the rust branch already gates on
TimestampNTZType. Also: align Compatible() call form across both
branches (was Compatible(None) in the java branch, Compatible() in the
rust branch). [skip ci]
@andygrove andygrove marked this pull request as draft May 14, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant