Skip to content

Add support for dfdl:lengthKind="endOfParent"#1652

Open
olabusayoT wants to merge 1 commit intoapache:mainfrom
olabusayoT:daf-238-endOfParent
Open

Add support for dfdl:lengthKind="endOfParent"#1652
olabusayoT wants to merge 1 commit intoapache:mainfrom
olabusayoT:daf-238-endOfParent

Conversation

@olabusayoT
Copy link
Copy Markdown
Contributor

  • Implemented logic to handle elements with dfdl:lengthKind="endOfParent".
  • Added validations to enforce schema/spec constraints and error conditions specific to endOfParent lengthKind.
  • Introduced determination of effective length units for parent elements and related error checks.
  • removed NYI Errors
  • add tests for endOfParent elements with different LengthKinds incl nested EndOfParent

DAFFODIL-238

- Implemented logic to handle elements with `dfdl:lengthKind="endOfParent"`.
- Added validations to enforce schema/spec constraints and error conditions specific to `endOfParent` lengthKind.
- Introduced determination of effective length units for parent elements and related error checks.
- removed NYI Errors
- add tests for endOfParent elements with different LengthKinds incl nested EndOfParent

DAFFODIL-238
@olabusayoT olabusayoT force-pushed the daf-238-endOfParent branch from 332415b to 7e7bd04 Compare April 8, 2026 18:33
@olabusayoT olabusayoT changed the title App support for dfdl:lengthKind="endOfParent" Add support for dfdl:lengthKind="endOfParent" Apr 8, 2026
Copy link
Copy Markdown
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, surprise how much we kindof already have support for this with some tweaks.

notYetImplemented("lengthKind='endOfParent' for complex type")
case LengthKind.EndOfParent =>
notYetImplemented("lengthKind='endOfParent' for simple type")
// per DFDL Spec 9.3.2, endOfParent is already positioned at parent's end so length is zero
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit confusing to me. This function is supposed to return whether or not we statically know if this element must have non-zero length. I imagine we can rarely statically know that for endOfParent elements, so I think returning false here is correct. But the comment kindof makes it sound like the length is always zero, which kindof contradicts that.

Reading this portion of the spec (which this comment copies), I think the spec is talking about the runtime evaluation of whether or not a field is zero length. I believe the spec is just saying that that an endOfParent element has zero length representation if it is already at the parents end (i.e. bitLimit == bitPosition). Since this is more about runtime, I'm not sure this comment belongs here and might avoid that confusion.

}

final lazy val immediatelyEnclosingElementParent: Option[ElementBase] = {
val p = optLexicalParent.flatMap {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I think lexical parents does not extend past global decls, so if a global decl has endOfParent then I'm not sure we will correctly check EOP restrictions for anything that references that decl. I'm wondering if the checks need to go down instead up?

For example, maybe an element needs to check if it has properties that would disallow children with lengthKind EOP and if so check if any children have are EOP? Or check if any immediate children have EOP, and if so then check if they are compatible?

case e: ElementBase => Some(e)
case ge: GlobalElementDecl => Some(ge.asRoot)
case s: SequenceTermBase => s.immediatelyEnclosingElementParent
case c: ChoiceTermBase => c.immediatelyEnclosingElementParent
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return the choice in some cases? It looks like the logic for EOP sometimes cares about the choice so I'm not sure we can bypass this?

</tdml:dfdlInfoset>
</tdml:infoset>
</tdml:parserTestCase>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some lines from the spec that I'm not sure I saw tests for:

The dfdl:lengthKind 'endOfParent' can also be used on the document root to allow the last element to consume the data up to the end of the data stream.
I assume this means something like:

<element name="root" dfdl:lengthKind="endOfParent">
  <complexType>
    <sequence>
      <element name="field1" ... />
      <element name="rest" dfdl:lenghtKind="endOfParent" ... />
    </sequence>
  </complexType>
 </element>

So rest should contain everything up until the end of the data. That doesn't really work will with the assuming that we'll always have a bit limit, since the root element won't every set bitLimit in this case.

A simple element must have one of [...] dfdl:representation 'binary' and a packed decimal representation

I don't think there are any tests for packed binary formats, and it looks like there aren't any modifications to the code to support packed types

A simple element must have one of [...] dfdl:representation 'text'

I think this implies that you can have simple types with bool/numbers/dates/times/etc as long as representation is text. I think these should all work because they just add a converter to a specified length string parser, but we should have tests for them.

The dfdl:lengthKind 'endOfParent' means that the element is terminated [...] or the end of an enclosing choice with dfdl:choiceLengthKind ‘explicit’.

choiceLengthKind="explicit" isn't used very often, but we should probably have a couple tests to make sure this works with EOP children.

notYetImplemented("lengthKind='endOfParent' for complex type")
case LengthKind.EndOfParent =>
notYetImplemented("lengthKind='endOfParent' for simple type")
case LengthKind.EndOfParent => new SpecifiedLengthEndOfParent(this, body)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is is only needed for complex types? My thinking is that simple types with lengthKind EOP should have parent parser (ether complex or explicity length choice) that already set the bit limit via one of these specified length parsers. So the bit limit has already been set correctly and we don't need another parser to do that.

But this is needed for complex types with EOP since they need to skip any bits up to their parents bit limit that thier children might not have consumed.

So I think this wants to be:

case LengthKind.EndOfParent if isComplexType => new SpecifiedLengthEndOfParent(this, body)

And then bodyRequiresSpecifiedLength wants to be modified to make it so it evaluates to false if this is a simple type with lengthKind EOP.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what if the simpleType is a root element with lengthKind EOP (where the user intends it to go to the end of the datastream)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. So maybe it instead wants to be something like this?

case LengthKind.EndOfParent => {
  Assert.invariant(isComplexType || isRoot)
  new SpecifiedLengthEndOfParent(this, body)
}

}
case None if this.isInstanceOf[Root] => LengthUnits.Characters
case _ =>
Assert.invariantFailed(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this invariant might break if we have something like a global element decl with a child with EOP. That EOP will want to reach up to find where it's used but wont' be able to find a parent because it only looks lexically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a child that's an element reference? Do we have any other way to look at a parent that's not optLexicalParent? Even immediatelyEnclosingGroupDef uses optLexicalParent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the below test and it works as expected where LK is EOP

	<xs:element name="text_string_txt_bytes" type="xs:string" dfdl:lengthUnits="bytes" nillable="true" />
	<xs:element name="text_string_txt_ref2">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="ex:text_string_txt_bytes"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
	<xs:element name="text_string_txt_ref3">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="ex:text_string_txt_ref2"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might I'm just forgetting how optLexicalParent works.

Refreshing my memory, it looks like the GlobalElementDecl DSOM object that represents the global definition does not have an optLexicalParent (or I guess more correctly it does, but it is the SchemaDocument).

But we also have an "ElementRef" DSOM object that represents the local element ref to that global definition. And the "ElementRef" is what is in the DSOM tree.

So as long as these functions are run within the context of ElementRef then maybe these invariants hold.

But I think maybe things get tricky if we try to recursively look up multiple parents though?

For example, if we are in the context of the ElementRef(text_string_txt_bytes) and ask for its optLexicalParent we'll get the global GlobalElemenDecl(text_string_txt_ref2). But if we then recursively ask for that GlobalElementDecl's optLexicalParent we'll get a SchemaDocument.

Similarly, say we have a group like this:

<group name="foo">
  <sequence>
    <element name="someEopElement" ... />
  </sequence>
</group>

Asking for the optLexicalParent of someEopElement returns the GlobalSequenceGroupDef for foo, who's optLexicalParent is the SchemaDocument and not whatever references the group.

And that makes sense because multiple different things could reference the group, so we don't really know which parent to examine.

So I'm not really sure how recursively looking up parents can work. The recrusion essentially ends at the global definition. It works fine if the thing you are looking for is within the scope of of your global element (e.g. Element Ref is always inside an Element, but I think it breaks once groups get involved. Unless those are handelded somehow else, and maybe I'm just looking at the wrongs spot when I'm inspeting values of optLexicalparent?

nextSibling.isDefined && nextSibling.get.isInstanceOf[ModelGroup],
"%s is specified as dfdl:lengthKind=\"endOfParent\", but a model group is defined between this element and the end of the enclosing component",
context
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nextSibling is lexical, so I don't think it will detect errors if this is a global element decl that is referenced in a manner where it has siblings? Or maybe the context will be the element reference and the that global decl so it will work? Do we ahve tests forthis?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the context ends up being the ElementRef which seems to accurately give the next Sibling. I added tests with and without siblings to confirm

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we hae an example like this:

<group name="foo">
  <sequence>
    <element name="eopElement" ... />
  </sequence>
</group>

<element name="bar">
  <complexType>
    <sequence>
      <group ref="foo" />
      <element name="laterSibling" ... />
    </sequence>
  </complexType>
</element>

In this case there effectively a sibling after eopELement, but I'm not sure we would detect that since I'm not sure optLeixcalParent sees past the globalGroupDef. Though, maybe we have logic to allow group refs to have parents? I seem to remember something where we copy groups, but I might be thinking of something else.

case _ => // do nothing
}
schemaDefinitionWhen(
representation == Representation.Text && knownEncodingWidthInBits != 8 && parentEffectiveLengthUnits != LengthUnits.Characters,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing and change the CVS schema to this:

  <element name="file" dfdl:lengthKind="explicit" dfdl:length="10" dfdl:terminator="%NL;">
    <complexType>
      <sequence>
        <element name="field" type="xs:string" dfdl:lengthKind="endOfParent" />
      </sequence>
    </complexType>
  </element>

And a got this stack trace:

org.apache.daffodil.lib.exceptions.Abort: Invariant broken: KnownEncodingMixin.this.isKnownEncoding
org.apache.daffodil.lib.exceptions.Assert$.abort(Assert.scala:153)
org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName(EncodingRuntimeData.scala:56)
org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName$(EncodingRuntimeData.scala:46)
org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingName$lzyINIT1(LocalElementDecl.scala:25)
	at org.apache.daffodil.lib.exceptions.Assert$.abort(Assert.scala:153)
	at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName(EncodingRuntimeData.scala:56)
	at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName$(EncodingRuntimeData.scala:46)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingName$lzyINIT1(LocalElementDecl.scala:25)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingName(LocalElementDecl.scala:25)
	at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingCharset(EncodingRuntimeData.scala:62)
	at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingCharset$(EncodingRuntimeData.scala:46)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingCharset$lzyINIT1(LocalElementDecl.scala:25)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingCharset(LocalElementDecl.scala:25)
	at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingWidthInBits(EncodingRuntimeData.scala:81)
	at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingWidthInBits$(EncodingRuntimeData.scala:46)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingWidthInBits$lzyINIT1(LocalElementDecl.scala:25)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingWidthInBits(LocalElementDecl.scala:25)
	at org.apache.daffodil.core.grammar.ElementBaseGrammarMixin.checkEndOfParentElem(ElementBaseGrammarMixin.scala:325)
	at org.apache.daffodil.core.grammar.ElementBaseGrammarMixin.checkEndOfParentElem$(ElementBaseGrammarMixin.scala:49)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.checkEndOfParentElem$lzyINIT1(LocalElementDecl.scala:25)
	at org.apache.daffodil.core.dsom.LocalElementDeclBase.checkEndOfParentElem(LocalElementDecl.scala:25)

I think the issue is that the default encoding used by csv is UTF-8, which seems to cause problems here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not able to replicate this. And I think CSV's default encoding is ASCII?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it uses the $dfdl:encoding variable which I think defaults to UTF-8

https://github.com/DFDLSchemas/CSV/blob/master/src/csv-base-format.dfdl.xsd#L48

That's a relatively new change to CSV, maybe you're using an older version?

schemaDefinitionWhen(
hasTerminator,
"%s is specified as dfdl:lengthKind=\"endOfParent\", but specifies a dfdl:terminator.",
context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to include context in the error string. I believe the error context is capture and output as part of the SDE.

case LengthKind.EndOfParent => LengthMultipleOf(1) // NYI
case LengthKind.EndOfParent =>
eb.immediatelyEnclosingElementParent match {
case Some(parent) => parent.elementSpecifiedLengthApprox
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right. The length of this element isn't the same as the parent length, it's whatever is left over of the parent after the previous siblings.

So this elements length is kindof parent.elementSpecifiedLenghtApprox - priorAlignmentApprox (i.e the length of the parent minus wherever we are starting) but we can't just subtract approx things since they are potentially multiples.

That said, I wonder if we don't really need to get this elements approx length perfect, because no elements come after it, and the endingAlignApprox of the parent won't need this specific is going to be known since it has an explicit length? Maybe this just becomes LengthMultipleOf 1 or 8 depending on length units? This might need some more thought...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants