Skip to content

8382713: [VectorAPI] Perform late inlining of failed vector intrinsics#30876

Open
jatin-bhateja wants to merge 15 commits into
openjdk:masterfrom
jatin-bhateja:JDK-8382713
Open

8382713: [VectorAPI] Perform late inlining of failed vector intrinsics#30876
jatin-bhateja wants to merge 15 commits into
openjdk:masterfrom
jatin-bhateja:JDK-8382713

Conversation

@jatin-bhateja

@jatin-bhateja jatin-bhateja commented Apr 22, 2026

Copy link
Copy Markdown
Member

Currently, we attempt lazy intrinsification of vector intrinsics during incremental inlining stage, in case intrinsification fail due to non-constant context expected by the inline expander, a static call is generated, this incurs a call overhead penalty.

As per following comments from @iwanowww on JDK-8303762 pull request
#24104 (comment)

We should attempt procedure inlining of failed vector intrinsics to avoid penalties associated with call overhead, for vector operations whose fall back implementation uses other vector APIs it will also save boxing penalty.

Patch address this concern by adding a new hybrid call generator (LateInlineVectorCallGenerator ) which encapsulates both intrinsic and parser call generator. During incremental inlining, the intrinsic gets multiple chances to succeed. If all attempts fail, the fallback implementation is inlined instead, absorbing call over head penalties.

Please review and share your feedback.

Best Regards,
Jatin



Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8382713: [VectorAPI] Perform late inlining of failed vector intrinsics (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/30876/head:pull/30876
$ git checkout pull/30876

Update a local copy of the PR:
$ git checkout pull/30876
$ git pull https://git.openjdk.org/jdk.git pull/30876/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 30876

View PR using the GUI difftool:
$ git pr show -t 30876

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/30876.diff

Using Webrev

Link to Webrev Comment

@jatin-bhateja

Copy link
Copy Markdown
Member Author

/label add hotspot-compiler-dev

@bridgekeeper

bridgekeeper Bot commented Apr 22, 2026

Copy link
Copy Markdown

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk

openjdk Bot commented Apr 22, 2026

Copy link
Copy Markdown

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk Bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Apr 22, 2026
@openjdk

openjdk Bot commented Apr 22, 2026

Copy link
Copy Markdown

@jatin-bhateja
The hotspot-compiler label was successfully added.

@openjdk

openjdk Bot commented Apr 22, 2026

Copy link
Copy Markdown

The total number of required reviews for this PR has been set to 2 based on the presence of this label: hotspot-compiler. This can be overridden with the /reviewers command.

@openjdk

openjdk Bot commented Apr 22, 2026

Copy link
Copy Markdown

@jatin-bhateja To determine the appropriate audience for reviewing this pull request, one or more labels corresponding to different subsystems will normally be applied automatically. However, no automatic labelling rule matches the changes in this pull request. In order to have an "RFR" email sent to the correct mailing list, you will need to add one or more applicable labels manually using the /label pull request command.

Applicable Labels
  • build
  • client
  • compiler
  • core-libs
  • hotspot
  • hotspot-compiler
  • hotspot-gc
  • hotspot-jfr
  • hotspot-runtime
  • i18n
  • ide-support
  • javadoc
  • jdk
  • net
  • nio
  • security
  • serviceability
  • shenandoah

@openjdk openjdk Bot added the rfr Pull request is ready for review label Apr 22, 2026
@mlbridge

mlbridge Bot commented Apr 22, 2026

Copy link
Copy Markdown

@iwanowww iwanowww left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Jatin!

}
};

bool LateInlineVectorCallGenerator::inline_fallback() const {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this method? All vector intrinsics do have fallback implementation. If there are any cases added later, then they don't have to rely on LateInlineVectorCallGenerator.

Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/doCall.cpp Outdated
@jatin-bhateja

Copy link
Copy Markdown
Member Author

Hi @iwanowww , your comments have been addressed.

Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated
@jatin-bhateja

jatin-bhateja commented Apr 29, 2026

Copy link
Copy Markdown
Member Author

I modified BackSholes benchmark to use FloatVector.SPECIES_512, and then explicitly passed
-XX:UseAVX=2 to force intrinsic failure. Following are the performance numbers with and without
InlineVectorFallback, we see some improvements despite of error margins.

CommandLine: java -jar target/benchmarks.jar -f 1 -i 5 -wi 1 -w 30 -jvmArgs "-XX:UseAVX=2 --add-modules=jdk.incubator.vector -XX:+UnlockDiagnosticVMOptions -XX:+InlineVectorFallback" BlackScholes.vector_black_scholes


With -XX:-InlineVectorFallback
Benchmark                          (size)   Mode  Cnt     Score      Error  Units
BlackScholes.vector_black_scholes    1024  thrpt    5  7460.391 ± 1412.273  ops/s

With -XX:+InlineVectorFallback
Benchmark                          (size)   Mode  Cnt     Score      Error  Units
BlackScholes.vector_black_scholes    1024  thrpt    5  7851.062 ± 1765.271  ops/s

@jatin-bhateja

Copy link
Copy Markdown
Member Author

Hi @iwanowww , your comments have been addressed.

@iwanowww iwanowww left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. Minor suggestions follow.

Comment thread src/hotspot/share/opto/c2_globals.hpp Outdated
product(bool, EnableVectorAggressiveReboxing, false, EXPERIMENTAL, \
"Enables aggressive reboxing of vectors") \
\
product(bool, InlineVectorFallback, true, DIAGNOSTIC, \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it IncrementalInlineVector and put it next to IncrementalInline et al.

Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
Comment thread src/hotspot/share/opto/compile.cpp Outdated
@jatin-bhateja

Copy link
Copy Markdown
Member Author

Hi @iwanowww , your comments have been addressed, please share the results of your test run.

@iwanowww

iwanowww commented May 6, 2026

Copy link
Copy Markdown
Contributor

Unfortunately, I see multiple failures in Vector API-related tests. They failed mostly on linux-aarch64, but there were few linux-x64 failures [1] as well. I'll take a closer look, but it seems the problem on linux-aarch64 is that fallback implementations are unconditionally inlined and it causes problems (multiple tests on 512-bit vectors fail due memory exhaustion [2]).

[1] In particular:

  • compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)
Failed IR Rules (3) of Methods (2)
----------------------------------
1) Method "compiler.vectorapi.TestVectorTest::branch" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_", "_#CMOVE_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"

2) Method "compiler.vectorapi.TestVectorTest::cmove" - [Failed IR rules: 2]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_TEST#_", "1", "_#CMOVE_I#_", "1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 7 = 1 [given]
  • compiler/vectorapi/VectorMaskCompareNotTest.java (w/ -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation)
Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareULEMaskNotByte" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"asimd", "true", "avx", "true", "rvv", "true"}, counts={"_#XOR_V_MASK#_", "= 0", "_#XOR_V#_", "= 0", "_#VECTOR_MASK_CAST#_", "= 1", "_#VECTOR_MASK_CMP#_", "= 3"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(XorVMask.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 3 = 0 [given]
         
         * Constraint 2: "(\\d+(\\s){2}(XorV.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 3 = 0 [given]

[2] jdk/incubator/vector/ByteVector512LoadStoreTests.java

    --- Allocation timelime by phase ---
        Phase seq. number                             Bytes                  Nodes
                 (380 older entries lost)
          >4                  incrementalInline 157724064 (+0)        32839 (+0) 
...
           >227          incrementalInline_igvn 174587184 (+261984)   34935 (-18) 
          <4 (cont.)          incrementalInline 174587184 (+0)        34935 (+0) 
         <2 (cont.)                   optimizer 180612856 (+6025672)  34697 (-238) 
...
         <295 (cont.)                  regalloc 177447520 (+0)        83805 (+0) 
          >305                         buildIFG 221892144 (+44444624)  83398 (-407) 
...
         <295 (cont.)                  regalloc 299650328 (+0)        81331 (+0) 
          >318                    regAllocSplit 1073764680 (+774114352)  81331 (+0) 
    ---

#  Internal Error (.../src/hotspot/share/compiler/compilationMemoryStatistic.cpp:935), pid=1510298, tid=1510316
#  fatal error: c2 (1695) jdk/incubator/vector/ByteVector512$ByteShuffle512::intoMemorySegment((Ljava/lang/foreign/MemorySegment;JLjava/nio/ByteOrder;)V): Hit MemLimit - limit: 1073741824 now: 1073764680

@jatin-bhateja

Copy link
Copy Markdown
Member Author

[1] In particular:

  • compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)
Failed IR Rules (3) of Methods (2)
----------------------------------
1) Method "compiler.vectorapi.TestVectorTest::branch" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_", "_#CMOVE_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"

2) Method "compiler.vectorapi.TestVectorTest::cmove" - [Failed IR rules: 2]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#CMP_I#_"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - failOn: Graph contains forbidden nodes:
         * Constraint 1: "(\\d+(\\s){2}(CmpI.*)+(\\s){2}===.*)"

   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_TEST#_", "1", "_#CMOVE_I#_", "1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 2: "(\\d+(\\s){2}(CMoveI.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 7 = 1 [given]

Hi @iwanowww , this failure is related to use of UseAVX=0, here fromBitsCoerced is not intrinsified, earlier it remained as CallStaticJavaNode but now it gets inlined, new inlined context has graph shape which infers CMoveI and CmpI and test failed since IR rule don't expect these nodes, one target agnostic fix is to guard these IR rules with -XX:-IncrementalInlineVector flag, but it will defeat the purpose of this test since IncrementalInlineVector is default on. Since test runs on multiple targets guarding by UseAVX > 0 may not be desirable.

Let me know what do you think ?


* compiler/vectorapi/VectorMaskCompareNotTest.java (w/ `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`)

Failed IR Rules (1) of Methods (1)

  1. Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareULEMaskNotByte" - [Failed IR rules: 1]:
    • @ir rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"asimd", "true", "avx", "true", "rvv", "true"}, counts={"#XOR_V_MASK#", "= 0", "#XOR_V#", "= 0", "#VECTOR_MASK_CAST#", "= 1", "#VECTOR_MASK_CMP#", "= 3"}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"

      Phase "PrintIdeal":

      • counts: Graph contains wrong number of nodes:
        • Constraint 1: "(\d+(\s){2}(XorVMask.)+(\s){2}===.)"

          • Failed comparison: [found] 3 = 0 [given]
        • Constraint 2: "(\d+(\s){2}(XorV.)+(\s){2}===.)"

          • Failed comparison: [found] 3 = 0 [given]

With the patch, the vector intrinsic fallback inlining generates more code in the compilation unit, this effects inlining of
other methods, e.g. AbstractMask::intoArray. When intoArray is NOT inlined, the mask must be boxed before passing to a non-inline method, as a result of this VectorMaskCmp encapsulated in VectorBoxNode get addition user which is VectorStoreMask created at https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L270 during VectorBoxNode scalarization.

This increase the outcout of VectorMaskCmpNode and inhabits optimization which folds XorVMask (VectorMaskCmp, maskAll(true)
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectornode.cpp#L2366

Increasing InlineSmallCode to 10000 allows intoArray to be inlined, mask is not boxed, VectorMaskCmp has outcnt=1, XorVMask is folded and Test Passes

[2] jdk/incubator/vector/ByteVector512LoadStoreTests.java

    --- Allocation timelime by phase ---
        Phase seq. number                             Bytes                  Nodes
                 (380 older entries lost)
          >4                  incrementalInline 157724064 (+0)        32839 (+0) 
...
           >227          incrementalInline_igvn 174587184 (+261984)   34935 (-18) 
          <4 (cont.)          incrementalInline 174587184 (+0)        34935 (+0) 
         <2 (cont.)                   optimizer 180612856 (+6025672)  34697 (-238) 
...
         <295 (cont.)                  regalloc 177447520 (+0)        83805 (+0) 
          >305                         buildIFG 221892144 (+44444624)  83398 (-407) 
...
         <295 (cont.)                  regalloc 299650328 (+0)        81331 (+0) 
          >318                    regAllocSplit 1073764680 (+774114352)  81331 (+0) 
    ---

#  Internal Error (.../src/hotspot/share/compiler/compilationMemoryStatistic.cpp:935), pid=1510298, tid=1510316
#  fatal error: c2 (1695) jdk/incubator/vector/ByteVector512$ByteShuffle512::intoMemorySegment((Ljava/lang/foreign/MemorySegment;JLjava/nio/ByteOrder;)V): Hit MemLimit - limit: 1073741824 now: 1073764680

Over all there is a tradeoff of unconditionally inlining vector intrinsic since most of them a bulky and it may impact inlining decisions within their calling context.

Do you think its beneficial to limit the scope of inlining to only few intrinsics initially e.g.
https://github.com/jatin-bhateja/jdk/blob/46fcc9acc05bdef5fd01f4972ed9a66de5f07198/src/hotspot/share/opto/callGenerator.cpp#L463

Please let me know your views.

@openjdk

openjdk Bot commented May 8, 2026

Copy link
Copy Markdown

@jatin-bhateja Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

@iwanowww

iwanowww commented May 8, 2026

Copy link
Copy Markdown
Contributor

compiler/vectorapi/TestVectorTest.java (w/ -XX:UseAVX=0)

target agnostic fix is to guard these IR rules with -XX:-IncrementalInlineVector flag, but it will defeat the purpose of this test since IncrementalInlineVector is default on.

I don't see why it defeats the purpose of the test. It's an IR test and limiting possible IR shapes is fine.

compiler/vectorapi/VectorMaskCompareNotTest.java

With the patch, the vector intrinsic fallback inlining generates more code in the compilation unit, this effects inlining of
other methods, e.g. AbstractMask::intoArray.

Do we miss @ForceInline on AbstractMask::intoArray? Any other methods not inlined?

Do you think its beneficial to limit the scope of inlining to only few intrinsics initially.

I think regular inlining heuristics should be applied to vector fallback implementations.

@@ -47,16 +47,16 @@ public static void main(String[] args) {
public int call() { return 1; }

@Test
@IR(failOn = {IRNode.CMP_I, IRNode.CMOVE_I})
@IR(failOn = {IRNode.CMP_I, IRNode.CMOVE_I}, applyIf = {"IncrementalInlineVector", "false"})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that the rule is disabled unless the test is explicitly run with -XX:-IncrementalInlineVector?
I doubt it will be regularly executed in such mode. So, it defeats the purpose of the test, doesn't it?

Instead, why don't you explicitly run the test with -XX:-IncrementalInlineVector flag?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I am now passing -XX:-IncrementalInlineVector to test invocation.

@@ -1294,7 +1294,7 @@ public static void testCompareMaskNotDoubleNegative() {
public static void main(String[] args) {
TestFramework testFramework = new TestFramework();
testFramework.setDefaultWarmup(5000)
.addFlags("--add-modules=jdk.incubator.vector")
.addFlags("--add-modules=jdk.incubator.vector", "-XX:InlineSmallCode=100000")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should AbstractMask::intoArray() be marked w/ @ForceInline instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With @ForceInline over AbstractMask::intoArray test passes with "-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation" but fails with default option due to difference in inlining


Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareNEMaskNotFloatNaN" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"asimd", "true", "avx", "true", "rvv", "true"}, counts={"_#XOR_V_MASK#_", "= 0", "_#XOR_V#_", "= 0", "_#VECTOR_MASK_CMP#_", "= 2"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 3: "(\d+(\s){2}(VectorMaskCmp.*)+(\s){2}===.*)"
           - Failed comparison: [found] 0 = 2 [given]
           - No nodes matched!

With -XX:InlineSmallCode=1000000 it passes with all the configurations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, elaborate where it fails. Does func.apply(m).intoArray(mr, 0); in testCompareMaskNotFloat cause problems?

@jatin-bhateja jatin-bhateja May 19, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated if further here is my analysis

Adding @ForceInline to AbstractMask::intoArray is desirable for vector intrinsic inlining, but it exposes a pre-existing bug in C2's switch profiling.

The bug is in Parse::do_tableswitch() in parse2.cpp: when a mature MDO has all-zero MultiBranchData counts, merge_ranges() marks every arm as never_reached, and jump_switch_ranges() collapses the entire switch to a single unstable_if trap. The parser should treat this as "no useful profile" (fall back to cnt = 1.0F), not "every arm is cold." I confirmed this analysis by passing -XX:-TieredCompilation or -XX:-UseSwitchProfiling — the test passes with either flag.

This profiling issue is orthogonal to the vector intrinsic late inlining work and should be addressed in a separate PR. For now, @ForceInline on AbstractMask::intoArray is not added and -XX:InlineSmallCode=1000000 is added to the failing test as a workaround

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details. Hm, that doesn't sound right. There's no support for caller-sensitive profiling yet, so each method profile data is stored in a dedicated per-method MDO instance. (There are deoptimization counts which may depend on inlining, but regular branch counts should not be affected.) Anyway, let's continue investigating it separately.
Please, file a follow-up bug for it. Does -XX:-IncrementalInlineVector work as a workaround? I'm not fond of InlineSmallCode tweaks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have filed a follow up JBS for this https://bugs.openjdk.org/browse/JDK-8385134

Comment thread src/hotspot/share/opto/doCall.cpp Outdated
Comment thread src/hotspot/share/opto/doCall.cpp Outdated
Comment thread test/hotspot/jtreg/compiler/vectorapi/TestVectorTest.java Outdated
@@ -1294,7 +1294,7 @@ public static void testCompareMaskNotDoubleNegative() {
public static void main(String[] args) {
TestFramework testFramework = new TestFramework();
testFramework.setDefaultWarmup(5000)
.addFlags("--add-modules=jdk.incubator.vector")
.addFlags("--add-modules=jdk.incubator.vector", "-XX:InlineSmallCode=100000")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, elaborate where it fails. Does func.apply(m).intoArray(mr, 0); in testCompareMaskNotFloat cause problems?

@iwanowww

Copy link
Copy Markdown
Contributor

Overall, looks good. I tweaked test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java [1] and submitted the patch for testing.

[1]

diff --git a/test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java b/test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java
index 935363f8526..4aeb5ba36b0 100644
--- a/test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java
+++ b/test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java
@@ -1295,7 +1295,7 @@ public static void main(String[] args) {
         TestFramework testFramework = new TestFramework();
         testFramework.setDefaultWarmup(5000)
                      .addFlags("--add-modules=jdk.incubator.vector",
-                               "-XX:InlineSmallCode=1000000")
+                               "-XX:-IncrementalInlineVector")
                      .start();
     }
 }

Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated
@iwanowww

Copy link
Copy Markdown
Contributor

One more IR test failure:

Test: compiler/vectorapi/VectorCompareWithZeroTest.java
Platform: linux-aarch64
Flags: -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation


Failed IR Rules (5) of Methods (5)
----------------------------------
1) Method "compiler.vectorapi.VectorCompareWithZeroTest::testByteVectorEqualToZero" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VMASK_CMP_ZERO_I_NEON#_", ">= 1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "Final Code":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(vmaskcmp_zeroI_neon.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!

2) Method "compiler.vectorapi.VectorCompareWithZeroTest::testDoubleVectorLessThanZero" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VMASK_CMP_ZERO_D_NEON#_", ">= 1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "Final Code":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(vmaskcmp_zeroD_neon.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!

3) Method "compiler.vectorapi.VectorCompareWithZeroTest::testFloatVectorLessEqualToZero" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VMASK_CMP_ZERO_F_NEON#_", ">= 1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "Final Code":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(vmaskcmp_zeroF_neon.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!

4) Method "compiler.vectorapi.VectorCompareWithZeroTest::testLongVectorGreaterThanZero" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VMASK_CMP_ZERO_L_NEON#_", ">= 1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "Final Code":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(vmaskcmp_zeroL_neon.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!

5) Method "compiler.vectorapi.VectorCompareWithZeroTest::testShortVectorNotEqualToZero" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VMASK_CMP_ZERO_I_NEON#_", ">= 1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "Final Code":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(vmaskcmp_zeroI_neon.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 >= 1 [given]
           - No nodes matched!

@openjdk

openjdk Bot commented Jun 4, 2026

Copy link
Copy Markdown

@jatin-bhateja this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8382713
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk Bot added the merge-conflict Pull request has merge conflict with target branch label Jun 4, 2026
@openjdk openjdk Bot removed the merge-conflict Pull request has merge conflict with target branch label Jun 11, 2026
@jatin-bhateja

Copy link
Copy Markdown
Member Author

Hi @iwanowww , added the handling to prevent accumulation of spurious messages during fallback call generator selection using RAII based mechanism. Also explicitly printing message "late inline succeeded (vector intrinsic fallback)" in case intrinsification fails but fallback generator (inlining) succeedes.

Please let me if the patch looks good land now, your earlier comments have been addressed.

@iwanowww iwanowww left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good.

return &_nullStream;
}
if (is_suspended()) {
locate(state, callee);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you perform locate call?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locate() was retained to keep the IPInlineSite topology intact (one node per JVMState frame) while only the message append is suppressed. But that topology is already guaranteed by record ordering — the order in which record() calls happen during compilation ensures every node's parent exists before the node itself is needed — so locate() rebuilding the path during a suspended probe is unnecessary, and dropping it is safe.

Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated
Comment thread src/hotspot/share/opto/callGenerator.cpp Outdated
Comment thread src/hotspot/share/opto/printinlining.hpp Outdated
@jatin-bhateja

Copy link
Copy Markdown
Member Author

Overall, looks good.

Hi @iwanowww , Your comments have been addressed.

@iwanowww iwanowww left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Submitted for testing.

@iwanowww

Copy link
Copy Markdown
Contributor

Strangely, compiler.vectorapi.VectorMaskCompareNotTest still fails. I noticed that you changed default warmup setting. Why did you do that?

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "compiler.vectorapi.VectorMaskCompareNotTest::testCompareUGTMaskNotByteCast" - [Failed IR rules: 1]:
   * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"avx2", "true", "rvv", "true"}, counts={"_#XOR_V_MASK#_", "= 0", "_#XOR_V#_", "= 0", "_#VECTOR_MASK_CMP#_", "= 1"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 3: "(\\d+(\\s){2}(VectorMaskCmp.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 = 1 [given]
           - No nodes matched!

@jatin-bhateja

Copy link
Copy Markdown
Member Author

Strangely, compiler.vectorapi.VectorMaskCompareNotTest still fails. I noticed that you changed default warmup setting. Why did you do that?

Rebased with latest mainline, looks like it mistakenly got introduced with previous merge.

Kindly re-verify.

@iwanowww iwanowww left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@jatin-bhateja

Copy link
Copy Markdown
Member Author

Hi @mhaessig , we need one more approval here to transition this to ready state, can you do the needful.

@eme64 eme64 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja This looks like important work, so thanks for working on it!

I have some questions about the tests below, I'm especially wondering why you had to set -XX:-IncrementalInlineVector in some of the IR tests? Because if the flag is now on by default, would it not be more important to have IR rules with the flag enabled? What are the affected IR rules?

Also: Could we have some new IR tests that demonstrate the benefit of late vector inlining, and make sure there won't be regressions on it?

public static void main(String[] args) {
TestFramework.runWithFlags("--add-modules=jdk.incubator.vector");
TestFramework.runWithFlags("--add-modules=jdk.incubator.vector",
"-XX:-IncrementalInlineVector");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add these flags here? Would the IR rules fail without?
Suggestion: can you have a run with and a run without the flag, and then show which IR rules are affected, guarding them with the flag?

Comment on lines +256 to +257
.addFlags("--add-modules=jdk.incubator.vector",
"-XX:-IncrementalInlineVector")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about flag here.

Comment on lines +1750 to +1751
.addFlags("--add-modules=jdk.incubator.vector",
"-XX:-IncrementalInlineVector")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about flag here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

3 participants