Skip to content

[BUG] Node Disable Triggers Fabric.exe Crash and Ungraceful Service Termination #1548

@rkyser

Description

@rkyser

Describe the bug:

During rolling node disables/restarts on a 3-node cluster with a near-expiring cert (<30 days), Fabric.exe intermittently crashes while preparing to move to the Disabled state. Application services on the node do not receive their graceful shutdown callbacks; ETL traces show these services are "terminated." The exact relationship between the Fabric.exe crash and the service terminations is not yet clear, but there certainly appears to be a relationship of some kind. Crash dumps show an access violation consistent with a use-after-free during secure session expiration timer execution.

Timeline:

  • Nov 22: Completed several rolling restart iterations without any problems.
  • Nov 25: Repeated the test and started experiencing failures. Upon investigation, found that Fabric.exe was crashing. This is when we noticed the certificate expiration warning had appeared in the cluster (cert expiring Dec 25, 2025—exactly 30 days out).
  • Nov 26: Generated a new certificate (expires February) and updated the cluster. The Fabric.exe crashes appear to have been resolved.

Area/Component:

Transport security / secure session expiration timer.

To Reproduce:

Steps to reproduce the behavior:

  1. Provision a 3-node cluster (Standard_D4s_v4) with a certificate expiring in <30 days.
  2. Deploy a test application to the cluster.
  3. In a loop, perform rolling graceful restarts of Service Fabric VMs:
    • Disable-ServiceFabricNode -NodeName {node-name} -Intent Restart -Force
    • Wait for the node to gracefully stop all application services.
    • Restart the VM.
    • VM returns in Disabled state.
    • Enable-ServiceFabricNode -NodeName {node-name}.
  4. Continue repeating step 3 until the observed behavior (see below) occurs. Repro is intermittent.

Expected behavior:

Disabling a Service Fabric node during a rolling restart should gracefully drain and stop services, honoring lifecycle callbacks, without crashing Fabric.exe or terminating application services abruptly; once re-enabled, services should come back normally.

Observed behavior:

Intermittent Fabric.exe crash while the node is being disabled/stopped. Crash dumps show access violation consistent with a use-after-free during secure session expiration timer callback. Application services on the node terminate ungracefully without receiving shutdown callbacks; the exact relationship to the Fabric.exe crash is not yet clear. Issues started after the cluster began warning the cluster cert would expire in <30 days (Let's Encrypt).

Screenshots:

N/A.

Service Fabric Runtime Version:

SF runtime: 11.0.2707.1

Environment:

  • Azure 3-node cluster (Standard_D4s_v4) on Windows Server 2022 Datacenter Azure Edition.
  • Certificates: Let's Encrypt-issued cluster certs.

If this is a regression, which version did it regress from?

  • Unknown.

Additional context:

  1. .dmp and .etl files available upon request
  2. If it matters, several service instances use the FabricClient to connect to the SF cluster

Fabric Dump Analysis:

windbg dump analysis of Fabric.exe.3984.dmp:

0:020> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


KEY_VALUES_STRING: 1

    Key  : AV.Dereference
    Value: NullClassPtr

    Key  : AV.Fault
    Value: Write

    Key  : Analysis.CPU.mSec
    Value: 1999

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 10426

    Key  : Analysis.Init.CPU.mSec
    Value: 3687

    Key  : Analysis.Init.Elapsed.mSec
    Value: 31768

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 125

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 245840

    Key  : Timeline.Process.Start.DeltaSec
    Value: 245818

    Key  : WER.OS.Branch
    Value: fe_release

    Key  : WER.OS.Timestamp
    Value: 2021-05-07T15:00:00Z

    Key  : WER.OS.Version
    Value: 10.0.20348.1

    Key  : WER.Process.Version
    Value: 11.0.2707.1


FILE_IN_CAB:  Fabric.exe.3984.dmp

NTGLOBALFLAG:  0

PROCESS_BAM_CURRENT_THROTTLED: 0

PROCESS_BAM_PREVIOUS_THROTTLED: 0

APPLICATION_VERIFIER_FLAGS:  0

CONTEXT:  (.ecxr)
rax=00000001c68fea68 rbx=00000001c68feae8 rcx=0000000000000000
rdx=00000001c68feae8 rsi=0000000000000000 rdi=0000000000000000
rip=00007ff6e758f7d8 rsp=00000001c68fea20 rbp=00007ff6e9d5b901
 r8=0000000000000000  r9=00000001c68fea80 r10=0000000000000002
r11=00000001c68fea60 r12=0000000000000000 r13=00000001c68feae8
r14=0000000000000001 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
Fabric+0x1bf7d8:
00007ff6`e758f7d8 4088696a        mov     byte ptr [rcx+6Ah],bpl ds:00000000`0000006a=??
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ff6e758f7d8 (Fabric+0x00000000001bf7d8)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000001
   Parameter[1]: 000000000000006a
Attempt to write to address 000000000000006a

PROCESS_NAME:  Fabric.exe

WRITE_ADDRESS:  000000000000006a 

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000001

EXCEPTION_PARAMETER2:  000000000000006a

STACK_TEXT:  
00000001`c68fea20 00007ff6`e758f519     : 00007ff6`e9d5b950 00000001`c68feae8 00007ff6`e9d5b950 00000000`00000000 : Fabric+0x1bf7d8
00000001`c68fea70 00007ff6`e8990e4a     : 00000187`c3c2cd60 00000019`254d3800 7fffffff`ffffffff ffffffe6`dab2c800 : Fabric+0x1bf519
00000001`c68fead0 00007ff6`e7d0c71a     : 00000187`c3e54b50 00000187`c37b3ac0 00000000`00000000 00000187`c3c2cd70 : Fabric+0x15c0e4a
00000001`c68febe0 00007ff6`e89942fa     : 00000187`00000013 00000187`00000000 00000187`00000004 00000187`00000001 : Fabric+0x93c71a
00000001`c68fec30 00007ff6`e7d0c71a     : 00000187`c37b3ac0 00000187`c4243750 00000000`00000000 00000000`00000000 : Fabric+0x15c42fa
00000001`c68fec90 00007ff6`e8993d1e     : 00000187`00000013 00000000`00000000 00000000`00000004 00007ffb`00000001 : Fabric+0x93c71a
00000001`c68fece0 00007ff6`e7d1742a     : 00000187`c4243750 00000187`c3d3aeb0 00000000`00000000 00007ff6`e898f5c5 : Fabric+0x15c3d1e
00000001`c68ff5e0 00007ff6`e89905c2     : 00000187`00000013 00000001`00000000 00000187`00000004 00007ffb`00000001 : Fabric+0x94742a
00000001`c68ff630 00007ff6`e947cb3f     : 00000187`c3d3aeb0 00000187`c4377bf8 00000001`c68ff7f0 00007ff6`e74ef433 : Fabric+0x15c05c2
00000001`c68ff6f0 00007ff6`e75919bf     : 00000187`00000008 00007ff6`e74f33d5 00000001`c68ff800 00007ff6`e75547dc : Fabric+0x20acb3f
00000001`c68ff720 00007ff6`e7591d2f     : 00000001`c68ff800 00000001`c68ffd58 00000001`c68ff8c0 00007ff6`e76809bf : Fabric+0x1c19bf
00000001`c68ff7a0 00007ff6`e88bda61     : 00000187`c39ff100 00000001`c68ffaa0 00000187`c38204c0 00000000`00000000 : Fabric+0x1c1d2f
00000001`c68ff7d0 00007ff6`e75919bf     : 00000187`00000008 00000000`00000000 00000187`c39ff150 00000001`c68ffaa0 : Fabric+0x14eda61
00000001`c68ff850 00007ff6`e7591cd7     : 00000001`c68ff940 00000187`c39ff150 00000001`c68ffaa0 00000187`c39ff408 : Fabric+0x1c19bf
00000001`c68ff8d0 00007ff6`e88bdffa     : 00000187`c39ff408 00000001`c68ffaa0 00000187`00000000 00000000`00000000 : Fabric+0x1c1cd7
00000001`c68ff900 00007ff6`e8566a15     : 00000187`c41a05d0 00000001`c68ffaa0 00000000`00000000 00007ffb`5b80c4da : Fabric+0x14edffa
00000001`c68ff9a0 00007ff6`e8566360     : 00000000`7ffe0386 00007ffb`5b807498 00000187`c3d3a6d0 00000000`7ffe0386 : Fabric+0x1196a15
00000001`c68ffb60 00007ffb`5b806e52     : 00000187`c3d3a6d0 00000001`c68ffd58 00000000`00000000 00007ffb`5b806d3a : Fabric+0x1196360
00000001`c68ffba0 00007ffb`5b80b8e8     : 00000187`c3d3a858 00000187`c4237d80 00000000`00000000 00000000`00000000 : ntdll!TppExecuteWaitCallback+0xae
00000001`c68ffbf0 00007ffb`5a7b4cb0     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!TppWorkerThread+0x448
00000001`c68ffee0 00007ffb`5b87edcb     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x10
00000001`c68fff10 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x2b


SYMBOL_NAME:  Fabric+1bf7d8

MODULE_NAME: Fabric

IMAGE_NAME:  Fabric.exe

STACK_COMMAND:  ~20s; .ecxr ; kb

FAILURE_BUCKET_ID:  NULL_CLASS_PTR_WRITE_c0000005_Fabric.exe!Unknown

OS_VERSION:  10.0.20348.1

BUILDLAB_STR:  fe_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

IMAGE_VERSION:  11.0.2707.1

FAILURE_ID_HASH:  {92037c43-0161-0cbe-6ff0-ac2eee0530f1}

Followup:     MachineOwner

Hypothesis:

Disclaimer: This analysis is based on source code predating 11.x SF releases.

SecurityContext::CreateSessionExpirationTimer captures raw this in the timer callback and uses a weak pointer to the connection. The callback writes sessionExpiration_ and calls connection->OnSessionExpired():

// SecurityContext.cpp
void SecurityContext::CreateSessionExpirationTimer()
{
    static const StringLiteral SessionExpirationTimerTag("SessionExpiration");
    auto connectionWPtr = connection_;
    sessionExpirationTimer_ = Timer::Create(
        SessionExpirationTimerTag,
        [connectionWPtr, this](TimerSPtr const&)
        {
            if (auto connection = connectionWPtr.lock())
            {
                sessionExpiration_ = StopwatchTime::MaxValue;  // writes to member via raw 'this'
                connection->OnSessionExpired();
            }
        });
}

SecurityContext::~SecurityContext calls sessionExpirationTimer_->Cancel(), but with default Timer settings (allowConcurrency=true, SetCancelWait() not used). Timer::Cancel() does not wait for in-flight callbacks and disassociates them from the threadpool:

// SecurityContext.cpp
SecurityContext::~SecurityContext()
{
    trace.SecurityContextDestructing(id_, TraceThis, --objCount);

    sessionExpirationTimer_->Cancel();  // does not wait for in-flight callbacks

    if (SecIsValidHandle(&hSecurityContext_))
    {
        DeleteSecurityContext(&hSecurityContext_);
    }
}

If the timer fires while the owning SecurityContext is being destroyed (e.g., during connection teardown when disabling a node), the callback can run on a freed object → access violation/use-after-free (matches dump AV on write to near-null address).


Assignees: /cc @microsoft/service-fabric-triage

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions