-
Notifications
You must be signed in to change notification settings - Fork 398
Description
Describe the bug:
During rolling node disables/restarts on a 3-node cluster with a near-expiring cert (<30 days), Fabric.exe intermittently crashes while preparing to move to the Disabled state. Application services on the node do not receive their graceful shutdown callbacks; ETL traces show these services are "terminated." The exact relationship between the Fabric.exe crash and the service terminations is not yet clear, but there certainly appears to be a relationship of some kind. Crash dumps show an access violation consistent with a use-after-free during secure session expiration timer execution.
Timeline:
- Nov 22: Completed several rolling restart iterations without any problems.
- Nov 25: Repeated the test and started experiencing failures. Upon investigation, found that
Fabric.exewas crashing. This is when we noticed the certificate expiration warning had appeared in the cluster (cert expiring Dec 25, 2025—exactly 30 days out). - Nov 26: Generated a new certificate (expires February) and updated the cluster. The
Fabric.execrashes appear to have been resolved.
Area/Component:
Transport security / secure session expiration timer.
To Reproduce:
Steps to reproduce the behavior:
- Provision a 3-node cluster (Standard_D4s_v4) with a certificate expiring in <30 days.
- Deploy a test application to the cluster.
- In a loop, perform rolling graceful restarts of Service Fabric VMs:
Disable-ServiceFabricNode -NodeName {node-name} -Intent Restart -Force- Wait for the node to gracefully stop all application services.
- Restart the VM.
- VM returns in Disabled state.
Enable-ServiceFabricNode -NodeName {node-name}.
- Continue repeating step 3 until the observed behavior (see below) occurs. Repro is intermittent.
Expected behavior:
Disabling a Service Fabric node during a rolling restart should gracefully drain and stop services, honoring lifecycle callbacks, without crashing Fabric.exe or terminating application services abruptly; once re-enabled, services should come back normally.
Observed behavior:
Intermittent Fabric.exe crash while the node is being disabled/stopped. Crash dumps show access violation consistent with a use-after-free during secure session expiration timer callback. Application services on the node terminate ungracefully without receiving shutdown callbacks; the exact relationship to the Fabric.exe crash is not yet clear. Issues started after the cluster began warning the cluster cert would expire in <30 days (Let's Encrypt).
Screenshots:
N/A.
Service Fabric Runtime Version:
SF runtime: 11.0.2707.1
Environment:
- Azure 3-node cluster (Standard_D4s_v4) on Windows Server 2022 Datacenter Azure Edition.
- Certificates: Let's Encrypt-issued cluster certs.
If this is a regression, which version did it regress from?
- Unknown.
Additional context:
.dmpand.etlfiles available upon request- If it matters, several service instances use the
FabricClientto connect to the SF cluster
Fabric Dump Analysis:
windbg dump analysis of Fabric.exe.3984.dmp:
0:020> !analyze -v
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
KEY_VALUES_STRING: 1
Key : AV.Dereference
Value: NullClassPtr
Key : AV.Fault
Value: Write
Key : Analysis.CPU.mSec
Value: 1999
Key : Analysis.DebugAnalysisManager
Value: Create
Key : Analysis.Elapsed.mSec
Value: 10426
Key : Analysis.Init.CPU.mSec
Value: 3687
Key : Analysis.Init.Elapsed.mSec
Value: 31768
Key : Analysis.Memory.CommitPeak.Mb
Value: 125
Key : Timeline.OS.Boot.DeltaSec
Value: 245840
Key : Timeline.Process.Start.DeltaSec
Value: 245818
Key : WER.OS.Branch
Value: fe_release
Key : WER.OS.Timestamp
Value: 2021-05-07T15:00:00Z
Key : WER.OS.Version
Value: 10.0.20348.1
Key : WER.Process.Version
Value: 11.0.2707.1
FILE_IN_CAB: Fabric.exe.3984.dmp
NTGLOBALFLAG: 0
PROCESS_BAM_CURRENT_THROTTLED: 0
PROCESS_BAM_PREVIOUS_THROTTLED: 0
APPLICATION_VERIFIER_FLAGS: 0
CONTEXT: (.ecxr)
rax=00000001c68fea68 rbx=00000001c68feae8 rcx=0000000000000000
rdx=00000001c68feae8 rsi=0000000000000000 rdi=0000000000000000
rip=00007ff6e758f7d8 rsp=00000001c68fea20 rbp=00007ff6e9d5b901
r8=0000000000000000 r9=00000001c68fea80 r10=0000000000000002
r11=00000001c68fea60 r12=0000000000000000 r13=00000001c68feae8
r14=0000000000000001 r15=0000000000000000
iopl=0 nv up ei pl nz na pe nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
Fabric+0x1bf7d8:
00007ff6`e758f7d8 4088696a mov byte ptr [rcx+6Ah],bpl ds:00000000`0000006a=??
Resetting default scope
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 00007ff6e758f7d8 (Fabric+0x00000000001bf7d8)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000001
Parameter[1]: 000000000000006a
Attempt to write to address 000000000000006a
PROCESS_NAME: Fabric.exe
WRITE_ADDRESS: 000000000000006a
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.
EXCEPTION_CODE_STR: c0000005
EXCEPTION_PARAMETER1: 0000000000000001
EXCEPTION_PARAMETER2: 000000000000006a
STACK_TEXT:
00000001`c68fea20 00007ff6`e758f519 : 00007ff6`e9d5b950 00000001`c68feae8 00007ff6`e9d5b950 00000000`00000000 : Fabric+0x1bf7d8
00000001`c68fea70 00007ff6`e8990e4a : 00000187`c3c2cd60 00000019`254d3800 7fffffff`ffffffff ffffffe6`dab2c800 : Fabric+0x1bf519
00000001`c68fead0 00007ff6`e7d0c71a : 00000187`c3e54b50 00000187`c37b3ac0 00000000`00000000 00000187`c3c2cd70 : Fabric+0x15c0e4a
00000001`c68febe0 00007ff6`e89942fa : 00000187`00000013 00000187`00000000 00000187`00000004 00000187`00000001 : Fabric+0x93c71a
00000001`c68fec30 00007ff6`e7d0c71a : 00000187`c37b3ac0 00000187`c4243750 00000000`00000000 00000000`00000000 : Fabric+0x15c42fa
00000001`c68fec90 00007ff6`e8993d1e : 00000187`00000013 00000000`00000000 00000000`00000004 00007ffb`00000001 : Fabric+0x93c71a
00000001`c68fece0 00007ff6`e7d1742a : 00000187`c4243750 00000187`c3d3aeb0 00000000`00000000 00007ff6`e898f5c5 : Fabric+0x15c3d1e
00000001`c68ff5e0 00007ff6`e89905c2 : 00000187`00000013 00000001`00000000 00000187`00000004 00007ffb`00000001 : Fabric+0x94742a
00000001`c68ff630 00007ff6`e947cb3f : 00000187`c3d3aeb0 00000187`c4377bf8 00000001`c68ff7f0 00007ff6`e74ef433 : Fabric+0x15c05c2
00000001`c68ff6f0 00007ff6`e75919bf : 00000187`00000008 00007ff6`e74f33d5 00000001`c68ff800 00007ff6`e75547dc : Fabric+0x20acb3f
00000001`c68ff720 00007ff6`e7591d2f : 00000001`c68ff800 00000001`c68ffd58 00000001`c68ff8c0 00007ff6`e76809bf : Fabric+0x1c19bf
00000001`c68ff7a0 00007ff6`e88bda61 : 00000187`c39ff100 00000001`c68ffaa0 00000187`c38204c0 00000000`00000000 : Fabric+0x1c1d2f
00000001`c68ff7d0 00007ff6`e75919bf : 00000187`00000008 00000000`00000000 00000187`c39ff150 00000001`c68ffaa0 : Fabric+0x14eda61
00000001`c68ff850 00007ff6`e7591cd7 : 00000001`c68ff940 00000187`c39ff150 00000001`c68ffaa0 00000187`c39ff408 : Fabric+0x1c19bf
00000001`c68ff8d0 00007ff6`e88bdffa : 00000187`c39ff408 00000001`c68ffaa0 00000187`00000000 00000000`00000000 : Fabric+0x1c1cd7
00000001`c68ff900 00007ff6`e8566a15 : 00000187`c41a05d0 00000001`c68ffaa0 00000000`00000000 00007ffb`5b80c4da : Fabric+0x14edffa
00000001`c68ff9a0 00007ff6`e8566360 : 00000000`7ffe0386 00007ffb`5b807498 00000187`c3d3a6d0 00000000`7ffe0386 : Fabric+0x1196a15
00000001`c68ffb60 00007ffb`5b806e52 : 00000187`c3d3a6d0 00000001`c68ffd58 00000000`00000000 00007ffb`5b806d3a : Fabric+0x1196360
00000001`c68ffba0 00007ffb`5b80b8e8 : 00000187`c3d3a858 00000187`c4237d80 00000000`00000000 00000000`00000000 : ntdll!TppExecuteWaitCallback+0xae
00000001`c68ffbf0 00007ffb`5a7b4cb0 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!TppWorkerThread+0x448
00000001`c68ffee0 00007ffb`5b87edcb : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x10
00000001`c68fff10 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x2b
SYMBOL_NAME: Fabric+1bf7d8
MODULE_NAME: Fabric
IMAGE_NAME: Fabric.exe
STACK_COMMAND: ~20s; .ecxr ; kb
FAILURE_BUCKET_ID: NULL_CLASS_PTR_WRITE_c0000005_Fabric.exe!Unknown
OS_VERSION: 10.0.20348.1
BUILDLAB_STR: fe_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
IMAGE_VERSION: 11.0.2707.1
FAILURE_ID_HASH: {92037c43-0161-0cbe-6ff0-ac2eee0530f1}
Followup: MachineOwner
Hypothesis:
Disclaimer: This analysis is based on source code predating 11.x SF releases.
SecurityContext::CreateSessionExpirationTimer captures raw this in the timer callback and uses a weak pointer to the connection. The callback writes sessionExpiration_ and calls connection->OnSessionExpired():
// SecurityContext.cpp
void SecurityContext::CreateSessionExpirationTimer()
{
static const StringLiteral SessionExpirationTimerTag("SessionExpiration");
auto connectionWPtr = connection_;
sessionExpirationTimer_ = Timer::Create(
SessionExpirationTimerTag,
[connectionWPtr, this](TimerSPtr const&)
{
if (auto connection = connectionWPtr.lock())
{
sessionExpiration_ = StopwatchTime::MaxValue; // writes to member via raw 'this'
connection->OnSessionExpired();
}
});
}SecurityContext::~SecurityContext calls sessionExpirationTimer_->Cancel(), but with default Timer settings (allowConcurrency=true, SetCancelWait() not used). Timer::Cancel() does not wait for in-flight callbacks and disassociates them from the threadpool:
// SecurityContext.cpp
SecurityContext::~SecurityContext()
{
trace.SecurityContextDestructing(id_, TraceThis, --objCount);
sessionExpirationTimer_->Cancel(); // does not wait for in-flight callbacks
if (SecIsValidHandle(&hSecurityContext_))
{
DeleteSecurityContext(&hSecurityContext_);
}
}If the timer fires while the owning SecurityContext is being destroyed (e.g., during connection teardown when disabling a node), the callback can run on a freed object → access violation/use-after-free (matches dump AV on write to near-null address).
Assignees: /cc @microsoft/service-fabric-triage