SAP HANA High Availability (…Well, my version) on AWS Using EventBridge and Lambda

Estimated read time 30 min read

Introduction

I don’t like Pacemaker clusters.

They work until they don’t and then you usually lose everything. They are very difficult to get right. On nearly all the HA projects I’ve worked on over the past 15 years, they’ve either been taken out completely in favour of manual failover using hdbnsutil -sr_takeover or replaced with a more robust solution like SIOS LifeKeeper or HPE Serviceguard. Both are supported on AWS and Azure and are arguably superior to Pacemaker, but they come at a significant commercial cost.

So I built something different. No cluster software on the instances at all. The HA logic lives entirely in AWS-native services — EventBridge, Lambda, CloudWatch, and Route 53 — and delivers the same outcome: automated failover, HSR takeover, and automatic re-registration of the failed node as the new secondary when it recovers.

Important caveat: this is not a supported SAP HA configuration. On AWS and Azure, SAP-certified HA solutions include Pacemaker with the SAPHana resource agent, SIOS LifeKeeper, and HPE Serviceguard. If you’re building a production landscape, stop here and go and implement one of those. If you’re interested in what’s possible with AWS-native services, read on.

The solution handles two failure scenarios:

OS/instance failure — detected in seconds via EventBridge EC2 state-change notificationsHANA process crash — detected within 30 seconds via a custom CloudWatch metric pushed by a lightweight monitor running on each node

Both trigger the same automated response: HSR takeover on the surviving node, Route 53 DNS update, re-registration of the failed node as secondary when it comes back.

Full repo: https://github.com/neilaspin/hana-ha-aws

See Appendix — Deployment Steps for the Ansible, Terraform, shell and script commands.

Architecture

┌─────────────────────────────────────────────────────────┐
│ AWS VPC │
│ │
│ ┌──────────────┐ HSR (syncmem) ┌──────────────┐ │
│ │ HANA SITE1 │◄───────────────────►│ HANA SITE2 │ │
│ │ (primary) │ │ (secondary) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ └──────────────────────────────────┬─┘ │
│ CloudWatch │ │
│ Custom Metrics │ │
└────────────────────────────────────────────┼────────────┘

┌────────────────────────┼──────────────┐
│ │ │
EventBridge CloudWatch Route 53
(EC2 state change) (HANA process alarm) (private zone)
│ │ │
└────────────┬───────────┘ │
│ │
Lambda Function │
(hana-failover) │
│ │
└──────────────────────────┘
updates DNS on failover

Component Role

EC2 (r5.xlarge, SLES for SAP)HANA nodes — one per AZHANA System Replication (HSR)syncmem / logreplay modeAmazon EventBridgeDetects EC2 instance state changes in secondsAmazon CloudWatchCustom metric from HANA process monitor (10s interval)AWS Lambda (Python 3.12)Orchestrates takeover and re-registration via SSHRoute 53 private hosted zoneStable DNS endpoint for HANA clientsAWS Secrets ManagerStores SSH private key for Lambda

How It Works

Failure Detection

Two complementary detection paths run in parallel. You need both — they catch different things.

Path 1 — EC2 instance failure (seconds)

EventBridge fires the moment an EC2 instance transitions to stopping, stopped, or terminated. OS panic, hard stop, hypervisor failure — all covered. No polling. The event arrives at the Lambda within one to two seconds of the state change.

Path 2 — HANA process crash (≤30 seconds)

I simulated losing HANA by issuing HDB kill-9 on a running instance, but this is completely invisible to EC2 monitoring — the OS stays up, the status checks stay green, HANA is dead and nobody knows. A lightweight shell script runs as a systemd service on each node, checking every 10 seconds whether hdbnameserver is alive and pushing a custom CloudWatch metric (HANA/Health :: HANARunning) — 1 for healthy, 0 for down. Three consecutive zeros triggers a CloudWatch alarm, which fires via SNS to the Lambda.

# /usr/local/bin/hana_monitor.sh — runs as systemd service
#!/bin/bash
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

while true; do
if pgrep -f hdbnameserver > /dev/null 2>&1; then
VALUE=1
else
VALUE=0
fi

aws cloudwatch put-metric-data
–namespace HANA/Health
–metric-name HANARunning
–dimensions InstanceId=”$INSTANCE_ID”
–value “$VALUE”
–storage-resolution 1
–region “$REGION”

sleep 10
done# /etc/systemd/system/hana-monitor.service
[Unit]
Description=HANA Process Monitor
After=network.target

[Service]
ExecStart=/usr/local/bin/hana_monitor.sh
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Failover

When the Lambda fires — from either path — it:

Checks Route 53 to confirm the failing instance is the current primarySSHes to the surviving node using the key from Secrets ManagerRuns hdbnsutil -sr_takeoverUpdates the Route 53 A record to point to the new primary’s private IPImmediately attempts re-registration of the failed node as the new secondary

Route 53 TTL is 30 seconds. HANA clients reconnect to the new primary as soon as their connection drops and DNS resolves.

Re-registration

After the takeover, the Lambda SSH’s into the former primary and:

Stops HANA (or confirms it is already stopped)Waits for all HANA processes to exit cleanlyRuns hdbnsutil -sr_register pointing at the new primaryStarts HANA — it comes up as secondary and begins log replay

If re-registration fails because the instance isn’t reachable yet, the Lambda logs a warning and retries on the next EC2 running state-change event.

Role Awareness

The Lambda doesn’t hardcode which node is primary. It queries Route 53 on every invocation — whichever node’s private IP matches the current record is the primary. This means it works correctly across multiple failovers in either direction without any reconfiguration.

The Lambda Function

import boto3
import json
import os
import stat
import tempfile
import time
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

r53 = boto3.client(‘route53’)
ec2 = boto3.client(‘ec2’)
secrets = boto3.client(‘secretsmanager’)

PRIMARY_INSTANCE_ID = os.environ[‘PRIMARY_INSTANCE_ID’]
SECONDARY_INSTANCE_ID = os.environ[‘SECONDARY_INSTANCE_ID’]
HOSTED_ZONE_ID = os.environ[‘HOSTED_ZONE_ID’]
HANA_SID = os.environ.get(‘HANA_SID’, ‘HDB’)
HANA_INSTANCE = os.environ.get(‘HANA_INSTANCE’, ’00’)
HANA_HOSTNAME = os.environ[‘HANA_HOSTNAME’]
SSH_KEY_SECRET_ARN = os.environ[‘SSH_KEY_SECRET_ARN’]
SSH_USER = os.environ.get(‘SSH_USER’, ‘ec2-user’)

def get_ssh_key_file():
secret = secrets.get_secret_value(SecretId=SSH_KEY_SECRET_ARN)
key_material = secret.get(‘SecretString’) or secret[‘SecretBinary’].decode()
f = tempfile.NamedTemporaryFile(mode=’w’, suffix=’.pem’, delete=False)
f.write(key_material)
f.close()
os.chmod(f.name, stat.S_IRUSR)
return f.name

def get_public_ip(instance_id):
r = ec2.describe_instances(InstanceIds=[instance_id])
return r[‘Reservations’][0][‘Instances’][0][‘PublicIpAddress’]

def get_private_ip(instance_id):
r = ec2.describe_instances(InstanceIds=[instance_id])
return r[‘Reservations’][0][‘Instances’][0][‘PrivateIpAddress’]

def ssh_run(host, key_file, command, timeout=300):
import paramiko
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(hostname=host, username=SSH_USER, key_filename=key_file, timeout=30)
_, stdout, stderr = client.exec_command(command, timeout=timeout)
rc = stdout.channel.recv_exit_status()
out = stdout.read().decode()
err = stderr.read().decode()
client.close()
return rc, out, err

def update_route53(new_ip):
r53.change_resource_record_sets(
HostedZoneId=HOSTED_ZONE_ID,
ChangeBatch={
‘Changes’: [{
‘Action’: ‘UPSERT’,
‘ResourceRecordSet’: {
‘Name’: HANA_HOSTNAME,
‘Type’: ‘A’,
‘TTL’: 30,
‘ResourceRecords’: [{‘Value’: new_ip}],
}
}]
}
)
logger.info(f”Route 53 updated: {HANA_HOSTNAME} -> {new_ip}”)

def get_r53_ip():
response = r53.list_resource_record_sets(
HostedZoneId=HOSTED_ZONE_ID,
StartRecordName=HANA_HOSTNAME,
StartRecordType=’A’,
MaxItems=’1′,
)
for rrset in response[‘ResourceRecordSets’]:
if rrset[‘Name’].rstrip(‘.’) == HANA_HOSTNAME.rstrip(‘.’):
return rrset[‘ResourceRecords’][0][‘Value’]
return None

def handle_failover(failing_instance_id):
target_id = SECONDARY_INSTANCE_ID if failing_instance_id == PRIMARY_INSTANCE_ID
else PRIMARY_INSTANCE_ID
logger.info(f”Initiating HSR takeover: {failing_instance_id} down, target={target_id}”)
key_file = get_ssh_key_file()
target_ip = get_public_ip(target_id)

rc, out, err = ssh_run(
target_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘hdbnsutil -sr_takeover'”,
)
logger.info(f”Takeover stdout: {out}”)
if err:
logger.warning(f”Takeover stderr: {err}”)
if rc != 0:
raise RuntimeError(f”sr_takeover failed (rc={rc}): {err}”)

new_primary_ip = get_private_ip(target_id)
update_route53(new_primary_ip)
logger.info(f”Failover complete — new primary: {target_id} ({new_primary_ip})”)

try:
handle_reregistration(failing_instance_id)
except Exception as e:
logger.warning(f”Re-registration of {failing_instance_id} failed: {e} — will retry on next boot”)

def handle_secondary_restart(instance_id):
logger.info(f”Restarting HANA on secondary {instance_id}”)
key_file = get_ssh_key_file()
ip = get_public_ip(instance_id)
rc, out, err = ssh_run(ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB start'”, timeout=600)
if rc != 0:
raise RuntimeError(f”HDB start failed on secondary (rc={rc}): {err}”)
logger.info(f”HANA restarted on secondary {instance_id}”)

def handle_reregistration(returning_id):
logger.info(f”Re-registering {returning_id} as HSR secondary”)

current_primary_id = SECONDARY_INSTANCE_ID if returning_id == PRIMARY_INSTANCE_ID
else PRIMARY_INSTANCE_ID
site_name = ‘SITE1’ if returning_id == PRIMARY_INSTANCE_ID else ‘SITE2’
primary_private_ip = get_private_ip(current_primary_id)
primary_hostname = “ip-” + primary_private_ip.replace(‘.’, ‘-‘)

key_file = get_ssh_key_file()
returning_ip = None

for attempt in range(12):
try:
returning_ip = get_public_ip(returning_id)
rc, _, _ = ssh_run(returning_ip, key_file, “echo ready”, timeout=15)
if rc == 0:
break
except Exception as e:
logger.info(f”SSH not ready on {returning_id} (attempt {attempt+1}/12): {e}”)
time.sleep(15)

ssh_run(returning_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB stop’ 2>&1 || true”)

for _ in range(40):
rc, out, _ = ssh_run(returning_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB info’ 2>&1 || true”, timeout=30)
if ‘hdbdaemon’ not in out:
break
time.sleep(15)
else:
raise RuntimeError(“HANA did not stop within 10 minutes”)

rc, out, err = ssh_run(
returning_ip, key_file,
f’sudo su – {HANA_SID.lower()}adm -c “‘
f’hdbnsutil -sr_register’
f’ –name={site_name}’
f’ –remoteHost={primary_hostname}’
f’ –remoteInstance={HANA_INSTANCE}’
f’ –replicationMode=syncmem’
f’ –operationMode=logreplay”‘,
)
if rc != 0:
raise RuntimeError(f”sr_register failed (rc={rc}): {err}”)

ssh_run(returning_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB start'”,
timeout=600)

logger.info(f”Re-registration complete for {returning_id}”)

def handler(event, context):
logger.info(f”Event: {json.dumps(event)}”)

if event.get(‘source’) == ‘aws.ec2’:
detail = event.get(‘detail’, {})
state = detail.get(‘state’)
instance_id = detail.get(‘instance-id’)

if state in (‘stopped’, ‘terminated’, ‘stopping’)
and instance_id in (PRIMARY_INSTANCE_ID, SECONDARY_INSTANCE_ID):
if get_r53_ip() == get_private_ip(instance_id):
handle_failover(instance_id)
else:
logger.info(f”{instance_id} stopped but is not current primary, no action”)

elif state == ‘running’
and instance_id in (PRIMARY_INSTANCE_ID, SECONDARY_INSTANCE_ID):
if get_r53_ip() != get_private_ip(instance_id):
handle_reregistration(instance_id)
else:
logger.info(f”{instance_id} is already current primary, no action”)
return

if ‘Records’ in event:
for record in event[‘Records’]:
if record.get(‘EventSource’) == ‘aws:sns’:
message = json.loads(record[‘Sns’][‘Message’])
if message.get(‘NewStateValue’) == ‘ALARM’:
dims = message.get(‘Trigger’, {}).get(‘Dimensions’, [])
instance_id = next(
(d[‘value’] for d in dims if d[‘name’] == ‘InstanceId’), None)
if instance_id and get_r53_ip() == get_private_ip(instance_id):
handle_failover(instance_id)
elif instance_id:
logger.info(f”HANA alarm for {instance_id} — secondary, restarting HANA”)
handle_secondary_restart(instance_id)
else:
logger.warning(“HANA alarm with no InstanceId dimension”)
return

logger.warning(“Unrecognized event format”)

Building the Lambda Package

Paramiko needs to be compiled for Linux x86_64 — not your Mac or Windows build machine. Use pip’s –platform flag:

#!/bin/bash
set -e
PACKAGE_DIR=”./lambda/package”
rm -rf “$PACKAGE_DIR” && mkdir -p “$PACKAGE_DIR”
cp lambda/failover.py “$PACKAGE_DIR/”
pip install paramiko
–platform manylinux2014_x86_64
–python-version 3.12
–only-binary=:all:
–target “$PACKAGE_DIR/”
cd “$PACKAGE_DIR” && zip -r ../lambda_failover.zip .

Infrastructure (Terraform)

Key excerpts — full Terraform is in the companion repository.

EventBridge — watches both nodes for instance state changes:

resource “aws_cloudwatch_event_rule” “primary_down” {
name = “hana-primary-down”
event_pattern = jsonencode({
source = [“aws.ec2”]
detail-type = [“EC2 Instance State-change Notification”]
detail = {
state = [“stopped”, “terminated”, “stopping”]
instance-id = [aws_instance.hana_primary.id, aws_instance.hana_secondary.id]
}
})
}

resource “aws_cloudwatch_event_rule” “instance_running” {
name = “hana-instance-running”
event_pattern = jsonencode({
source = [“aws.ec2”]
detail-type = [“EC2 Instance State-change Notification”]
detail = {
state = [“running”]
instance-id = [aws_instance.hana_primary.id, aws_instance.hana_secondary.id]
}
})
}

Route 53 private hosted zone:

resource “aws_route53_zone” “hana” {
name = var.private_zone_name # e.g. hana.internal
vpc {
vpc_id = aws_vpc.hana.id
}
}

resource “aws_route53_record” “primary” {
zone_id = aws_route53_zone.hana.zone_id
name = “hana-primary.${var.private_zone_name}”
type = “A”
ttl = 30
records = [aws_instance.hana_primary.private_ip]
}

CloudWatch alarms — one per node, 10-second high-resolution periods:

for INSTANCE in i-0xxxxxxxxxxxx i-0yyyyyyyyyyyy; do
aws cloudwatch put-metric-alarm
–alarm-name “hana-process-failed-${INSTANCE}”
–namespace HANA/Health
–metric-name HANARunning
–dimensions Name=InstanceId,Value=”${INSTANCE}”
–period 10
–evaluation-periods 3
–statistic Maximum
–threshold 1
–comparison-operator LessThanThreshold
–treat-missing-data ignore
–alarm-actions arn:aws:sns:eu-west-1:123456789012:hana-failover
–region eu-west-1
done

Note: –treat-missing-data ignore matters here. Set it to breaching and you’ll get false alarms in the window between alarm creation and the first metric arriving.

Tested Failure Scenarios

All scenarios below were tested live, with both nodes taking turns as primary to confirm fully bidirectional operation.

Scenario 1: Killing the OS (EC2 instance stop)

The first test is stopping the EC2 instance itself — equivalent to a hard power-off or hypervisor failure.

aws ec2 stop-instances –instance-ids i-0xxxxxxxxxxxx

EventBridge fires the moment the instance transitions to stopping. The Lambda receives the event within one to two seconds, checks Route 53 to confirm the stopping instance is the current primary, and SSHes to the surviving secondary to run hdbnsutil -sr_takeover. Route 53 is updated to point at the new primary’s private IP.

When the stopped instance comes back up, EventBridge fires a running event. The Lambda detects that its private IP no longer matches the Route 53 record and automatically re-registers it as the new HSR secondary — no manual intervention required.

This was tested in both directions. After SITE1 was stopped and SITE2 took over, SITE2 was then stopped to confirm SITE1 could take over in the reverse direction.

Detection time: ~2 seconds

Scenario 2: Killing the HANA process (HDB kill-9)

The more insidious failure — HANA dies while the OS stays up. EC2 status checks stay green, EventBridge sees nothing. Completely invisible to AWS infrastructure monitoring.

sudo -u hdbadm HDB kill-9

The process monitor detects hdbnameserver is gone within the next 10-second polling cycle. It pushes HANARunning = 0 to CloudWatch. After three consecutive zero readings (30 seconds), the CloudWatch alarm fires and delivers the notification to the Lambda via SNS.

The screenshot below shows this in action. The left alarm (hana-process-failed-i-0ebf62666d…) has turned red — HANA is down on SITE1. The right alarm (SITE2) remains green, confirming the secondary is healthy and ready for takeover.

The Lambda receives the SNS notification, confirms the failing node is the current primary via Route 53, and runs the takeover on SITE2. Once the takeover and re-registration complete, both alarms return to OK.

The metric graphs show the characteristic signature: HANARunning drops to 0 during the failure, then returns to 1 once HANA is restarted on the re-registered secondary.

Detection time: ~30 seconds

Scenario Detection method Detection time Result

aws ec2 stop-instancesEventBridge~2 secondsTakeover + re-registration ✓Instance hardware failureEventBridge~2 secondsTakeover + re-registration ✓HDB kill-9 (OS stays up)CloudWatch alarm~30 secondsTakeover + re-registration ✓Secondary HANA crashCloudWatch alarm~30 secondsAutomatic HDB start ✓

What This Replaces

Pacemaker component AWS equivalent

Cluster daemon (corosync/pacemaker)EventBridge + LambdaSTONITH / fencingNot required — Lambda only acts on the surviving nodeResource agent (SAPHana)Lambda handle_failover / handle_reregistrationCluster VIPRoute 53 private zone (TTL 30s)Cluster logs (crm_mon)CloudWatch Logs (/aws/lambda/hana-failover)

No cluster software on the HANA nodes. The nodes have no knowledge of each other beyond HSR itself.

Advantages and Limitations

Advantages

Split-brain is not possible in this design — the Lambda only ever acts on the surviving node, never both simultaneously. That’s one less thing to worry about compared to Pacemaker.All logic in one place — single Lambda, easy to read and modifyCost — no additional software licencesBidirectional without reconfiguration

Limitations

SSH dependency: The Lambda reaches the surviving node over public IP via SSH. If SSH is unreachable, the takeover won’t complete. Lambda timeout / re-registration : Set to 600 seconds. Re-registration including HANA startup can take several minutes; this gives us a bit of headroom.No SAP support: This is not a supported HA configuration. On AWS and Azure, SAP-certified HA solutions include Pacemaker with the SAPHana resource agent (on SLES or RHEL), SIOS LifeKeeper, and HPE Serviceguard — all of which are proven, supported options. They come at a significant cost in terms of licensing, implementation, and ongoing support, but that’s the price of a supported production configuration. Coverage: This solution covers two failure scenarios — EC2 instance failure and HANA process crash. In the real world there are many more potential causes of downtime: network partitions, storage failures, AZ outages, OS-level issues that don’t kill the instance, and HANA errors that don’t manifest as a dead process. The process monitor only checks hdbnameserver — a more robust implementation would also monitor hdbindexserver, hdbpreprocessor, and hdbxsengine, as it’s entirely possible for the indexserver to crash while the nameserver stays up, leaving HANA effectively unusable but the health metric still showing green. These scenarios are not covered here.

Summary

EventBridge handles sub-second EC2 failure detection, a custom CloudWatch metric covers HANA process failures that EC2 monitoring can’t see, and a single Lambda function orchestrates the whole thing. All the cluster logic is in one place, none of it is on the nodes, and it works bidirectionally without reconfiguration.

It’s an interesting exploration of what’s possible with AWS-native services. I don’t expect it to replace Pacemaker in a production SAP landscape anytime soon, but it was a worthwhile exercise.

Appendix – Deployment Steps

1. Deploy infrastructure

cd terraform
terraform apply

Note: only run this once. After initial deployment, never run `terraform apply` again — use AWS CLI for any updates.

2. Generate inventory

./generate_inventory.sh

3. Install HANA on both nodes

cd ansible
ansible-playbook -i inventory.ini hana-install.yml

4. Configure HSR

ansible-playbook -i inventory.ini hsr-setup.yml

5. Build and deploy the Lambda package

cd ..
./build.sh
cd lambda/package && zip -r ../lambda_failover.zip .
aws lambda update-function-code –function-name hana-failover –zip-file fileb://../lambda_failover.zip –region us-east-2

6. Deploy the HANA process monitor to both nodes

cd ansible
ansible-playbook -i inventory.ini hana-monitor.yml

7. Create the CloudWatch alarms

for INSTANCE in <primary-id> <secondary-id>; do
aws cloudwatch put-metric-alarm
–alarm-name “hana-process-failed-${INSTANCE}”
–namespace HANA/Health
–metric-name HANARunning
–dimensions Name=InstanceId,Value=”${INSTANCE}”
–period 10 –evaluation-periods 3
–statistic Maximum –threshold 1
–comparison-operator LessThanThreshold
–treat-missing-data ignore
–alarm-actions arn:aws:sns:us-east-2:<account-id>:hana-failover
–region us-east-2
done

 

 

 

​ IntroductionI don’t like Pacemaker clusters.They work until they don’t and then you usually lose everything. They are very difficult to get right. On nearly all the HA projects I’ve worked on over the past 15 years, they’ve either been taken out completely in favour of manual failover using hdbnsutil -sr_takeover or replaced with a more robust solution like SIOS LifeKeeper or HPE Serviceguard. Both are supported on AWS and Azure and are arguably superior to Pacemaker, but they come at a significant commercial cost.So I built something different. No cluster software on the instances at all. The HA logic lives entirely in AWS-native services — EventBridge, Lambda, CloudWatch, and Route 53 — and delivers the same outcome: automated failover, HSR takeover, and automatic re-registration of the failed node as the new secondary when it recovers.Important caveat: this is not a supported SAP HA configuration. On AWS and Azure, SAP-certified HA solutions include Pacemaker with the SAPHana resource agent, SIOS LifeKeeper, and HPE Serviceguard. If you’re building a production landscape, stop here and go and implement one of those. If you’re interested in what’s possible with AWS-native services, read on.The solution handles two failure scenarios:OS/instance failure — detected in seconds via EventBridge EC2 state-change notificationsHANA process crash — detected within 30 seconds via a custom CloudWatch metric pushed by a lightweight monitor running on each nodeBoth trigger the same automated response: HSR takeover on the surviving node, Route 53 DNS update, re-registration of the failed node as secondary when it comes back.Full repo: https://github.com/neilaspin/hana-ha-awsSee Appendix — Deployment Steps for the Ansible, Terraform, shell and script commands.Architecture┌─────────────────────────────────────────────────────────┐
│ AWS VPC │
│ │
│ ┌──────────────┐ HSR (syncmem) ┌──────────────┐ │
│ │ HANA SITE1 │◄───────────────────►│ HANA SITE2 │ │
│ │ (primary) │ │ (secondary) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ └──────────────────────────────────┬─┘ │
│ CloudWatch │ │
│ Custom Metrics │ │
└────────────────────────────────────────────┼────────────┘

┌────────────────────────┼──────────────┐
│ │ │
EventBridge CloudWatch Route 53
(EC2 state change) (HANA process alarm) (private zone)
│ │ │
└────────────┬───────────┘ │
│ │
Lambda Function │
(hana-failover) │
│ │
└──────────────────────────┘
updates DNS on failoverComponent RoleEC2 (r5.xlarge, SLES for SAP)HANA nodes — one per AZHANA System Replication (HSR)syncmem / logreplay modeAmazon EventBridgeDetects EC2 instance state changes in secondsAmazon CloudWatchCustom metric from HANA process monitor (10s interval)AWS Lambda (Python 3.12)Orchestrates takeover and re-registration via SSHRoute 53 private hosted zoneStable DNS endpoint for HANA clientsAWS Secrets ManagerStores SSH private key for LambdaHow It WorksFailure DetectionTwo complementary detection paths run in parallel. You need both — they catch different things.Path 1 — EC2 instance failure (seconds)EventBridge fires the moment an EC2 instance transitions to stopping, stopped, or terminated. OS panic, hard stop, hypervisor failure — all covered. No polling. The event arrives at the Lambda within one to two seconds of the state change.Path 2 — HANA process crash (≤30 seconds)I simulated losing HANA by issuing HDB kill-9 on a running instance, but this is completely invisible to EC2 monitoring — the OS stays up, the status checks stay green, HANA is dead and nobody knows. A lightweight shell script runs as a systemd service on each node, checking every 10 seconds whether hdbnameserver is alive and pushing a custom CloudWatch metric (HANA/Health :: HANARunning) — 1 for healthy, 0 for down. Three consecutive zeros triggers a CloudWatch alarm, which fires via SNS to the Lambda.# /usr/local/bin/hana_monitor.sh — runs as systemd service
#!/bin/bash
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

while true; do
if pgrep -f hdbnameserver > /dev/null 2>&1; then
VALUE=1
else
VALUE=0
fi

aws cloudwatch put-metric-data
–namespace HANA/Health
–metric-name HANARunning
–dimensions InstanceId=”$INSTANCE_ID”
–value “$VALUE”
–storage-resolution 1
–region “$REGION”

sleep 10
done# /etc/systemd/system/hana-monitor.service
[Unit]
Description=HANA Process Monitor
After=network.target

[Service]
ExecStart=/usr/local/bin/hana_monitor.sh
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.targetFailoverWhen the Lambda fires — from either path — it:Checks Route 53 to confirm the failing instance is the current primarySSHes to the surviving node using the key from Secrets ManagerRuns hdbnsutil -sr_takeoverUpdates the Route 53 A record to point to the new primary’s private IPImmediately attempts re-registration of the failed node as the new secondaryRoute 53 TTL is 30 seconds. HANA clients reconnect to the new primary as soon as their connection drops and DNS resolves.Re-registrationAfter the takeover, the Lambda SSH’s into the former primary and:Stops HANA (or confirms it is already stopped)Waits for all HANA processes to exit cleanlyRuns hdbnsutil -sr_register pointing at the new primaryStarts HANA — it comes up as secondary and begins log replayIf re-registration fails because the instance isn’t reachable yet, the Lambda logs a warning and retries on the next EC2 running state-change event.Role AwarenessThe Lambda doesn’t hardcode which node is primary. It queries Route 53 on every invocation — whichever node’s private IP matches the current record is the primary. This means it works correctly across multiple failovers in either direction without any reconfiguration.The Lambda Functionimport boto3
import json
import os
import stat
import tempfile
import time
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

r53 = boto3.client(‘route53’)
ec2 = boto3.client(‘ec2’)
secrets = boto3.client(‘secretsmanager’)

PRIMARY_INSTANCE_ID = os.environ[‘PRIMARY_INSTANCE_ID’]
SECONDARY_INSTANCE_ID = os.environ[‘SECONDARY_INSTANCE_ID’]
HOSTED_ZONE_ID = os.environ[‘HOSTED_ZONE_ID’]
HANA_SID = os.environ.get(‘HANA_SID’, ‘HDB’)
HANA_INSTANCE = os.environ.get(‘HANA_INSTANCE’, ’00’)
HANA_HOSTNAME = os.environ[‘HANA_HOSTNAME’]
SSH_KEY_SECRET_ARN = os.environ[‘SSH_KEY_SECRET_ARN’]
SSH_USER = os.environ.get(‘SSH_USER’, ‘ec2-user’)

def get_ssh_key_file():
secret = secrets.get_secret_value(SecretId=SSH_KEY_SECRET_ARN)
key_material = secret.get(‘SecretString’) or secret[‘SecretBinary’].decode()
f = tempfile.NamedTemporaryFile(mode=’w’, suffix=’.pem’, delete=False)
f.write(key_material)
f.close()
os.chmod(f.name, stat.S_IRUSR)
return f.name

def get_public_ip(instance_id):
r = ec2.describe_instances(InstanceIds=[instance_id])
return r[‘Reservations’][0][‘Instances’][0][‘PublicIpAddress’]

def get_private_ip(instance_id):
r = ec2.describe_instances(InstanceIds=[instance_id])
return r[‘Reservations’][0][‘Instances’][0][‘PrivateIpAddress’]

def ssh_run(host, key_file, command, timeout=300):
import paramiko
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(hostname=host, username=SSH_USER, key_filename=key_file, timeout=30)
_, stdout, stderr = client.exec_command(command, timeout=timeout)
rc = stdout.channel.recv_exit_status()
out = stdout.read().decode()
err = stderr.read().decode()
client.close()
return rc, out, err

def update_route53(new_ip):
r53.change_resource_record_sets(
HostedZoneId=HOSTED_ZONE_ID,
ChangeBatch={
‘Changes’: [{
‘Action’: ‘UPSERT’,
‘ResourceRecordSet’: {
‘Name’: HANA_HOSTNAME,
‘Type’: ‘A’,
‘TTL’: 30,
‘ResourceRecords’: [{‘Value’: new_ip}],
}
}]
}
)
logger.info(f”Route 53 updated: {HANA_HOSTNAME} -> {new_ip}”)

def get_r53_ip():
response = r53.list_resource_record_sets(
HostedZoneId=HOSTED_ZONE_ID,
StartRecordName=HANA_HOSTNAME,
StartRecordType=’A’,
MaxItems=’1′,
)
for rrset in response[‘ResourceRecordSets’]:
if rrset[‘Name’].rstrip(‘.’) == HANA_HOSTNAME.rstrip(‘.’):
return rrset[‘ResourceRecords’][0][‘Value’]
return None

def handle_failover(failing_instance_id):
target_id = SECONDARY_INSTANCE_ID if failing_instance_id == PRIMARY_INSTANCE_ID
else PRIMARY_INSTANCE_ID
logger.info(f”Initiating HSR takeover: {failing_instance_id} down, target={target_id}”)
key_file = get_ssh_key_file()
target_ip = get_public_ip(target_id)

rc, out, err = ssh_run(
target_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘hdbnsutil -sr_takeover'”,
)
logger.info(f”Takeover stdout: {out}”)
if err:
logger.warning(f”Takeover stderr: {err}”)
if rc != 0:
raise RuntimeError(f”sr_takeover failed (rc={rc}): {err}”)

new_primary_ip = get_private_ip(target_id)
update_route53(new_primary_ip)
logger.info(f”Failover complete — new primary: {target_id} ({new_primary_ip})”)

try:
handle_reregistration(failing_instance_id)
except Exception as e:
logger.warning(f”Re-registration of {failing_instance_id} failed: {e} — will retry on next boot”)

def handle_secondary_restart(instance_id):
logger.info(f”Restarting HANA on secondary {instance_id}”)
key_file = get_ssh_key_file()
ip = get_public_ip(instance_id)
rc, out, err = ssh_run(ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB start'”, timeout=600)
if rc != 0:
raise RuntimeError(f”HDB start failed on secondary (rc={rc}): {err}”)
logger.info(f”HANA restarted on secondary {instance_id}”)

def handle_reregistration(returning_id):
logger.info(f”Re-registering {returning_id} as HSR secondary”)

current_primary_id = SECONDARY_INSTANCE_ID if returning_id == PRIMARY_INSTANCE_ID
else PRIMARY_INSTANCE_ID
site_name = ‘SITE1’ if returning_id == PRIMARY_INSTANCE_ID else ‘SITE2’
primary_private_ip = get_private_ip(current_primary_id)
primary_hostname = “ip-” + primary_private_ip.replace(‘.’, ‘-‘)

key_file = get_ssh_key_file()
returning_ip = None

for attempt in range(12):
try:
returning_ip = get_public_ip(returning_id)
rc, _, _ = ssh_run(returning_ip, key_file, “echo ready”, timeout=15)
if rc == 0:
break
except Exception as e:
logger.info(f”SSH not ready on {returning_id} (attempt {attempt+1}/12): {e}”)
time.sleep(15)

ssh_run(returning_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB stop’ 2>&1 || true”)

for _ in range(40):
rc, out, _ = ssh_run(returning_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB info’ 2>&1 || true”, timeout=30)
if ‘hdbdaemon’ not in out:
break
time.sleep(15)
else:
raise RuntimeError(“HANA did not stop within 10 minutes”)

rc, out, err = ssh_run(
returning_ip, key_file,
f’sudo su – {HANA_SID.lower()}adm -c “‘
f’hdbnsutil -sr_register’
f’ –name={site_name}’
f’ –remoteHost={primary_hostname}’
f’ –remoteInstance={HANA_INSTANCE}’
f’ –replicationMode=syncmem’
f’ –operationMode=logreplay”‘,
)
if rc != 0:
raise RuntimeError(f”sr_register failed (rc={rc}): {err}”)

ssh_run(returning_ip, key_file,
f”sudo su – {HANA_SID.lower()}adm -c ‘HDB start'”,
timeout=600)

logger.info(f”Re-registration complete for {returning_id}”)

def handler(event, context):
logger.info(f”Event: {json.dumps(event)}”)

if event.get(‘source’) == ‘aws.ec2’:
detail = event.get(‘detail’, {})
state = detail.get(‘state’)
instance_id = detail.get(‘instance-id’)

if state in (‘stopped’, ‘terminated’, ‘stopping’)
and instance_id in (PRIMARY_INSTANCE_ID, SECONDARY_INSTANCE_ID):
if get_r53_ip() == get_private_ip(instance_id):
handle_failover(instance_id)
else:
logger.info(f”{instance_id} stopped but is not current primary, no action”)

elif state == ‘running’
and instance_id in (PRIMARY_INSTANCE_ID, SECONDARY_INSTANCE_ID):
if get_r53_ip() != get_private_ip(instance_id):
handle_reregistration(instance_id)
else:
logger.info(f”{instance_id} is already current primary, no action”)
return

if ‘Records’ in event:
for record in event[‘Records’]:
if record.get(‘EventSource’) == ‘aws:sns’:
message = json.loads(record[‘Sns’][‘Message’])
if message.get(‘NewStateValue’) == ‘ALARM’:
dims = message.get(‘Trigger’, {}).get(‘Dimensions’, [])
instance_id = next(
(d[‘value’] for d in dims if d[‘name’] == ‘InstanceId’), None)
if instance_id and get_r53_ip() == get_private_ip(instance_id):
handle_failover(instance_id)
elif instance_id:
logger.info(f”HANA alarm for {instance_id} — secondary, restarting HANA”)
handle_secondary_restart(instance_id)
else:
logger.warning(“HANA alarm with no InstanceId dimension”)
return

logger.warning(“Unrecognized event format”)Building the Lambda PackageParamiko needs to be compiled for Linux x86_64 — not your Mac or Windows build machine. Use pip’s –platform flag:#!/bin/bash
set -e
PACKAGE_DIR=”./lambda/package”
rm -rf “$PACKAGE_DIR” && mkdir -p “$PACKAGE_DIR”
cp lambda/failover.py “$PACKAGE_DIR/”
pip install paramiko
–platform manylinux2014_x86_64
–python-version 3.12
–only-binary=:all:
–target “$PACKAGE_DIR/”
cd “$PACKAGE_DIR” && zip -r ../lambda_failover.zip .Infrastructure (Terraform)Key excerpts — full Terraform is in the companion repository.EventBridge — watches both nodes for instance state changes:resource “aws_cloudwatch_event_rule” “primary_down” {
name = “hana-primary-down”
event_pattern = jsonencode({
source = [“aws.ec2”]
detail-type = [“EC2 Instance State-change Notification”]
detail = {
state = [“stopped”, “terminated”, “stopping”]
instance-id = [aws_instance.hana_primary.id, aws_instance.hana_secondary.id]
}
})
}

resource “aws_cloudwatch_event_rule” “instance_running” {
name = “hana-instance-running”
event_pattern = jsonencode({
source = [“aws.ec2”]
detail-type = [“EC2 Instance State-change Notification”]
detail = {
state = [“running”]
instance-id = [aws_instance.hana_primary.id, aws_instance.hana_secondary.id]
}
})
}Route 53 private hosted zone:resource “aws_route53_zone” “hana” {
name = var.private_zone_name # e.g. hana.internal
vpc {
vpc_id = aws_vpc.hana.id
}
}

resource “aws_route53_record” “primary” {
zone_id = aws_route53_zone.hana.zone_id
name = “hana-primary.${var.private_zone_name}”
type = “A”
ttl = 30
records = [aws_instance.hana_primary.private_ip]
}CloudWatch alarms — one per node, 10-second high-resolution periods:for INSTANCE in i-0xxxxxxxxxxxx i-0yyyyyyyyyyyy; do
aws cloudwatch put-metric-alarm
–alarm-name “hana-process-failed-${INSTANCE}”
–namespace HANA/Health
–metric-name HANARunning
–dimensions Name=InstanceId,Value=”${INSTANCE}”
–period 10
–evaluation-periods 3
–statistic Maximum
–threshold 1
–comparison-operator LessThanThreshold
–treat-missing-data ignore
–alarm-actions arn:aws:sns:eu-west-1:123456789012:hana-failover
–region eu-west-1
doneNote: –treat-missing-data ignore matters here. Set it to breaching and you’ll get false alarms in the window between alarm creation and the first metric arriving.Tested Failure ScenariosAll scenarios below were tested live, with both nodes taking turns as primary to confirm fully bidirectional operation.Scenario 1: Killing the OS (EC2 instance stop)The first test is stopping the EC2 instance itself — equivalent to a hard power-off or hypervisor failure.aws ec2 stop-instances –instance-ids i-0xxxxxxxxxxxxEventBridge fires the moment the instance transitions to stopping. The Lambda receives the event within one to two seconds, checks Route 53 to confirm the stopping instance is the current primary, and SSHes to the surviving secondary to run hdbnsutil -sr_takeover. Route 53 is updated to point at the new primary’s private IP.When the stopped instance comes back up, EventBridge fires a running event. The Lambda detects that its private IP no longer matches the Route 53 record and automatically re-registers it as the new HSR secondary — no manual intervention required.This was tested in both directions. After SITE1 was stopped and SITE2 took over, SITE2 was then stopped to confirm SITE1 could take over in the reverse direction.Detection time: ~2 secondsScenario 2: Killing the HANA process (HDB kill-9)The more insidious failure — HANA dies while the OS stays up. EC2 status checks stay green, EventBridge sees nothing. Completely invisible to AWS infrastructure monitoring.sudo -u hdbadm HDB kill-9The process monitor detects hdbnameserver is gone within the next 10-second polling cycle. It pushes HANARunning = 0 to CloudWatch. After three consecutive zero readings (30 seconds), the CloudWatch alarm fires and delivers the notification to the Lambda via SNS.The screenshot below shows this in action. The left alarm (hana-process-failed-i-0ebf62666d…) has turned red — HANA is down on SITE1. The right alarm (SITE2) remains green, confirming the secondary is healthy and ready for takeover.The Lambda receives the SNS notification, confirms the failing node is the current primary via Route 53, and runs the takeover on SITE2. Once the takeover and re-registration complete, both alarms return to OK.The metric graphs show the characteristic signature: HANARunning drops to 0 during the failure, then returns to 1 once HANA is restarted on the re-registered secondary.Detection time: ~30 secondsScenario Detection method Detection time Resultaws ec2 stop-instancesEventBridge~2 secondsTakeover + re-registration ✓Instance hardware failureEventBridge~2 secondsTakeover + re-registration ✓HDB kill-9 (OS stays up)CloudWatch alarm~30 secondsTakeover + re-registration ✓Secondary HANA crashCloudWatch alarm~30 secondsAutomatic HDB start ✓What This ReplacesPacemaker component AWS equivalentCluster daemon (corosync/pacemaker)EventBridge + LambdaSTONITH / fencingNot required — Lambda only acts on the surviving nodeResource agent (SAPHana)Lambda handle_failover / handle_reregistrationCluster VIPRoute 53 private zone (TTL 30s)Cluster logs (crm_mon)CloudWatch Logs (/aws/lambda/hana-failover)No cluster software on the HANA nodes. The nodes have no knowledge of each other beyond HSR itself.Advantages and LimitationsAdvantagesSplit-brain is not possible in this design — the Lambda only ever acts on the surviving node, never both simultaneously. That’s one less thing to worry about compared to Pacemaker.All logic in one place — single Lambda, easy to read and modifyCost — no additional software licencesBidirectional without reconfigurationLimitationsSSH dependency: The Lambda reaches the surviving node over public IP via SSH. If SSH is unreachable, the takeover won’t complete. Lambda timeout / re-registration : Set to 600 seconds. Re-registration including HANA startup can take several minutes; this gives us a bit of headroom.No SAP support: This is not a supported HA configuration. On AWS and Azure, SAP-certified HA solutions include Pacemaker with the SAPHana resource agent (on SLES or RHEL), SIOS LifeKeeper, and HPE Serviceguard — all of which are proven, supported options. They come at a significant cost in terms of licensing, implementation, and ongoing support, but that’s the price of a supported production configuration. Coverage: This solution covers two failure scenarios — EC2 instance failure and HANA process crash. In the real world there are many more potential causes of downtime: network partitions, storage failures, AZ outages, OS-level issues that don’t kill the instance, and HANA errors that don’t manifest as a dead process. The process monitor only checks hdbnameserver — a more robust implementation would also monitor hdbindexserver, hdbpreprocessor, and hdbxsengine, as it’s entirely possible for the indexserver to crash while the nameserver stays up, leaving HANA effectively unusable but the health metric still showing green. These scenarios are not covered here.SummaryEventBridge handles sub-second EC2 failure detection, a custom CloudWatch metric covers HANA process failures that EC2 monitoring can’t see, and a single Lambda function orchestrates the whole thing. All the cluster logic is in one place, none of it is on the nodes, and it works bidirectionally without reconfiguration.It’s an interesting exploration of what’s possible with AWS-native services. I don’t expect it to replace Pacemaker in a production SAP landscape anytime soon, but it was a worthwhile exercise.Appendix – Deployment Steps1. Deploy infrastructure

cd terraform
terraform apply

Note: only run this once. After initial deployment, never run `terraform apply` again — use AWS CLI for any updates.

2. Generate inventory

./generate_inventory.sh

3. Install HANA on both nodes

cd ansible
ansible-playbook -i inventory.ini hana-install.yml

4. Configure HSR

ansible-playbook -i inventory.ini hsr-setup.yml

5. Build and deploy the Lambda package

cd ..
./build.sh
cd lambda/package && zip -r ../lambda_failover.zip .
aws lambda update-function-code –function-name hana-failover –zip-file fileb://../lambda_failover.zip –region us-east-2

6. Deploy the HANA process monitor to both nodes

cd ansible
ansible-playbook -i inventory.ini hana-monitor.yml

7. Create the CloudWatch alarms

for INSTANCE in <primary-id> <secondary-id>; do
aws cloudwatch put-metric-alarm
–alarm-name “hana-process-failed-${INSTANCE}”
–namespace HANA/Health
–metric-name HANARunning
–dimensions Name=InstanceId,Value=”${INSTANCE}”
–period 10 –evaluation-periods 3
–statistic Maximum –threshold 1
–comparison-operator LessThanThreshold
–treat-missing-data ignore
–alarm-actions arn:aws:sns:us-east-2:<account-id>:hana-failover
–region us-east-2
done     Read More Technology Blog Posts by Members articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author