Cisco 4100 Clustering. Part 4: Rebuilding failed unit

One of the Cisco Firepower Threat Defense (FTD) units in HA cluster experienced multiple failures related to hard drive malfunction which required rebuilding unit from scratch. Here are some of the pain points I had to go through to get new appliance back online and clustered.

Formatting the drive failed and rebooting the appliance did not make a difference.

/ssa/slot # show fault
Severity Code Last Transition Time ID Description
——— ——– ———————— ——– ———–
Cleared F1545 2018-03-23T10:16:11.785 200066 Slot 1, is not
operationally up
Major F1547 2018-03-23T10:15:25.508 198282 Disk format is failed on slot 1
Major F1550 2018-03-21T21:50:01.785 198574 Failed to install App
Instance ftd on slot 1. Error: Insufficient_Disk_Space

You can also try to reinitialize the blade.

# scope ssa
/ssa # scope slot 1
/ssa/slot # reinitialize
Warning: Reinitializing blade takes a few minutes. All the application data on blade will get lost. Please backup application running config files before commit-buffer.
/ssa/slot* # commit-buffer

If all fails next step before contacting TAC is to generate tech-support file.

# connect local-mgmt
(local-mgmt)# show tech-support fprm detail
The detailed tech-support information is located at workspace:///techsupport/xyz_FPRM.tar

Once the file is generated copy it to FTP server and attach it to the case to save time.

(local-mgmt)# copy workspace:///techsupport/xyz_FPRM.tar ftp:

Since health monitoring alerts clearly indicated issue with the appliance hard drive we had it replaced. New drive required re-configuring appliance from scratch. Once initial setup was completed I ran into login issue. For some reason, password was updated on console and ssh but GUI login kept on failing. So I had to go through reset procedure described here.

Next, I’ve discovered that original FTD installation image version is no longer available for download on CCO. Image version mismatch does not allow new unit to join existing cluster so I had to specifically request it from TAC.

Once all the versions matched I still could not sync up unit back to the cluster. The unit was kicked out from cluster because of either Application health check failure or it was stuck on “SLAVE_BULK_SYNC“. Tech-support file will indicate the reason for health check failure. Long story short we ended up replacing the whole appliance. The replacement unit was again rebuilt from scratch and joined the cluster successfully.

Next step was to join FTD to Firepower Management Center (FMC). Keep in mind even so FTD is in cluster you add it to FMC as a separate managed device using the management IP address.

This is where I ran into another issue. FTD would not join FMC even so all settings were copied to the new unit following these steps. The fix was to update FTD manually from CLI with “configure manager add <IP>” command. I’ve seen this happen before on FirePOWER modules and apparently it is a bug. After that, I’ve attempted join from FMC and it was successful.

I thought I was done but not yet. My cluster was patched to latest version and since we did not upgrade the FTD before adding it to the cluster update option was not available. I guess you can wait for next patch and apply it to the cluster to even it out but if not then upgrade FTD via CLI. I had to copy the file to the FTD from FMC and ran upgrade command from CLI.

FTD:/ngfw/var/sf/updates# scp admin@<FMC IP>:/var/sf/updates/Cisco_FTD_SSP_Patch-6.2.0.x.sh Cisco_FTD_SSP_Patch-6.2.0.x.sh

Once copied run update with this command.

install_update.pl /ngfw/var/sf/updates/Cisco_FTD_SSP_Patch-6.2.0.x.sh

I’ve noticed on several occasions that file may need to be copied to specific updates location which may change based on the version so if this location (/ngfw/var/sf/updates/) does not work check with TAC.

Finally here is the list of commands I found useful while troubleshooting clustering issues.

connect module 1 con
connect ftd
Cluster status
> show cluster info
> system support diagnostic-cli
Cluster configuration
#sh run cluster
Check for time differentiation
#sh clock
View cluster interface stats
#sh int po48
Cluster health data
#sh cluster info health
#sh cluster info trace
Capture packets to verify cluster communication
#cap ccl interface cluster buffer 33554430 match ip any any
#sh cap ccl

#scope ssa
Logical device status
/ssa # show logical-device

#scope eth-uplink
/eth-uplink # scope fabric a
Interface statistics and status
/eth-uplink/fabric # show interface
/eth-uplink/fabric # sh port-channel

2 comments On Cisco 4100 Clustering. Part 4: Rebuilding failed unit

Brenden
January 10, 2023 at 3:17 pm - Reply

FTDs are the absolute bane of my life. Complete and utter crap. UPgrades dont work first time, they fail almost constantly and they are generally a complete PoS.
- admin
  January 27, 2023 at 8:02 pm - Reply
  
  As of 7.0.4 I see a lot of improvements in stability and manageability. I run 7.3 in a limited deployment and so far no issues. So I think we are slowly getting there.

FINKOTEK.

Cisco 4100 Clustering. Part 4: Rebuilding failed unit

Cisco ASA: This interface is located in a PCI-e x2 slot

Cisco Anyconnect: Roaming Security profile is missing

2 comments On Cisco 4100 Clustering. Part 4: Rebuilding failed unit

Leave a reply: Cancel Reply

Cisco 4100 Clustering. Part 4: Rebuilding failed unit

Post Navigation

Cisco ASA: This interface is located in a PCI-e x2 slot

Cisco Anyconnect: Roaming Security profile is missing

2 comments On Cisco 4100 Clustering. Part 4: Rebuilding failed unit

Leave a reply: Cancel Reply

Sliding Sidebar

Tags