The Reasons Wells Fargo Latest Data Outage Should Scare You- A LOT!
The Reasons Wells Fargo Latest Data Outage Should Scare You- A LOT!
Friday, July 17, 2020
February 6th, 2019: The bankers, IT staff, CEO, and executive team at Wells Fargo went to sleep with the money, trust, and (somewhat shaken) confidence of their 70 million customers across the globe. The next morning should have been just another Thursday for most of them. But it was going to prove to be anything but.
February 7th, 2019: started out a fairly normal day for about 70 million customers worldwide and quickly turned into one that shall live in infamy for Wells Fargo Bank, based in San Francisco.
At 6:06 AM, the official WellsFargo Twitter account, @wellsfargo, posted the following tweet:
FIRST ALERT TWEET- MAY BE EXPERIENCING ISSUES
Some Wells Fargo customers tweeted back annoyed, but used to the issues, as Wells Fargo had some technical difficulties on Jan 1, Jan 2, Feb 1, and they were all related to an outage of their online-banking and Mobile Applications. Seems like this was just another one of these outages.
As the morning progressed, the information became a little more ominous and vague, but offered some hope to the customers that this was just a "glitch."
0705 AM: Wells Fargo tweets that they are working to "restore services as soon as possible"
It appears that they have determined the issue, and all will be well soon.
0944 AM: Wells Fargo update mentions that systems issues due to a power shutdown
were the result of "routine maintenance"
It appears that there is a power-fire related incident, following "routine maintenance."
Oh good- this isn't that bad then?
FROM 7TH FEBRUARY, 2019 0944 AM UNTIL 0910 AM ON FEBRUARY 08 ( ALMOST 24 HOURS, THERE IS NO UPDATE FROM WELLS FARGO'S TWITTER ACCOUNT.
Multiple tweets show that customers are frustrated and confused and furious about the lack of communication, access to money and several tweets live that they are switching their accounts to other banks, including commercial and personal accounts, direct, deposit, etc.
February 08, 2019:
0944 AM: The following tweet expresses regret and offers an apology, asking users to call if they have concerns, but warns that "hold times are longer than usual."
Note that customers haven't been able to access their money, direct deposit, or run payroll for the past 24 hours or more.
FAST FORWARD TO FEB 9, 2019 AT 0832 AM:
It appears that an all-clear is sounded after what appears to be a full system restoration of customer data, and critical business functionality.
A FULL 48 OR MORE HOURS OF DOWNTIME FOR 70 MILLION CUSTOMERS.
It doesn't take a lot of experience with IT, or disaster recovery or even basic business sense to ask the obvious question of Hotel-Tango Foxtrot could something like this happen to a publicly-traded company with 70 million customers?
While we will never know the real reasons for the outage until an investigation is launched (yet again for Wells Fargo), a few people get fired including the CEO, CIO, and several key executives, but here are some obvious things that went wrong.
Replication of Data to a secondary site is a must-have for any critical business functions, which are necessary to ensure 4, 5, or 6 nines of continuity of data.
Clearly documented communication plans and exercises performed regularly with key leadership, and business function leaders, help to maintain the integrity of critical information flow internally, as well as to customers in the event of a disaster, or other business interruption.
"Point of Least Impact" Maintenance Windows that don't follow "Chaos Monkey" scenarios are usually scheduled for non-impactful times of the day, or in the case of 24/7 global customers, at times of least-disruption are common-sense.
Having a history of integrity, continuous improvement, and transparency to your customers usually reaps benefits and is indicative of healthy, caring company culture.
While the data isn't available for us to analyze, it's clear that Wells Fargo has some serious deficiencies in every single one of these areas.
It's apparent based on tweets and the news, that Wells Fargo became aware of the issues via customer complaints, not internal mechanisms, which shows that their monitoring systems are (allegedly and apparently) lacking basic functionally for early warning and critical alerting for proactive remediation.
What is also clear that is that it took upwards of 3 hours for them to determine or call out a root cause ( which appears to be vague and ominous) as fire and smoke-related incident. Unless the fire suppression systems in the data center were not halon or gas-based, and were rather water-based, forcing sprinklers to damage servers and network gear, this seems rather implausible.
Enteprise Risk Management
If a " protect human life first" directive was issued at this data center, it is clear that would have happened (in the normal world) at 6:30 AM, classified as an incident, and triggering their already-in-place documented, tested and validated procedure of secondary site cut-over ( failover). It didn't.
Some customers were quick to say that this didn't pass the smell test, and sure something just isn't right here, but that seems to be a pattern with Wells Fargo.
They were down on Jan 1, Jan 2, Jan 7, and then Feb 1, with similar ( but contained) issues affecting all their customers, or subsets of them. This tells us that stability and mission-critical systems were just not providing the performance that one would expect.
Finally, it is sad (and shocking), that an organization the size of Wells Fargo, could not communicate with customers for over 24 hours, while their business suffered, cash was frozen, cards, payroll, and direct deposit didn't work, and most of all, confusion poured in over what the actual root-cause was.
This was a failure of epic proportions- and while this culture of mediocrity may be part of the DNA at Wells Fargo, it certainly will not be tolerated by its ever-diminishing customers.
It is critical for a business to perform a thorough audit of its systems, procedures and incident response personnel to know exactly what their Mean Tolerable Downtime (MTD), Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) are.
It is also critical that your infrastructure is monitored, data classified and labeled, and incident handling and response procedures are created, tested, failed over, and tested again.
It is a guarantee that this recent incident will result in a loss of customers, good-will, and trust for Wells Fargo via its unfortunate patters of negligence and mismanagement.
At cloudskope, we help companies investigate, audit, and understand and mitigate their risks before they become incidents. Contact usto start your audit and understand your risks.
Not doing so, could very well mean you're the next headline that hits CNN.