A little over two weeks ago, the news media started to carry stories about NatWest customers having trouble accessing their accounts. The stories nearly all took the line that a “glitch” in the bank’s computer systems had led to some accounts not being updated correctly. Over the next couple of days it became apparent that the problem was widespread and continuing.
The media eventually realised that the problem was not confined to NatWest, but also RBS, Ulster Bank and several others. It became apparent that the problem centred on the RBS clearing system, which is housed in Edinburgh.
The news reports mostly changed the name of the bank involved, partly because it allowed them to keep whipping RBS; since Fred the Shred’s day the bank has become a hate figure for the media.
However, they all continued to call the fault a “glitch”. In the IT industry this wasn’t being referred to as a glitch. A glitch is an annoyance – this was a full-scale, industrial-strength, grade A disaster.
This is a long post – grab a coffee.
I make a good living out of working with large companies – mainly banks nowadays – being consulted on various topics across IT. Mainly, I help them test new software and advise them how best to arrange their systems to ensure that all new code is fit for purpose. So the RBS debacle is of particular interest to me. There has been little (reliable) public pronouncement from RBS regarding the detail of the problem but the various leaks that have come out over the past fortnight have allowed us to build up a fairly detailed and plausible explanation of just what did happen. And it does not reflect well on RBS.
Firstly, some explanation of technical terms may be needed. Big companies like banks perform most of their data processing on mainframe computers. These are often very much like the things that used to populate science-fiction films in the 1960’s, although without the flashing lights and whirring tape reels. Think of the traffic control computer room in “The Italian Job” and you’ll have some idea of the scale of the thing. Mainframes are very expensive in floor space, electricity, maintenance and manning but they are amazingly powerful and reliable – they just never break down (well, almost never, but that’s another story). A mainframe is to your PC what a Formula 1 racing car or a battle tank is to a 2CV - they all move people around but in rather different fashions. Mainframes were meant to have died out long ago with the advent of smaller, cheaper computers but there are good reasons why they are still in use: we’ll come to these later.
As well as the mainframe(s) there are lots of other systems dealing with the various account types and counterparties that the bank has. Some of these may be large bespoke systems comprising many powerful workstations; others may be sitting on a single PC using software hacked up in someone's kitchen many years ago..
Every night, after the branches are closed, every bank has its processing run. This often starts at about 18:30 and can last all through the night, finishing in the small hours. This starts from the bank’s financial position at the start of the day and then applies all the changes that have been made since then to try to come to a definitive position of the financial position that night. This is an enormously complicated process and takes in and sends out many hundreds or thousands of data files or data streams during the nightly catch-up. The various jobs that make up the process have to be performed in a strict order. Many of the jobs depend on others and many of them provide information for subsequent jobs. At the end of a hard night’s processing, the bank should know just where it stands. This processing run happens every Monday to Friday night, apart from bank holidays, when a smaller run may be done just to maintain a few peripheral systems.
Because the sequence of jobs is so large and complex, the task of managing it is handed to another computer program called The Batch (it may not have capital letters in real life but it should have). The Batch does not do any of the important calculations itself, but ensures that the jobs that do are run in the correct order, that all of their dependencies are met before they start, and that they complete successfully. In case of any problems, The Batch notifies the local operators (the ops guys) who will see if they can fix the problem easily. Sometimes it can be a technical problem like a failed or full hard disc, a broken printer or a missing input file. These may be rectified quickly and The Batch restarted. Sometimes the problem is more complicated and the ops will call on second-line support, who have a more detailed knowledge of the system. These guys are the systems administrators (sysadmins for short). If the sysadmin cannot fix it, he will ask the ops to call out the developer responsible for the system. This is quite serious and the production manager may be dragged out of his bed at the same time. This is an emergency situation and no-one goes home until the problem is resolved.
In RBS The Batch is handled by a bit of software from Computer Associates, called CA-7. It’s a big bit of software, well known and very reliable. Everyone (probably) uses it.
It is important to remember that the ops are not supposed to try to fix anything really serious; they are there to fix the simple things and make sure that someone else fixes the difficult stuff. There are always at least two ops on duty at all times. They live in their own secure office usually called the Control Centre and some of them are quite impressive, full of glowing screens, flashing lights and occasional contented little beeps from monitoring kit. You are not allowed into the Control Centre without the express permission of the ops, and quite right too. The Control Centre and the computer it manages may not be in the same building, town or even continent.
Ops have no discretion. They have procedures to follow. They are employed to tend systems, not to make decisions. They have very responsible positions and look after expensive systems but they are not allowed any sharp tools.
I was a sysadmin doing production support for a bank much bigger than RBS for a while. Every second or third night I would be woken up by the ops. My pager would go off and I dialled in to the system to usually cure the problem fairly quickly. A couple of times I had to ask the ops to rouse a developer or two to fix something a bit more complicated, not a task to be contemplated lightly. Waking the developer that is, not fixing the problem. Developers are in the job they’re in because they don’t like other people and prefer just to deal with code. They especially don’t like other people at 3am. That's why the sysadmins get the ops to make the call.
Because the processing runs only on nights following bank opening days, this gives a whole weekend every week for the installation of new software. This is often on Saturdays after Friday’s batch run was completed and backed up. We usually did ours on Friday evenings before the batch – this gave us a full production run to test the new software, safe in the knowledge that we had a weekend to return to the Thursday night backup and start again. Also, we then didn’t have to pay for a Saturday shift. We could do that because the new software had been tested (often by me) and passed fit for production use so we had a high degree of confidence that there would be no problems.
That’s the background. What happened at RBS? When? And where?
Well, as far as we can work out…
On Wednesday 19th June stories started to appear about problems that NatWest customers were having with their accounts: balances were incorrect, cash machines would not pay out, automatic payments had not been made, deposited cash had not appeared, in fact just about every problem imaginable. Eventually it was realised that the whole of the RBS group and other banks that relied on RBS for processing services were affected. Absolute chaos reigned for a few days until a grip was got and the mess started to clear up. However, even today, 16 days after the first reports, customers of Ulster Bank are still having difficulties accessing their own money.
Because the problem became widely known on the 19th, many people in the industry thought that it was because of a failure in the Tuesday night batch run. The story got out that a software fix had been implemented on the Tuesday and the industry’s opprobrium rained down on RBS’s corporate head for installing new software mid week. In fact, that seems not to have been the case but the real situation may actually have been much worse.
It seems that the fix was installed during the previous weekend, as is industry practice. We don’t yet know what the fix was supposed to do, although there is some suspicion that was not unrelated to the mobile app that went live earlier the previous week. Unfortunately it was either insufficiently tested or not tested at all, because problems became apparent on the Monday or Tuesday. At some point on the Tuesday evening it was decided to remove the new software from the batch run. Strangely, this seems to have been attempted while The Batch was actually running and some processing had already taken place. This is a high-risk strategy, not to be undertaken by the faint-hearted or unskilled.
Unfortunately, whoever did it seems to have been creatively unskilled because instead of removing one job from the schedule, he deleted the entire schedule, which would just stop everything dead. I have never administered CA-7 myself but I am told by people who have that this isn’t easy – jobs are removed individually, not in bulk. Something went badly wrong, something that should not have been possible either practically or procedurally.
Mistakes happen everywhere, but well-prepared organisations recover from them by exercising their recovery plans. In this case the recovery should have been by simply restoring the latest backup of the CA-7 schedule. Unfortunately, again, it appears that they did not have a viable backup. This is inexcusable.
This original mistake seems to have been made by an op. This is not a job for an op, it is arguable whether even a sysadmin should have the system privileges to allow it. This should have been at least a developer task. What probably happened next is probably that the developers were all scrambled to try and recreate the batch schedule from scratch. This would have been an Herculean task, all the more so since RBS have been sacking many of their experienced developers and support staff in Edinburgh and replacing them with less-experienced but cheaper personnel – the last of their Edinburgh-based batch support staff were laid off just three weeks before the disaster. Add to this the fact that we are dealing with a mainframe. An old machine, running old systems, many with badly-written, wrong or just plain missing documentation. And no-one left who had any great knowledge of the system.
It’s amazing that they managed to achieve as much as they did in a week or so. Remember that for every day The Batch doesn’t run another day’s transactions pile up behind it. A normal batch may take up to twelve hours to run. With all the extra checking that would have been in place during the catch-up batches it could easily have extended the processing time by 50%. I suspect that for many people this was the hardest week’s work they will ever do in their lives and very possibly the best paid.
However, they didn’t get it quite right. All over Britain just now bank statements are landing on door mats that show duplicate transactions and balances that don’t match the sums of the transactions. From this it seems reasonable to deduce that when The Batch was restarted some of the databases were reflecting a point part-way through the Tuesday night processing when some transactions had already been loaded. On restart some jobs did not take cognisance of this fact and loaded them again.
Anyway, their hard work largely paid off and about a week ago, after two consecutive and unprecedented weekends of branch openings, the bank was able to announce that things were “largely” in order. They were spinning this as a success, as an IT professional I count it as a major failure. Ulster Bank is still not back in full operation, and is not expected to be until some time next week, which leads one to think that the UB systems had not been fully integrated into the RBS structure. It seems that other banks that rely on RBS for processing or send or receive payments from RBS are also still having problems.
The disaster has caused millions of pounds in overtime and other costs for RBS, many millions of pounds worth of losses for RBS customers because of missed payments, damage to credit ratings and even an extra weekend in jail for one poor sap; tens of millions of pounds in reputational damage; and probably hundreds of millions of pounds in compensation. Not to mention the wrath of the FSA, who have now been given a good excuse for sticking their corporate nose even further into the internal affairs of private businesses. Mind you, look on the bright side: RBS HR have probably saved the thick end of £2M from their staffing costs over the past couple of years. Assuming, of course, that they didn't have to pay out more than that in consultancy fees to the people who made things work again, the same people who are now probably also on lucrative contracts to document the systems that they had been saying for years needed documenting.
So there we are, the story of UK banking's worst and most public IT disaster. So far. It's nearly all conjecture, of course, but informed conjecture.
Why did it happen at all? That is a whole 'nother story. Come back next week and I'll also explain why I'm switching my remaining RBS accounts to the Cooperative Bank.
Our grateful thanks go to The Register, the planet's pre-eminent hard-working IT website. If you really want to know what is going on in the world of IT go there, all us professionals do.