Even though PACS systems are designed to run in mission-critical environments without interruption, sometimes PACS systems will still fail. When a PACS does go down, a PACS administrator has to face an immediate hurricane of angry physicians from the ED and Surgery as well as watching the radiology department grind to a stop. It’s not a fun time for anyone involved and we all try to prevent these episodes as much as possible. Drawing on the collective wisdom from veteran PACS administrators in ClubPACS, we have come up with a list of the most frequent failure modes to watch out for and prevent if at all possible.
10.Dongle keys are the Devil:
Dongle keys are hardware license keys that some vendors use to unlock or control usage of their software. These hardware keys usually take the form of a USB drive. If the dongles fail or are more likely removed, the system will lock up and refuse to service users from looking at images. Often times the keys are only good for a fixed period of time, expiring unexpectedly and requiring a new key to be sent in the mail. We feel that PACS systems have enough potential failure modes without having vendors introducing new single points of failure that can take out their system. (There are far better ways to ensure your user is abiding by their service agreement). This technique is a throwback from the 1980s and has absolutely no place on any modern IT system.
9.Maintenance crews unplugging things:
This happens more often than you might expect. Maintenance crews may come into your datacenter to clean floors or perform other maintenance and have been known to bump power switches or kick out cables. This happens most frequently when your ‘data center’ also doubles as a maintenance closet. We have learned that being obsessive compulsive about cabling is not a bad thing. Keeping your cables neatly tucked in and attached to cable arms allows you to pull out the server for maintenance and not pull out the cables. Label your cables accurately with hostname and port (including ports on your switch at the other end) so that you know where to plug things back in if they do come apart.
8. Stuffing your servers in a broom closet.
Servers need air to breath and can generate more heat than you think. Without air circulation you will be amazed and dismayed at how short the lifetime of your servers can be. Usually the first things to go in an overheated space are your hard drives. If you lose more than a couple of drives over a short period, you should look suspiciously at environmental factors inducing those failures. Data centers also provide very nice amenities like uninterruptible power, constant humidity, controlled access, and halon fire suppression. A water sprinkler can ruin your day in a hurry.
7. Not watching the shop.
Are you notified when a disk drive fails? Hard drive failures are the most common because they use moving parts. These failures should be expected and accounted for by using a RAID to prevent loss of data. RAID stands for Redundant Array of Inexpensive Disks. A RAID 5 configuration has about 8-12 drives with an extra disk acting as a parity drive, checksum operation. This means that if one of the 8-12 drives fails, no data is lost, because the data is distributed across the other drives. A common problem we see is that having a redundant configuration makes people overconfident and they think they will be protected in the event of multiple failures. Many times they neglect to notice that one of the drives has failed. If you don’t have a spare drive rebuild the RAID automatically, you are running in a very vulnerable state where the loss of any one additional drive will ensure the complete loss of the entire array. We’ve heard of situations where a drive loss in a RAID goes undetected for six months and then is followed by another failed drive some time later. Make sure you have a hot spare drive configured and you at least get an automatic system email when you do lose a drive so you can get a new one in ASAP.
6. Running out of disk space for images.
Images suck up space on your storage system better than a Hoover. Most people subscribe to the just-in-time buying strategy of buying storage each year to enjoy the dramatic reduction in price, however you should keep a watchful eye on your available capacity. Adding new or upgraded modalities (e.g. new 64-channel CT scanner) will cause your old burn rate to become inaccurate. You should be ever-vigilant about what disk space you have left so that you don’t underestimate it and run out of gas.
5. System is too complex.
If your PACS system uses multiple databases, like Oracle and SQL Server, and/or multiple types of operating systems, like Windows and Linux/Unix, you are asking for sleepless nights. These complex architectures are precarious and often have many single points of failure. These patchwork systems are often a result of a haphazard technology strategy that plugged various components from acquired companies together. It requires too much expertise in all the operating systems and databases to keep the system running smoothly. Keep it simple -- complexity is the enemy of reliability.
4. Upgrades can turn off your lights.
Did you ever notice just how often those scheduled downtimes boil over into unscheduled downtimes? This failure mode has several reasons:
- An unforeseen complication with software or hardware during the upgrade causes the software to crash. This might come from DLL versioning conflicts or device drivers.
- Often the new software requires more hardware resources, the technical word is bloat, and the computers you had can’t handle the load.
- In the vendor’s exuberance to upgrade you, they blew away your configuration data. Laugh now, but it happens.
Word to the wise: Always ask for a back-out plan -- or Plan B -- during an upgrade, of what to do when things go south and for the last possible abort time, also known as the Point of No Return. Those plans might need to be implemented to make sure you don’t disrupt operations at peak times like Monday morning. If you are really good you would have a test system to test out what will happen in your production environment and always remember a full backup BEFORE you upgrade.
3. Network outages.
Network outages are frequently out of our control but still can cause us a great deal of grief. CIOs have woken up quickly to the fact that their networks are no longer just for billing and now are part of the delivery of care. A faulty network is the probably the most frequent cause of a CIO’s early dismissal. There are things you can do that can add network redundancy of your system at very little cost. Most data centers will have two separate network trunks available. For an extra $100 or so, make sure to have either two Network cards (NIC) or a dual NIC on your servers so you can have a connection to both network trunks at the same time. If one of the trunks becomes unavailable, network traffic will automatically route through the other trunk and your PACS users will never know about the failure. The side benefit is you can get twice the network performance out of your servers, which might make things faster.
2. Lack of database space.
The database is the master central controller of your workflow and operations. Whenever you ask for a worklist of cases that haven’t been read or are searching for a case, the database is the one answering your questions. Databases store lots of data (although not the actual images) and need space, too. If a database system runs out of space, your system will stop dead in its tracks. This is readily solved with automated active monitoring but it still seems be a frequent failure that bites the unwary. Databases need space, too.
1. Hardware failures.
In a recent survey of database professionals regarding failures in the server room, hardware failures led the causes with 49% of the reason for unplanned outages over the past year. (Seehttp://www.dmreview.com/editorial/newsletter_article.cfm?articleId=1061….) The cost of redundant servers has become very inexpensive, especially compared to the damage done to your department’s credibility during a downtime. A loaded enterprise server with dual power supply, dual core processor server costs only $3,000 - $5,000. If you look at the cost of the system, the cost for system redundancy is very minimal. Vendors still don’t take full advantage of modern IT principles in fault tolerance and could benefit from selling that as part of their standard configuration. The simple fact is the vast majority of failures are likely attributed to poor systems management that is avoidable and doesn’t have to cost much to implement, sometimes nothing at all. The best thing you can do to prevent a failure is to keep a watchful eye on your system and keep open lines of communication with both your users and your IT department. Hardware failures are almost always preceded by many error messages on the system. The question is, are you listening? It’s far easier to recover from the warning tremors of a failure than a complete failure.