Jump to content
Goose

Preventive Maintenance Checklist

Recommended Posts

Ok so I'm coming up with a preventative maintenance checklist as a reference for keeping our server up and running.

 

Our CCTV server contains a rack with 8 DVR's(and an archive DVR), setup in RAID 6 with two workstations.

 

With a server like that, what would you guys do in the way of preventative maintenance? Either daily or weekly.

Share this post


Link to post
Share on other sites

I think daily/weekly might be a little excessive for PM. Monthly, at most...

 

I assume these are PC-based DVRs? Are the drive arrays internal in each one, or are you running separate NAS arrays?

 

For me, with a PC-based DVR, most of the PM I'd do would require taking the system offline - can your archive system be used as a temporary failover machine to cover for the one that's being worked on?

 

First thing I'd do is open the thing up and check fan operations, and blow out any collected dust... in most environments I see, semi-yearly is sufficient for this; if you're running in a room with filtered air, even yearly you probably won't see much dust buildup.

 

Next thing I do is run Memtest86+, and if I have the time, let it run a couple passes.

 

From there, I'll run manufacturer drive diagnostics (PowerMax, DFT, SeaTools, etc.), ideally from a DOS boot disk; most now have Windows-based utilities I can run without taking the system down, but most will also be happier if nothing is accessing the drive being tested, so shutting down the DVR software is probably advisable.

 

All the NAS arrays I've used have their own disk-test diagnostics as well, so running those every 4 or 6 months is advisable... but most will also monitor SMART status and send an email if issues are found. The newer QNAP firmwares allow you to set scheduled automatic disk tests.

 

The other thing to watch for is capacitor plague - visually inspect the electrolytics on the motherboard for bulging tops, and if you see any, plan to replace the MOBO ASAP.

 

Beyond that, I suggest just good monitoring... something like Speedfan can watch temperatures, fan speeds, and drive SMART statistics, and send an email if anything goes over user-defined thresholds. There are other, fancier commercial products as well, of course.

Share this post


Link to post
Share on other sites

About forty years ago I read an interesting analysis of potentiometer failures. Long story short, the users would use a certain range of the potentiometer day in and day out. Over time, dust and grime would begin to collect at the two end points of the typical daily range. Then, at a time of crisis when they tried to use a wider range, the potentiometer would fail, just when they needed it most.

 

Recently one of my neighbors had an attempted theft. Their video system had been working for flawlessly for many months, but when the police came to their house to get a copy of the evidence, they had no idea how to create a CD. They had never tried that before.

 

Same thing with backups. I've known companies that had an elaborate backup scheme, but they had not tried to recover a file for a very long time. One day when an important file was actually lost and they tried to recover it, that was when they realized the backup scheme had not been working for many months.

 

So, my suggestion regarding PM, is to add crisis drills once in a while. For example, create an evidence CD once in a while from some random date in the past, just to make sure all the critical pieces are working.

 

Best,

Christopher

Share this post


Link to post
Share on other sites

Yeah it's a PC based DVR system and the RAID arrays are internal. And it sounds like the stuff I'm already doing is sufficient.

 

And yeah Christopher, good point about doing drills to make sure things are working as they should. Duly noted.

Share this post


Link to post
Share on other sites

One other thing you could do, maybe semi-annually, is run a series of Prime95 benchmarks to stress-test the other hardware... things that will test I/O performance, run the CPU hard to see if the fan controls maintain temperatures properly, and so on. It's designed primarily for overclockers to test the stability of their mods.

 

Maybe keep a log of the tests so you can compare them from one to the next and watch for any decreases in performance that could indicate pending failures.

Share this post


Link to post
Share on other sites

Agree with above go over your restore procedure. My IT department was backing up on a tape drive for years, I brought in our security system contractor in and they said they wouldnt have any temp servers w/tape drives, obviosly we changed our backup method.

 

If you have a/c to cool down your equipment check on its regular maintenance and also test any high temp alarms if you have them. It would not be good to do regular maintenance on DVR's+other equipment and you loose all the equipment because your a/c unit failed.

 

Other lessons learned:

-Check any UPS's if you have any!

-Make sure your electrical outlets are labeled with location of circuit breaker and the circuit breaker in the electrical panel is labeled indicating what equipment is on that breaker.

Share this post


Link to post
Share on other sites
-Check any UPS's if you have any!

 

I print the date of purchase on a Dymo label for all my lead acid batteries. If it's a critical system, I recommend changing lead acid batteries every three years. I've had a few lead acid batteries last much longer on non-critical systems, but more than three years is pushing it for critical systems.

 

Best,

Christopher

Share this post


Link to post
Share on other sites
Agree with above go over your restore procedure. My IT department was backing up on a tape drive for years, I brought in our security system contractor in and they said they wouldnt have any temp servers w/tape drives, obviosly we changed our backup method.

 

If you have a/c to cool down your equipment check on its regular maintenance and also test any high temp alarms if you have them. It would not be good to do regular maintenance on DVR's+other equipment and you loose all the equipment because your a/c unit failed.

 

Other lessons learned:

-Check any UPS's if you have any!

-Make sure your electrical outlets are labeled with location of circuit breaker and the circuit breaker in the electrical panel is labeled indicating what equipment is on that breaker.

 

Out maintenance department does monthly check ups on all the AC equipment and UPS equipment in the Casino. But I hadn't given that any thought, so I'll at least be checking their logs every once in a while.

 

Thanks for the suggestions everyone.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×