Windows Server 2003 Maintenance Made Easy
Following a management and maintenance regimen reduces administration, maintenance, and business expenses while increasing reliability, stability, and security.
Maintaining Windows Server 2003 systems isn't always simple. Administrators facing the often-daunting task of maintaining a Windows Server 2003 environment must do so in the midst of daily administration and firefighting. Administrators have little time to identify, prioritize, and organize maintenance processes and procedures until after a preventable catastrophe occurs.
When maintenance tasks are given proper priority in an enterprise, they can alleviate many of the more common firefighting tasks. To decrease the number of administrative inefficiencies and the amount of unscheduled fixes an administrator must go through, it is important to identify those tasks (e.g., service maintenance, log file review, and more) that are important to the systems' overall health and security. After they've been identified, routines should be set to ensure that the Windows Server 2003 environment is stable and reliable.
The processes and procedures for maintaining Windows Server 2003 systems can be separated based on the needs of the systems and the appropriate time interval between procedures. Some maintenance issues require daily attention, whereas others might require only quarterly checkups. The detailed maintenance processes and procedures that an organization follows depend strictly on the particular environment; however, the concept of placing tasks into daily, weekly, monthly, and quarterly categories works with all sizes and varieties of IT infrastructures.
While these tasks might seem commonplace or mundane, they are often overlooked and they are critical to keeping the system environment and users working productively. As the number of servers and services within an environment increases so does the amount of time required to perform routine management and maintenance. As a result, many organizations are beginning to realize the importance of employing systems or operational management software such as Microsoft Operations Manager, NetIQ AppManager, IBM Tivoli, and others. Many of these products can monitor machines automatically, send alerts to administrators, and execute actions when they discover a problem.
Some maintenance procedures require more attention than others. Those that require the most are categorized as daily procedures. These procedures include checking the overall server health and functionality, verifying that backups are successful, and monitoring the Event Viewer logs.
First, pay close attention to the myriad of patches and updates made available by Microsoft and other vendors to clear up performance and security issues. These service packs (SPs) and updates for both the operating system and applications can be critical components to maintaining your environment. There are several ways an administrator can update a system with the latest SP or update: CD-ROM, manually entered commands, Windows Update, Windows Update Services (WUS), or third-party products such as NetIQ Patch Manager.
No matter which method you use to pull down and apply patches, be sure to test and evaluate SPs and updates in a lab environment thoroughly before installing them on production servers and client machines. While the majority of such patches will install without incident, it only takes one serious patch problem to bring down your environment. It is a good idea to keep all system and application software consistent by installing the same patch levels to each server and client machine.
The Event Viewer is used to check the System, Security, Application, and other logs on a local or remote system. These logs can be an invaluable source of information regarding the system health. The event logs present on Windows Server 2003 systems are the Security, Application, and System logs. In addition, the File Replication Service, Directory Service, and DNS Server logs are present on domain controllers.
All Event Viewer events are categorized either as informational, warning, or error. Checking these logs often will increase your understanding of them. There are some events that constantly appear but aren't significant. Events will begin to look familiar, so you will notice when something is new or amiss in your event logs. Some best practices for monitoring event logs include understanding the events that are being reported, setting up a database for archived event logs, and archiving event logs frequently.
To simplify monitoring hundreds or thousands of generated events each day, the administrator should use the filtering mechanism provided in the Event Viewer or use one of the operational management packages mentioned earlier. Although warnings and errors should take priority, the informational events should be reviewed to track what was happening before the problem occurred. After the administrator reviews the informational events, he or she can then easily filter out the informational events and view only the warnings and errors.
The three event logs on all servers and the three extra logs on a DC can be archived manually, or a script can be written to automate the task. You should archive the event logs to a central location for ease of management and retrieval.
The specific amount of time to keep archived log files varies on a per-organization basis. For example, banks or other high-security organizations might be required to keep event logs up to a few years. As a best practice, organizations should keep event logs for at least three months.
The script logarchive.vbs (see Listing 1) can retrieve event logs and store them in a central location. The process might take a long time (up to a few hours) depending on the size of the log files as well as how many servers you're pulling from. Avoid running this script over slow WAN connections so that bandwidth is conserved.
Another file, logarchive.ini, is required when using logarchive.vbs. This file, shown next, contains a list of servers and the following archiving modes: T means purge after archiving; F means archive only.
To use logarchive.vbs, do the following:
- Verify that the logarchivelog.vbs and logarchive.ini files are in a pathed directory.
- Right-click the logarchive.ini file and type the list of servers on which you want to archive event logs.
- Choose Start, Run and type cmd to open a command prompt.
- At the command prompt, type cscript logarchive.vbs.
The command in step 4 archives all the event logs for the servers that were specified in the logarchive.ini file. The log files are stored in the directory specified in the script. Logs will be labeled in the following format:
For example, sfdc01_sec_02202004.log is the name for the SFDC01 server's Security log, archived on February 20, 2004.
It is recommended that you label log files in the following manner:
- _sec_ Security log
- _app_ Application log
- _sys_ System log
- _rep_ File Replication log
- _dns_ DNS Server log
- _dir_ Directory Service log
Note that logarchive.vbs does not purge the event logs.
Hardware components supported by Windows Server 2003 are reliable, but this doesn't mean that they'll always run continuously without failure. Hardware availability is measured in terms of mean time between failures (MTBF) and mean time to repair (MTTR). This includes downtime for both planned and unplanned events. These measurements provided by the manufacturer are good guidelines to follow; however, mechanical parts are bound to fail at one time or another. As a result, you need to monitor hardware weekly to ensure efficient operation.
You can monitor hardware many different ways. For example, server systems might have internal checks and logging functionality to warn against possible failure, Windows Server 2003's System Monitor might bring light to a hardware failure, and a physical hardware check can help to determine whether the system is about to experience a problem with the hardware.
If a failure has occurred or is about to occur, having an inventory of spare hardware can significantly improve the chances and timing of recoverability. Checking system hardware on a weekly basis provides the opportunity to correct the issue before it becomes a problem.
Disk space is a precious commodity. Although the disk capacity of a Windows Server 2003 system can be virtually endless, the amount of free space on all drives should be checked daily. Serious problems can occur if there isn't enough disk space, including, but not limited to, application failures, system crashes, unsuccessful backup jobs, service failures, and more. To prevent these problems from occurring, administrators should keep the amount of free space to at least 25 percent.
Another major disk issue is caused by drive fragmentation. Whenever files are created, deleted, or modified, Windows Server 2003 assigns a group of disk clusters depending on the size of the file. As file size requirements fluctuate over time, so does the number of groups of clusters assigned to the file. Even though this process is more efficient when using NTFS, the files and volumes eventually become fragmented as the files do not reside in a contiguous location on the disk.
As fragmentation levels increase, disk access slows (see Figure 1). The system must use additional resources and time to find all the cluster groups in order to use the file. To minimize the amount of fragmentation and give performance a boost, the administrator should use the built-in Disk Defragmenter or a third-party tool such as Diskeeper to defragment all volumes.
The Domain Controller Diagnostic (DCDIAG) utility provided in the Windows Server 2003 Support Tools is used to analyze the state of a domain controller (DC). It runs a series of tests, analyzes the state of the DC, and verifies different areas of the system, such as
- Topology integrity
- Security descriptors
- Netlogon rights
- Intersite health
- Trust verification
DCDIAG should be run on each DC on a weekly basis or more often if problems arise.
WINS, DHCP, and DNS are three low-maintenance network services prevalent in most environments, but as with any other database, it's important to regularly check these databases to keep them running as efficiently as possible.
WINS uses the Extensible Storage Engine (ESE) to store system entries, and these entries are being updated, added, or deleted continuously. Over time, the WINS database can contain a lot of unused space due to the abundance of changes. As a result, the database should be compacted to regain the unused space and to enable the system to service the environment faster and more efficiently. Windows Server 2003 dynamically compacts the database, but offline compaction is required periodically as well.
The first step in WINS database compaction is to make sure that WINS has been backed up successfully. With the exception of the first backup, WINS backups happen in Windows Server 2003 automatically. To back up the WINS database, select Mappings, Back Up Database within the WINS Manager and then specify a location for the backup files. Click OK and WINS will then back up its database automatically every 24 hours.
If the WINS database ever becomes corrupted, simply stop and restart the WINS service. If WINS detects corruption, it will restore the most recent backup automatically. If WINS does not detect the corruption, the administrator can force a restore by selecting Mappings, Restore Database from the WINS Manager.
WINS is also designed to compact its databases automatically when they become too large. However, the administrator should periodically compact them manually. For large Windows Server 2003 environments with more than 1000 systems, Microsoft recommends compacting the database manually once a month. To compact the WINS database manually, open a command prompt window and change the directory path to show %SystemRoot%\Systems32\wins and then type NET STOP WINS. Next type the command JETPACK WINS.MDB TEMP.MDB and then type NET START WINS.
DHCP maintenance is less complex than WINS maintenance. The DHCP database and related Registry entries (i.e., DHCP.mdb, DHCP.tmp, J50.log, J50#####.log, and J50.chk) are backed up every 15 minutes automatically by default. Also, the DHCP database is compacted at specific intervals automatically. You can use the same procedure to compact the DHCP database.
Similar to WINS and DHCP, DNS operates efficiently on its own and requires very little intervention or maintenance. However, one way to maintain DNS in medium to large environments is to set aging and scavenging. Depending on the number of updates and the number of records, DNS can potentially experience problems with removing stale records. Although this doesn't necessarily cause performance degradation or resolution problems in smaller networks, it might affect larger ones. As such, it's important to periodically scavenge this database. Aging and scavenging are not enabled by default, so they must manually be enabled by selecting Action, Set Aging/Scavenging within the DNS snap-in. (Be sure the appropriate server is highlighted first.) Then check the box within the Server Aging/Scavenging Properties window and set the appropriate intervals, as shown in Figure 2.
Another often-overlooked monthly task is testing uninterruptible power supplies (UPSs). These devices protect the system or group of systems from power failures (such as spikes and surges) and keep the system running long enough after a power outage so that an administrator can shut down the system gracefully. Administrators should follow the manufacturer's test guidelines at least once a month.
Once a month, an administrator should validate backups by restoring the backups to a server located in a lab. This is in addition to verifying that backups were successful from log files or the backup program's management interface. A restore gives the administrator the opportunity to verify the backups and to practice the restore procedures that would be used when recovering the server during a real disaster. In addition, this procedure tests the state of the backup media to ensure that they are in working order and builds administrator confidence for recovering from a true disaster. This is one of the most commonly skipped tasks of small to medium-sized IT departments. It is typically implemented after a major disaster in which the tape backup system failed to restore a server or data properly.
An integral part of managing and maintaining any IT environment is to document the network infrastructure and procedures. The following are just a few of the documents you should consider having on hand:
- server build guides
- disaster recovery guides and procedures
- configuration settings
- change configuration logs
- historical performance data
- special user rights assignments
- special application settings
As systems and services are built and procedures are ascertained, document these facts to reduce learning curves, administration, and maintenance.
Not only is it important to adequately document the IT environment, but it's often even more important to keep those documents up to date. Otherwise, documents can quickly become outdated as the environment, processes, and procedures change as the business changes.
In addition to operating system software, most of the major Intel-based server manufacturers provide frequent hardware driver and firmware updates. The updates resolve issues identified by the manufacturers that could interfere with proper function of the systems. Manufacturers generally require matched sets of firmware and driver release levels to alleviate compatibility problems. Like operating system patches, it is important to test firmware and driver updates before putting them into your production environment. Checking driver and firmware versions monthly is usually sufficient, unless a specific problem occurs.
As the name implies, you perform quarterly maintenance four times a year. Areas to maintain and manage on a quarterly basis are typically fairly self-sufficient and self-sustaining. Infrequent maintenance is required to keep the system healthy. This doesn't mean, however, that the tasks are simple or that they aren't as critical as those tasks that require more frequent maintenance.
For starters, storage capacity on all volumes should be checked to ensure that all volumes have ample free space. Keep approximately 25 percent free space on all volumes. Running low or completely out of disk space creates unnecessary risk for any system. Services can fail, applications can stop responding, and systems can even crash if there isn't adequate disk space.
Passwords should, at a minimum, be changed every quarter (90 days). This includes resetting the administrative account passwords. Changing passwords strengthens security measures so that systems can't be compromised easily. As an integral part of the password policy, you need to review other password requirements, including password age, history, length, and strength.
AD is the heart of the Windows Server 2003 environment. As objects are added, modified, or deleted from the AD database or the schema is modified, these interactions with the database can cause fragmentation. Windows Server 2003 performs online defragmentation nightly to reclaim space in the AD database. However, the database size doesn't shrink unless you perform offline defragmentation. Figure 3 shows the differences in fragmented versus defragmented AD databases.
Performing offline defragmentation and compaction of the AD database requires the domain controller to be rebooted. As such, this maintenance routine can be run on a less frequent basis. Ntdsutil is the tool for maintaining AD databases. It defragments the AD database and also performs other routines such as cleaning up metadata left behind by abandoned domain controllers and managing Flexible Single Master Operations (FSMO).
To use Ntdsutil to defragment the AD database, do the following:
- Restart the DC and when the initial screen appears, press the F8 key.
- From the Windows Advanced Options menu, select Directory Services Restore Mode.
- In the next screen, select the Windows Server 2003 operating system being used and then log on to the system.
- Click OK when the informational message appears.
- At a command prompt, type ntdsutil files.
- At the File Maintenance prompt, type compact to %s, where %s identifies an empty target directory. This invokes esentutl.exe to compact the existing database and write to the specified directory. Figure 4 illustrates the compaction process.
- If compaction was successful, copy the new ntds.dit file to %systemroot%\NTDS and delete the old log files found in %systemroot%\NTDS.
- Type quit twice to exit the utility and then restart the computer.
Other uses for the Ntdsutil include:
: Analyzes and reports the free space, reads the registry, and then reports the sizes of the database and log files.
Integrity: Performs an integrity check on the database, which detects any kind of low-level database corruption. This can take a long time to process if the AD database is large. It's important to note that you should always run Recover prior to running an integrity check.
Recover: Attempts to perform a soft recovery of the database. This task scans the log files and ensures all committed transactions therein are also reflected in the data file. Table 1 summarizes some of the maintenance tasks and recommendations discussed.
As administrators, it is easy to get caught up in firefighting and the daunting tasks of Windows Server 2003 administration. However, it is important to structure and prioritize system management and maintenance to help prevent unnecessary amounts of downtime and other such problems. Following a management and maintenance regimen reduces administration, maintenance, and business expenses while at the same time increasing reliability, stability, and security. These tasks appear simple. However, if they are not done with regularity, administrators will find themselves with a lot of explaining to do.