MustGather information for web server hangs

The documentation required to diagnose web server hangs includes

  • the results of attempting different types of requests at the time of the hang
  • web server backtraces at the time of the hang
  • web server and plug-in configuration files
  • web server and plug-in log files

    The ServerDoc tool provided with ihsdiag automates much of the work of gathering this information. The user runs ServerDoc and provides the IHS installation directory and other information; ServerDoc creates a new directory to hold the required documentation, and stores information in that new directory.

    Once the ServerDoc tool has completed, the user should copy any remaining log files and configuration files used by the web server and the plug-in into the new directory, and send in the directory to IBM support.

    what we expect to learn from this information

    Common web server hang conditions can be categorized as follows:

    As discussed in the following sections, the root cause of the problem may not reside in IHS, so analysis of the IHS hang documentation may indicate that a different type of information is necessary.

    IHS is waiting on an external application

    A primary use of IHS is as a front-end to the WebSphere Application Server. It is possible for applications running in WebSphere to have delayed reponse, or no response at all, so that all IHS threads are waiting for a application server response and no free IHS threads are available to handle new client connections.

    Some authentication mechanisms for IHS, such as LDAP authentication capability provided with IHS or by a third party vendor, must contact a server over the network as part of IHS request processing. If that communication stalls, it is possible that after some time all IHS threads are waiting on an authentication response and no free IHS threads are available to handle new client connections.

    In any situation where all IHS threads are waiting on an external application, the IHS hang documentation will show which component is waiting but it cannot determine the root cause for why the application is not responding.

    Note: If the IHS hang documentation shows that IHS is waiting for a WebSphere response, related documentation for WebSphere will need to be gathered. Instructions for this WebSphere documentation can be found at http://www-1.ibm.com/support/search.wss?rs=180&tc=SSEQTP&tc1=SSCMPB9&q=mustgather. It is possible to collect this documentation at the same time when the IHS hang documentation is collected, so that the required WebSphere information is available to IBM support if it is necessary.

    Vendors of third-party components which run inside IHS may provide similar information for gathering documentation on problems that can cause the component to hang or stall; contact the vendor for more information.

    IHS has a problem with processing new client connections

    For this type of problem, IBM support anticipates being able to determine the failing component, as well as whether or not this is a known problem. Occasionally there are operating system issues which prevent IHS from finding out about new client connections. If analysis of the IHS hang documentation shows such a problem, network traces may be necessary and operating system support may suggest further diagnostic information.

    making sure required support programs are available

    Please refer to these instructions for verifying that required support programs are installed.

    special AIX considerations for IHS 1.3

    In some levels of AIX, backtraces cannot be obtained if IHS 1.3 is using the pthread accept mutex mechanism. The backtraces are critical for hang diagnosis. If the backtraces could not be collected, the cause of the hang cannot be diagnosed. Submit the hang documentation just in case, but check for the release-specific issues listed below to prepare for possible future occurrences of the hang.

    Other releases of IHS are not affected.

    running the tool

    Run the tool as root to avoid any permissions problems with obtaining backtraces or reading files, such as log files and configuration files. (More information about the requirement to run this tool as root is available here.)

    ServerDoc is passed in four parameters for gathering hang documentation:

    1. GatherHangDoc
    2. the name of the IHS installation directory (e.g., /usr/HTTPServer)
    3. the web server parent process id, or "auto" if the parent process has exited and left stranded child processes
    4. the address of a non-SSL port handled by the web server (e.g., 127.0.0.1:80), or "-" if there is no non-SSL port
    # java -jar ServerDoc.jar GatherHangDoc /path/to/IHS 1398 127.0.0.1:80
    

    The tool creates a new directory which contains a timestamp in the name, and the hang documentation will be saved in that directory.

    determining the value of the non-SSL address parameter

    If the IHS installation only supports SSL, then use - (hyphen) for this parameter. Otherwise, specify an IP address and port which can be used to reach the server from the local machine without using SSL.

    Use the following table to determine the value of the non-SSL address parameter based on the form of a non-SSL Listen directive used in your configuration:

    Listen directive looks like this use this for address parameter
    (no non-SSL ports) -
    Listen 80 127.0.0.1:80
    Listen port 127.0.0.1:port
    Listen 192.168.1.15:80 192.168.1.15:80
    Listen ipaddress:port ipaddress:port
    Listen myhostname:80 myhostname:80

    a sample run

    For this example, IHS is installed in /scratch/IHS, the parent process id is stored in file /scratch/IHS/logs/httpd.pid, the non-SSL port can be reached from the web server machine on address 127.0.0.1:8080, and ihsdiag was unpacked into directory /root/ihsdiag-1.3.0.

    # cd /tmp
    # java -jar /root/ihsdiag-1.3.0/ServerDoc.jar GatherHangDoc \
    /scratch/IHS `cat /scratch/IHS/logs/httpd.pid` 127.0.0.1:8080
    Gathering doc on 4 web server processes...
    5985  5986  5988  5984
    
    Seconds remaining before gathering information again:
    60...54...48...42...36...30...24...18...12...6...
    
    Gathering doc on 4 web server processes...
    5985  5986  5988  5984
    
    Seconds remaining before gathering information again:
    30...27...24...21...18...15...12...9...6...3...
    
    Gathering doc on 4 web server processes...
    5985  5986  5988  5984
    
    Reports, log files, and configuration files have been saved to
    directory
      HangDoc.200408310607
    If you have additional log files or configuration files, copy them
    there
    before packing up the directory.
    Web server log and conf files other than the default will have to be
    copied manually.
    WebSphere plug-in conf and log files will have to be copied manually.
    
    Hint for packing up the directory:
      tar -cf HangDoc.200408310607.tar HangDoc.200408310607
      gzip HangDoc.200408310607.tar
    # ls -l HangDoc.200408310607/
    total 772
    -rw-rw-r--    1 trawick  trawick         0 Aug 31 06:07 access_log
    -rw-rw-r--    1 trawick  trawick      5358 Aug 31 06:07 apachectl
    -rw-rw-r--    1 trawick  trawick       118 Aug 31 06:07 error_log
    -rw-rw-r--    1 trawick  trawick    462978 Aug 31 06:07 httpd
    -rw-rw-r--    1 trawick  trawick     28790 Aug 31 06:07 httpd.conf
    -rw-rw-r--    1 trawick  trawick    255056 Aug 31 06:08 log
    -rw-rw-r--    1 trawick  trawick        56 Aug 31 06:07 redhat-release
    -rw-rw-r--    1 trawick  trawick      5453 Aug 31 06:08 report
    

    what if the HangDoc tool is taking a very long time?

    There are two normal situations where the tool can take a long time to gather data:

    A less frequent cause is that there is a problem in the tool which causes it to hang.

    the display is being updated on a regular basis but there are so many httpd processes that it will take forever

    Two conditions will cause the display to be updated:

    If you need to interrupt the tool so the web server can be restarted (to try to resolve the hang condition), the best place to interrupt it is when it is counting down the number of seconds until it checks the web server state again. The last lines of output on the display will look like this:

    Seconds remaining before gathering information again:
    60...54...48...42...36...30.
    

    If the tool is interrupted at a different time, incomplete information will be gathered on the state of the web server. This will introduce some risk into our analysis of the problem, but as long as a meaningful percentage of the web server processes have been examined (>30%), it is usually possible to find a probable cause of the hang.

    the display is not being updated after several minutes

    If the IHS child processes have a very large number of threads (e.g., ThreadsPerChild is higher than 200), the expected cause is that the system debugger has a performance degradation analyzing such processes.

    It is also possible that the HangDoc tool has a problem interacting with the system debugger, and it will never finish.

    To find out more information about the cause of the delay, take these steps:

    1. Make sure you've waited at least four minutes from the time that the display was last updated.
    2. From another terminal window, save the output of ps -ef to a file. This must be done before interrupting the HangDoc tool.
    3. Interrupt the HangDoc tool and find the most recent HangDoc.xxxx directory, which is what it was using when it stalled.
    4. Cut and paste the HangDoc display to a file.
    5. Send in the ps listing, the HangDoc display, and the HangDoc.xxxx directory to IHS support for analysis.

    copying other web server and plug-in files

    The next step is to copy any other web server or plug-in configuration files and logs into the new HangDoc directory. Here is a list of files to copy if they are being used:

  • any IHS configuration file other than httpd.conf in the IHS install directory
  • any additional web server error or access log files, such as log files specific to each virtual host or log files created by rotatelogs
  • the WebSphere plug-in configuration file
  • the WebSphere plug-in log file

    saving the documentation directory

    The last step is to pack up and compress the documentation directory using tar and gzip. The easiest way is to cut and paste the messages displayed by ServerDoc previously which showed the tar and gzip commands to use.

    a sample run

    # tar -cf HangDoc.200408310607.tar HangDoc.200408310607
    # gzip HangDoc.200408310607.tar
    

    understanding the root requirement

    When gathering information on web server hangs, the tool must attach to live web server processes to obtain information about the state of those processes.

    If the web server is started as root, then at least one of these processes will be owned by root and other processes will be owned by the web server user id (e.g., nobody or www). Only root has the authority to attach to all of the processes, so the tool itself must be run as root. If the web server administrator does not have authority to log in or switch user to root, a simple script can be created to gather the hang documentation, and the system administrator can give the web server administrator sudo access to that script. sudo is a third-party tool available without cost for all appropriate platforms.

    If the web server is not started as root, there are no such concerns, and the hang documentation tool may be run by the user id which starts the web server.

    If the tool is run as non-root and it is unable to gather the required information, the problem will have to be recreated. It may not be possible to determine if this problem occurred until the documentation has been analyzed by IBM HTTP Server support.