Provide feedback on the IBM HTTP Server forum on IBM developerWorks.
The documentation required to diagnose web server hangs includes
GatherHangDoc
tool
The ServerDoc tool provided with ihsdiag automates much of the work of gathering this information. The user runs ServerDoc during the hang and provides the IHS installation directory and other information; ServerDoc creates a new directory to hold the required documentation, and stores information in that new directory. This collector also supports IBM Edge Caching Proxy (ibmproxy, WTE) since version 1.4.19
Don't collect the ServerDoc based mustgather on a system that isn't currently experiencing a hang. It cannot be analyzed by support.
Obtaining and installing the collector, ihsdiag, is documented here
Once the ServerDoc tool has completed, the user should copy any remaining log files and configuration files used by the web server and the plug-in into the new directory, and send in the directory to IBM support.
This tool uses native tools such as strace
and truss
to obtain system call traces, which include the contents of buffers used to
read and write data from the network.
See gather_highcpu_doc.html#GSKITICC_HIGHCPU
If the server stops accepting new connections on z/OS, make sure LE APAR PM90528 is present.
In many cases, the ServerDoc tool will simply show that the WebSphere Plugin is waiting for a response from a backend Application Server that is not responding in a timely fashion. For this reason, we encourage customers to proactively gather the WebSphere Application Server MustGather for Hangs and submit it along with the IBM HTTP Server MustGather.
For WebSphere Application Server MustGathers for your platform, see here
Startup hangs with mod_proxy_balancer
mod_proxy_balancer
requires 2 bytes of random data during
start/restart which will result in a blocking read of /dev/random
.
On some virtualized systems, or systems where some other process is exhausting
/dev/random
, startup may hang until enough entropy can be gathered
by the system.
For web server hangs with IHS 2.0 on Solaris, please see this document first.
For web server hangs with any release of IHS on AIX 5.3:
$ /usr/sbin/instfix -vik IY58143 IY58143 Abstract: Required fixes for AIX 5.3 Fileset X11.Dt.lib:5.3.0.1 is applied on the system. Fileset X11.Dt.rte:5.3.0.1 is applied on the system. Fileset X11.base.rte:5.3.0.1 is applied on the system. Fileset X11.fnt.ucs.ttf is not applied on the system. Fileset X11.fnt.ucs.ttf_extb:5.3.0.1 is applied on the system. ... Fileset devices.vdevice.hvterm1.rte:5.3.0.1 is applied on the system. Fileset devices.vtdev.scsi.rte:5.3.0.1 is applied on the system. Fileset sysmgt.websm.apps:5.3.0.1 is applied on the system. Fileset sysmgt.websm.framework:5.3.0.1 is applied on the system. Fileset sysmgt.websm.rte:5.3.0.1 is applied on the system. Fileset sysmgt.websm.webaccess:5.3.0.1 is applied on the system. All filesets for IY58143 were found.
Common web server hang conditions can be categorized as follows:
As discussed in the following sections, the root cause of the problem may not reside in IHS, so analysis of the IHS hang documentation may indicate that a different type of information is necessary.
A primary use of IHS is as a front-end to the WebSphere Application Server. It is possible for applications running in WebSphere to have delayed reponse, or no response at all, so that all IHS threads are waiting for a application server response and no free IHS threads are available to handle new client connections.
Some authentication mechanisms for IHS, such as LDAP authentication capability provided with IHS or by a third party vendor, must contact a server over the network as part of IHS request processing. If that communication stalls, it is possible that after some time all IHS threads are waiting on an authentication response and no free IHS threads are available to handle new client connections.
In any situation where all IHS threads are waiting on an external application, the IHS hang documentation will show which component is waiting but it cannot determine the root cause for why the application is not responding.
Note: If the IHS hang documentation shows that IHS is waiting for a WebSphere response, related documentation for WebSphere will need to be gathered. Instructions for this WebSphere documentation can be found at http://www-1.ibm.com/support/search.wss?rs=180&tc=SSEQTP&tc1=SSCMPB9&q=mustgather. It is possible to collect this documentation at the same time when the IHS hang documentation is collected, so that the required WebSphere information is available to IBM support if it is necessary.
Vendors of third-party components which run inside IHS may provide similar information for gathering documentation on problems that can cause the component to hang or stall; contact the vendor for more information.
For this type of problem, IBM support anticipates being able to determine the failing component, as well as whether or not this is a known problem. Occasionally there are operating system issues which prevent IHS from finding out about new client connections. If analysis of the IHS hang documentation shows such a problem, network traces may be necessary and operating system support may suggest further diagnostic information.
Please refer to these instructions for verifying that required support programs are installed.
Note: This executable mustgather is not used on Windows nor on z/OS.
Run the tool as root
to avoid any permissions problems
with obtaining backtraces or reading files, such as log files and
configuration files. (More information about the requirement to run
this tool as root
is available here.)
ServerDoc is passed in four parameters for gathering hang documentation:
GatherHangDoc
# java -jar ServerDoc.jar GatherHangDoc /path/to/IHS 1398 127.0.0.1:80
The tool creates a new directory which contains a timestamp in the name, and the hang documentation will be saved in that directory.
If the IHS installation only supports SSL, then use - (hyphen) for this parameter. Otherwise, specify an IP address and port which can be used to reach the server from the local machine without using SSL.
Use the following table to determine the value of the non-SSL
address parameter based on the form of a non-SSL Listen
directive used in your configuration:
Listen directive looks like this | use this for address parameter |
(no non-SSL ports) | - |
Listen 80 |
127.0.0.1:80 |
Listen port |
127.0.0.1:port |
Listen 192.168.1.15:80 |
192.168.1.15:80 |
Listen ipaddress:port |
ipaddress:port |
Listen myhostname:80 |
myhostname:80 |
For this example, IHS is installed in /scratch/IHS
,
the parent process id is stored in file
/scratch/IHS/logs/httpd.pid
, the non-SSL port can be
reached from the web server machine on address
127.0.0.1:8080
, and ihsdiag was unpacked into directory
/root/ihsdiag-1.3.0
.
# cd /tmp # java -jar /root/ihsdiag-1.3.0/ServerDoc.jar GatherHangDoc \ /scratch/IHS `cat /scratch/IHS/logs/httpd.pid` 127.0.0.1:8080 Gathering doc on 4 web server processes... 5985 5986 5988 5984 Seconds remaining before gathering information again: 60...54...48...42...36...30...24...18...12...6... Gathering doc on 4 web server processes... 5985 5986 5988 5984 Seconds remaining before gathering information again: 30...27...24...21...18...15...12...9...6...3... Gathering doc on 4 web server processes... 5985 5986 5988 5984 Reports, log files, and configuration files have been saved to directory HangDoc.200408310607 If you have additional log files or configuration files, copy them there before packing up the directory. Web server log and conf files other than the default will have to be copied manually. WebSphere plug-in conf and log files will have to be copied manually. Hint for packing up the directory: tar -cf HangDoc.200408310607.tar HangDoc.200408310607 gzip HangDoc.200408310607.tar # ls -l HangDoc.200408310607/ total 772 -rw-rw-r-- 1 trawick trawick 0 Aug 31 06:07 access_log -rw-rw-r-- 1 trawick trawick 5358 Aug 31 06:07 apachectl -rw-rw-r-- 1 trawick trawick 118 Aug 31 06:07 error_log -rw-rw-r-- 1 trawick trawick 462978 Aug 31 06:07 httpd -rw-rw-r-- 1 trawick trawick 28790 Aug 31 06:07 httpd.conf -rw-rw-r-- 1 trawick trawick 255056 Aug 31 06:08 log -rw-rw-r-- 1 trawick trawick 56 Aug 31 06:07 redhat-release -rw-rw-r-- 1 trawick trawick 5453 Aug 31 06:08 report
There are two normal situations where the tool can take a long time to gather data:
A less frequent cause is that there is a problem in the tool which causes it to hang.
Two conditions will cause the display to be updated:
If you need to interrupt the tool so the web server can be restarted (to try to resolve the hang condition), the best place to interrupt it is when it is counting down the number of seconds until it checks the web server state again. The last lines of output on the display will look like this:
Seconds remaining before gathering information again: 60...54...48...42...36...30.
If the tool is interrupted at a different time, incomplete information will be gathered on the state of the web server. This will introduce some risk into our analysis of the problem, but as long as a meaningful percentage of the web server processes have been examined (>30%), it is usually possible to find a probable cause of the hang.
If the IHS child processes have a very large number of threads (e.g., ThreadsPerChild is higher than 200), the expected cause is that the system debugger has a performance degradation analyzing such processes.
It is also possible that the HangDoc tool has a problem interacting with the system debugger, and it will never finish.
To find out more information about the cause of the delay, take these steps:
ps
-ef
to a file. This must be done before interrupting the
HangDoc tool.
HangDoc.xxxx
directory, which is what it was using when
it stalled.
HangDoc.xxxx
directory to IHS support for analysis.
The next step is to copy any other web server or plug-in configuration files and logs into the new HangDoc directory. Here is a list of files to copy if they are being used:
The last step is to pack up and compress the documentation directory using zip, tar followed by gzip, or pax followed by compress. The easiest way is to cut and paste the messages displayed by ServerDoc previously which showed the commands to use. The suggested commands will vary by platform. On z/OS, for example, pax and compress will be suggested instead of tar and gzip.
Don't forget to collect the corresponding WebSphere Appplication Server MustGathers and include them in your submission.
# tar -cf HangDoc.200408310607.tar HangDoc.200408310607 # gzip HangDoc.200408310607.tar
The resulting compressed file is the file to send to IBM support.
root
requirementWhen gathering information on web server hangs, the tool must attach to live web server processes to obtain information about the state of those processes.
If the web server is started as
root
, then at least one of these processes will be owned
by root
and other processes will be owned by the web
server user id (e.g., nobody
or www
). Only
root
has the authority to attach to all of the processes,
so the tool itself must be run as root
. If the web
server administrator does not have authority to log in or switch user
to root
, a simple script can be created to gather the
hang documentation, and the system administrator can give the web
server administrator sudo
access to that script.
sudo
is a third-party tool available without cost for all
appropriate platforms.
If the web server is not started as root
, there are no
such concerns, and the hang documentation tool may be run by the user
id which starts the web server.
If the tool is run as non-root
and it is unable to
gather the required information, the problem will have to be
recreated. It may not be possible to determine if this problem
occurred until the documentation has been analyzed by IBM HTTP Server
support.
gsk_get_last_validation_error
in the call stack in unexpected places?
Sometimes a series of GSKit internal functions show up as gsk_get_last_validation_error
in the backtraces
but they are not a cause of concern. Usually the lowest call in the stack is a properly displayed GSKit function
(such as gsk_secure_soc_init
) and higher in the stack will be the IHS or WebServer Plugin I/O callbacks
(secure_read or plugin_ssl_read).
ThreadsPerChild
?
This is normal for some utility processes created by IHS, such as the IHS parent process, sidd, or the CGI daemon will be decorated as such
If one of the threads is decorated as the "IHS main thread waiting for process to exit...", then the remaining threads are likely to be hung or stuck in a loop (causing high CPU). We'd typically look for a blocking system call (read, poll, select, mutex/lock related) in the first few frames for a hung thread, then look to identify the module owning the hung code by looking farther down in the stack to see where control was handed off from the core of Apache.