IBM HTTP Server - Diagnosing problems with sidd

During the initial SSL handshake between browser and web server, a SSL session is established, and characteristics such as client authentication and allowable ciphers are determined. This initial handshake is computationally intensive.

For subsequent TCP connections, the browser normally attempts to resume the prior SSL session instead of establishing a new SSL session, in order to avoid the expense of a full handshake. In order to support this resumption of SSL sessions, IBM HTTP Server maintains a cache of SSL sessions which can be resumed.

For the Windows platform, only one IBM HTTP Server child process is used to handle client connections, so an in-process cache maintained by the security library is sufficient. This document about sidd does not apply to the Windows platform.

For platforms other than Windows, multiple IBM HTTP Server child processes are normally used to handle client connections, so the cache of sessions must be accessible to all of those child processes. A session id cache daemon is provided (IHSROOT/bin/sidd), and it is started automatically when SSL support is enabled. It runs as a separate process.

Disabling sidd

If problems are experienced with sidd, there are certain circumstances where it can be safely disabled. Otherwise, most problems can be resolved with a configuration change.

AIX, HP-UX, Linux, Solaris

If a single long-lived child process is used to serve requests, sidd can be disabled and the internal security library cache used instead.

Disable the IBM HTTP Server sidd with the SSLCacheDisable directive and remove any existing SSLCacheEnable directives in httpd.conf.

z/OS

The IBM HTTP Server InfoCenter explains now to use the native z/OS equivalent of sidd.

Diagnosing sidd connect failures

For every SSL handshake, the httpd process handling the connection will communicate with the session id daemon. The communication takes place over a Unix (AF_UNIX) socket.

Certain types of problems can result in a connect failure, and one of the following messages may be seen:

Failure reason Example message Description
ECONNREFUSED [crit] (nnn)Connection refused: SSL0600S: Unable to connect to session ID cache The session id cache is not running or is temporarily overloaded or an operating system-specific issue has been encountered.
EPERM [crit] (13)Permission denied: SSL0600S: Unable to connect to session ID cache The filesystem permissions in the path to the session id cache socket do not permit the web server user id to access it.
EMFILE [crit] (24)Too many open files: SSL0600S: Unable to connect to session ID cache The per-process file descriptor limit is too low, or the system-wide file descriptor limit is too low.
This typically affects other web server and plug-in operations as well.
generic [crit] SSL0600S: Unable to connect to session ID cache The customer is using an older level of IBM HTTP Server which does not log the exact failure reason.

IBM HTTP Server 1.3.26.x and 1.3.28.x users can upgrade to cumulative fix PK05084 or later to get the more descriptive message and save time diagnosing the problem.

IBM HTTP Server 2.0.x users can upgrade to cumulative fix PQ94389 or later to get the more descriptive message and save time diagnosing the problem.

If the customer cannot upgrade, work through the diagnosis steps for all of the other variations of this message.

All service levels of IBM HTTP Server 6.0 and later releases will log the specific failure reason.

The wording of "Connection refused" or "Permission denied" can vary from one platform to another.

Communication between httpd processes and sidd is required to reuse SSL sessions (i.e., avoid the expensive handshake on every TCP connection). Thus, if the connect error is occurring very frequently, it will result in a substantial increase in CPU utilization because SSL sessions will be reused infrequently.

General diagnosis steps

These steps do not apply to the EMFILE error.

  1. Make sure that the Unix socket used by sidd resides on a normal, local filesystem, as some network or other filesystems don't support Unix sockets. The default location for the Unix sockets is

    IHSROOT/logs/siddport.

    If that does not reside on a normal filesystem, use the SSLCachePortFilename directive to place the Unix socket in a directory which resides on a local filesystem.

    Example:

    SSLCachePortFilename /var/run/siddport                         
    
  2. If more than one IBM HTTP Server instance is used on this machine, make sure that each has been configured with a specific Unix socket. This is usually a problem when two instances share the same server root or install location.

    Example:

    httpd-app1.conf
    SSLCachePortFilename /var/run/app1-siddport                         
    
    httpd-app2.conf
    SSLCachePortFilename /var/run/app2-siddport                         
    
  3. Make sure that one sidd process is running for every IBM HTTP Server instance. The parent process of sidd will be the parent httpd process of that web server instance.

    If sidd is not running for one or more instances, use the sslcacheerrorlog directive in the conf file to specify the name of a sidd error log. Restart the web server. Once sidd exits or fails to start up, check the sidd error log.

Specific diagnosis steps for the EPERM failure

This problem is caused by the web server user id (e.g., "www" or "nobody") not having permission to read the Unix socket used by sidd, the session id cache daemon). When this error occurs:

Consider /opt/IBMIHS as the example IBM HTTP Server install directory, and assume that customer did not use the SSLCachePortFilename directive to specify the location of the sidd socket, and www is the web server user id (value of User directive).

When IBM HTTP Server starts up, sidd will create the file /opt/IBMIHS/logs/siddport. When a new client SSL connection is received, mod_ibm_ssl will be running as user "www" and will try to connect to the sidd socket. So user "www" must have read and execute permissions to these directories:

        
/opt                                                                    
/opt/IBMIHS                                                             
/opt/IBMIHS/logs                                                        
And user "www" must have read permission to this "file":
/opt/IBMIHS/logs/siddport
Normally, when IBM HTTP Server is installed the directories will be world readable and executable. If the customer changes those permissions (on /opt, /opt/IBMIHS, or /opt/IBMIHS/logs) then permission errors will be received when new SSL connections are being established and mod_ibm_ssl tries to connect to the sidd socket. The SSLCachePortFilename directive can be used to place the sidd socket somewhere else.

Example:

SSLCachePortFilename /var/run/siddport                         

The actual file needs to be in a directory structure which, on your system, the web server user id can access.

If you have two instances of IBM HTTP Server that share an installation directory, they should each have a different argument to SSLCachePortFilename directive specified.

Specific diagnosis steps for the ECONNREFUSED failure

There are several classes of this error:

For solid failures, follow the general diagnosis steps above.

For a single failure that occurs immediately following an IHS restart, a problem was identified and a fix provided by APAR PK78007.
A fix for this APAR is provided in fixpacks 6.0.2.35, 6.1.0.25, and 7.0.0.5. Refer to the APAR for additional details.
The problem can be safely ignored as there are no ill effects on the server itself, but you can apply the appropriate fixpack as desired.
The fix is pertinent only for this 'Connection Refused' error message and not for the other errors such as 'Permission denied' error.
If you are getting multiples of this error message, then the problem is likely to be some other error or misconfiguration that is not addressed by this APAR.

For intermittent failures, find how many handshakes are impacted by comparing the number of failures to the number of total handshakes.

Set LogLevel info in the web server configuration file, rename error_log so that a new one is created, and restart. After sufficient data has been gathered:

  1. Find the total number of SSL handshakes
  2. Find the number of sidd connect errors
    $ grep "SSL0600" logs/error_log | wc -l
    49
    
  3. Find the percentage of failures
    49 / 5073 is a little less than 1%
    

If the percentage of failure is less than 10%, it should have only a small impact on CPU usage.

If the percentage of failure is higher, check the operating system-specific notes below for known issues.

Specific diagnosis steps for the EMFILE failure

Operating system-specific notes

Solaris

Solaris 10

Solaris 10 has an apparent problem, seen both on SPARC and x64 platforms, which results in the ECONNREFUSED failure even under relatively light loads. This issue is tracked by Sun under bug id 6460268. Customers encountering the "Connection refused: SSL0600S" message on Solaris 10 should check with Sun on the availability of a fix for this problem.

March 14, 2007 status: Sun reported that fixes for this bug are now in the development build of Solaris. The fixes have not yet been backported to Solaris 10.

June 1, 2007 status: Sun reported that a test/temporary patch is currently available for Solaris 10; it will be integrated in the next update of Solaris 10, which will probably release in July.

Solaris 8 and 9

These levels of Solaris have a hard-coded queue length for the number of connections to an AF_UNIX socket. This hard-coded queue length is 32. The ECONNREFUSED failure will occur with 33 or more simultaneous attempts to communicate with sidd. This is tracked by Sun bug id 4352289. A fix is available for Solaris 9.

Linux

On Linux systems tested (2.4 and 2.6 kernels), the ECONNREFUSED error can only occur due to a configuration problem and/or the sidd process exiting. It will not occur intermittently, because the AF_UNIX support in the kernel will block a thread waiting to connect to sidd once the connect queue becomes full.