It was a monday early morning , I got a call from my SL1 team in offshore . It was regarding a unique issue of one application which is hosted in Apache tomcat platform . After the initial investigation ,team updated the following and escalated the ticket to my queue.
1) Site was not loading / 500 Internal Server Error
2) Apache error log was throwing “ Premature End of Script “
I started to work on that issue and found that particular “php” processes for that site was hitting the Maximum allowed connections in Apache . I couldnt find anymore info from apache logs .
For example , if adminlogs.info is the site and “ admin “ is the username , 50 is the maximum allowed number of connections.
# Ps aux | grep admin | wc -l
51
I decided to kill all these processes and restart apache . As expected , the site started to work as normal . But after few minutes its again hit the limit 50 !!!! I felt something is stuck somewhere and the process cycle is not completing properly .
And finally it was time to take a deeper look on the issue !!
strace : strace is an excellent diagnostic tool for linux admins which will trace the system calls and signals.
I decided to use strace to dig the above mentioned process , for example if the pid is “12345 “
1) [email protected]:~ # strace -f -p 12345
Process 27776 attached – interrupt to quit
select (1024, [13], [], NULL, NULL
here it clearly shows the system call is stuck with the ” Select ” query . Unfortunately , in the earlier stage of investigation we were not able to find/understand the file descriptor #13.
( In the above strace output , the first argument (1024) is the max number of file descriptors in a set, the second ([13]) is the set of file descriptors polled for reading, the third ([] – empty set) is the set of file descriptors polled for writing. )
2) We can see the details of file descriptor (13) using the ” lsof ” command as follows
[email protected]:~ # lsof -p 27776
php-5.3.6 27009 dw02290c 13u IPv4 1612251401 TCP server.web-global.com:49062->ldap-global4.com:ldap(ESTABLISHED)
The above ” lsof ” output revealed that the file descriptor “13u ” is stuck with a system call to a remote/backend ldap server . And due to which the process does not get any response from the remote ldap server and hence it became stuck/hung .
3) I confirmed the above connections using netstat command also
[email protected]:~ # netstat -a | grep ldap-global4.com
tcp 0 0 server.web-global.com:49062 ldap-global4.com:ldap ESTABLISHED
4) Then we advised the client to remove the above problem ldap server “ ldap-global4.com” from the configured server pool of their application ( Issue was present in the staging site also ) . And later deployed the changes to live site after testing in staging site/ server.
Hope this would be helpful to some of my friends facing such issues in future.
Recent Comments