STrace : Third Eye of a System Admin

 

It was a monday early morning , I got a call from my SL1 team in offshore . It was regarding a unique issue of one application which is hosted in Apache tomcat platform .  After the initial investigation ,team updated the following and escalated the ticket to my queue.

1) Site was not loading / 500 Internal Server Error
2) Apache error log was throwing “ Premature End of Script “

I started to work on that issue and found that particular “php” processes for that site was hitting the Maximum allowed connections in Apache . I couldnt find anymore info from apache logs .

For example , if adminlogs.info is the site and “ admin “ is the username , 50 is the maximum allowed number of connections.

# Ps aux | grep admin | wc -l
51

I decided to kill all these processes and restart apache . As expected , the site started to work as normal . But after few minutes its again hit the limit 50 !!!!  I felt something is stuck somewhere and the process cycle is not completing properly .

And finally it was time to take a deeper look on the issue !!

strace : strace is an excellent diagnostic tool for linux admins which will trace the system calls and signals.

I decided to use strace to dig the above mentioned process , for example if the pid is “12345 “

1) [email protected]:~ # strace -f -p 12345
Process 27776 attached – interrupt to quit
select (1024, [13], [], NULL, NULL

here it clearly shows the system call is stuck with the ” Select ” query . Unfortunately , in the earlier stage of investigation we were not able to find/understand the file descriptor #13.

( In the above strace output , the first argument (1024) is the max number of file descriptors in a set, the second ([13]) is the set of file descriptors polled for reading, the third ([] – empty set) is the set of file descriptors polled for writing. )

2) We can see the details of file descriptor (13) using  the ” lsof ” command as follows

[email protected]:~ # lsof  -p 27776

php-5.3.6 27009 dw02290c   13u IPv4 1612251401                 TCP server.web-global.com:49062->ldap-global4.com:ldap(ESTABLISHED)

The above ” lsof ” output revealed that the file descriptor “13u ”  is stuck with a system call to a remote/backend ldap server . And due to which the process does not get any response from the remote ldap server and hence it became stuck/hung .

3)  I confirmed the above connections using netstat command also

[email protected]:~ # netstat -a | grep  ldap-global4.com
tcp        0      0 server.web-global.com:49062  ldap-global4.com:ldap  ESTABLISHED

4)  Then we advised the client to remove the above problem ldap server “ ldap-global4.com” from the configured server pool of their  application ( Issue was present in the staging site also ) . And later deployed the changes to live site after testing in staging site/ server.

Hope this would be helpful to some of my friends facing such issues in future.