Dgraph이 좀비가 되는 이유가 뭘까 오랫동안 궁금했지만 원인을 몰랐는데 아주 유력(?)한 이유를 알게 되었다.
인덱싱이 진행될 때 dgraph을 stop -> start 하는 과정을 거치게 되는데 stop을 할 때 해당 dgraph의 큐에 큰 job이 걸려 있어서 제한 시간(default 30초)안에 프로세스가 종료가 되지 않는 경우가 발생한다.
이 때 dgraph은 좀비프로세스가 되어버리고 새로운 dgraph이 만들어진다. Endeca 에서는 이 dgraph이 memory, disk, port 와 같은 자원을 사용하지 않기 때문에 시스템에 문제가 없다고 설명하고 있다. 좀비프로세스라서 좀 찜찜한 면이 없긴 하지만 시스템 운영에 문제가 없다고 하니 그리 걱정 할 일은 아니라고 생각은 되어진다.

참조: http://gen.endeca.com/help/index.jsp?topic=/com.endeca.IAP.Admin.doc/src/ciad_removing_defunct_eac_processes.html

On UNIX systems, the ps command may report a number of defunct EAC-originated processes. This is known and expected EAC behavior and it does not necessarily indicate a problem.

For example, you might see the following output from the ps command:
> ps -ef | grep endeca
endeca 1924 1875 0 - ? 2:00 <defunct>
[...]
Additionally, warning messages of this form appear in the $ENDECA_CONF/logs/process.0.log file on the affected server:
Apr 17, 2009 11:24:17 AM
com.endeca.esf.delegate.procctrl.ExecutableProcessHandle
tryCleanShutdown
WARNING: Process 1924 did not shutdown cleanly after 30 seconds.
Terminating forcefully.

The cause of these warning messages is as follows. When the EAC shuts down a child process like a Dgraph, it initially sends the correct exit command for the process (admin?op=exit in the case of the Dgraph) and waits 30 seconds for the process to exit. However, if the Dgraph is processing a long-running query, or if its request queue is long, it may not be able to shut down within 30 seconds.

If the process does not exit after 30 seconds, the EAC logs the warning message shown above and then kills the process with the operating system's kill command. When this occurs, the affected process is reported by ps as being in a <defunct> state. In this state, it does not use memory, disk space, or ports and should not be a problem for the system.

Alternatively, this can happen if you kill the EAC process directly rather than by using the shutdown.sh script. In this case the EAC process terminates immediately, leaving any chlid processes in a <defunct> state.

To avoid defunct EAC processing, consider the following recommendations:
  • For a Dgraph or Agraph, the request log shows whether queuing or long processing times are preventing the Dgraph from responding in time to the admin?op=exit command. If this is the case, spreading traffic over a larger number of MDEX Engine mirrors (for queuing) or reducing query complexity (for long processing times) should allow the Dgraph to respond more quickly to the exit command.
  • Another option may be to override the default 30-second timeout period for EAC shutdowns by modifying the value of the com.endeca.eac.process.shutdownTimeoutSecs setting in your server's $ENDECA_CONF/conf/eac.properties file.

    This value shows the length of time in seconds that the EAC Agent on that server waits for a process to exit. Specifying a higher value for this setting may help prevent creation of <defunct> EAC child processes, but may also make EAC updates slower, because the EAC Agent will wait longer for all processes to exit.

    Note: Modifications to this setting will not take effect until the Endeca HTTP service (which contains the EAC) is restarted on the server.



+ Recent posts