Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG timeout
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-07-08 11:49:20


Several thins are going on here. First, this error message:

> mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal
> 6 (Aborted).
> 2 additional processes aborted (not shown)

indicates that your application procs are aborting for some reason. The
system is then attempting to shutdown and somehow got itself "hung", hence
the timeout error message.

I'm not sure that increasing the timeout value will help in this situation.
Unfortunately, 1.2.x has problems with this scenario (1.3 is -much- better!
;-)). If you want to try adjusting the timeout anyway, you can do so with:

mpirun -mca orte_abort_timeout x ...

where x is the specified timeout in seconds.

Hope that helps.
Ralph

On 7/8/08 8:55 AM, "Alastair Basden" <a.g.basden_at_[hidden]> wrote:

> Hi,
> I've got some code that uses openmpi, and sometimes, it crashes, after
> printing somthing like:
>
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1166
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line
> 90
> mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal
> 6 (Aborted).
> 2 additional processes aborted (not shown)
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1198
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons for this job. Returned
> value Timeout instead of ORTE_SUCCESS.
> --------------------------------------------------------------------------
>
> In this case, all processes were running on the same machine, so its not a
> connection problem. Is this a bug, or something else wrong? Is there a
> way to increase the timeout time?
>
> Thanks...
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users