Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun hang up randomly
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-07-09 05:20:29


While your method starts mpirun itself nohup, the mpi processes themselves are not launched that way and therefore run in the foreground. This message indicates that at least one of those mpi processes received a hangup signal and aborted. Even though mpirun won't get the signal itself, it does detect that the mpi processes abnormally terminated and shuts down the job.

Afraid you'll have to figure out why your mpi processes are getting hangup signals.

On Jul 8, 2010, at 11:25 PM, Harichand M V wrote:

> Hi,
>
> I am getting hang ups in mpi job randomly.
>
>
> ..............
> ...........
> IT:20760 CF: 0.7743 Time: 1540.0 MaxMin:20.69/5 :20.12/12
> IT:20770 CF: 0.7734 Time: 1560.2 MaxMin:20.50/1 :19.31/5
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 9399 on node node1 exited on signal 1 (Hangup).
> --------------------------------------------------------------------------
> [node1:09356] filem:rsh: close()
> [node1:09356] mca: base: close: component rsh closed
> [node1:09356] mca: base: close: unloading component rsh
> [node1:09356] mca: base: close: component default closed
> [node1:09356] mca: base: close: unloading component default
> [node1:09356] mca: base: close: component hnp closed
> [node1:09356] mca: base: close: unloading component hnp
> [node1:09356] mca: base: close: component round_robin closed
> [node1:09356] mca: base: close: unloading component round_robin
> [node1:09356] mca: base: close: component rsh closed
> [node1:09356] mca: base: close: unloading component rsh
> [node1:09356] mca: base: close: component default closed
> [node1:09356] mca: base: close: unloading component default
> [node1:09356] mca: base: close: component bad closed
> [node1:09356] mca: base: close: unloading component bad
> [node1:09356] mca: base: close: unloading component binomial
> [node1:09356] mca: base: close: component tcp closed
> [node1:09356] mca: base: close: unloading component tcp
> [node1:09356] mca: base: close: component oob closed
> [node1:09356] mca: base: close: unloading component oob
> [node1:09356] mca: base: close: unloading component auto_detect
> [node1:09356] mca: base: close: unloading component linux
>
> I am using open mpi version 1.2.7 over infiniband.
> I was running the application over 15 nodes.
>
> job is started using nohup to run it in back ground.
>
> Thanks in advance
> Harichand M V
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users