Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI job launch failures
From: Bharath Ramesh (bramesh_at_[hidden])
Date: 2013-02-14 13:58:52


After manually fixing some of the issues I see that the failed
nodes never receive commands to launch the local processes. I am
going to request the admins to look into the logs for any dropped
connections.

On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote:
> Sounds like the orteds aren't reporting back to mpirun after launch. The MPI_proctable observation just means that the procs didn't launch in those cases where it is absent, which is something you already observed.
>
> Set "-mca plm_base_verbose 5" on your cmd line. You should see each orted report back to mpirun after it launches. If not, then it is likely that something is blocking it.
>
> You could also try updating to 1.6.3/4 in case there is some race condition in 1.6.1, though we haven't heard of it to-date.
>
>
> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bramesh_at_[hidden]> wrote:
>
> > On our cluster we are noticing intermediate job launch failure when using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is integrated with Torque-4.1.3. It failes even for a simple MPI hello world applications. The issue is that orted gets launched on all the nodes but there are a bunch of nodes that dont launch the actual MPI application. There are no errors reported when the job gets killed because the walltime expires. Enabling --debug-daemons doesnt show any errors either. The only difference being that successful runs have MPI_proctable listed and for failures this is absent. Any help in debugging this issue is greatly appreciated.
> >
> > --
> > Bharath
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Bharath


  • application/x-pkcs7-signature attachment: smime.p7s