Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] intermittent node file error running with torque/maui integration
From: Gus Correa (gus_at_[hidden])
Date: 2013-09-20 11:52:11


Hi Noam

Could it be that Torque, or probably more likely NFS,
is too slow to create/make available the PBS_NODEFILE?

What if you insert a "sleep 2",
or whatever number of seconds you want,
before the mpiexec command line?
Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE",
just to make sure the file it is available and
filled with the node list, before mpiexec takes over?

My two cents,
Gus Correa

On 09/20/2013 09:55 AM, Noam Bernstein wrote:
> Hi - we've been using openmpi for a while, but only for the last few months
> with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail with the error:
>
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 142
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 82
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 149
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 99
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 194
>
> This is completely unrepeatable - resubmitting the same job almost
> always works the second time around. The line appears to be
> associated with looking for the torque/maui generated node file,
> and when I do something like
> echo $PBS_NODEFILE
> cat $PBS_NODEFILE
> it appears that the file is present and correct.
>
> We're running OpenMPI 1.6.4, configured with
> ./configure \
> --prefix=${DEST} \
> --with-tm=/usr/local/torque \
> --enable-mpirun-prefix-by-default \
> --with-openib=/usr \
> --with-openib-libdir=/usr/lib64
>
> Has anyone seen anything like this before, or has any ideas of what might
> be happening? It appears to be a line where openmpi looks for
> the PBS node file, which is on a local filesystem (e.g. PBS_NODEFILE=/var/spool/torque/aux//4600.tin).
>
> thanks,
> Noam
>
>
>
> Noam Bernstein
> Center for Computational Materials Science
> NRL Code 6390
> noam.bernstein_at_[hidden]
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users