Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] intermittent node file error running with torque/maui integration
From: Noam Bernstein (noam.bernstein_at_[hidden])
Date: 2013-09-20 09:55:44


Hi - we've been using openmpi for a while, but only for the last few months
with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail with the error:

[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 142
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 82
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 149
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 99
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 194

This is completely unrepeatable - resubmitting the same job almost
always works the second time around. The line appears to be
associated with looking for the torque/maui generated node file,
and when I do something like
  echo $PBS_NODEFILE
  cat $PBS_NODEFILE
it appears that the file is present and correct.

We're running OpenMPI 1.6.4, configured with
./configure \
        --prefix=${DEST} \
        --with-tm=/usr/local/torque \
        --enable-mpirun-prefix-by-default \
        --with-openib=/usr \
        --with-openib-libdir=/usr/lib64

Has anyone seen anything like this before, or has any ideas of what might
be happening? It appears to be a line where openmpi looks for
the PBS node file, which is on a local filesystem (e.g. PBS_NODEFILE=/var/spool/torque/aux//4600.tin).

                                                                        thanks,
                                                                        Noam

Noam Bernstein
Center for Computational Materials Science
NRL Code 6390
noam.bernstein_at_[hidden]