Brock and I talked about this on IM -- the preferred solution would be
to set the cluster nodes limits.conf to allow interactive logins to
have unlimited locked memory. That would fix the OFED issue.
On Aug 27, 2009, at 11:01 AM, Brock Palen wrote:
> We have a case where we need to spwan many (random) allocated MPI jobs
> within the same PBS job. (I have talked to the user about changing
> this behavior).
> The code will work If I do:
> pbsdsh -n $(($GROUP*$JOBSIZE-$JOBSIZE)) \
> mpirun \
> -wdir $PWD/$GROUP \
> --mca plm ^tm \
> --mca ras ^tm \
> --hostfile $PWD/nodefile.$GROUP \
> ./swjv_aim &
> Problem is, because only the pbs_mom on our system starts with the
> correct amount of pinned/locked memory for ofed, not using the tm ras
> causes ofed to fail on us.
> I tried removing --mca plm ^tm which would think would use tm, to
> launch processes, read from the nodefile, (which is built dynamically
> in the PBS script, from PBS_NODEFILE), when you run though mpirun
> fails with:
> [nyx0407.engin.umich.edu:07392] plm:tm: failed to poll for a spawned
> daemon, return status = 17002
> In the pbs_mom logs I see this error:
> 08/27/2009 10:53:57;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
> descriptor (9) in tm_request, bad header Negative sign on an unsigned
> Is there a way to tell openmpi, start on only these hosts from your
> PBS job, and start using tm?
> Brock Palen
> Center for Advanced Computing
> users mailing list