On 09/20/2013 12:48 PM, Noam Bernstein wrote:
> On Sep 20, 2013, at 11:52 AM, Gus Correa<gus_at_[hidden]> wrote:
>> Hi Noam
>> Could it be that Torque, or probably more likely NFS,
>> is too slow to create/make available the PBS_NODEFILE?
>> What if you insert a "sleep 2",
>> or whatever number of seconds you want,
>> before the mpiexec command line?
>> Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE",
>> just to make sure the file it is available and
>> filled with the node list, before mpiexec takes over?
> I don't see how NFS could be involved, since it's on a local filesystem.
> As for adding a sleep, I already tried that - if the file doesn't exist, I sleep a few
> seconds and check again, and in every case if it's not there to begin with it's not
> there the second time either. And this all doesn't explain the very
> mysterious even more infrequent situation where I can cat the file, but
> mpirun can't find it.
I only read the full email exchange after I sent my message.
Now I read it is not over NFS but local.
Still a communication delay (which can be non-deterministic)
between pbs_server and the local pbs_mom on the node could be
behind the problem (say, if the server authorizes the
node to start the job first, then second it copies over the
node file over, which may take some time,
depending on the network traffic).