Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI looking for PBS file?
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-03-14 18:54:11


Just to clarify: OMPI is launched with either mpirun or mpiexec commands, so long as your path is pointing to the correct OMPI installation. This looks like that is the case as the error message comes from us.

It really, really helps if you tell us what version of OMPI you are using. Some older version have known bugs, and the 1.3/1.4 series treats hostfile differently than earlier series.

OMPI's Torque support will always look for the PBS_NODEFILE as given in the environment by PBS. You don't need to copy it anywhere or specify it with -machinefile. We will abort if we cannot find that file as it indicates to us that something is wrong with the PBS environment.

So the real question is: why are we not able to find the PBS_NODEFILE? Did you move it instead of copy it? Or is the envar not being set?

On Mar 14, 2010, at 3:20 PM, Josh Bernstein wrote:

> Hi John,
>
> Mpiexec isn't needed with OMPI, in fact if you are using the one from OSC, it only works with MPICH.
>
> Instead just build OMPI with --with-tm, and it will link against TORQUE and start up and track jobs properly.
>
> -Joshua Bernstein
> Penguin Computing
>
> On Mar 14, 2010, at 21:35, "John R. Cary" <cary_at_[hidden]> wrote:
>
>> I have a script that launches a bunch of runs on some compute nodes of
>> a cluster. Once I get through the queue, I query PBS for my machine
>> file, then I copy that to a local file 'nodes' which I use for mpiexec:
>>
>> mpiexec -machinefile /home/research/cary/projects/vpall/vptests/nodes -np 6 /hom
>> e/research/cary/projects/vpall/builds/vorpal/par/vorpal/vorpal -i bathtubAntenna
>> .in -dim 2 -o bathtubAntenna2p -n 100 -d 100
>>
>> but this fails with
>>
>> [node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file ../../../
>> ../../orte/mca/ras/tm/ras_tm_module.c at line 153
>> [node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file ../../../
>> ../../orte/mca/ras/tm/ras_tm_module.c at line 87
>> [node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file ../../../
>> ../orte/mca/ras/base/ras_base_allocate.c at line 133
>> [node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file ../../../
>> ../orte/mca/plm/base/plm_base_launch_support.c at line 72
>> [node47:07004] [[25769,0],0] ORTE_ERROR_LOG: File open failure in file ../../../
>> ../../orte/mca/plm/tm/plm_tm_module.c at line 167
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
>> launch so we are aborting.
>>
>> The appropriate code snippet is
>>
>> /* setup the full path to the PBS file */
>> filename = opal_os_path(false, mca_ras_tm_component.nodefile_dir,
>> pbs_jobid, NULL);
>> fp = fopen(filename, "r");
>> if (NULL == fp) {
>> ORTE_ERROR_LOG(ORTE_ERR_FILE_OPEN_FAILURE);
>> free(filename);
>> return ORTE_ERR_FILE_OPEN_FAILURE;
>> }
>>
>> which kind of looks like it might be trying to open my pbs file instead
>> of the file I gave on the command line? I really don't know, but does
>> anyone have any ideas here?
>>
>> Thx....John Cary
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users