Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Q: Problems launching MPMD applications? ('mca_oob_tcp_peer_try_connect' error 103)
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2007-12-05 10:30:46

I believe the problem is that you are only applying the MCA
parameters to the first app context instead of all of them:

  shell$ mpiexec -v -d -machinefile $PBS_NODEFILE -mca
oob_tcp_if_exclude eth0 -mca pls_rsh_agent ssh -np 6 ./hwc.exe : -np
2 ./hwc.exe

The '-mca' parameter will apply the MCA parameter specified only to
the app context in which it is specified. In the example you gave
(above) it is the first (-np 6 ./hwc.exe) app context that receives
these parameters and not the second (-np 2 ./hwc.exe). It is likely
that you need these parameters specified for all app contexts.

There are two main ways of doing this:
1) The most common ways is to use the '-gmca' parameter which will
mark the following MCA parameter as global across all app contexts:

  shell$ mpiexec -v -d -machinefile $PBS_NODEFILE -gmca
oob_tcp_if_exclude eth0 -gmca pls_rsh_agent ssh -np 6 ./hwc.exe : -
np 2 ./hwc.exe

2) Alternatively you can duplicate the MCA parameters for each app

  shell$ mpiexec -v -d -machinefile $PBS_NODEFILE -mca
oob_tcp_if_exclude eth0 -mca pls_rsh_agent ssh -np 6 ./hwc.exe : -
mca oob_tcp_if_exclude eth0 -mca pls_rsh_agent ssh -np 2 ./hwc.exe

If these MCA parameters are required for every run of Open MPI on
your system you may consider putting them in the default MCA file,
see point 4 in the following FAQ:

Taking a look at the FAQ it seems that we do not discuss the
difference between -mca and -gmca mpirun/mpiexec/orterun parameters.
However, if you do a 'mpiexec --help' they will appear in the help

Hope this helps,

On Dec 5, 2007, at 1:50 AM, Brian Dobbins wrote:

> Hi guys,
> I seem to have encountered an error while trying to run an MPMD
> executable through Open MPI's '-app' option, and I'm wondering if
> anyone else has seen this or can verify this?
> Backing up to a simple example, running a "hello world" executable
> (hwc.exe) works fine when run as: (using an interactive PBS
> session with -l nodes=2:ppn=4)
> mpiexec -v -d -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude
> eth0 -mca pls_rsh_agent ssh -np 8 ./hwc.exe
> But when I run what should be the same thing via an '--app' file
> (or implied command line) liks the following fails:
> mpiexec -v -d -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude
> eth0 -mca pls_rsh_agent ssh -np 6 ./hwc.exe : -np 2 ./hwc.exe
> My understanding is that these are equivalent, no? But the
> latter example fails with multiple "Software caused connection
> abort (103)" errors, such as the following:
> [xxx:13978] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
> to xx.x.2.81:34103 failed: Software caused connection abort (103)
> Any thoughts? I haven't dug around the source yet since this
> could be a weird problem with the system I'm using. For the
> record, this is with OpenMPI 1.2.4 compiled with PGI 7.1-2.
> As an aside, the '-app' syntax DOES work fine when all copies are
> running on the same node.. for example, having requested 4 CPUs per
> node, if I run "-np 2 ./hwc.exe : -np 2 ./hwc.exe", it works fine.
> And I did also try duplicating the mca parameters after the colon
> since I figured they might not propagate, thus perhaps it was
> trying to use the wrong interface, but that didn't help either.
> Thanks very much,
> - Brian
> Brian Dobbins
> Yale University HPC
> _______________________________________________
> users mailing list
> users_at_[hidden]