Hi guys,
I seem to have encountered an error while trying to run an MPMD
executable through Open MPI's '-app' option, and I'm wondering if
anyone else has seen this or can verify this?
Backing up to a simple example, running a "hello world" executable
(hwc.exe) works fine when run as: (using an interactive PBS session
with -l nodes=2:ppn=4)
mpiexec -v -d -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude eth0 -mca pls_rsh_agent ssh -np 8 ./hwc.exe
But when I run what should be the same thing via an '--app' file (or implied command line) liks the following fails:
mpiexec -v -d -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude eth0
-mca pls_rsh_agent ssh -np 6 ./hwc.exe : -np 2 ./hwc.exe
My understanding is that these are equivalent, no? But the latter
example fails with multiple "Software caused connection abort (103)"
errors, such as the following:
[xxx:13978] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
xx.x.2.81:34103 failed: Software caused connection abort (103)
Any thoughts? I haven't dug around the source yet since this could
be a weird problem with the system I'm using. For the record, this is
with OpenMPI 1.2.4 compiled with PGI 7.1-2.
As an aside, the '-app' syntax DOES work fine when all copies are
running on the same node.. for example, having requested 4 CPUs per
node, if I run "-np 2 ./hwc.exe : -np 2 ./hwc.exe", it works fine. And I did also try duplicating the mca parameters after the colon since I figured they might not propagate, thus perhaps it was trying to use the wrong interface, but that didn't help either.
Thanks very much,
- Brian
Brian Dobbins
Yale University HPC