Ok Brian,
for the build part, attached is my config.log.
About stacktrace, I have with my compile options from gdb:
#0 0xb7d105b9 in orte_pls_rsh_launch ()
from /home/cluster/openmpi/lib/openmpi/mca_pls_rsh.so
and recompiling with -g
#0 0xb7ca2599 in orte_pls_rsh_launch (jobid=1) at pls_rsh_module.c:716
716 if (mca_pls_rsh_component.debug) {
which means we have a memory corruption somewhere else...
Investigating from outside on what may cause the problem, I have found that I
can make the job run also changing the hostname in my hostfile.
-) No localhost in hostfile -> run
-) "wowbagger" or "localhost" in hostfile -> run
-) FQDN wowbagger.cluster in hostfile -> SIGSEGV
I have a private network (10.2.1.0) with cluster master (local node) as DNS
with bind v9.
# hostname
wowbagger
# host wowbagger
wowbagger.cluster has address 10.2.1.100
# mpirun --hostfile wrf_openmpi.mac -np 10 -bynode wrf.exe
mpirun noticed that job rank 0 with PID 0 on node "wowbagger.cluster" exited
on signal 11.
[wowbagger:20400] ERROR: A daemon on node wowbagger.cluster failed to start as
expected.
[wowbagger:20400] ERROR: There may be more information available from
[wowbagger:20400] ERROR: the remote shell (see above).
[wowbagger:20400] The daemon received a signal 11 (with core).
mpirun: killing job...
9 processes killed (possibly by Open MPI)
Changing wowbagger.cluster with simply wowbagger do the trick. Something in
host name resolution?
Attached is my hostfile.
Graziano.
P.S.: Sorry for the delay, but yesterday here in Florence we had heavy
snowfall !
--
\ | /
(@ @)
-------------------------o00-(_)-00o -----------------------------
LaMMA - Laboratorio per la Meteorologia e la Modellistica Ambientale
Laboratory for Meteorology and Environmental Modelling
Via Madonna del Piano, 50019 Sesto Fiorentino (FI)
tel: + 39 055 4483049
fax: + 39 055 444083
web: www.lamma.rete.toscana.it
e-mail: giuliani_at_[hidden]
- application/pgp-signature attachment: stored
|