Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] parallel AMBER & PBS issue
From: Arturas Ziemys (arturas.ziemys_at_[hidden])
Date: 2008-06-12 11:52:30


Hi,

We have Xeon dual cpu cluster on redhat. I have compiled openMPI 1.2.6
with g95 and AMBER (scientific program doing parallel molecular
simulations; Fortran 77&90). Both compilation seems to be fine. However,
AMBER runs from command prompt "mpiexec -np x <exe ...>" successfully,
but using PBS batch system fails to run in parallel and runs only using
single CPU. I get errors like:

[Morpheus06:02155] *** Process received signal ***
[Morpheus06:02155] Signal: Segmentation fault (11)
[Morpheus06:02155] Signal code: Address not mapped (1)
[Morpheus06:02155] Failing at address: 0x39000000
[Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610]
[Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e]
[Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e)
[0x42029eae]
[Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x40018325]
[Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x400190f6]
[Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894]
[Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20]
[Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI [0x82beb63]
[Morpheus06:02155] [ 8]
/home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648]
[Morpheus06:02155] [ 9]
/home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03]
[Morpheus06:02155] [10]
/home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51]
[Morpheus06:02155] [11]
/home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471]
[Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4)
[0x42015574]
[Morpheus06:02155] [13]
/home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1]
[Morpheus06:02155] *** End of error message ***
mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06 exited
on signal 11 (Segmentation fault).
5 additional processes aborted (not shown)

If I decide to supply machine file ($PBS_NODEFILE), it fails with :

Host key verification failed.
Host key verification failed.
[Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c
at line 90
[Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1198
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

Help, please.

-- 
Arturas Ziemys, PhD
  School of Health Information Sciences
  University of Texas Health Science Center at Houston
  7000 Fannin, Suit 880
  Houston, TX 77030
  Phone: (713) 500-3975
  Fax:   (713) 500-3929