Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] parallel AMBER & PBS issue
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-21 09:14:51


Sorry for the delay in replying -- I was on vacation for a week and
all the mail piled up...

That is a very weird stack trace. Is the application finishing and
then crashing during the shutdown?

I'd be surprised if the problem is actually related to PBS (the stack
trace would be quite different). I wonder if the real problem was
that it only started one process, and Amber was unable to handle that
nicely...?

Are you sure that you have PBS support compiled in Open MPI properly?
Check ompi_info | grep tm. You should see a line like this:

                  MCA pls: tm (MCA v1.0, API v1.0.1, Component v1.2.6)

If you don't see a "pls: tm" line, then your OMPI was not configured
with PBS support, and mpiexec may have only started one copy of
Amber...?

As for trying to use a hostfile, I think the real errors are here:

> Host key verification failed.
> Host key verification failed.

It seems that you ssh is not setup properly...?

On Jun 12, 2008, at 11:52 AM, Arturas Ziemys wrote:

> Hi,
>
> We have Xeon dual cpu cluster on redhat. I have compiled openMPI 1.2.6
> with g95 and AMBER (scientific program doing parallel molecular
> simulations; Fortran 77&90). Both compilation seems to be fine.
> However,
> AMBER runs from command prompt "mpiexec -np x <exe ...>" successfully,
> but using PBS batch system fails to run in parallel and runs only
> using
> single CPU. I get errors like:
>
> [Morpheus06:02155] *** Process received signal ***
> [Morpheus06:02155] Signal: Segmentation fault (11)
> [Morpheus06:02155] Signal code: Address not mapped (1)
> [Morpheus06:02155] Failing at address: 0x39000000
> [Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610]
> [Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e]
> [Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e)
> [0x42029eae]
> [Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
> [0x40018325]
> [Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
> [0x400190f6]
> [Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894]
> [Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20]
> [Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI
> [0x82beb63]
> [Morpheus06:02155] [ 8]
> /home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648]
> [Morpheus06:02155] [ 9]
> /home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03]
> [Morpheus06:02155] [10]
> /home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51]
> [Morpheus06:02155] [11]
> /home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471]
> [Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4)
> [0x42015574]
> [Morpheus06:02155] [13]
> /home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1]
> [Morpheus06:02155] *** End of error message ***
> mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06
> exited
> on signal 11 (Segmentation fault).
> 5 additional processes aborted (not shown)
>
> If I decide to supply machine file ($PBS_NODEFILE), it fails with :
>
> Host key verification failed.
> Host key verification failed.
> [Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to
> start as
> expected.
> [Morpheus06:02107] ERROR: There may be more information available from
> [Morpheus06:02107] ERROR: the remote shell (see above).
> [Morpheus06:02107] ERROR: The daemon exited unexpectedly with status
> 255.
> [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1166
> [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c
> at line 90
> [Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to
> start as
> expected.
> [Morpheus06:02107] ERROR: There may be more information available from
> [Morpheus06:02107] ERROR: the remote shell (see above).
> [Morpheus06:02107] ERROR: The daemon exited unexpectedly with status
> 255.
> [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1198
> --------------------------------------------------------------------------
> mpiexec was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
> --------------------------------------------------------------------------
>
> Help, please.
>
> --
>
> Arturas Ziemys, PhD
> School of Health Information Sciences
> University of Texas Health Science Center at Houston
> 7000 Fannin, Suit 880
> Houston, TX 77030
> Phone: (713) 500-3975
> Fax: (713) 500-3929
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems