Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] CP2K mpi hang
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-19 08:29:18


fork() support in OpenFabrics has always been dicey -- it can lead to
random behavior like this. Supposedly it works in a specific set of
circumstances, but I don't have a recent enough kernel on my machines
to test.

It's best not to use calls to system() if they can be avoided.
Indeed, Open MPI v1.3.x will warn you if you create a child process
after MPI_INIT when using OpenFabrics networks.

On May 18, 2009, at 5:05 PM, Noam Bernstein wrote:

> Hi all - I have a bizarre OpenMPI hanging problem. I'm running an MPI
> code
> called CP2K (related to, but not the same as cpmd). The complications
> of the
> software aside, here are the observations:
>
> At the base is a serial code that uses system() calls to repeatedly
> invoke
> mpirun cp2k.popt.
> When I run from my NFS mounted home directory, everything appears to
> be
> fine. When I run from a scratch directory local to each node, it
> hangs on
> the _third_ invokation of CP2K (the 1st and 3rd invokations do
> computationally
> expensive stuff, the 2nd uses the code in a different mode which does
> a rather
> different and quicker computation). These behaviors are quite
> repeatable.
> Run from NFS mounted home dir - no problem. Run from node-local
> scratch
> directory - hang. Hang is always in the same place (as far as the
> output of
> the code, anyway).
>
> The underlying system is Linux with a 2.6.18-128.1.6.el5 kernel
> (CentOS 5.3)
> on a dual single core Opteron system with Mellanox Infiniband SDR
> cards.
> One note of caution is that I'm running OFED 1.4.1-rc4, because I need
> 1.4.1
> for compatibility with this kernel as far as I can tell.
>
> The code is complicated, the input files are big and lead to long
> computation
> times, so I don't think I'll be able to make a simple test case.
> Instead
> I attached to the hanging processes (all 8 of them) with gdb
> during the hang. The stack trace is below. Nodes seem to spend
> most of
> their time in the btl_openib_component_progress(), and occasionally
> in
> mca_pml_ob1_progress(). I.e. not completely stuck, but not making
> progress.
>
> Does anyone have any ideas what could be wrong?
>
> Noam
>
> P.S. I get a similar hang with MVAPICH, in a nearby but different part
> of the
> code (on an MPI_Bcast, specifically), increasing my tendency to
> believe
> that it's OFED's fault. But maybe the stack trace will suggest to
> someone
> where it might be stuck, and therefore perhaps an mca flag to try?
>
>
> #0 0x00002ac2d19d7733 in btl_openib_component_progress () from /
> share/
> apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_btl_openib.so
> #1 0x00002ac2cdd4daea in opal_progress () from /share/apps/mpi/
> openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0
> #2 0x00002ac2cd887e55 in ompi_request_default_wait_all () from /
> share/
> apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
> #3 0x00002ac2d2eb544f in
> ompi_coll_tuned_allreduce_intra_recursivedoubling () from /share/apps/
> mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_coll_tuned.so
> #4 0x00002ac2cd89b867 in PMPI_Allreduce () from /share/apps/mpi/
> openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
> #5 0x00002ac2cd6429b5 in pmpi_allreduce__ () from /share/apps/mpi/
> openmpi-1.3.2/intel-11.0.083/lib/libmpi_f77.so.0
> #6 0x000000000077e7db in message_passing_mp_mp_sum_r1_ ()
> #7 0x0000000000be67dd in
> sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ ()
> #8 0x000000000160b68c in
> qs_initial_guess_mp_calculate_first_density_matrix_ ()
> #9 0x0000000000a7ec05 in qs_scf_mp_scf_env_initial_rho_setup_ ()
> #10 0x0000000000a79fca in qs_scf_mp_init_scf_run_ ()
> #11 0x0000000000a659fd in qs_scf_mp_scf_ ()
> #12 0x00000000008c5713 in qs_energy_mp_qs_energies_ ()
> #13 0x00000000008d469e in qs_force_mp_qs_forces_ ()
> #14 0x00000000005368bb in
> force_env_methods_mp_force_env_calc_energy_force_ ()
> #15 0x000000000053620e in
> force_env_methods_mp_force_env_calc_energy_force_ ()
> #16 0x0000000000742724 in md_run_mp_qs_mol_dyn_ ()
> #17 0x0000000000489c42 in cp2k_runs_mp_cp2k_run_ ()
> #18 0x000000000048878a in cp2k_runs_mp_run_input_ ()
> #19 0x0000000000487669 in MAIN__ ()
> #20 0x000000000048667c in main ()
>
>
>
>
>
>
> #0 0x00002b4d0b57bf09 in mca_pml_ob1_progress () from /share/apps/
> mpi/
> openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_pml_ob1.so
> #1 0x00002b4d08538aea in opal_progress () from /share/apps/mpi/
> openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0
> #2 0x00002b4d08072e55 in ompi_request_default_wait_all () from /
> share/
> apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
> #3 0x00002b4d0d6a044f in
> ompi_coll_tuned_allreduce_intra_recursivedoubling () from /share/apps/
> mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/mca_coll_tuned.so
> #4 0x00002b4d08086867 in PMPI_Allreduce () from /share/apps/mpi/
> openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
> #5 0x00002b4d07e2d9b5 in pmpi_allreduce__ () from /share/apps/mpi/
> openmpi-1.3.2/intel-11.0.083/lib/libmpi_f77.so.0
> #6 0x000000000077e7db in message_passing_mp_mp_sum_r1_ ()
> #7 0x0000000000be67dd in
> sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ ()
> #8 0x000000000160b68c in
> qs_initial_guess_mp_calculate_first_density_matrix_ ()
> #9 0x0000000000a7ec05 in qs_scf_mp_scf_env_initial_rho_setup_ ()
> #10 0x0000000000a79fca in qs_scf_mp_init_scf_run_ ()
> #11 0x0000000000a659fd in qs_scf_mp_scf_ ()
> #12 0x00000000008c5713 in qs_energy_mp_qs_energies_ ()
> #13 0x00000000008d469e in qs_force_mp_qs_forces_ ()
> #14 0x00000000005368bb in
> force_env_methods_mp_force_env_calc_energy_force_ ()
> #15 0x000000000053620e in
> force_env_methods_mp_force_env_calc_energy_force_ ()
> #16 0x0000000000742724 in md_run_mp_qs_mol_dyn_ ()
> #17 0x0000000000489c42 in cp2k_runs_mp_cp2k_run_ ()
> #18 0x000000000048878a in cp2k_runs_mp_run_input_ ()
> #19 0x0000000000487669 in MAIN__ ()
> #20 0x000000000048667c in main ()
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Jeff Squyres
Cisco Systems