Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] CP2K mpi hang
From: Noam Bernstein (noam.bernstein_at_[hidden])
Date: 2009-05-19 11:01:37


On May 19, 2009, at 9:32 AM, Ashley Pittman wrote:
>
> Can you confirm that *all* processes are in PMPI_Allreduce at some
> point, the collectives commonly get blamed for a lot of hangs and it's
> not always the correct place to look.

For the openmpi run, every single process showed one of those
two stack traces, mostly the first one.

>
>> P.S. I get a similar hang with MVAPICH, in a nearby but different
>> part
>> of the
>> code (on an MPI_Bcast, specifically), increasing my tendency to
>> believe
>> that it's OFED's fault. But maybe the stack trace will suggest to
>> someone
>> where it might be stuck, and therefore perhaps an mca flag to try?
>
> This strikes me as a filesystem problem more than MPI per se. Again
> with MVAPICH are all your processes in MPI_Bcast or just some of them?

I'd suspect the filesystem too, except that it's hung up in an MPI
call. As I said
before, the whole thing is bizarre. It doesn't matter where the
executable is,
just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun /home/
bernstei/exec,
but if it's sitting in /scratch it'll hang). And I've been running
other codes both from NFS and from scratch directories for months,
and never had a problem.

Using MVAPICH every process is stuck in a collective, but they're not
all the
same collective (see stack traces below). The 2 processes on the head
node
are stuck on mpi_bcast, in various low level MPI routines. The other 6
processes are stuck on an mpi_allreduce, again in various low level mpi
processes. I don't know enough about the code to tell they're all
supposed
to be part of the same communicator, and the fact that they're stuck on
different collectives is suspicious. I can look into that.

So yes, it does seem to be a problem with collective communication.
But a very weird one.

                                                                        Noam

#0 0x0000000001b2c120 in MPIDI_CH3I_read_progress ()
#1 0x0000000001b2be44 in MPIDI_CH3I_Progress ()
#2 0x0000000001b0686b in MPIC_Wait ()
#3 0x0000000001b072a6 in MPIC_Send ()
#4 0x0000000001b01b16 in MPIR_Bcast ()
#5 0x0000000001b033ad in PMPI_Bcast ()
#6 0x0000000001b1ec52 in pmpi_bcast_ ()
#7 0x00000000007098d4 in message_passing_mp_mp_bcast_rm_ ()
#8 0x000000000091f9c0 in qs_mo_types_mp_read_mos_restart_low_ ()
#9 0x0000000000922485 in qs_mo_types_mp_read_mo_set_from_restart_ ()
#10 0x000000000158b00e in
qs_initial_guess_mp_calculate_first_density_matrix_ ()
#11 0x0000000000a013c5 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#12 0x00000000009fc78a in qs_scf_mp_init_scf_run_ ()
#13 0x00000000009e81bd in qs_scf_mp_scf_ ()
#14 0x0000000000847ed3 in qs_energy_mp_qs_energies_ ()
#15 0x0000000000856e5e in qs_force_mp_qs_forces_ ()
#16 0x00000000004b904b in
force_env_methods_mp_force_env_calc_energy_force_ ()
#17 0x00000000004b899e in
force_env_methods_mp_force_env_calc_energy_force_ ()
#18 0x00000000006c4ee4 in md_run_mp_qs_mol_dyn_ ()
#19 0x000000000040c3d2 in cp2k_runs_mp_cp2k_run_ ()
#20 0x000000000040af1a in cp2k_runs_mp_run_input_ ()
#21 0x0000000000409df9 in MAIN__ ()
#22 0x0000000000408e0c in main ()

#0 0x0000000001b3d4e4 in MPIDI_CH3I_MRAILI_Get_next_vbuf ()
#1 0x0000000001b2c1ae in MPIDI_CH3I_read_progress ()
#2 0x0000000001b2be44 in MPIDI_CH3I_Progress ()
#3 0x0000000001b0686b in MPIC_Wait ()
#4 0x0000000001b06c60 in MPIC_Sendrecv ()
#5 0x0000000001aff15a in MPIR_Allreduce ()
#6 0x0000000001b0036d in PMPI_Allreduce ()
#7 0x0000000001b1f1da in pmpi_allreduce_ ()
#8 0x0000000000700f9b in message_passing_mp_mp_sum_r1_ ()
#9 0x0000000000b68f9d in
sparse_matrix_types_mp_cp_sm_sm_trace_scalar_ ()
#10 0x000000000158de4c in
qs_initial_guess_mp_calculate_first_density_matrix_ ()
#11 0x0000000000a013c5 in qs_scf_mp_scf_env_initial_rho_setup_ ()
#12 0x00000000009fc78a in qs_scf_mp_init_scf_run_ ()
#13 0x00000000009e81bd in qs_scf_mp_scf_ ()
#14 0x0000000000847ed3 in qs_energy_mp_qs_energies_ ()
#15 0x0000000000856e5e in qs_force_mp_qs_forces_ ()
#16 0x00000000004b904b in
force_env_methods_mp_force_env_calc_energy_force_ ()
#17 0x00000000004b899e in
force_env_methods_mp_force_env_calc_energy_force_ ()
#18 0x00000000006c4ee4 in md_run_mp_qs_mol_dyn_ ()
#19 0x000000000040c3d2 in cp2k_runs_mp_cp2k_run_ ()
#20 0x000000000040af1a in cp2k_runs_mp_run_input_ ()
#21 0x0000000000409df9 in MAIN__ ()
#22 0x0000000000408e0c in main ()

                                                                                                        Noam