Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Unbelievable situation BUG
From: Gleb Natapov (glebn_at_[hidden])
Date: 2008-04-27 12:33:51


On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote:
> Hi, all
>
> I faced the "Unbelievable situation"
The situation is believable, but commit r18274, that adds this output, is
not, as it doesn't take into account sequence number wrap around.

>
> during running IMB benchmark.
>
>
>
>
>
> /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode -hostfile
> hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing
> Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier
>
>
>
>
>
>
>
> #----------------------------------------------------------------
>
> # Benchmarking Allreduce
>
> # #processes = 96
>
> #----------------------------------------------------------------
>
> #Benchmarking #procs #bytes #repetitions t_min[usec]
> t_max[usec] t_avg[usec]
>
> Allreduce 96 0 1000 0.02
> 0.03 0.02
>
> Allreduce 96 4 1000 297.88
> 298.07 297.95
>
> Allreduce 96 8 1000 296.15
> 296.32 296.24
>
> Allreduce 96 16 1000 297.99
> 298.17 298.09
>
> Allreduce 96 32 1000 296.97
> 297.20 297.04
>
> Allreduce 96 64 1000 298.43
> 298.64 298.49
>
> Allreduce 96 128 1000 296.86
> 297.07 296.93
>
> Allreduce 96 256 1000 298.00
> 298.30 298.09
>
> Allreduce 96 512 1000 296.79
> 296.96 296.85
>
> Allreduce 96 1024 1000 299.23
> 299.39 299.31
>
> Allreduce 96 2048 1000 295.51
> 295.64 295.57
>
> Allreduce 96 4096 1000 246.02
> 246.13 246.08
>
> Allreduce 96 8192 1000 492.52
> 492.74 492.63
>
> Allreduce 96 16384 1000 5380.59
> 5381.47 5381.10
>
> Allreduce 96 32768 1000 5372.86
> 5373.69 5373.36
>
> Allreduce 96 65536 640 5470.41
> 5471.88 5471.16
>
> Allreduce 96 131072 320 5554.52
> 5556.82 5555.75
>
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 0 (expected 65534) from witch23
>
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 65116 (expected 65534) from witch23
>
> [witch24:15639] *** Process received signal ***
>
> [witch24:15639] Signal: Segmentation fault (11)
>
> [witch24:15639] Signal code: Address not mapped (1)
>
> [witch24:15639] Failing at address: 0x632457d0
>
> [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]
>
> [witch24:15639] [ 1]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so
> [0x2b792aa47d34]
>
> [witch24:15639] [ 2]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so
> [0x2b792b172163]
>
> [witch24:15639] [ 3]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b0772]
>
> [witch24:15639] [ 4]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b15ff]
>
> [witch24:15639] [ 5]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so
> [0x2b792b38307f]
>
> [witch24:15639] [ 6]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a)
> [0x2b79294cd16a]
>
> [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0
> [0x2b79292163a8]
>
> [witch24:15639] [ 8]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c077cb7]
>
> [witch24:15639] [ 9]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c07b296]
>
> [witch24:15639] [10]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7)
> [0x2b7929229907]
>
> [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]
>
> [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]
>
> [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b7929bc2154]
>
> [witch24:15639] [14] ./IMB-MPI1 [0x4030a9]
>
> [witch24:15639] *** End of error message ***
>
> ------------------------------------------------------------------------
> --
>
> Best Regards,
>
> Lenny.
>
>
>

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.