Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Unbelievable situation BUG
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-04-27 14:26:22


Mea culpa, I completely ignore the possible rollback of the sequence
number. I will remove the commit asap.

   Thanks,
     george.

On Apr 27, 2008, at 12:33 PM, Gleb Natapov wrote:

> On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote:
>> Hi, all
>>
>> I faced the "Unbelievable situation"
> The situation is believable, but commit r18274, that adds this
> output, is
> not, as it doesn't take into account sequence number wrap around.
>
>>
>> during running IMB benchmark.
>>
>>
>>
>>
>>
>> /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode -hostfile
>> hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing
>> Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier
>>
>>
>>
>>
>>
>>
>>
>> #----------------------------------------------------------------
>>
>> # Benchmarking Allreduce
>>
>> # #processes = 96
>>
>> #----------------------------------------------------------------
>>
>> #Benchmarking #procs #bytes #repetitions t_min[usec]
>> t_max[usec] t_avg[usec]
>>
>> Allreduce 96 0 1000 0.02
>> 0.03 0.02
>>
>> Allreduce 96 4 1000 297.88
>> 298.07 297.95
>>
>> Allreduce 96 8 1000 296.15
>> 296.32 296.24
>>
>> Allreduce 96 16 1000 297.99
>> 298.17 298.09
>>
>> Allreduce 96 32 1000 296.97
>> 297.20 297.04
>>
>> Allreduce 96 64 1000 298.43
>> 298.64 298.49
>>
>> Allreduce 96 128 1000 296.86
>> 297.07 296.93
>>
>> Allreduce 96 256 1000 298.00
>> 298.30 298.09
>>
>> Allreduce 96 512 1000 296.79
>> 296.96 296.85
>>
>> Allreduce 96 1024 1000 299.23
>> 299.39 299.31
>>
>> Allreduce 96 2048 1000 295.51
>> 295.64 295.57
>>
>> Allreduce 96 4096 1000 246.02
>> 246.13 246.08
>>
>> Allreduce 96 8192 1000 492.52
>> 492.74 492.63
>>
>> Allreduce 96 16384 1000 5380.59
>> 5381.47 5381.10
>>
>> Allreduce 96 32768 1000 5372.86
>> 5373.69 5373.36
>>
>> Allreduce 96 65536 640 5470.41
>> 5471.88 5471.16
>>
>> Allreduce 96 131072 320 5554.52
>> 5556.82 5555.75
>>
>> [witch24:15639] Unbelievable situation ... we got a duplicated
>> fragment
>> with seq number of 0 (expected 65534) from witch23
>>
>> [witch24:15639] Unbelievable situation ... we got a duplicated
>> fragment
>> with seq number of 65116 (expected 65534) from witch23
>>
>> [witch24:15639] *** Process received signal ***
>>
>> [witch24:15639] Signal: Segmentation fault (11)
>>
>> [witch24:15639] Signal code: Address not mapped (1)
>>
>> [witch24:15639] Failing at address: 0x632457d0
>>
>> [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]
>>
>> [witch24:15639] [ 1]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so
>> [0x2b792aa47d34]
>>
>> [witch24:15639] [ 2]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so
>> [0x2b792b172163]
>>
>> [witch24:15639] [ 3]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
>> [0x2b792b6b0772]
>>
>> [witch24:15639] [ 4]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
>> [0x2b792b6b15ff]
>>
>> [witch24:15639] [ 5]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so
>> [0x2b792b38307f]
>>
>> [witch24:15639] [ 6]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress
>> +0x4a)
>> [0x2b79294cd16a]
>>
>> [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0
>> [0x2b79292163a8]
>>
>> [witch24:15639] [ 8]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
>> [0x2b792c077cb7]
>>
>> [witch24:15639] [ 9]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
>> [0x2b792c07b296]
>>
>> [witch24:15639] [10]
>> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7)
>> [0x2b7929229907]
>>
>> [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]
>>
>> [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]
>>
>> [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x2b7929bc2154]
>>
>> [witch24:15639] [14] ./IMB-MPI1 [0x4030a9]
>>
>> [witch24:15639] *** End of error message ***
>>
>> ------------------------------------------------------------------------
>> --
>>
>> Best Regards,
>>
>> Lenny.
>>
>>
>>
>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s