Hi, all
I faced the “Unbelievable situation”
during running IMB benchmark.
/home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode -hostfile
hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing Sendrecv
Exchange Allreduce Reduce Reduce_scatter Bcast Barrier
#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 96
#----------------------------------------------------------------
#Benchmarking #procs
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
Allreduce 96
0 1000
0.02 0.03
0.02
Allreduce 96
4 1000
297.88 298.07
297.95
Allreduce 96
8 1000
296.15 296.32 296.24
Allreduce 96
16 1000
297.99 298.17
298.09
Allreduce 96
32 1000
296.97 297.20
297.04
Allreduce 96
64 1000
298.43 298.64
298.49
Allreduce 96
128 1000
296.86 297.07
296.93
Allreduce 96
256 1000
298.00 298.30
298.09
Allreduce 96
512 1000
296.79 296.96
296.85
Allreduce 96
1024 1000
299.23 299.39
299.31
Allreduce 96
2048 1000
295.51 295.64
295.57
Allreduce 96
4096 1000
246.02 246.13
246.08
Allreduce 96
8192 1000
492.52 492.74
492.63
Allreduce 96
16384 1000
5380.59 5381.47 5381.10
Allreduce 96 32768
1000 5372.86 5373.69
5373.36
Allreduce 96
65536 640
5470.41 5471.88 5471.16
Allreduce 96
131072 320
5554.52 5556.82 5555.75
[witch24:15639] Unbelievable situation ... we got a duplicated fragment
with seq number of 0 (expected 65534) from witch23
[witch24:15639] Unbelievable situation ... we got a duplicated fragment
with seq number of 65116 (expected 65534) from witch23
[witch24:15639] *** Process received signal ***
[witch24:15639] Signal: Segmentation fault (11)
[witch24:15639] Signal code: Address not mapped (1)
[witch24:15639] Failing at address: 0x632457d0
[witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]
[witch24:15639] [ 1] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so
[0x2b792aa47d34]
[witch24:15639] [ 2] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so
[0x2b792b172163]
[witch24:15639] [ 3] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
[0x2b792b6b0772]
[witch24:15639] [ 4] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
[0x2b792b6b15ff]
[witch24:15639] [ 5] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so
[0x2b792b38307f]
[witch24:15639] [ 6] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a)
[0x2b79294cd16a]
[witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0 [0x2b79292163a8]
[witch24:15639] [ 8] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
[0x2b792c077cb7]
[witch24:15639] [ 9] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
[0x2b792c07b296]
[witch24:15639] [10] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7)
[0x2b7929229907]
[witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]
[witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]
[witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b7929bc2154]
[witch24:15639] [14] ./IMB-MPI1 [0x4030a9]
[witch24:15639] *** End of error message ***
--------------------------------------------------------------------------
Best Regards,
Lenny.