Hi, all

I faced the “Unbelievable situation”

during running IMB benchmark.

 

 

/home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode  -hostfile hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier

 

 

 

#----------------------------------------------------------------

# Benchmarking Allreduce

# #processes = 96

#----------------------------------------------------------------

#Benchmarking        #procs       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]

Allreduce       96                  0         1000         0.02         0.03         0.02

Allreduce       96                  4         1000       297.88       298.07       297.95

Allreduce       96                  8         1000       296.15       296.32       296.24

Allreduce       96                 16         1000       297.99       298.17       298.09

Allreduce       96                 32         1000       296.97       297.20       297.04

Allreduce       96                 64         1000       298.43       298.64       298.49

Allreduce       96                128         1000       296.86       297.07       296.93

Allreduce       96                256         1000       298.00       298.30       298.09

Allreduce       96                512         1000       296.79       296.96       296.85

Allreduce       96               1024         1000       299.23       299.39       299.31

Allreduce       96               2048         1000       295.51       295.64       295.57

Allreduce       96               4096         1000       246.02       246.13       246.08

Allreduce       96               8192         1000       492.52       492.74       492.63

Allreduce       96              16384         1000      5380.59      5381.47      5381.10

Allreduce       96              32768         1000      5372.86      5373.69      5373.36

Allreduce       96              65536          640      5470.41      5471.88      5471.16

Allreduce       96             131072          320      5554.52      5556.82      5555.75

[witch24:15639] Unbelievable situation ... we got a duplicated fragment with seq number of 0 (expected 65534) from witch23

[witch24:15639] Unbelievable situation ... we got a duplicated fragment with seq number of 65116 (expected 65534) from witch23

[witch24:15639] *** Process received signal ***

[witch24:15639] Signal: Segmentation fault (11)

[witch24:15639] Signal code: Address not mapped (1)

[witch24:15639] Failing at address: 0x632457d0

[witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]

[witch24:15639] [ 1] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so [0x2b792aa47d34]

[witch24:15639] [ 2] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so [0x2b792b172163]

[witch24:15639] [ 3] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so [0x2b792b6b0772]

[witch24:15639] [ 4] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so [0x2b792b6b15ff]

[witch24:15639] [ 5] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so [0x2b792b38307f]

[witch24:15639] [ 6] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a) [0x2b79294cd16a]

[witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0 [0x2b79292163a8]

[witch24:15639] [ 8] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so [0x2b792c077cb7]

[witch24:15639] [ 9] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so [0x2b792c07b296]

[witch24:15639] [10] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7) [0x2b7929229907]

[witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]

[witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]

[witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b7929bc2154]

[witch24:15639] [14] ./IMB-MPI1 [0x4030a9]

[witch24:15639] *** End of error message ***

--------------------------------------------------------------------------

Best Regards,

Lenny.