It seems like we have 2 bugs here.
1. After commiting NUMA awareness we see seqf
2. Before commiting NUMA r18656 we see application hangs.
3. I checked both it with and without sendi, same results.
4. It hangs most of the times, but sometimes large msg ( >1M ) are working.
I will keep investigating :)
VER=TRUNK; //home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER} /opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ; /home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w ./mpi_p_${VER} -t bw -s 4000000
[witch17:09798] *** Process received signal ***
[witch17:09798] Signal: Segmentation fault (11)
[witch17:09798] Signal code: Address not mapped (1)
[witch17:09798] Failing at address: (nil)
[witch17:09798] [ 0] /lib64/libpthread.so.0 [0x2b1d13530c10]
[witch17:09798] [ 1] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_btl_sm.so [0x2b1d1557a68a]
[witch17:09798] [ 2] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_bml_r2.so [0x2b1d14e1b12f]
[witch17:09798] [ 3] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libopen-pal.so.0(opal_progress+0x5a) [0x2b1d12f6a6da]
[witch17:09798] [ 4] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0 [0x2b1d12cafd28]
[witch17:09798] [ 5] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0(PMPI_Waitall+0x91) [0x2b1d12cd9d71]
[witch17:09798] [ 6] ./mpi_p_TRUNK(main+0xd32) [0x401ca2]
[witch17:09798] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b1d13657154]
[witch17:09798] [ 8] ./mpi_p_TRUNK [0x400ea9]
[witch17:09798] *** End of error message ***
[witch1:24955] --------------------------------------------------------------------------
mpirun noticed that process rank 62 with PID 9798 on node witch17 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA # VER=18551; //home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER} /opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ; /home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w ./mpi_p_${VER} -t bw -s 4000000
BW (100) (size min max avg) 4000000 654.496755 2121.899985 1156.171067
witch1:/home/USERS/lenny/TESTS/NUMA #
On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <
bosilca@eecs.utk.edu> wrote:
Lenny,
I guess you're running the latest version. If not, please update, Galen and myself corrected some bugs last week. If you're using the latest (and greatest) then ... well I imagine there is at least one bug left.
There is a quick test you can do. In the btl_sm.c in the module structure at the beginning of the file, please replace the sendi function by NULL. If this fix the problem, then at least we know that it's a sm send immediate problem.
Thanks,
george.
On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:
Hi, George,
I have a problem running BW benchmark on 100 rank cluster after r18551.
The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
#mpirun -np 100 -hostfile hostfile_w ./mpi_p_18549 -t bw -s 100000
BW (100) (size min max avg) 100000 576.734030 2001.882416 1062.698408
#mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000
mpirun: killing job...
( it hangs even after 10 hours ).
It doesn't happen if I run --bynode or btl openib,self only.
Lenny.