Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BW benchmark hangs after r 18551
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-06-17 09:12:57


It seems like we have 2 bugs here.
1. After commiting NUMA awareness we see seqf
2. Before commiting NUMA r18656 we see application hangs.
3. I checked both it with and without sendi, same results.
4. It hangs most of the times, but sometimes large msg ( >1M ) are working.

I will keep investigating :)

VER=TRUNK; //home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER}
/opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ;
/home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w
./mpi_p_${VER} -t bw -s 4000000
[witch17:09798] *** Process received signal ***
[witch17:09798] Signal: Segmentation fault (11)
[witch17:09798] Signal code: Address not mapped (1)
[witch17:09798] Failing at address: (nil)
[witch17:09798] [ 0] /lib64/libpthread.so.0 [0x2b1d13530c10]
[witch17:09798] [ 1]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_btl_sm.so [0x2b1d1557a68a]
[witch17:09798] [ 2]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_bml_r2.so [0x2b1d14e1b12f]
[witch17:09798] [ 3]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libopen-pal.so.0(opal_progress+0x5a)
[0x2b1d12f6a6da]
[witch17:09798] [ 4] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0
[0x2b1d12cafd28]
[witch17:09798] [ 5]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0(PMPI_Waitall+0x91)
[0x2b1d12cd9d71]
[witch17:09798] [ 6] ./mpi_p_TRUNK(main+0xd32) [0x401ca2]
[witch17:09798] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b1d13657154]
[witch17:09798] [ 8] ./mpi_p_TRUNK [0x400ea9]
[witch17:09798] *** End of error message ***
[witch1:24955]
--------------------------------------------------------------------------
mpirun noticed that process rank 62 with PID 9798 on node witch17 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA # VER=18551;
//home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER}
/opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ;
/home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w
./mpi_p_${VER} -t bw -s 4000000
BW (100) (size min max avg) 4000000 654.496755 2121.899985
1156.171067
witch1:/home/USERS/lenny/TESTS/NUMA
#

On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosilca_at_[hidden]>
wrote:

> Lenny,
>
> I guess you're running the latest version. If not, please update, Galen and
> myself corrected some bugs last week. If you're using the latest (and
> greatest) then ... well I imagine there is at least one bug left.
>
> There is a quick test you can do. In the btl_sm.c in the module structure
> at the beginning of the file, please replace the sendi function by NULL. If
> this fix the problem, then at least we know that it's a sm send immediate
> problem.
>
> Thanks,
> george.
>
>
> On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:
>
> Hi, George,
>>
>> I have a problem running BW benchmark on 100 rank cluster after r18551.
>> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
>>
>>
>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18549 -t bw -s 100000
>> BW (100) (size min max avg) 100000 576.734030 2001.882416
>> 1062.698408
>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000
>> mpirun: killing job...
>> ( it hangs even after 10 hours ).
>>
>>
>> It doesn't happen if I run --bynode or btl openib,self only.
>>
>>
>> Lenny.
>>
>
>