Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] BW benchmark hangs after r 18551
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-06-17 09:12:57


It seems like we have 2 bugs here.
1. After commiting NUMA awareness we see seqf
2. Before commiting NUMA r18656 we see application hangs.
3. I checked both it with and without sendi, same results.
4. It hangs most of the times, but sometimes large msg ( >1M ) are working.

I will keep investigating :)

VER=TRUNK; //home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER}
/opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ;
/home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w
./mpi_p_${VER} -t bw -s 4000000
[witch17:09798] *** Process received signal ***
[witch17:09798] Signal: Segmentation fault (11)
[witch17:09798] Signal code: Address not mapped (1)
[witch17:09798] Failing at address: (nil)
[witch17:09798] [ 0] /lib64/libpthread.so.0 [0x2b1d13530c10]
[witch17:09798] [ 1]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_btl_sm.so [0x2b1d1557a68a]
[witch17:09798] [ 2]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_bml_r2.so [0x2b1d14e1b12f]
[witch17:09798] [ 3]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libopen-pal.so.0(opal_progress+0x5a)
[0x2b1d12f6a6da]
[witch17:09798] [ 4] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0
[0x2b1d12cafd28]
[witch17:09798] [ 5]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0(PMPI_Waitall+0x91)
[0x2b1d12cd9d71]
[witch17:09798] [ 6] ./mpi_p_TRUNK(main+0xd32) [0x401ca2]
[witch17:09798] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b1d13657154]
[witch17:09798] [ 8] ./mpi_p_TRUNK [0x400ea9]
[witch17:09798] *** End of error message ***
[witch1:24955]
--------------------------------------------------------------------------
mpirun noticed that process rank 62 with PID 9798 on node witch17 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA # VER=18551;
//home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER}
/opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ;
/home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w
./mpi_p_${VER} -t bw -s 4000000
BW (100) (size min max avg) 4000000 654.496755 2121.899985
1156.171067
witch1:/home/USERS/lenny/TESTS/NUMA
#

On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosilca_at_[hidden]>
wrote:

> Lenny,
>
> I guess you're running the latest version. If not, please update, Galen and
> myself corrected some bugs last week. If you're using the latest (and
> greatest) then ... well I imagine there is at least one bug left.
>
> There is a quick test you can do. In the btl_sm.c in the module structure
> at the beginning of the file, please replace the sendi function by NULL. If
> this fix the problem, then at least we know that it's a sm send immediate
> problem.
>
> Thanks,
> george.
>
>
> On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:
>
> Hi, George,
>>
>> I have a problem running BW benchmark on 100 rank cluster after r18551.
>> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
>>
>>
>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18549 -t bw -s 100000
>> BW (100) (size min max avg) 100000 576.734030 2001.882416
>> 1062.698408
>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000
>> mpirun: killing job...
>> ( it hangs even after 10 hours ).
>>
>>
>> It doesn't happen if I run --bynode or btl openib,self only.
>>
>>
>> Lenny.
>>
>
>