Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BW benchmark hangs after r 18551
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-06-23 04:42:07


Hi,
Seqf bug fixed in r18706.

Best Regards
Lenny.
On Thu, Jun 19, 2008 at 5:37 PM, Lenny Verkhovsky <
lenny.verkhovsky_at_[hidden]> wrote:

> Sorry,
> I checked it without sm.
>
> pls ignore this mail.
>
>
>
> On Thu, Jun 19, 2008 at 4:32 PM, Lenny Verkhovsky <
> lenny.verkhovsky_at_[hidden]> wrote:
>
>> Hi,
>> I found what caused the problem in both cases.
>>
>> --- ompi/mca/btl/sm/btl_sm.c (revision 18675)
>> +++ ompi/mca/btl/sm/btl_sm.c (working copy)
>> @@ -812,7 +812,7 @@
>> */
>> MCA_BTL_SM_FIFO_WRITE(endpoint, endpoint->my_smp_rank,
>> endpoint->peer_smp_rank, frag->hdr, false, rc);
>> - return (rc < 0 ? rc : 1);
>> + return OMPI_SUCCESS;
>> }
>> I am just not sure if it's OK.
>>
>> Lenny.
>> On Wed, Jun 18, 2008 at 3:21 PM, Lenny Verkhovsky <
>> lenny.verkhovsky_at_[hidden]> wrote:
>>
>>> Hi,
>>> I am not sure if it related,
>>> but I applied your patch ( r18667 ) to r 18656 ( one before NUMA )
>>> together with disabling sendi,
>>> The result still the same ( hanging ).
>>>
>>>
>>>
>>>
>>> On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosilca_at_[hidden]>
>>> wrote:
>>>
>>>> Lenny,
>>>>
>>>> I guess you're running the latest version. If not, please update, Galen
>>>> and myself corrected some bugs last week. If you're using the latest (and
>>>> greatest) then ... well I imagine there is at least one bug left.
>>>>
>>>> There is a quick test you can do. In the btl_sm.c in the module
>>>> structure at the beginning of the file, please replace the sendi function by
>>>> NULL. If this fix the problem, then at least we know that it's a sm send
>>>> immediate problem.
>>>>
>>>> Thanks,
>>>> george.
>>>>
>>>>
>>>> On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:
>>>>
>>>> Hi, George,
>>>>>
>>>>> I have a problem running BW benchmark on 100 rank cluster after r18551.
>>>>> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
>>>>>
>>>>>
>>>>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18549 -t bw -s 100000
>>>>> BW (100) (size min max avg) 100000 576.734030 2001.882416
>>>>> 1062.698408
>>>>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000
>>>>> mpirun: killing job...
>>>>> ( it hangs even after 10 hours ).
>>>>>
>>>>>
>>>>> It doesn't happen if I run --bynode or btl openib,self only.
>>>>>
>>>>>
>>>>> Lenny.
>>>>>
>>>>
>>>>
>>>
>>
>