Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.3 Infiniband Hang
From: Allen Barnett (allen_at_[hidden])
Date: 2009-08-19 15:56:55


Hi: Setting mpi_leave_pinned to 0 allows my application to run to
completion when running with openib active. I realize that it's probably
not going to help my application's performance, but since "ON" is the
default, I'd like to understand what's happening. There's definitely a
dependence on problem size: smaller problems run to completion while
larger problems hang at different points in the code. Are there buffer
sizes (or other BTL settings) I can adjust to understand my problem
better?

Thanks,
Allen

On Thu, 2009-08-13 at 10:11 +0300, Lenny Verkhovsky wrote:
> Hi,
> 1.
> The Mellanox has a newer fw for those
> HCAshttp://www.mellanox.com/content/pages.php?pg=firmware_table_IH3Lx
>
> I am not sure if it will help, but newer fw usually have some bug
> fixes.
>
> 2.
> try to disable leave_pinned during the run. It's on by default in
> 1.3.3
>
> Lenny.
>
> On Thu, Aug 13, 2009 at 5:12 AM, Allen Barnett
> <allen_at_[hidden]> wrote:
> Hi:
> I recently tried to build my MPI application against OpenMPI
> 1.3.3. It
> worked fine with OMPI 1.2.9, but with OMPI 1.3.3, it hangs
> part way
> through. It does a fair amount of comm, but eventually it
> stops in a
> Send/Recv point-to-point exchange. If I turn off the openib
> btl, it runs
> to completion. Also, I built 1.3.3 with memchecker (which is
> very nice;
> thanks to everyone who worked on that!) and it runs to
> completion, even
> with openib active.
>
> Our cluster consists of dual dual-core opteron boxes with
> Mellanox
> MT25204 (InfiniHost III Lx) HCAs and a Mellanox MT47396
> Infiniscale-III
> switch. We're running RHEL 4.8 which appears to include OFED
> 1.4. I've
> built everything using GCC 4.3.2. Here is the output from
> ibv_devinfo.
> "ompi_info --all" is attached.
> $ ibv_devinfo
> hca_id: mthca0
> fw_ver: 1.1.0
> node_guid: 0002:c902:0024:3284
> sys_image_guid: 0002:c902:0024:3287
> vendor_id: 0x02c9
> vendor_part_id: 25204
> hw_ver: 0xA0
> board_id: MT_03B0140002
> phys_port_cnt: 1
> port: 1
> state: active (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 1
> port_lmc: 0x00
>
> I'd appreciate any tips for debugging this.
> Thanks,
> Allen