Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-10-01 12:11:34


Ram,

What is the name and version of the kernel module for your NIC? I have
experimented some similar with my tg3 module. The error which appeared
for my was different:

[btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
failed: No route to host (113)

I solved it changing the following parameter in the linux kernel:

/sbin/ethtool -K eth0 tso off

Leonardo

Aurélien Bouteiller escribió:
> If you have several network cards in your system, it can sometime get
> the endpoints confused. Especially if you don't have the same number
> of cards or don't use the same subnet for all "eth0, eth1". You should
> try to restrict Open MPI to use only one of the available networks by
> using the --mca btl_tcp_if_include ethx parameter to mpirun, where x
> is the network interface that is always connected to the same logical
> and physical network on your machine.
>
> Aurelien
>
> Le 1 oct. 08 à 11:47, V. Ram a écrit :
>
>> I wrote earlier about one of my users running a third-party Fortran code
>> on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd crash
>> behavior.
>>
>> Our cluster's nodes all have 2 single-core processors. If this code is
>> run on 2 processors on 1 node, it runs seemingly fine. However, if the
>> job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode), then
>> it crashes and gives messages like:
>>
>> [node4][0,1,4][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
>> [node3][0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed with errno=110
>> mca_btl_tcp_frag_recv: readv failed with errno=104
>>
>> Essentially, if any network communication is involved, the job crashes
>> in this form.
>>
>> I do have another user that runs his own MPI code on 10+ of these
>> processors for days at a time without issue, so I don't think it's
>> hardware.
>>
>> The original code also runs fine across many networked nodes if the
>> architecture is x86-64 (also running OMPI 1.2.7).
>>
>> We have also tried different Fortran compilers (both PathScale and
>> gfortran) and keep getting these crashes.
>>
>> Are there any suggestions on how to figure out if it's a problem with
>> the code or the OMPI installation/software on the system? We have tried
>> "--debug-daemons" with no new/interesting information being revealed.
>> Is there a way to trap segfault messages or more detailed MPI
>> transaction information or anything else that could help diagnose this?
>>
>> Thanks.
>> --
>> V. Ram
>> v_r_959_at_[hidden]
>>
>> --
>> http://www.fastmail.fm - Same, same, but different...
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478