What is the name and version of the kernel module for your NIC? I have
experimented some similar with my tg3 module. The error which appeared
for my was different:
[btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
failed: No route to host (113)
I solved it changing the following parameter in the linux kernel:
/sbin/ethtool -K eth0 tso off
Aurélien Bouteiller escribió:
> If you have several network cards in your system, it can sometime get
> the endpoints confused. Especially if you don't have the same number
> of cards or don't use the same subnet for all "eth0, eth1". You should
> try to restrict Open MPI to use only one of the available networks by
> using the --mca btl_tcp_if_include ethx parameter to mpirun, where x
> is the network interface that is always connected to the same logical
> and physical network on your machine.
> Le 1 oct. 08 à 11:47, V. Ram a écrit :
>> I wrote earlier about one of my users running a third-party Fortran code
>> on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd crash
>> Our cluster's nodes all have 2 single-core processors. If this code is
>> run on 2 processors on 1 node, it runs seemingly fine. However, if the
>> job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode), then
>> it crashes and gives messages like:
>> mca_btl_tcp_frag_recv: readv failed with errno=110
>> mca_btl_tcp_frag_recv: readv failed with errno=104
>> Essentially, if any network communication is involved, the job crashes
>> in this form.
>> I do have another user that runs his own MPI code on 10+ of these
>> processors for days at a time without issue, so I don't think it's
>> The original code also runs fine across many networked nodes if the
>> architecture is x86-64 (also running OMPI 1.2.7).
>> We have also tried different Fortran compilers (both PathScale and
>> gfortran) and keep getting these crashes.
>> Are there any suggestions on how to figure out if it's a problem with
>> the code or the OMPI installation/software on the system? We have tried
>> "--debug-daemons" with no new/interesting information being revealed.
>> Is there a way to trap segfault messages or more detailed MPI
>> transaction information or anything else that could help diagnose this?
>> V. Ram
>> http://www.fastmail.fm - Same, same, but different...
>> users mailing list
> users mailing list
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088