Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2008-10-01 11:58:30


If you have several network cards in your system, it can sometime get
the endpoints confused. Especially if you don't have the same number
of cards or don't use the same subnet for all "eth0, eth1". You should
try to restrict Open MPI to use only one of the available networks by
using the --mca btl_tcp_if_include ethx parameter to mpirun, where x
is the network interface that is always connected to the same logical
and physical network on your machine.

Aurelien

Le 1 oct. 08 à 11:47, V. Ram a écrit :

> I wrote earlier about one of my users running a third-party Fortran
> code
> on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd
> crash
> behavior.
>
> Our cluster's nodes all have 2 single-core processors. If this code
> is
> run on 2 processors on 1 node, it runs seemingly fine. However, if
> the
> job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode),
> then
> it crashes and gives messages like:
>
> [node4][0,1,4][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
> [node3][0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed with errno=110
> mca_btl_tcp_frag_recv: readv failed with errno=104
>
> Essentially, if any network communication is involved, the job crashes
> in this form.
>
> I do have another user that runs his own MPI code on 10+ of these
> processors for days at a time without issue, so I don't think it's
> hardware.
>
> The original code also runs fine across many networked nodes if the
> architecture is x86-64 (also running OMPI 1.2.7).
>
> We have also tried different Fortran compilers (both PathScale and
> gfortran) and keep getting these crashes.
>
> Are there any suggestions on how to figure out if it's a problem with
> the code or the OMPI installation/software on the system? We have
> tried
> "--debug-daemons" with no new/interesting information being revealed.
> Is there a way to trap segfault messages or more detailed MPI
> transaction information or anything else that could help diagnose
> this?
>
> Thanks.
> --
> V. Ram
> v_r_959_at_[hidden]
>
> --
> http://www.fastmail.fm - Same, same, but different...
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users