Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory
From: V. Ram (v_r_959_at_[hidden])
Date: 2008-10-10 12:29:41


Sorry for replying to this so late, but I have been away. Reply
below...

On Wed, 1 Oct 2008 11:58:30 -0400, "Aurélien Bouteiller"
<bouteill_at_[hidden]> said:
> If you have several network cards in your system, it can sometime get
> the endpoints confused. Especially if you don't have the same number
> of cards or don't use the same subnet for all "eth0, eth1". You should
> try to restrict Open MPI to use only one of the available networks by
> using the --mca btl_tcp_if_include ethx parameter to mpirun, where x
> is the network interface that is always connected to the same logical
> and physical network on your machine.

I was pretty sure this wasn't the problem since basically all the nodes
only have one interface configured, but I had the user try the --mca
btl_tcp_if_include parameter. The same result / crash occurred.

>
> Aurelien
>
> Le 1 oct. 08 à 11:47, V. Ram a écrit :
>
> > I wrote earlier about one of my users running a third-party Fortran
> > code
> > on 32-bit x86 machines, using OMPI 1.2.7, that is having some odd
> > crash
> > behavior.
> >
> > Our cluster's nodes all have 2 single-core processors. If this code
> > is
> > run on 2 processors on 1 node, it runs seemingly fine. However, if
> > the
> > job runs on 1 processor on each of 2 nodes (e.g., mpirun --bynode),
> > then
> > it crashes and gives messages like:
> >
> > [node4][0,1,4][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
> > [node3][0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed with errno=110
> > mca_btl_tcp_frag_recv: readv failed with errno=104
> >
> > Essentially, if any network communication is involved, the job crashes
> > in this form.
> >
> > I do have another user that runs his own MPI code on 10+ of these
> > processors for days at a time without issue, so I don't think it's
> > hardware.
> >
> > The original code also runs fine across many networked nodes if the
> > architecture is x86-64 (also running OMPI 1.2.7).
> >
> > We have also tried different Fortran compilers (both PathScale and
> > gfortran) and keep getting these crashes.
> >
> > Are there any suggestions on how to figure out if it's a problem with
> > the code or the OMPI installation/software on the system? We have
> > tried
> > "--debug-daemons" with no new/interesting information being revealed.
> > Is there a way to trap segfault messages or more detailed MPI
> > transaction information or anything else that could help diagnose
> > this?
> >
> > Thanks.
> > --
> > V. Ram
> > v_r_959_at_[hidden]

-- 
  V. Ram
  v_r_959_at_[hidden]
-- 
http://www.fastmail.fm - A no graphics, no pop-ups email service