Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-10-20 20:59:50


The error message seems to imply that mpirun itself didn't segfault, but that something else did. Is that segfault pid from mpirun?

This kind of problem usually is caused by mismatched builds - i.e., you compile against your new build, but you pick up the Myrinet build when you try to run because of path and ld_library_path issues. You might check to ensure you are running against what you built with.

On Oct 20, 2010, at 6:41 PM, Raymond Muno wrote:

> We are doing a test build of a new cluster. We are re-using our Myrinet 10G gear from a previous cluster.
>
> I have built OpenMPI 1.4.2 with PGI 10.4. We use this regularly on our Infiniband based cluster and all the install elements were readily available.
>
> With a few go-arounds with the Myrinet MX stack, we are now running MX -1.2.12 with allowances for more than the max of 16 endpoints. Each node has 24 cores.
>
> The cluster is running rocks 5.3.
>
> As part of the initial build, I installed the Myrinet_MX Rocks Roll from Myricom. With the default limitation of 16 endpoints, we could not run on all nodes. As mentioned above, the MX stack was replaced.
>
> Myrinet provided a build of OpenMPI 1.4.1. That build works. It is only compiled with gcc and gfortran and we wanted it built with the compilers we normally use, e.g. PGI, Pathscale and Intel.
>
> We can compile with the OpenMPI 1.4.2 / PGI 10.4 build. However, we cannot launch jobs with mpirun. It seg faults.
>
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> [enet1-head2-eth1:29532] *** Process received signal ***
> [enet1-head2-eth1:29532] Signal: Segmentation fault (11)
> [enet1-head2-eth1:29532] Signal code: Address not mapped (1)
> [enet1-head2-eth1:29532] Failing at address: 0x6c
> [enet1-head2-eth1:29532] *** End of error message ***
> Segmentation fault
>
> However, if we launch the job with the Myricom supplied mpirun in the OpenMPI tree, the job runs successfully. This works even with a test program compiled with the OpenMPI 1.4.2 with PGI 10.4 build.
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users