Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfault on any MPI communication on head node
From: Gus Correa (gus_at_[hidden])
Date: 2011-09-27 14:33:37


Any chance that the stacksize in the head node is too small,
compared to the compute nodes?
Small stacksize can cause segfaults.
Check /etc/security/limits.conf (and man limits.conf).
You could set it to unlimited (say, along with locked memory and
perhaps number of open files):

* - stack -1
* - memlock -1
* - nofile 4096

I hope this helps,
Gus Correa

Phillip Vassenkov wrote:
> Thanks, but my main concern is the segfault :P I changed and as I
> expected it still segfaults.
>
> On 9/27/11 9:48 AM, Henderson, Brent wrote:
>> Here is another possibly non-helpful suggestion. :) Change:
>>
>> char* name[20];
>> int maxlen = 20;
>>
>> To:
>>
>> char name[256];
>> int maxlen = 256;
>>
>> gethostname() is supposed to properly truncate the hostname it returns
>> if the actual name is longer than the length provided, but since you
>> have at least one that is longer than 20 characters, I'm curious.
>>
>> Brent
>>
>>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On Behalf Of Jeff Squyres
>> Sent: Tuesday, September 27, 2011 6:29 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Segfault on any MPI communication on head node
>>
>> Hmm. It's not immediately clear to me what's going wrong here.
>>
>> I hate to ask, but could you install a debugging version of Open MPI
>> and capture a proper stack trace of the segv?
>>
>> Also, could you try the 1.4.4 rc and see if that magically fixes the
>> problem? (I'm about to post a new 1.4.4 rc later this morning, but
>> either the current one or the one from later today would be a good
>> datapoint)
>>
>>
>> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote:
>>
>>> Yep, Fedora Core 14 and OpenMPI 1.4.3
>>>
>>> On 9/24/11 7:02 AM, Jeff Squyres wrote:
>>>> Are you running the same OS version and Open MPI version between the
>>>> head node and regular nodes?
>>>>
>>>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote:
>>>>
>>>>> Hey all,
>>>>> I've been racking my brains over this for several days and was
>>>>> hoping anyone could enlighten me. I'll describe only the relevant
>>>>> parts of the network/computer systems. There is one head node and a
>>>>> multitude of regular nodes. The regular nodes are all identical to
>>>>> each other. If I run an mpi program from one of the regular nodes
>>>>> to any other regular nodes, everything works. If I include the head
>>>>> node in the hosts file, I get segfaults which I'll paste below
>>>>> along with sample code. The machines are all networked via
>>>>> infiniband and Ethernet. The issue only arises when mpi
>>>>> communication occurs. By this I mean, MPi_Init might succeed but
>>>>> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found
>>>>> a work around by disabling the openib btl and enforcing that
>>>>> communications go over infiniband(if I don't force infiniband,
>>>>> it'll go over Ethernet). This command works when the head node is
>>>>> included in the hosts file:
>>>>> mpirun --hostfile hostfile --mca btl ^openib --mca
>>>>> btl_tcp_if_include ib0 -np 2 ./b.out
>>>>>
>>>>> Sample Code:
>>>>> #include "mpi.h"
>>>>> #include<stdio.h>
>>>>> int main(int argc, char *argv[])
>>>>> {
>>>>> int rank, nprocs;
>>>>> char* name[20];
>>>>> int maxlen = 20;
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>> MPI_Barrier(MPI_COMM_WORLD);
>>>>> gethostname(name,maxlen);
>>>>> printf("Hello, world. I am %d of %d and host %s \n", rank,
>>>>> nprocs,name);
>>>>> fflush(stdout);
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>>
>>>>> }
>>>>>
>>>>> Segfault:
>>>>> [pastec:19917] *** Process received signal ***
>>>>> [pastec:19917] Signal: Segmentation fault (11)
>>>>> [pastec:19917] Signal code: Address not mapped (1)
>>>>> [pastec:19917] Failing at address: 0x8
>>>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0]
>>>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa)
>>>>> [0x7eff6430b6aa]
>>>>> [pastec:19917] [ 2]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9)
>>>>> [0x7eff66a163c9]
>>>>> [pastec:19917] [ 3]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70)
>>>>> [0x7eff66a21b70]
>>>>> [pastec:19917] [ 4]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89)
>>>>> [0x7eff66a21c89]
>>>>> [pastec:19917] [ 5]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d)
>>>>> [0x7eff66a1703d]
>>>>> [pastec:19917] [ 6]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6)
>>>>> [0x7eff676670e6]
>>>>> [pastec:19917] [ 7]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273)
>>>>> [0x7eff6765b273]
>>>>> [pastec:19917] [ 8]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f)
>>>>> [0x7eff65539b2f]
>>>>> [pastec:19917] [ 9]
>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf)
>>>>> [0x7eff655425cf]
>>>>> [pastec:19917] [10]
>>>>> /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e]
>>>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
>>>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>>> [0x34a841ee5d]
>>>>> [pastec:19917] [13] ./b.out() [0x400919]
>>>>> [pastec:19917] *** End of error message ***
>>>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1]
>>>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> mpirun noticed that process rank 1 with PID 19917 on node
>>>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault).
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users