Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfault on any MPI communication on head node
From: Phillip Vassenkov (phillip.vassenkov_at_[hidden])
Date: 2011-10-07 12:18:09


Okay so I finally have a matching set of debug openmpi installs. Here is
the output:

[phillipv_at_pastec thomastests]$mpicc testCode2.c -o b.out;mpirun
--hostfile hostfile -np 2 ./b.out
Enter passphrase for key '/home/phillipv/.ssh/id_rsa':
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

   orte_grpcomm_modex failed
   --> Returned "Data unpack would read past end of buffer" (-26)
instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[pastec.gtri.gatech.edu:31031] Abort before MPI_INIT completed
successfully; not able to guarantee that all other processes were killed!
[pastec.gtri.gatech.edu:31031] [[31908,1],1] ORTE_ERROR_LOG: Data unpack
would read past end of buffer in file grpcomm_bad_module.c at line 535
[compute-4-17.local:21269] [[31908,1],0] ORTE_ERROR_LOG: Data unpack
would read past end of buffer in file grpcomm_bad_module.c at line 535
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[compute-4-17.local:21269] Abort before MPI_INIT completed successfully;
not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 31031 on
node pastec.gtri.gatech.edu exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[pastec.gtri.gatech.edu:31027] 1 more process has sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
[pastec.gtri.gatech.edu:31027] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages

On 10/3/11 8:28 PM, Ralph Castain wrote:
> That means you have mismatched installations around - one configured as debug, and one not. They have to match.
>
> Sent from my iPad
>
> On Oct 3, 2011, at 2:44 PM, Phillip Vassenkov<phillip.vassenkov_at_[hidden]> wrote:
>
>> I went into the directory that I used to install 1.4.3, did the following:
>> make clean
>> ./configure --enable-debug
>> make -j8 all install
>>
>> and it hangs at this when I try to run my code (I commented out all the host name stuff, so it's just MPI code now)
>>
>> [hostname:16574] [[17705,0],0] ORTE_ERROR_LOG: Buffer type (described vs non-described) mismatch - operation not allowed in file base/odls_base_default_fns.c at line 2600
>>
>> I'm googling for more info but does anyone have any ideas?
>>
>> On 9/28/11 8:30 PM, Jeff Squyres wrote:
>>> Use --enable-debug on your configure line. This will add in some debugging code to OMPI, and it'll compile everything with -g so that you can get stack traces.
>>>
>>> Beware that the extra debugging junk makes OMPI slightly slower; don't do any benchmarking with this install, etc.
>>>
>>>
>>> On Sep 28, 2011, at 6:27 PM, Phillip Vassenkov wrote:
>>>
>>>> I tried 1.4.4rc4, same problem. Where do I get a debugging version?
>>>>
>>>> On 9/28/11 8:32 AM, Jeff Squyres wrote:
>>>>> Agreed that the original program had the char*[20]/char[20] bug, but his segv is occurring before trying to use that array. So it's a bug - but he just hadn't hit it yet. :-)
>>>>>
>>>>> I'd still like to see a debugging version so that we can get a real stack trace, and/or try the latest 1.4.4 RC (posted yesterday).
>>>>>
>>>>>
>>>>> On Sep 27, 2011, at 3:08 PM, German Hoecht wrote:
>>>>>
>>>>>> char* name[20]; yields 20 (undefined) pointers to char, guess you mean
>>>>>> char name[20];
>>>>>>
>>>>>> So Brent's suggestion should work as well(?)
>>>>>>
>>>>>> To be safe I would also add:
>>>>>> gethostname(name,maxlen);
>>>>>> name[19] = '\0';
>>>>>> printf("Hello, world. I am %d of %d and host %s \n", rank, ...
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 09/27/2011 07:40 PM, Phillip Vassenkov wrote:
>>>>>>> Thanks, but my main concern is the segfault :P I changed and as I
>>>>>>> expected it still segfaults.
>>>>>>>
>>>>>>> On 9/27/11 9:48 AM, Henderson, Brent wrote:
>>>>>>>> Here is another possibly non-helpful suggestion. :) Change:
>>>>>>>>
>>>>>>>> char* name[20];
>>>>>>>> int maxlen = 20;
>>>>>>>>
>>>>>>>> To:
>>>>>>>>
>>>>>>>> char name[256];
>>>>>>>> int maxlen = 256;
>>>>>>>>
>>>>>>>> gethostname() is supposed to properly truncate the hostname it returns
>>>>>>>> if the actual name is longer than the length provided, but since you
>>>>>>>> have at least one that is longer than 20 characters, I'm curious.
>>>>>>>>
>>>>>>>> Brent
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>>>>>>> On Behalf Of Jeff Squyres
>>>>>>>> Sent: Tuesday, September 27, 2011 6:29 AM
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: Re: [OMPI users] Segfault on any MPI communication on head node
>>>>>>>>
>>>>>>>> Hmm. It's not immediately clear to me what's going wrong here.
>>>>>>>>
>>>>>>>> I hate to ask, but could you install a debugging version of Open MPI
>>>>>>>> and capture a proper stack trace of the segv?
>>>>>>>>
>>>>>>>> Also, could you try the 1.4.4 rc and see if that magically fixes the
>>>>>>>> problem? (I'm about to post a new 1.4.4 rc later this morning, but
>>>>>>>> either the current one or the one from later today would be a good
>>>>>>>> datapoint)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote:
>>>>>>>>
>>>>>>>>> Yep, Fedora Core 14 and OpenMPI 1.4.3
>>>>>>>>>
>>>>>>>>> On 9/24/11 7:02 AM, Jeff Squyres wrote:
>>>>>>>>>> Are you running the same OS version and Open MPI version between the
>>>>>>>>>> head node and regular nodes?
>>>>>>>>>>
>>>>>>>>>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey all,
>>>>>>>>>>> I've been racking my brains over this for several days and was
>>>>>>>>>>> hoping anyone could enlighten me. I'll describe only the relevant
>>>>>>>>>>> parts of the network/computer systems. There is one head node and a
>>>>>>>>>>> multitude of regular nodes. The regular nodes are all identical to
>>>>>>>>>>> each other. If I run an mpi program from one of the regular nodes
>>>>>>>>>>> to any other regular nodes, everything works. If I include the head
>>>>>>>>>>> node in the hosts file, I get segfaults which I'll paste below
>>>>>>>>>>> along with sample code. The machines are all networked via
>>>>>>>>>>> infiniband and Ethernet. The issue only arises when mpi
>>>>>>>>>>> communication occurs. By this I mean, MPi_Init might succeed but
>>>>>>>>>>> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found
>>>>>>>>>>> a work around by disabling the openib btl and enforcing that
>>>>>>>>>>> communications go over infiniband(if I don't force infiniband,
>>>>>>>>>>> it'll go over Ethernet). This command works when the head node is
>>>>>>>>>>> included in the hosts file:
>>>>>>>>>>> mpirun --hostfile hostfile --mca btl ^openib --mca
>>>>>>>>>>> btl_tcp_if_include ib0 -np 2 ./b.out
>>>>>>>>>>>
>>>>>>>>>>> Sample Code:
>>>>>>>>>>> #include "mpi.h"
>>>>>>>>>>> #include<stdio.h>
>>>>>>>>>>> int main(int argc, char *argv[])
>>>>>>>>>>> {
>>>>>>>>>>> int rank, nprocs;
>>>>>>>>>>> char* name[20];
>>>>>>>>>>> int maxlen = 20;
>>>>>>>>>>> MPI_Init(&argc,&argv);
>>>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
>>>>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD);
>>>>>>>>>>> gethostname(name,maxlen);
>>>>>>>>>>> printf("Hello, world. I am %d of %d and host %s \n", rank,
>>>>>>>>>>> nprocs,name);
>>>>>>>>>>> fflush(stdout);
>>>>>>>>>>> MPI_Finalize();
>>>>>>>>>>> return 0;
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Segfault:
>>>>>>>>>>> [pastec:19917] *** Process received signal ***
>>>>>>>>>>> [pastec:19917] Signal: Segmentation fault (11)
>>>>>>>>>>> [pastec:19917] Signal code: Address not mapped (1)
>>>>>>>>>>> [pastec:19917] Failing at address: 0x8
>>>>>>>>>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0]
>>>>>>>>>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa)
>>>>>>>>>>> [0x7eff6430b6aa]
>>>>>>>>>>> [pastec:19917] [ 2]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9)
>>>>>>>>>>> [0x7eff66a163c9]
>>>>>>>>>>> [pastec:19917] [ 3]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70)
>>>>>>>>>>> [0x7eff66a21b70]
>>>>>>>>>>> [pastec:19917] [ 4]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89)
>>>>>>>>>>> [0x7eff66a21c89]
>>>>>>>>>>> [pastec:19917] [ 5]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d)
>>>>>>>>>>> [0x7eff66a1703d]
>>>>>>>>>>> [pastec:19917] [ 6]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6)
>>>>>>>>>>> [0x7eff676670e6]
>>>>>>>>>>> [pastec:19917] [ 7]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273)
>>>>>>>>>>> [0x7eff6765b273]
>>>>>>>>>>> [pastec:19917] [ 8]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f)
>>>>>>>>>>> [0x7eff65539b2f]
>>>>>>>>>>> [pastec:19917] [ 9]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf)
>>>>>>>>>>> [0x7eff655425cf]
>>>>>>>>>>> [pastec:19917] [10]
>>>>>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e]
>>>>>>>>>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
>>>>>>>>>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>>>>>>>>> [0x34a841ee5d]
>>>>>>>>>>> [pastec:19917] [13] ./b.out() [0x400919]
>>>>>>>>>>> [pastec:19917] *** End of error message ***
>>>>>>>>>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1]
>>>>>>>>>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> mpirun noticed that process rank 1 with PID 19917 on node
>>>>>>>>>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault).
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users