Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Segfault on any MPI communication on head node
From: Phillip Vassenkov (phillip.vassenkov_at_[hidden])
Date: 2011-10-07 12:18:09


Okay so I finally have a matching set of debug openmpi installs. Here is
the output:

[phillipv_at_pastec thomastests]$mpicc testCode2.c -o b.out;mpirun
--hostfile hostfile -np 2 ./b.out
Enter passphrase for key '/home/phillipv/.ssh/id_rsa':
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

   orte_grpcomm_modex failed
   --> Returned "Data unpack would read past end of buffer" (-26)
instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[pastec.gtri.gatech.edu:31031] Abort before MPI_INIT completed
successfully; not able to guarantee that all other processes were killed!
[pastec.gtri.gatech.edu:31031] [[31908,1],1] ORTE_ERROR_LOG: Data unpack
would read past end of buffer in file grpcomm_bad_module.c at line 535
[compute-4-17.local:21269] [[31908,1],0] ORTE_ERROR_LOG: Data unpack
would read past end of buffer in file grpcomm_bad_module.c at line 535
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[compute-4-17.local:21269] Abort before MPI_INIT completed successfully;
not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 31031 on
node pastec.gtri.gatech.edu exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[pastec.gtri.gatech.edu:31027] 1 more process has sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
[pastec.gtri.gatech.edu:31027] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages

On 10/3/11 8:28 PM, Ralph Castain wrote:
> That means you have mismatched installations around - one configured as debug, and one not. They have to match.
>
> Sent from my iPad
>
> On Oct 3, 2011, at 2:44 PM, Phillip Vassenkov<phillip.vassenkov_at_[hidden]> wrote:
>
>> I went into the directory that I used to install 1.4.3, did the following:
>> make clean
>> ./configure --enable-debug
>> make -j8 all install
>>
>> and it hangs at this when I try to run my code (I commented out all the host name stuff, so it's just MPI code now)
>>
>> [hostname:16574] [[17705,0],0] ORTE_ERROR_LOG: Buffer type (described vs non-described) mismatch - operation not allowed in file base/odls_base_default_fns.c at line 2600
>>
>> I'm googling for more info but does anyone have any ideas?
>>
>> On 9/28/11 8:30 PM, Jeff Squyres wrote:
>>> Use --enable-debug on your configure line. This will add in some debugging code to OMPI, and it'll compile everything with -g so that you can get stack traces.
>>>
>>> Beware that the extra debugging junk makes OMPI slightly slower; don't do any benchmarking with this install, etc.
>>>
>>>
>>> On Sep 28, 2011, at 6:27 PM, Phillip Vassenkov wrote:
>>>
>>>> I tried 1.4.4rc4, same problem. Where do I get a debugging version?
>>>>
>>>> On 9/28/11 8:32 AM, Jeff Squyres wrote:
>>>>> Agreed that the original program had the char*[20]/char[20] bug, but his segv is occurring before trying to use that array. So it's a bug - but he just hadn't hit it yet. :-)
>>>>>
>>>>> I'd still like to see a debugging version so that we can get a real stack trace, and/or try the latest 1.4.4 RC (posted yesterday).
>>>>>
>>>>>
>>>>> On Sep 27, 2011, at 3:08 PM, German Hoecht wrote:
>>>>>
>>>>>> char* name[20]; yields 20 (undefined) pointers to char, guess you mean
>>>>>> char name[20];
>>>>>>
>>>>>> So Brent's suggestion should work as well(?)
>>>>>>
>>>>>> To be safe I would also add:
>>>>>> gethostname(name,maxlen);
>>>>>> name[19] = '\0';
>>>>>> printf("Hello, world. I am %d of %d and host %s \n", rank, ...
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 09/27/2011 07:40 PM, Phillip Vassenkov wrote:
>>>>>>> Thanks, but my main concern is the segfault :P I changed and as I
>>>>>>> expected it still segfaults.
>>>>>>>
>>>>>>> On 9/27/11 9:48 AM, Henderson, Brent wrote:
>>>>>>>> Here is another possibly non-helpful suggestion. :) Change:
>>>>>>>>
>>>>>>>> char* name[20];
>>>>>>>> int maxlen = 20;
>>>>>>>>
>>>>>>>> To:
>>>>>>>>
>>>>>>>> char name[256];
>>>>>>>> int maxlen = 256;
>>>>>>>>
>>>>>>>> gethostname() is supposed to properly truncate the hostname it returns
>>>>>>>> if the actual name is longer than the length provided, but since you
>>>>>>>> have at least one that is longer than 20 characters, I'm curious.
>>>>>>>>
>>>>>>>> Brent
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>>>>>>> On Behalf Of Jeff Squyres
>>>>>>>> Sent: Tuesday, September 27, 2011 6:29 AM
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: Re: [OMPI users] Segfault on any MPI communication on head node
>>>>>>>>
>>>>>>>> Hmm. It's not immediately clear to me what's going wrong here.
>>>>>>>>
>>>>>>>> I hate to ask, but could you install a debugging version of Open MPI
>>>>>>>> and capture a proper stack trace of the segv?
>>>>>>>>
>>>>>>>> Also, could you try the 1.4.4 rc and see if that magically fixes the
>>>>>>>> problem? (I'm about to post a new 1.4.4 rc later this morning, but
>>>>>>>> either the current one or the one from later today would be a good
>>>>>>>> datapoint)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote:
>>>>>>>>
>>>>>>>>> Yep, Fedora Core 14 and OpenMPI 1.4.3
>>>>>>>>>
>>>>>>>>> On 9/24/11 7:02 AM, Jeff Squyres wrote:
>>>>>>>>>> Are you running the same OS version and Open MPI version between the
>>>>>>>>>> head node and regular nodes?
>>>>>>>>>>
>>>>>>>>>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey all,
>>>>>>>>>>> I've been racking my brains over this for several days and was
>>>>>>>>>>> hoping anyone could enlighten me. I'll describe only the relevant
>>>>>>>>>>> parts of the network/computer systems. There is one head node and a
>>>>>>>>>>> multitude of regular nodes. The regular nodes are all identical to
>>>>>>>>>>> each other. If I run an mpi program from one of the regular nodes
>>>>>>>>>>> to any other regular nodes, everything works. If I include the head
>>>>>>>>>>> node in the hosts file, I get segfaults which I'll paste below
>>>>>>>>>>> along with sample code. The machines are all networked via
>>>>>>>>>>> infiniband and Ethernet. The issue only arises when mpi
>>>>>>>>>>> communication occurs. By this I mean, MPi_Init might succeed but
>>>>>>>>>>> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found
>>>>>>>>>>> a work around by disabling the openib btl and enforcing that
>>>>>>>>>>> communications go over infiniband(if I don't force infiniband,
>>>>>>>>>>> it'll go over Ethernet). This command works when the head node is
>>>>>>>>>>> included in the hosts file:
>>>>>>>>>>> mpirun --hostfile hostfile --mca btl ^openib --mca
>>>>>>>>>>> btl_tcp_if_include ib0 -np 2 ./b.out
>>>>>>>>>>>
>>>>>>>>>>> Sample Code:
>>>>>>>>>>> #include "mpi.h"
>>>>>>>>>>> #include<stdio.h>
>>>>>>>>>>> int main(int argc, char *argv[])
>>>>>>>>>>> {
>>>>>>>>>>> int rank, nprocs;
>>>>>>>>>>> char* name[20];
>>>>>>>>>>> int maxlen = 20;
>>>>>>>>>>> MPI_Init(&argc,&argv);
>>>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
>>>>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD);
>>>>>>>>>>> gethostname(name,maxlen);
>>>>>>>>>>> printf("Hello, world. I am %d of %d and host %s \n", rank,
>>>>>>>>>>> nprocs,name);
>>>>>>>>>>> fflush(stdout);
>>>>>>>>>>> MPI_Finalize();
>>>>>>>>>>> return 0;
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Segfault:
>>>>>>>>>>> [pastec:19917] *** Process received signal ***
>>>>>>>>>>> [pastec:19917] Signal: Segmentation fault (11)
>>>>>>>>>>> [pastec:19917] Signal code: Address not mapped (1)
>>>>>>>>>>> [pastec:19917] Failing at address: 0x8
>>>>>>>>>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0]
>>>>>>>>>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa)
>>>>>>>>>>> [0x7eff6430b6aa]
>>>>>>>>>>> [pastec:19917] [ 2]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9)
>>>>>>>>>>> [0x7eff66a163c9]
>>>>>>>>>>> [pastec:19917] [ 3]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70)
>>>>>>>>>>> [0x7eff66a21b70]
>>>>>>>>>>> [pastec:19917] [ 4]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89)
>>>>>>>>>>> [0x7eff66a21c89]
>>>>>>>>>>> [pastec:19917] [ 5]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d)
>>>>>>>>>>> [0x7eff66a1703d]
>>>>>>>>>>> [pastec:19917] [ 6]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6)
>>>>>>>>>>> [0x7eff676670e6]
>>>>>>>>>>> [pastec:19917] [ 7]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273)
>>>>>>>>>>> [0x7eff6765b273]
>>>>>>>>>>> [pastec:19917] [ 8]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f)
>>>>>>>>>>> [0x7eff65539b2f]
>>>>>>>>>>> [pastec:19917] [ 9]
>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf)
>>>>>>>>>>> [0x7eff655425cf]
>>>>>>>>>>> [pastec:19917] [10]
>>>>>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e]
>>>>>>>>>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
>>>>>>>>>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>>>>>>>>> [0x34a841ee5d]
>>>>>>>>>>> [pastec:19917] [13] ./b.out() [0x400919]
>>>>>>>>>>> [pastec:19917] *** End of error message ***
>>>>>>>>>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1]
>>>>>>>>>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> mpirun noticed that process rank 1 with PID 19917 on node
>>>>>>>>>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault).
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users