Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-03-13 09:43:22


I was informed yesterday that we will not be doing any more bug fixes in the
1.1 series beyond what is in the soon-to-be-released 1.1.5. So I've been
asked to confine any "fix" activity to the 1.2 series about to be released.

Unfortunately, 1.1.5 won't solve the problem you noted. Tim tells me that he
has seen at least some indications of similar, but not identical, behavior
in 1.2, so I'll take a look at that and see if I can replicate the problem.
Any fixes, though, won't be available until at least a 1.2.1 update is
released (timing uncertain as 1.2 hasn't been released yet).

I'll try to post something back to the list when I dig a little further into
this.

Ralph

On 3/6/07 11:53 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:

> I believe I know what is happening here. My availability in the next week is
> pretty limited due to a family emergency, but I'll take a look when I get
> back. In brief, this is a resource starvation issue where the system thinks
> your node is unable to support any further processes and so it blocks.
>
> On a separate note, I never use threaded configurations due to the lack of
> any real thread-safety review or testing on Open MPI to-date (per Tim's
> earlier comment). My "standard" configuration for development and testing is
> with --disable-progress-threads --without-threads.
>
> I'll post something back to the list when I get it resolved.
>
> Thanks
> Ralph
>
>
> On 3/6/07 9:00 AM, "Rozzen.VINCONT_at_[hidden]"
> <Rozzen.VINCONT_at_[hidden]> wrote:
>
>> Hi Tim, I get back to you
>>
>> "What kind of system is it?"
>> =>The system is a "Debian Sarge".
>> "How many nodes are you running on?"
>> => There is no cluster configured, so I guess I work with no node
>> environnement.
>> "Have you been able to try a more recent version of Open MPI?"
>> =>Today, I tried with version 1.1.4, but the results are not better.
>> I tested 2 cases :
>> Test 1 : with the sames configuration options (./configure
>> --enable-mpi-threads --enable-progress-threads --with-threads=posix
>> --enable-smp-locks)
>> The program stopped on MPI_Init_thread in __lll_mutex_lock_wait () from
>> /lib/tls/libpthread.so.0
>>
>> Test 2 : with the default configuration options (./configure
>> --prefix=/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread)
>> The program stoped on the "node allocation" after the spawn n°31.
>> Maybe the problem comes from the lack of node definition?
>> Thanks for your help.
>>
>> Here below, the different log files of the 2 tests
>>
>> /******************************TEST 1*******************************/
>> GNU gdb 6.3-debian
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB. Type "show warranty" for details.
>> This GDB was configured as "i386-linux"...Using host libthread_db library
>> "/lib/tls/libthread_db.so.1".
>>
>> (gdb) run
>> Starting program: /home/workspace/test_spaw1/src/spawn
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1076646560 (LWP 5178)]
>> main*******************************
>> main : Lancement MPI*
>> [New Thread 1085225904 (LWP 5181)]
>> [New Thread 1094495152 (LWP 5182)]
>>
>> Program received signal SIGINT, Interrupt.
>> [Switching to Thread 1076646560 (LWP 5178)]
>> 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
>> (gdb) where
>> #0 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
>> #1 0x40187893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
>> #2 0xbffff508 in ?? ()
>> #3 0x4000bcd0 in _dl_map_object_deps () from /lib/ld-linux.so.2
>> #4 0x40b9f8cb in mca_btl_tcp_component_create_listen () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so
>> #5 0x40b9f8cb in mca_btl_tcp_component_create_listen () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so
>> #6 0x40b9eef4 in mca_btl_tcp_component_init () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so
>> #7 0x4008c652 in mca_btl_base_select () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #8 0x40b8dd28 in mca_bml_r2_component_init () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_bml_r2.so
>> #9 0x4008bf54 in mca_bml_base_init () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #10 0x40b7e5c9 in mca_pml_ob1_component_init () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_pml_ob1.so
>> #11 0x40094192 in mca_pml_base_select () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #12 0x4005742c in ompi_mpi_init () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #13 0x4007c182 in PMPI_Init_thread () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #14 0x080489f3 in main (argc=1, argv=0xbffff8a4) at spawn6.c:33
>>
>>
>>
>> /******************************TEST 2*******************************/
>>
>> GNU gdb 6.3-debian
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB. Type "show warranty" for details.
>> This GDB was configured as "i386-linux"...Using host libthread_db library
>> "/lib/tls/libthread_db.so.1".
>>
>> (gdb) run -np 1 --host myhost spawn6
>> Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun
>> -np
>> 1 --host myhost spawn6
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1076121728 (LWP 4022)]
>> main*******************************
>> main : Lancement MPI*
>> Exe : Lance
>> Exe: lRankExe = 1 lRankMain = 0
>> 1 main***MPI_Comm_spawn return : 0
>> 1 main***Rang main : 0 Rang exe : 1
>> Exe : Lance
>> Exe: Fin.
>>
>>
>> Exe: lRankExe = 1 lRankMain = 0
>> 2 main***MPI_Comm_spawn return : 0
>> 2 main***Rang main : 0 Rang exe : 1
>> Exe : Lance
>> Exe: Fin.
>>
>> ...
>>
>> Exe: lRankExe = 1 lRankMain = 0
>> 30 main***MPI_Comm_spawn return : 0
>> 30 main***Rang main : 0 Rang exe : 1
>> Exe : Lance
>> Exe: Fin.
>>
>> Exe: lRankExe = 1 lRankMain = 0
>> 31 main***MPI_Comm_spawn return : 0
>> 31 main***Rang main : 0 Rang exe : 1
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 1076121728 (LWP 4022)]
>> 0x4018833b in strlen () from /lib/tls/libc.so.6
>> (gdb) where
>> #0 0x4018833b in strlen () from /lib/tls/libc.so.6
>> #1 0x40297c5e in orte_gpr_replica_create_itag () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
>> #2 0x4029d2df in orte_gpr_replica_put_fn () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
>> #3 0x40297281 in orte_gpr_replica_put () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
>> #4 0x40048287 in orte_ras_base_node_assign () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #5 0x400463e1 in orte_ras_base_allocate_nodes () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #6 0x402c2bb8 in orte_ras_hostfile_allocate () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_ras_hostfile.so
>> #7 0x400464e0 in orte_ras_base_allocate () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #8 0x402b063f in orte_rmgr_urm_allocate () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
>> #9 0x4004f277 in orte_rmgr_base_cmd_dispatch () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #10 0x402b10ae in orte_rmgr_urm_recv () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
>> #11 0x4004301e in mca_oob_recv_callback () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #12 0x4027a748 in mca_oob_tcp_msg_data () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
>> #13 0x4027bb12 in mca_oob_tcp_peer_recv_handler () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
>> #14 0x400703f9 in opal_event_loop () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
>> #15 0x4006adfa in opal_progress () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
>> #16 0x0804c7a1 in opal_condition_wait (c=0x804fbcc, m=0x804fba8) at
>> condition.h:81
>> #17 0x0804a4c8 in orterun (argc=6, argv=0xbffff854) at orterun.c:427
>> #18 0x08049dd6 in main (argc=6, argv=0xbffff854) at main.c:13
>> (gdb)
>> -----Message d'origine-----
>> De : users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]De la
>> part de Tim Prins
>> Envoyé : lundi 5 mars 2007 22:34
>> À : Open MPI Users
>> Objet : Re: [OMPI users] MPI_Comm_Spawn
>>
>>
>> Never mind, I was just able to replicate it. I'll look into it.
>>
>> Tim
>>
>> On Mar 5, 2007, at 4:26 PM, Tim Prins wrote:
>>
>>> That is possible. Threading support is VERY lightly tested, but I
>>> doubt it is the problem since it always fails after 31 spawns.
>>>
>>> Again, I have tried with these configure options and the same version
>>> of Open MPI and have still have been able to replicate this (after
>>> letting it spawn over 500 times). Have you been able to try a more
>>> recent version of Open MPI? What kind of system is it? How many nodes
>>> are you running on?
>>>
>>> Tim
>>>
>>> On Mar 5, 2007, at 1:21 PM, Rozzen.VINCONT_at_[hidden] wrote:
>>>
>>>>
>>>> Maybe the problem comes from the configuration options.
>>>> The configuration options used are :
>>>> ./configure --enable-mpi-threads --enable-progress-threads --with-
>>>> threads=posix --enable-smp-locks
>>>> Could you give me your point of view about that please ?
>>>> Thanks
>>>>
>>>> -----Message d'origine-----
>>>> De : users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>>> De la
>>>> part de Ralph H Castain
>>>> Envoyé : mardi 27 février 2007 16:26
>>>> À : Open MPI Users <users_at_[hidden]>
>>>> Objet : Re: [OMPI users] MPI_Comm_Spawn
>>>>
>>>>
>>>> Now that's interesting! There shouldn't be a limit, but to be
>>>> honest, I've
>>>> never tested that mode of operation - let me look into it and see.
>>>> It sounds
>>>> like there is some counter that is overflowing, but I'll look.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>>
>>>> On 2/27/07 8:15 AM, "Rozzen.VINCONT_at_[hidden]"
>>>> <Rozzen.VINCONT_at_[hidden]> wrote:
>>>>
>>>>> Do you know if there is a limit to the number of MPI_Comm_spawn we
>>>>> can use in
>>>>> order to launch a program?
>>>>> I want to start and stop a program several times (with the function
>>>>> MPI_Comm_spawn) but every time after 31 MPI_Comm_spawn, I get a
>>>>> "segmentation
>>>>> fault".
>>>>> Could you give me your point of you to solve this problem?
>>>>> Thanks
>>>>>
>>>>> /*file .c : spawned the file Exe*/
>>>>> #include <stdio.h>
>>>>> #include <malloc.h>
>>>>> #include <unistd.h>
>>>>> #include "mpi.h"
>>>>> #include <pthread.h>
>>>>> #include <signal.h>
>>>>> #include <sys/time.h>
>>>>> #include <errno.h>
>>>>> #define EXE_TEST "/home/workspace/test_spaw1/src/
>>>>> Exe"
>>>>>
>>>>>
>>>>>
>>>>> int main( int argc, char **argv ) {
>>>>>
>>>>> long *lpBufferMpi;
>>>>> MPI_Comm lIntercom;
>>>>> int lErrcode;
>>>>> MPI_Comm lCommunicateur;
>>>>> int lRangMain,lRangExe,lMessageEnvoi,lIter,NiveauThreadVoulu,
>>>>> NiveauThreadObtenu,lTailleBuffer;
>>>>> int *lpMessageEnvoi=&lMessageEnvoi;
>>>>> MPI_Status lStatus; /*status de reception*/
>>>>>
>>>>> lIter=0;
>>>>>
>>>>>
>>>>> /* MPI environnement */
>>>>>
>>>>> printf("main*******************************\n");
>>>>> printf("main : Lancement MPI*\n");
>>>>>
>>>>> NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
>>>>> MPI_Init_thread( &argc, &argv, NiveauThreadVoulu,
>>>>> &NiveauThreadObtenu );
>>>>> lpBufferMpi = calloc( 10000, sizeof(long));
>>>>> MPI_Buffer_attach( (void*)lpBufferMpi, 10000 * sizeof(long) );
>>>>>
>>>>> while (lIter<1000){
>>>>> lIter ++;
>>>>> lIntercom=(MPI_Comm)-1 ;
>>>>>
>>>>> MPI_Comm_spawn( EXE_TEST, NULL, 1, MPI_INFO_NULL,
>>>>> 0, MPI_COMM_WORLD, &lIntercom, &lErrcode );
>>>>> printf( "%i main***MPI_Comm_spawn return : %d\n",lIter,
>>>>> lErrcode );
>>>>>
>>>>> if(lIntercom == (MPI_Comm)-1 ){
>>>>> printf("%i Intercom null\n",lIter);
>>>>> return 0;
>>>>> }
>>>>> MPI_Intercomm_merge(lIntercom, 0,&lCommunicateur );
>>>>> MPI_Comm_rank( lCommunicateur, &lRangMain);
>>>>> lRangExe=1-lRangMain;
>>>>>
>>>>> printf("%i main***Rang main : %i Rang exe : %i
>>>>> \n",lIter,(int)lRangMain,(int)lRangExe);
>>>>> sleep(2);
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> /* Arret de l'environnement MPI */
>>>>> lTailleBuffer=10000* sizeof(long);
>>>>> MPI_Buffer_detach( (void*)lpBufferMpi, &lTailleBuffer );
>>>>> MPI_Comm_free( &lCommunicateur );
>>>>> MPI_Finalize( );
>>>>> free( lpBufferMpi );
>>>>>
>>>>> printf( "Main = End .\n" );
>>>>> return 0;
>>>>>
>>>>> }
>>>>> /
>>>>> ********************************************************************
>>>>> *
>>>>> ********
>>>>> *******************/
>>>>> Exe:
>>>>> #include <string.h>
>>>>> #include <stdlib.h>
>>>>> #include <stdio.h>
>>>>> #include <malloc.h>
>>>>> #include <unistd.h> /* pour sleep() */
>>>>> #include <pthread.h>
>>>>> #include <semaphore.h>
>>>>> #include "mpi.h"
>>>>>
>>>>> int main( int argc, char **argv ) {
>>>>> /*1)pour communiaction MPI*/
>>>>> MPI_Comm lCommunicateur; /*communicateur du process*/
>>>>> MPI_Comm CommParent; /*Communiacteur parent à
>>>>> récupérer*/
>>>>> int lRank; /*rang du communicateur du
>>>>> process*/
>>>>> int lRangMain; /*rang du séquenceur si lancé en
>>>>> mode normal*/
>>>>> int lTailleCommunicateur; /*taille du communicateur;*/
>>>>> long *lpBufferMpi; /*buffer pour message*/
>>>>> int lBufferSize; /*taille du buffer*/
>>>>>
>>>>> /*2) pour les thread*/
>>>>> int NiveauThreadVoulu, NiveauThreadObtenu;
>>>>>
>>>>>
>>>>> lCommunicateur = (MPI_Comm)-1;
>>>>> NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
>>>>> int erreur = MPI_Init_thread( &argc, &argv, NiveauThreadVoulu,
>>>>> &NiveauThreadObtenu );
>>>>>
>>>>> if (erreur!=0){
>>>>> printf("erreur\n");
>>>>> free( lpBufferMpi );
>>>>> return -1;
>>>>> }
>>>>>
>>>>> /*2) Attachement à un buffer pour le message*/
>>>>> lBufferSize=10000 * sizeof(long);
>>>>> lpBufferMpi = calloc( 10000, sizeof(long));
>>>>> erreur = MPI_Buffer_attach( (void*)lpBufferMpi, lBufferSize );
>>>>>
>>>>> if (erreur!=0){
>>>>> printf("erreur\n");
>>>>> free( lpBufferMpi );
>>>>> return -1;
>>>>> }
>>>>>
>>>>> printf( "Exe : Lance \n" );
>>>>> MPI_Comm_get_parent(&CommParent);
>>>>> MPI_Intercomm_merge( CommParent, 1, &lCommunicateur );
>>>>> MPI_Comm_rank( lCommunicateur, &lRank );
>>>>> MPI_Comm_size( lCommunicateur, &lTailleCommunicateur );
>>>>> lRangMain =1-lRank;
>>>>> printf( "Exe: lRankExe = %d lRankMain = %d\n", lRank ,
>>>>> lRangMain,
>>>>> lTailleCommunicateur);
>>>>>
>>>>> sleep(1);
>>>>> MPI_Buffer_detach( (void*)lpBufferMpi, &lBufferSize );
>>>>> MPI_Comm_free( &lCommunicateur );
>>>>> MPI_Finalize( );
>>>>> free( lpBufferMpi );
>>>>> printf( "Exe: Fin.\n\n\n" );
>>>>> }
>>>>>
>>>>>
>>>>> /
>>>>> ********************************************************************
>>>>> *
>>>>> ********
>>>>> *******************/
>>>>> result :
>>>>> main*******************************
>>>>> main : Lancement MPI*
>>>>> 1 main***MPI_Comm_spawn return : 0
>>>>> Exe : Lance
>>>>> 1 main***Rang main : 0 Rang exe : 1
>>>>> Exe: lRankExe = 1 lRankMain = 0
>>>>> Exe: Fin.
>>>>>
>>>>>
>>>>> 2 main***MPI_Comm_spawn return : 0
>>>>> Exe : Lance
>>>>> 2 main***Rang main : 0 Rang exe : 1
>>>>> Exe: lRankExe = 1 lRankMain = 0
>>>>> Exe: Fin.
>>>>>
>>>>>
>>>>> 3 main***MPI_Comm_spawn return : 0
>>>>> Exe : Lance
>>>>> 3 main***Rang main : 0 Rang exe : 1
>>>>> Exe: lRankExe = 1 lRankMain = 0
>>>>> Exe: Fin.
>>>>>
>>>>> ....
>>>>>
>>>>> 30 main***MPI_Comm_spawn return : 0
>>>>> Exe : Lance
>>>>> 30 main***Rang main : 0 Rang exe : 1
>>>>> Exe: lRankExe = 1 lRankMain = 0
>>>>> Exe: Fin.
>>>>>
>>>>>
>>>>> 31 main***MPI_Comm_spawn return : 0
>>>>> Exe : Lance
>>>>> 31 main***Rang main : 0 Rang exe : 1
>>>>> Exe: lRankExe = 1 lRankMain = 0
>>>>> Erreur de segmentation
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users