Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault (11)
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-03-31 11:00:21


That is interesting. I cannot think of any reason why this might be causing a problem just in Open MPI. popen() is similar to fork()/system() so you have to be careful with interconnects that do not play nice with fork(), like openib. But since it looks like you are excluding openib, this should not be the problem.

I wonder if this has something to so with the way we use BLCR (maybe we need to pass additional parameters to cr_checkpoint()). When the process fails, are there any messages in the system logs from BLCR indicating an issue that it encountered? It is common for BLCR to post a 'socket open' warning, but that is expected/normal since we leave TCP sockets open in most cases as an optimization. I am wondering if there is a warning about the popen'ed process.

Personally, I will not have an opportunity to look into this in more detail until probably mid-April. :/

Let me know what you find, and maybe we can sort out what is happening on the list.

-- Josh

On Mar 29, 2010, at 2:28 PM, Jean Potsam wrote:

> Hi Josh/All,
> I just tested a simple c application with blcr and it worked fine.
>
> ##########################################
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <fcntl.h>
> #include <limits.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include<signal.h>
> #include <fcntl.h>
> #include <unistd.h>
>
> char * getprocessid()
> {
> FILE * read_fp;
> char buffer[BUFSIZ + 1];
> int chars_read;
> char * buffer_data="12345";
> memset(buffer, '\0', sizeof(buffer));
> read_fp = popen("uname -a", "r");
> /*
> ...
> */
> return buffer_data;
> }
>
> int main(int argc, char ** argv)
> {
>
> int rank;
> int size;
> char * thedata;
> int n=0;
> thedata=getprocessid();
> printf(" the data is %s", thedata);
>
> while( n <10)
> {
> printf("value is %d\n", n);
> n++;
> sleep(1);
> }
> printf("bye\n");
>
> }
>
>
> jean_at_sun32:/tmp$ cr_run ./pipetest3 &
> [1] 31807
> jean_at_sun32:~$ the data is 12345value is 0
> value is 1
> value is 2
> ...
> value is 9
> bye
>
> jean_at_sun32:/tmp$ cr_checkpoint 31807
>
> jean_at_sun32:/tmp$ cr_restart context.31807
> value is 7
> value is 8
> value is 9
> bye
>
> ##############################################
>
>
> It looks like its more to do with Openmpi. Any ideas from you side?
>
> Thank you.
>
> Kind regards,
>
> Jean.
>
>
>
>
>
> --- On Mon, 29/3/10, Josh Hursey <jjhursey_at_[hidden]> wrote:
>
> From: Josh Hursey <jjhursey_at_[hidden]>
> Subject: Re: [OMPI users] Segmentation fault (11)
> To: "Open MPI Users" <users_at_[hidden]>
> Date: Monday, 29 March, 2010, 16:08
>
> I wonder if this is a bug with BLCR (since the segv stack is in the BLCR thread). Can you try an non-MPI version of this application that uses popen(), and see if BLCR properly checkpoints/restarts it?
>
> If so, we can start to see what Open MPI might be doing to confuse things, but I suspect that this might be a bug with BLCR. Either way let us know what you find out.
>
> Cheers,
> Josh
>
> On Mar 27, 2010, at 6:17 AM, jody wrote:
>
> > I'm not sure if this is the cause of your problems:
> > You define the constant BUFFER_SIZE, but in the code you use a constant called BUFSIZ...
> > Jody
> >
> >
> > On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam <jeanpotsam_at_[hidden]> wrote:
> > Dear All,
> > I am having a problem with openmpi . I have installed openmpi 1.4 and blcr 0.8.1
> >
> > I have written a small mpi application as follows below:
> >
> > #######################
> > #include <unistd.h>
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <string.h>
> > #include <fcntl.h>
> > #include <limits.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <mpi.h>
> > #include<signal.h>
> > #include <fcntl.h>
> > #include <unistd.h>
> >
> > #define BUFFER_SIZE PIPE_BUF
> >
> > char * getprocessid()
> > {
> > FILE * read_fp;
> > char buffer[BUFSIZ + 1];
> > int chars_read;
> > char * buffer_data="12345";
> > memset(buffer, '\0', sizeof(buffer));
> > read_fp = popen("uname -a", "r");
> > /*
> > ...
> > */
> > return buffer_data;
> > }
> >
> > int main(int argc, char ** argv)
> > {
> > MPI_Status status;
> > int rank;
> > int size;
> > char * thedata;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_size(MPI_COMM_WORLD,&size);
> > MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> > thedata=getprocessid();
> > printf(" the data is %s", thedata);
> > MPI_Finalize();
> > }
> > ############################
> >
> > I get the following result:
> >
> > #######################
> > jean_at_sunn32:~$ mpicc pipetest2.c -o pipetest2
> > jean_at_sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib pipetest2
> > [sun32:19211] *** Process received signal ***
> > [sun32:19211] Signal: Segmentation fault (11)
> > [sun32:19211] Signal code: Address not mapped (1)
> > [sun32:19211] Failing at address: 0x4
> > [sun32:19211] [ 0] [0xb7f3c40c]
> > [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b]
> > [sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free+0x2a) [0xb7a5925a]
> > [sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72]
> > [sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266]
> > [sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e]
> > [sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc]
> > [sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836]
> > [sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897]
> > [sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455]
> > [sun32:19211] [10] pipetest2 [0x8048761]
> > [sun32:19211] *** End of error message ***
> > #####################################################
> >
> >
> > However, If I compile the application using gcc, it works fine. The problem arises with:
> > read_fp = popen("uname -a", "r");
> >
> > Does anyone has an idea how to resolve this problem?
> >
> > Many thanks
> >
> > Jean
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users