Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault on master task
From: Eduardo Morras (nec556_at_[hidden])
Date: 2012-05-02 06:00:07

At 08:51 02/05/2012, you wrote:
>I am trying to execute following code on cluster.

>run_kernel is a cuda call with signature int run_kernel(int
>array[],int nelements, int taskid, char hostname[]);

... deleted code

>mysum = run_kernel(&onearray[20000000], chunksize, taskid, myname);

... more deleted code

>I am simply trying to calculate sum of array elements using kernel
>function. Each task has its own data and it calculates its own sum.
>I am getting segmentation fault on master task but all other task
>calculate the sum successfully.
>Here is the output
>MPI task 0 has started on host node4
>MPI task 1 has started on host node4
>MPI task 2 has started on host node5
>MPI task 6 has started on host node6
>MPI task 5 has started on node5
>MPI task 9 has started on host node6
>MPI task 8 has started on host node6
>MPI task 3 has started on node5
>MPI task 4 has started on hnode5
>MPI task 7 has started on node6
>[node4] *** Process received signal ***
>[node4] Signal: Segmentation fault (11)
>[node4] Signal code: Address not mapped (1)
>[node4] Failing at address: 0xb7866000
>[node4] [ 0] [0xbc040c]
>[node4] [ 1] /usr/lib/ [0x10640f6]
>[node4] [ 2] /usr/lib/ [0x1070912]
>[node4] [ 3] /usr/lib/ [0x1071231]
>[node4] [ 4] /usr/lib/ [0x1066b64]
>[node4] [ 5] /usr/lib/ [0x104263c]
>[node4] [ 6] /usr/lib/ [0x104793b]
>[node4] [ 7] /usr/lib/ [0x1037264]
>[node4] [ 8] /usr/local/cuda/lib/ [0x224336]
>[node4] [ 9] /usr/local/cuda/lib/ [0x257360]
>[node4] [10] mpi_array_distributed(run_kernel+0x9a) [0x804a482]
>[node4] [11] mpi_array_distributed(main+0x325) [0x804a139]
>[node4] [12] /lib/ [0x4dece6]
>[node4] [13] mpi_array_distributed() [0x8049d81]
>[node4] *** End of error message ***

It fails doing the cuMemcpyHtoD inside cuda code. Perhaps doing any
of this changes can fix your problem:

a) mysum = run_kernel(onearray, chunksize, taskid, myname);

b) mysum = run_kernel(&onearray[20000000-1], chunksize, taskid, myname);

> --------------------------------------------------------------------------
>mpirun noticed that process rank 0 with PID 3054 on node
>exited on signal 11 (Segmentation fault).
>Sadly i cant install memory checker such as valgrind on my machine
>due to some restrictions. I could not spot any error by looking in code.
>Can anyone help me ?what is wrong in above code.