Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault on master task
From: Eduardo Morras (nec556_at_[hidden])
Date: 2012-05-02 06:00:07


At 08:51 02/05/2012, you wrote:
>Hi,
>
>I am trying to execute following code on cluster.

>run_kernel is a cuda call with signature int run_kernel(int
>array[],int nelements, int taskid, char hostname[]);

... deleted code

>mysum = run_kernel(&onearray[20000000], chunksize, taskid, myname);

... more deleted code

>I am simply trying to calculate sum of array elements using kernel
>function. Each task has its own data and it calculates its own sum.
>
>I am getting segmentation fault on master task but all other task
>calculate the sum successfully.
>
>Here is the output
>
>
>MPI task 0 has started on host node4
>MPI task 1 has started on host node4
>MPI task 2 has started on host node5
>MPI task 6 has started on host node6
>MPI task 5 has started on node5
>MPI task 9 has started on host node6
>MPI task 8 has started on host node6
>MPI task 3 has started on node5
>MPI task 4 has started on hnode5
>MPI task 7 has started on node6
>[node4] *** Process received signal ***
>[node4] Signal: Segmentation fault (11)
>[node4] Signal code: Address not mapped (1)
>[node4] Failing at address: 0xb7866000
>[node4] [ 0] [0xbc040c]
>[node4] [ 1] /usr/lib/libcuda.so(+0x13a0f6) [0x10640f6]
>[node4] [ 2] /usr/lib/libcuda.so(+0x146912) [0x1070912]
>[node4] [ 3] /usr/lib/libcuda.so(+0x147231) [0x1071231]
>[node4] [ 4] /usr/lib/libcuda.so(+0x13cb64) [0x1066b64]
>[node4] [ 5] /usr/lib/libcuda.so(+0x11863c) [0x104263c]
>[node4] [ 6] /usr/lib/libcuda.so(+0x11d93b) [0x104793b]
>[node4] [ 7] /usr/lib/libcuda.so(cuMemcpyHtoD_v2+0x64) [0x1037264]
>[node4] [ 8] /usr/local/cuda/lib/libcudart.so.4(+0x20336) [0x224336]
>[node4] [ 9] /usr/local/cuda/lib/libcudart.so.4(cudaMemcpy+0x230) [0x257360]
>[node4] [10] mpi_array_distributed(run_kernel+0x9a) [0x804a482]
>[node4] [11] mpi_array_distributed(main+0x325) [0x804a139]
>[node4] [12] /lib/libc.so.6(__libc_start_main+0xe6) [0x4dece6]
>[node4] [13] mpi_array_distributed() [0x8049d81]
>[node4] *** End of error message ***

It fails doing the cuMemcpyHtoD inside cuda code. Perhaps doing any
of this changes can fix your problem:

a) mysum = run_kernel(onearray, chunksize, taskid, myname);

b) mysum = run_kernel(&onearray[20000000-1], chunksize, taskid, myname);

> --------------------------------------------------------------------------
>mpirun noticed that process rank 0 with PID 3054 on node
><http://ecm-c-l-207-004.uniwa.uwa.edu.au>ecm-c-l-207-004.uniwa.uwa.edu.au
>exited on signal 11 (Segmentation fault).
>--------------------------------------------------------------------------
>
>Sadly i cant install memory checker such as valgrind on my machine
>due to some restrictions. I could not spot any error by looking in code.
>
>Can anyone help me ?what is wrong in above code.
>
>Thanks