Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fault Tolerant Features in OpenMPI
From: Edson Tavares de Camargo (etcamargo_at_[hidden])
Date: 2013-08-12 15:17:13


Hi, George!

I had studied the ULFM document before begin the tests with failure
detection in open mpi and seems me a good choice.

But I'm having trouble with the ULFM-enabled version of Open MPI
(openmpi-1.7ft_b3.tar.gz). I follow the UFML setup (in
http://fault-tolerance.org/ulfm/ulfm-setup/). The program compile seems
ok, but when running happens the error below. Any mpi program does not run
anymore (with ou without ft). Could you help me?

Thanks a lot!

Edson

Linux version 3.2.0-51-generic (buildd_at_allspice) (gcc version 4.6.3
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #77-Ubuntu SMP Wed Jul 24 20:18:19 UTC
2013

----------------
edson_at_edson:~/UFPR/MPI_Fault$ mpirun -np 8 -am ft-enable-mpi ./teste1
[edson:04372] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_errmgr_default:
/usr/local/lib/openmpi/mca_errmgr_default.so: undefined symbol:
orte_errmgr_base_error_abort (ignored)
[edson:04372] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_grpcomm_basic:
/usr/local/lib/openmpi/mca_grpcomm_basic.so: undefined symbol:
opal_profile_file (ignored)
[edson:04372] *** Process received signal ***
[edson:04372] Signal: Segmentation fault (11)
[edson:04372] Signal code: Address not mapped (1)
[edson:04372] Failing at address: 0x14
[edson:04372] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)
[0x7f5d425bdcb0]
[edson:04372] [ 1]
/usr/local/lib/openmpi/mca_rmaps_load_balance.so(+0xa88) [0x7f5d409bca88]
[edson:04372] [ 2]
/usr/local/lib/libopen-rte.so.0(orte_rmaps_base_map_job+0x112)
[0x7f5d42838132]
[edson:04372] [ 3]
/usr/local/lib/libopen-rte.so.0(orte_plm_base_setup_job+0x11c)
[0x7f5d4283362c]
[edson:04372] [ 4] /usr/local/lib/openmpi/mca_plm_rsh.so(+0x4ee7)
[0x7f5d401a9ee7]
[edson:04372] [ 5] mpirun(orterun+0xeb0) [0x404420]
[edson:04372] [ 6] mpirun(main+0x20) [0x4033c4]
[edson:04372] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
[0x7f5d4221076d]
[edson:04372] [ 8] mpirun() [0x4032e9]
[edson:04372] *** End of error message ***
Falha de segmentação (imagem do núcleo gravada)

-----------

> Edson,
>
> Based on your questions I would suggest you take a look at the
> ULFM-enabled version of Open MPI. You can find it at
> http://fault-tolerance.org/.
>
> George.
>
>
> On Aug 11, 2013, at 15:33 , Edson Tavares de Camargo
> <etcamargo_at_[hidden]> wrote:
>
>> Thanks a lot for your reply, Ralph!
>>
>> Could you tell me in what situation the error handler would be called in
>> the 1.6.5 version?
>>
>> I had thought that a failure in a process would be catched by the error
>> handler. Kill, or abort, the process wouldn't the same behaviour?
>>
>> In the 1.7.4 release if a process was killed the error handler will be
>> catched?
>>
>> Thanks,
>>
>> Edson
>> ---------------------
>>
>>> The error handler wouldn't be called in that situation - we simply
>>> abort
>>> the job. We expect to provide that integration in something like the
>>> 1.7.4
>>> release milestone.
>>>
>>>
>>> On Aug 10, 2013, at 11:07 AM, Edson Tavares de Camargo
>>> <etcamargo_at_[hidden]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I was looking for posts about fault tolerant in MPI and I found the
>>>> post
>>>> below:
>>>>
>>>> http://www.open-mpi.org/community/lists/users/2012/06/19658.php
>>>>
>>>> I am trying to understand all work about failures detection present
>>>> in
>>>> open-mpi. So, I began with a simple application, a ring application
>>>> (ring.c) , to understand errors handlers. But, it seems me that didn't
>>>> work, why not? (the code is below)
>>>>
>>>> The application (the process) was running in the same machine with the
>>>> following code line:
>>>>
>>>> $ mpiexec -n 4 ring
>>>>
>>>> While the ring application was running, one of the process was
>>>> killed.
>>>> So, the entire application stopped (ok until here), but didn't show me
>>>> the
>>>> error message. The line if(error != MPI_SUCCESS) should not worked?
>>>>
>>>> I am using the mpiexec (OpenRTE) 1.6.5.
>>>>
>>>> Thanks in advance,
>>>>
>>>> Edson
>>>>
>>>> -----------------------------------------------
>>>> #include <stdio.h>
>>>> #include <mpi.h>
>>>> #include <time.h>
>>>>
>>>> int main( int argc, char *argv[] )
>>>> {
>>>> int rank, size;
>>>> int n = 0;
>>>> int tag = 0;
>>>> int error;
>>>> int root = 0;
>>>> int next, previous;
>>>> double start = 0;
>>>> double finish = 0;
>>>>
>>>> MPI_Status status;
>>>>
>>>> MPI_Init( &argc, &argv );
>>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>
>>>> // error handler
>>>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>>>>
>>>> do {
>>>> next = (rank + 1) % (size);
>>>> n++;
>>>>
>>>> if(rank != 0){
>>>> previous = (rank - 1);
>>>> }else{
>>>> previous = size - 1;
>>>> }
>>>>
>>>> if (rank =
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>