Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with Filem
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-05-07 07:25:50


I'm glad that the recent commits fixed your problem.

At the moment, we do not implement a mirroring file storage mechanism
(where peers save checkpoints to each others local disk). We have
been working towards supporting this and other techniques in some off-
trunk development, but nothing ready to put back into the trunk as of
yet. Sorry :/

Best,
Josh

On May 7, 2009, at 4:07 AM, Bouguerra mohamed slim wrote:

> Hello,
> Thank you, with the release r21172 and it works. But how i can
> dispatch the checkpoint on different storage nodes, because it is
> to costly that all computing nodes write on one storage node.
>
>
>
>
> Josh Hursey a écrit :
>> I just realized that not all of the FileM fixes made it to the
>> trunk in my previous commit. Sorry about that :( I just committed
>> the remainder of the changes in r21167 if you wanted to try them out.
>>
>> Cheers,
>> Josh
>>
>> On May 4, 2009, at 8:48 AM, Josh Hursey wrote:
>>
>>> The command line looks fine. Can you send the output generated by
>>> the verbose arguments (there was no file attached to the last
>>> email)?
>>>
>>> The version of the trunk that I was referring to was r21131, and
>>> can be downloaded via SVN or a nightly snapshot tarball from the
>>> links below:
>>> http://www.open-mpi.org/svn/
>>> http://www.open-mpi.org/nightly/trunk/
>>>
>>> Best,
>>> Josh
>>>
>>> On May 4, 2009, at 3:44 AM, Bouguerra mohamed slim wrote:
>>>
>>>> Hello,
>>>> this is the global command that i use it to run the program.
>>>>
>>>> /home/grenoble/msbouguerra/install/ompi-1.3.2/cr/bin/mpirun -mca
>>>> orte_base_help_aggregate 0 -mca filem_rsh_rcp scp -mca
>>>> filem_rsh_verbose 99 -mca filem_base_verbose 99 -mca
>>>> snapc_base_verbose 1 -mca ompi_cr_verbose 1 -mca orte_cr_verbose
>>>> 1 -mca opal_cr_verbose 1 -mca snapc_base_global_snapshot_dir /
>>>> tmp/stable -mca snapc_base_store_in_place 0 -mca
>>>> snapc_base_global_snapshot_ref ompi_global_snapshot_09_30 -np 20
>>>> -am ft-enable-cr -hostfile ./hostfile_04_05 --mca btl '^mx' ./
>>>> nqueen 16
>>>>
>>>> Then, i got always the same problem, the error stack in the file.
>>>>
>>>> Finally can you tell me exactly the version of the development
>>>> trunk.
>>>>
>>>> Thank you,
>>>>
>>>>
>>>>
>>>> Josh Hursey a écrit :
>>>>>
>>>>> This typically this means that one or more of the rcp/scp or
>>>>> rsh/ssh commands failed. FileM should be printing an error
>>>>> message when one of the copy commands fail. Try turning up the
>>>>> verbose level to 10 to see if it indicates any problems:
>>>>> -mca filem_rsh_verbose 10
>>>>>
>>>>> Can you send me the MCA parameters that you are setting? That
>>>>> may help narrow down the problem as well. Also I cleaned up
>>>>> some of the filem (and snapc) error reporting in the
>>>>> development trunk if you want to give that a try.
>>>>>
>>>>> Let me know what you find out.
>>>>>
>>>>> Best,
>>>>> Josh
>>>>>
>>>>> On Apr 30, 2009, at 6:40 AM, Bouguerra mohamed slim wrote:
>>>>>
>>>>>> Hello,
>>>>>> I have a problem with the Filem module when i would checkpoint
>>>>>> on a remote host without shared space file system.
>>>>>> I use the new open-mpi 1.3.2 and it is the same problem as in
>>>>>> the version 1.3.1. Indeed, when i use the NFS system file it
>>>>>> works. Thus i guess that is a problem with the Filem.
>>>>>>
>>>>>> [azur-6.fr:23223] filem:rsh: wait_all(): Wait failed (-1)
>>>>>> [azur-6.fr:23223] [[48784,0],0] ORTE_ERROR_LOG: Error in file /
>>>>>> home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/snapc/full/
>>>>>> snapc_full_global.c at line 1054
>>>>>>
>>>>>> --
>>>>>> Cordialement,
>>>>>> Mohamed-Slim BOUGUERRA PhD student INRIA-Grenoble / Projet
>>>>>> MOAIS
>>>>>> ENSIMAG - antenne de Montbonnot
>>>>>> ZIRST 51, avenue Jean Kuntzmann
>>>>>> 38330 MONTBONNOT SAINT MARTIN France
>>>>>> Tel :+33 (0)4 76 61 20 79
>>>>>> Fax :+33 (0)4 76 61 20 99
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> --
>>>> Cordialement,
>>>> Mohamed-Slim BOUGUERRA PhD student INRIA-Grenoble / Projet MOAIS
>>>> ENSIMAG - antenne de Montbonnot
>>>> ZIRST 51, avenue Jean Kuntzmann
>>>> 38330 MONTBONNOT SAINT MARTIN France
>>>> Tel :+33 (0)4 76 61 20 79
>>>> Fax :+33 (0)4 76 61 20 99
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>>> Filename Requested (/tmp/opal_snapshot_1.ckpt) translated to (/
>>>> tmp/opal_snapshot_1.ckpt)
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>>> Filename Requested (/tmp/opal_snapshot_5.ckpt) translated to (/
>>>> tmp/opal_snapshot_5.ckpt)
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>>> Filename Requested (/tmp/opal_snapshot_9.ckpt) translated to (/
>>>> tmp/opal_snapshot_9.ckpt)
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>>> Filename Requested (/tmp/opal_snapshot_13.ckpt) translated to (/
>>>> tmp/opal_snapshot_13.ckpt)
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>>> Filename Requested (/tmp/opal_snapshot_17.ckpt) translated to (/
>>>> tmp/opal_snapshot_17.ckpt)
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: Could not preload specified file: File already exists.
>>>>
>>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>>> Host: sol-7.sophia.grid5000.fr
>>>>
>>>> Will continue attempting to launch the process.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> [sol-7.sophia.grid5000.fr:04545] filem:rsh: wait_all(): Wait
>>>> failed (-1)
>>>> [sol-7.sophia.grid5000.fr:04545] [[52993,0],0] ORTE_ERROR_LOG:
>>>> Error in file /home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/
>>>> snapc/full/snapc_full_global.c at line 1054
>>>
>>
>>
>>
>
> --
> Cordialement,
> Mohamed-Slim BOUGUERRA PhD student INRIA-Grenoble / Projet MOAIS
> ENSIMAG - antenne de Montbonnot
> ZIRST 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN France
> Tel :+33 (0)4 76 61 20 79
> Fax :+33 (0)4 76 61 20 99