Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with Filem
From: Bouguerra mohamed slim (mohamed-slim.bouguerra_at_[hidden])
Date: 2009-05-07 04:07:58


Hello,
Thank you, with the release r21172 and it works. But how i can dispatch
the checkpoint on different storage nodes, because it is to costly that
all computing nodes write on one storage node.

Josh Hursey a écrit :
> I just realized that not all of the FileM fixes made it to the trunk
> in my previous commit. Sorry about that :( I just committed the
> remainder of the changes in r21167 if you wanted to try them out.
>
> Cheers,
> Josh
>
> On May 4, 2009, at 8:48 AM, Josh Hursey wrote:
>
>> The command line looks fine. Can you send the output generated by the
>> verbose arguments (there was no file attached to the last email)?
>>
>> The version of the trunk that I was referring to was r21131, and can
>> be downloaded via SVN or a nightly snapshot tarball from the links
>> below:
>> http://www.open-mpi.org/svn/
>> http://www.open-mpi.org/nightly/trunk/
>>
>> Best,
>> Josh
>>
>> On May 4, 2009, at 3:44 AM, Bouguerra mohamed slim wrote:
>>
>>> Hello,
>>> this is the global command that i use it to run the program.
>>>
>>> /home/grenoble/msbouguerra/install/ompi-1.3.2/cr/bin/mpirun -mca
>>> orte_base_help_aggregate 0 -mca filem_rsh_rcp scp -mca
>>> filem_rsh_verbose 99 -mca filem_base_verbose 99 -mca
>>> snapc_base_verbose 1 -mca ompi_cr_verbose 1 -mca orte_cr_verbose 1
>>> -mca opal_cr_verbose 1 -mca snapc_base_global_snapshot_dir
>>> /tmp/stable -mca snapc_base_store_in_place 0 -mca
>>> snapc_base_global_snapshot_ref ompi_global_snapshot_09_30 -np 20 -am
>>> ft-enable-cr -hostfile ./hostfile_04_05 --mca btl '^mx' ./nqueen 16
>>>
>>> Then, i got always the same problem, the error stack in the file.
>>>
>>> Finally can you tell me exactly the version of the development trunk.
>>>
>>> Thank you,
>>>
>>>
>>>
>>> Josh Hursey a écrit :
>>>>
>>>> This typically this means that one or more of the rcp/scp or
>>>> rsh/ssh commands failed. FileM should be printing an error message
>>>> when one of the copy commands fail. Try turning up the verbose
>>>> level to 10 to see if it indicates any problems:
>>>> -mca filem_rsh_verbose 10
>>>>
>>>> Can you send me the MCA parameters that you are setting? That may
>>>> help narrow down the problem as well. Also I cleaned up some of the
>>>> filem (and snapc) error reporting in the development trunk if you
>>>> want to give that a try.
>>>>
>>>> Let me know what you find out.
>>>>
>>>> Best,
>>>> Josh
>>>>
>>>> On Apr 30, 2009, at 6:40 AM, Bouguerra mohamed slim wrote:
>>>>
>>>>> Hello,
>>>>> I have a problem with the Filem module when i would checkpoint on
>>>>> a remote host without shared space file system.
>>>>> I use the new open-mpi 1.3.2 and it is the same problem as in the
>>>>> version 1.3.1. Indeed, when i use the NFS system file it works.
>>>>> Thus i guess that is a problem with the Filem.
>>>>>
>>>>> [azur-6.fr:23223] filem:rsh: wait_all(): Wait failed (-1)
>>>>> [azur-6.fr:23223] [[48784,0],0] ORTE_ERROR_LOG: Error in file
>>>>> /home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/snapc/full/snapc_full_global.c
>>>>> at line 1054
>>>>>
>>>>> --
>>>>> Cordialement,
>>>>> Mohamed-Slim BOUGUERRA PhD student INRIA-Grenoble / Projet MOAIS
>>>>> ENSIMAG - antenne de Montbonnot
>>>>> ZIRST 51, avenue Jean Kuntzmann
>>>>> 38330 MONTBONNOT SAINT MARTIN France
>>>>> Tel :+33 (0)4 76 61 20 79
>>>>> Fax :+33 (0)4 76 61 20 99
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> --
>>> Cordialement,
>>> Mohamed-Slim BOUGUERRA PhD student INRIA-Grenoble / Projet MOAIS
>>> ENSIMAG - antenne de Montbonnot
>>> ZIRST 51, avenue Jean Kuntzmann
>>> 38330 MONTBONNOT SAINT MARTIN France
>>> Tel :+33 (0)4 76 61 20 79
>>> Fax :+33 (0)4 76 61 20 99
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>> Filename Requested (/tmp/opal_snapshot_1.ckpt) translated to
>>> (/tmp/opal_snapshot_1.ckpt)
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>> Filename Requested (/tmp/opal_snapshot_5.ckpt) translated to
>>> (/tmp/opal_snapshot_5.ckpt)
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>> Filename Requested (/tmp/opal_snapshot_9.ckpt) translated to
>>> (/tmp/opal_snapshot_9.ckpt)
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>> Filename Requested (/tmp/opal_snapshot_13.ckpt) translated to
>>> (/tmp/opal_snapshot_13.ckpt)
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> [sol-7.sophia.grid5000.fr:04545] filem:base:
>>> process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]:
>>> Filename Requested (/tmp/opal_snapshot_17.ckpt) translated to
>>> (/tmp/opal_snapshot_17.ckpt)
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> WARNING: Could not preload specified file: File already exists.
>>>
>>> Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
>>> Host: sol-7.sophia.grid5000.fr
>>>
>>> Will continue attempting to launch the process.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> [sol-7.sophia.grid5000.fr:04545] filem:rsh: wait_all(): Wait failed
>>> (-1)
>>> [sol-7.sophia.grid5000.fr:04545] [[52993,0],0] ORTE_ERROR_LOG: Error
>>> in file
>>> /home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/snapc/full/snapc_full_global.c
>>> at line 1054
>>
>
>
>

-- 
Cordialement,
Mohamed-Slim BOUGUERRA    PhD student INRIA-Grenoble / Projet MOAIS
ENSIMAG - antenne de Montbonnot
ZIRST 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN France
Tel :+33 (0)4 76 61 20 79
Fax :+33 (0)4 76 61 20 99