|
|
Ralph Castain wrote:
The two files have a slightly different format
Agreed.
and completely different meaning.
Somewhat agreed. They're both related to mapping processes onto a
cluster.
The hostfile specifies how many slots are on a node. The
rankfile specifies a rank and what node/slot it is to be mapped onto.
Agreed.
Rankfiles can use relative node indexing and refer to nodes
received from a resource manager - i.e., without any hostfile.
This is the main part I'm concerned about. E.g.,
% cat rankfile
rank 0=node0 slot=0
rank 1=node1 slot=0
% mpirun -np 2 -rf rankfile ./a.out
--------------------------------------------------------------------------
Rankfile claimed host node1 that was not allocated or oversubscribed
it's slots:
--------------------------------------------------------------------------
[node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
rmaps_rank_file.c at line 107
[node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 86
[node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/plm_base_launch_support.c at line 86
[node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
plm_rsh_module.c at line 1016
% mpirun -np 2 -host node0,node1 -rf rankfile ./a.out
0 on node0
1 on node1
done
It seems to me that the rankfile has sufficient information to express
what I want it to do. But mpirun won't accept this. To fix this, I
have to, e.g., supply/maintain/specify redundant information in a
hostfile or host list.
So the files are intentionally quite different. Trying to
combine them would be rather ugly.
Right. And my issue is that I'm forced to use both when I only want
rankfile functionality.
On Thu, Jun 18, 2009 at 1:52 PM, Eugene Loh <Eugene.Loh@sun.com>
wrote:
In
order to use "mpirun --rankfile", I also need to specify
hosts/hostlist. But that information is redundant with what I provide
in the rankfile. So, from a user's point of view, this strikes me as
broken. Yes? Should I file a ticket, or am I missing something here
about this functionality?
|
|
|