-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Open
Description
This line contains the code block
for resource_pool_name, process_on_nodes in self.resource_pool_spec.items():
# max_colocate_count means the number of WorkerGroups (i.e. processes) in each RayResourcePool
# For FSDP backend, we recommend using max_colocate_count=1 that merge all WorkerGroups into one.
# For Megatron backend, we recommend using max_colocate_count>1
# that can utilize different WorkerGroup for differnt models
resource_pool = RayResourcePool(
process_on_nodes=process_on_nodes, use_gpu=True, max_colocate_count=1, name_prefix=resource_pool_name
)
self.resource_pool_dict[resource_pool_name] = resource_pool
So what is max_colocate_count? According to the explanation, max_colocate_count>1 shall be set for megatron backend; however this number is hardcoded as 1.
By checking here, it is more like the number of CPUs per colocated processing. The name is confusing?
Metadata
Metadata
Assignees
Labels
No labels