[Learning Verl] what is max_colocate_count?

[This](https://github.com/volcengine/verl/blob/418f964ab84d2b7c49aa4404f65774917501b092/verl/trainer/ppo/ray_trainer.py#L88) line contains the code block
```
        for resource_pool_name, process_on_nodes in self.resource_pool_spec.items():
            # max_colocate_count means the number of WorkerGroups (i.e. processes) in each RayResourcePool
            # For FSDP backend, we recommend using max_colocate_count=1 that merge all WorkerGroups into one.
            # For Megatron backend, we recommend using max_colocate_count>1
            # that can utilize different WorkerGroup for differnt models
            resource_pool = RayResourcePool(
                process_on_nodes=process_on_nodes, use_gpu=True, max_colocate_count=1, name_prefix=resource_pool_name
            )
            self.resource_pool_dict[resource_pool_name] = resource_pool
```

So what is `max_colocate_count`? According to the explanation, `max_colocate_count>1` shall be set for megatron backend; however this number is hardcoded as `1`. 

By checking [here](https://github.com/volcengine/verl/blob/418f964ab84d2b7c49aa4404f65774917501b092/verl/single_controller/ray/base.py#L122), it is more like the number of CPUs per colocated processing. The name is confusing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Learning Verl] what is max_colocate_count? #4058

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Learning Verl] what is max_colocate_count? #4058

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions