I pulled this image - rayproject/ray:2.51.0.801bd7-extra-py310 from dockerhub with podman (ray prefers podman internally).
On AWS-EC2 then I started a head instance with - Deep Learning Base AMI with Single CUDA (Ubuntu 22.04), ami-06eff6f62c23006e9 (64-bit (x86))
On the instance, I have:
ubuntu@ip-172-31-15-118:~$ python --version
Python 3.10.19
ubuntu@ip-172-31-15-118:~$ python3 --version
Python 3.10.19
ubuntu@ip-172-31-15-118:~$ docker --version
Docker version 28.5.1, build e180ab8
ubuntu@ip-172-31-15-118:~$ podman --version
podman version 3.4.4
python3 -m venv ~/raycli-env
source ~/raycli-env/bin/activate
pip install -U "ray[default]"
(raycli-env) ubuntu@ip-172-31-15-118:~$ ray --version
ray, version 2.51.0
I have exported the following environment variables also:
export RAY_RUNTIME_ENV_DOCKER=1
export RAY_RUNTIME_ENV_PODMAN_EXE=/usr/bin/docker
export PATH=/usr/bin:$PATH
And I have started the ray cluster with:
ray start --head --port=6379 --dashboard-host=0.0.0.0
Then I have done:
mkdir ray-job
cd ray-job
echo 'import ray
ray.init(address="auto")
@ray.remote
def hello():
return "Hello World from Ray on GPU!"
print(ray.get(hello.remote()))' > hello_ray.py
Then I extended the image with the following Dockerfile:
ARG RAY_UID=1000
ARG RAY_GID=100
FROM rayproject/ray:2.51.0.801bd7-extra-py310
USER root
RUN pip install "ray[default]"
# go back to ray user for running workloads
USER ray
WORKDIR /home/ray
CMD ["python", "hello_ray.py"]
And built the image with:
podman build -t ray-image:latest
But when I run, the following command, I don't get any errors, and the job stays in pending forever.
ray job submit \
--address="http://127.0.0.1:8265" \
--runtime-env-json '{"image_uri": "localhost/ray-image:latest"}' \
-- python hello_ray.py
But when I run this with python-slim with python versions matched, I am able to successfully submit the job.
I don't know what I am doing wrong. The python and ray versions on EC2 instance and the podman image are same. Please help me. I have been stucked here for so long.