0

I am currently using this piece of code :

class FileSystem(metaclass=Singleton):
    """File System manager based on Spark"""

    def __init__(self, spark):
        self._path = spark._jvm.org.apache.hadoop.fs.Path
        self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
            spark._jsc.hadoopConfiguration()
        )

    @classmethod
    def without_spark(cls):
        with Spark() as spark:
            return cls(spark)

My object depends obviously on the Spark object (another object that I created - If you need to see its code, I can add it but I do not think it is required for my current issue).

It can be used in 2 differents ways resulting the same behavior :

fs = FileSystem.without_spark()

# OR

with Spark() as spark:
    fs = FileSystem(spark)

My problem is that, even if FileSystem is a singleton, using the class method without_spark makes me enter (__enter__) the context manager of spark, which lead to a connection to spark cluster, which takes a lot of time. How can I make that the first execution of without_spark do the connection, but the next one only returns the already created instance?

The expected behavior would be something like this :

    @classmethod
    def without_spark(cls):
        if not cls.exists:  # I do not know how to persist this information in the class
            with Spark() as spark:
                return cls(spark)
        else:
            return cls()
2
  • 1
    It seems that your context manager is not really needed as you will leave the context with the return statement and all subsequent usage of the returned FileSystem instance happens outside of the context. Commented Nov 19, 2021 at 14:33
  • @user2390182 I know the behavior is a bit strange. But actually, I need spark to be connected when I define my FileSystem. And once disconnected, FileSystem keeps working. They use java objects behind. Commented Nov 19, 2021 at 14:46

2 Answers 2

1

I think you are looking for something like

import contextlib

class FileSystem(metaclass=Singleton):
    """File System manager based on Spark"""

    spark = None


    def __init__(self, spark):
        self._path = spark._jvm.org.apache.hadoop.fs.Path
        self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
            spark._jsc.hadoopConfiguration()
        )

    @classmethod
    def without_spark(cls):
        if cls.spark is None:
            cm = cls.spark = Spark()
        else:
            cm = contextlib.nullcontext(cls.spark)
            
        with cm as s:
            return cls(s)

The first time without_spark is called, a new instance of Spark is created and used as a context manager. Subsequent calls reuse the same Spark instance and use a null context manager.


I believe your approach will work as well; you just need to initialize exists to be False, then set it to True the first (and every, really) time you call the class method.

class FileSystem(metaclass=Singleton):
    """File System manager based on Spark"""

    exists = False

    def __init__(self, spark):
        self._path = spark._jvm.org.apache.hadoop.fs.Path
        self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
            spark._jsc.hadoopConfiguration()
        )

    @classmethod
    def without_spark(cls):
        if not cls.exists:
            cls.exists = True
            with Spark() as spark:
                return cls(spark)
        else:
            return cls()
Sign up to request clarification or add additional context in comments.

5 Comments

What is the purpose of the context manager when the context is closed with the return statement and all subsequent usages of cls.spark are "out of context"?
I fixed the name error. I'm almost certainly still botching the intended semantics. (I'm answering this from a strictly formal standpoint, which could be completely wrong.) It's not clear to me what the semantics of the context manager are; the OP's example suggests that fs can be used outside the context manager, but maybe that's not the case.
@chepner yes, fs can be used outside the context manager. It just need to be inside the context manager when __init__. But your anwser does not work; It still makes me go through with spark each time I call the class method. I want to do the with only once, the first time i use the methode. I made an edit on my post, added a piece of what I would like (but I'm not sure it is easy to understand)
OK, I've updated the answer to use a null context manager on subsequent calls to without_spark, though I think your approach will work as well: you just need to initialize exists = False when the class is defined, then set cls.exists = True when it is first seen to be false.
Haaaa ... I tried to do this, but actually, I added the statement self.exists = True in __init__ which was not working because it was only changing the value for the object, not for the class. Stupid mistake. But thanks for the help
0

Can't you make the constructor argument optional, and initiate the Spark lazily, e.g. in a property (or functools.cached_property):

from functools import cached_property

class FileSystem(metaclass=Singleton):
    def __init__(self, spark=None):
        self._spark = spark

    @cached_property
    def spark(self):
        if self._spark:
            return self._spark
        return self._spark := Spark()

    @cached_property
    def path(self):
        return self.spark._jvm.org.apache.hadoop.fs.Path

    @cached_property
    def fs(self):
        with self.spark:
            return self.spark._jvm.org.apache.hadoop.fs.FileSystem.get(
                self.spark._jsc.hadoopConfiguration()
            )

2 Comments

I've never seen cached_property. Need to study a bit to understand.
It's just like a normal property, but only evaluated on the first call, then cached.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.