I am currently using this piece of code :
class FileSystem(metaclass=Singleton):
"""File System manager based on Spark"""
def __init__(self, spark):
self._path = spark._jvm.org.apache.hadoop.fs.Path
self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
spark._jsc.hadoopConfiguration()
)
@classmethod
def without_spark(cls):
with Spark() as spark:
return cls(spark)
My object depends obviously on the Spark object (another object that I created - If you need to see its code, I can add it but I do not think it is required for my current issue).
It can be used in 2 differents ways resulting the same behavior :
fs = FileSystem.without_spark()
# OR
with Spark() as spark:
fs = FileSystem(spark)
My problem is that, even if FileSystem is a singleton, using the class method without_spark makes me enter (__enter__) the context manager of spark, which lead to a connection to spark cluster, which takes a lot of time. How can I make that the first execution of without_spark do the connection, but the next one only returns the already created instance?
The expected behavior would be something like this :
@classmethod
def without_spark(cls):
if not cls.exists: # I do not know how to persist this information in the class
with Spark() as spark:
return cls(spark)
else:
return cls()
FileSysteminstance happens outside of the context.