0

I'm trying to reduce the memory footprint of my Python program, and I found that I have a large number of strings with the same value but different id's. I could intern the strings, but knowing how interning works, it seems messy to me to introduce these strings into a global dictionary which is used every time a new string is created everywhere in my program.

So I wrote this:

def _coalesce_str(s, candidates):
    for candidate in candidates:
        if s == candidate:
            return candidate
    return s

Example usage:

COLORS = 'red', 'green', 'blue', 'orange'

def f():
    results = []
    for doc in very_large_db_collection():
        results.append(_coalesce_str(doc['color'], COLORS))

Let's say this is the only place I create a significant number of instances of those colors. Then my manual string interning has the same effect on the memory footprint as built-in interning, but it's limited to the appropriate context.

This approach sits better with me (in this particular case, which is analogous to a real-world example I'm working with) in some ways, but obviously circumventing the built-in mechanism does not sit well at all. So I'm just curious if there is in fact a built-in (or commonly used) way to do this - I found lots of discussion about string interning out there but couldn't find this particular question anywhere.

6
  • If the strings are short, I thought Python automatically interns them. Commented Jan 14, 2020 at 3:23
  • 1
    @Barmar Python 3.8.1 tells me ''.join('aa') is ''.join('aa') is False. And 'aa' is ''.join('aa'), too. Commented Jan 14, 2020 at 3:29
  • Maybe it's just string literals, not strings that are created dynamically. Commented Jan 14, 2020 at 3:30
  • Yeah, that's it. medium.com/@bdov_/… Commented Jan 14, 2020 at 3:31
  • 1
    Do you know all possible values in advance, like your COLORS tuple suggests? Commented Jan 14, 2020 at 3:38

2 Answers 2

1

I ran into this when I was converting some records from one format to another. The process involved many steps of string transformation and even after using this it still required 14GB of memory to represent all the strings and dictionaries.

What I found was that sys.intern didn't work on Unicode strings and Unicode strings weren't weak referencable so I couldn't use a weak key dictionary. So I used a regular dictionary and wrote:

_objs = {}
def intern(x): 
    return _objs.setdefault(x, x)

This is better than sys.intern as it works on any hashable type. And can be cleared. If you pre-populate the dictionary, you can use .get(x, x) instead if you would prefer the dictionary not to accumulate values.

If needed, one could even use a dictionary with a limited size and an eviction policy.

Sign up to request clarification or add additional context in comments.

Comments

1

For the general case, I would've suggested the same as Dan. Since you clarified you know all possible values in advance, you could also just build a dictionary

intern = dict(zip(COLORS, COLORS))

and use it like inter[color]. Demo:

>>> COLORS = 'red', 'green', 'blue', 'orange'
>>> intern = dict(zip(COLORS, COLORS))
>>> color = ''.join('red')
>>> color, intern[color]
('red', 'red')
>>> color == intern[color], color is intern[color]
(True, False)

If for some reason you want to limit the size, like Dan mentioned at the end, an easy yet good option would be to use functools.lru_cache:

@lru_cache(1000)
def intern(x):
    return x

I'd say all (also Dan's) are better than your own version, as a dict lookup is simpler and faster than looping through candidates (at least if the number of candidates is large).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.