3

I have two lists and I want to find the matching elements using python difflib/sequence matcher, and it goes like this:

from difflib import SequenceMatcher
def match_seq(list1,list2):
    output=[]
    s = SequenceMatcher(None, list1, list2)
    blocks=s.get_matching_blocks()
    for bl in blocks:
        #print(bl, bl.a, bl.b, bl.size)
        for bi in range(bl.size):
            cur_a=bl.a+bi
            cur_b=bl.b+bi
            output.append((cur_a,cur_b))
    return output

so when I run it on two lists like this

list1=["orange","apple","lemons","grapes"]
list2=["pears", "orange","apple", "lemons", "cherry", "grapes"]
for a,b in match_seq(list1,list2):
    print(a,b, list1[a],list2[b])

I get this output:

(0, 1, 'orange', 'orange')
(1, 2, 'apple', 'apple')
(2, 3, 'lemons', 'lemons')
(3, 5, 'grapes', 'grapes')

but suppose that I do not want to match the identical items only, but rather use a matching function (for example, a function that can match orange to oranges or vice versa, or match the equivalent word in another language).

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

Is there any option in difflib/sequence matcher, or any other python built-in library, that can provide this, so that I can match list3 and list 4, and also list3 and list5, the same way I did for list 1 and list2?

In general, can you think of a solution for this? I thought of replacing each word in the target list with the possible equivalents that I want to match, but this can be problematic because I may need to have multiple equivalents for each word, which may disturb the sequence.

3 Answers 3

3
+100

You have basically three solutions: 1) write your own implementation of diff; 2) hack the difflib module; 3) find a workaround.

Your own implementation

In case 1), you could look at this question and read a few books like CLRS or the books of Robert Sedgewick.

Hack the difflib module

In case 2), have a look at the source code: get_matching_blocks calls find_longest_match at line 479. In the core of find_longest_match, you have the b2j dictionary that maps elements of the list a to their indexes in the list b. If you overwrite this dictionary, you can achieve what you want. Here's the standard version:

>>> import difflib
>>> from difflib import SequenceMatcher
>>> list3 = ["orange","apple","lemons","grape"]
>>> list4 = ["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> s = SequenceMatcher(None, list3, list4)
>>> s.get_matching_blocks()
[Match(a=1, b=2, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(1, 2, 'apple', 'apple')]

Here's the hacked version:

>>> s = SequenceMatcher(None, list3, list4)
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5]}
>>> s.b2j = {**s.b2j, 'orange':s.b2j['oranges'], 'lemons':s.b2j['lemon'], 'grape':s.b2j['grapes']}
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5], 'orange': [1], 'lemons': [3], 'grape': [5]}
>>> s.get_matching_blocks()
[Match(a=0, b=1, size=3), Match(a=3, b=5, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(0, 1, 'orange', 'oranges'), (1, 2, 'apple', 'apple'), (2, 3, 'lemons', 'lemon'), (3, 5, 'grape', 'grapes')]

This is not hard to automate, but I wouldn't recommend you that solution, since there is a very simple workaround.

A workaround

The idea is to group words by families:

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes"}]

It's now easy to create a dictionary that takes maps every word of the family to one of those words (let's call it the main word):

>>> d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
>>> d
{'pears': 'peras', 'orange': 'naranjas', 'oranges': 'naranjas', 'manzana': 'apple', 'lemon': 'lemons', 'limón': 'lemons', 'cherry': 'cereza', 'grape': 'grapes'}

Note that main, *alternatives in map(list, families) unpacks the family into a main word (the first of the list) and a list of alternatives using the star operator:

>>> head, *tail = [1,2,3,4,5]
>>> head
1
>>> tail
[2, 3, 4, 5]

You can then convert the lists to use only main words:

>>> list3=["orange","apple","lemons","grape"]
>>> list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]
>>> [d.get(w, w) for w in list3]
['naranjas', 'apple', 'limón', 'grapes']
>>> [d.get(w, w) for w in list4]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'grapes']
>>> [d.get(w, w) for w in list5]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'uvas']

The expression d.get(w, w) will return d[w] if w is a key, else w itself. Hence, the words that belong to a family are converted to the main word of that family and the other words are left untouched.

Those lists are easy to compare with the difflib.

Important: the time complexity of the conversion of the lists is negligible compared to the diff algorithm, hence you should not see the difference.

Full code

As a bonus, the full code:

def match_seq(list1, list2):
    """A generator that yields matches of list1 vs list2"""
    s = SequenceMatcher(None, list1, list2)
    for block in s.get_matching_blocks():
        for i in range(block.size):
            yield block.a + i, block.b + i # you don't need to store the matches, just yields them

def create_convert(*families):
    """Return a converter function that converts a list
    to the same list with only main words"""
    d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
    return lambda L: [d.get(w, w) for w in L]

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes", "uvas"}]
convert = create_convert(*families)

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

print ("list3 vs list4")
for a,b in match_seq(convert(list3), convert(list4)):
    print(a,b, list3[a],list4[b])

#  list3 vs list4
# 0 1 orange oranges
# 1 2 apple apple
# 2 3 lemons lemon
# 3 5 grape grapes

print ("list3 vs list5")
for a,b in match_seq(convert(list3), convert(list5)):
    print(a,b, list3[a],list5[b])

# list3 vs list5
# 0 1 orange naranjas
# 1 2 apple manzana
# 2 3 lemons limón
# 3 5 grape uvas
Sign up to request clarification or add additional context in comments.

1 Comment

very nice solution, I like both the hacking and the workaround!
1

Here's an approach that utilizes a class that inherits from UserString and overrides __eq__() and __hash__() such that strings deemed to be synonyms evaluate as equal:

import collections
from difflib import SequenceMatcher


class SynonymString(collections.UserString):
    def __init__(self, seq, synonyms, inverse_synonyms):
        super().__init__(seq)

        self.synonyms = synonyms
        self.inverse_synonyms = inverse_synonyms

    def __eq__(self, other):
        if self.synonyms.get(other) and self.data in self.synonyms.get(other):
            return True
        return self.data == other

    def __hash__(self):
        if str(self.data) in self.inverse_synonyms:
            return hash(self.inverse_synonyms[self.data])
        return hash(self.data)


def match_seq_syn(list1, list2, synonyms):

    inverse_synonyms = {
        string: key for key, value in synonyms.items() for string in value
    }

    list1 = [SynonymString(s, synonyms, inverse_synonyms) for s in list1]
    list2 = [SynonymString(s, synonyms, inverse_synonyms) for s in list2]

    output = []
    s = SequenceMatcher(None, list1, list2)
    blocks = s.get_matching_blocks()

    for bl in blocks:
        for bi in range(bl.size):
            cur_a = bl.a + bi
            cur_b = bl.b + bi
            output.append((cur_a, cur_b))
    return output


list3 = ["orange", "apple", "lemons", "grape"]
list5 = ["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

synonyms = {
    "orange": ["oranges", "naranjas"],
    "apple": ["manzana"],
    "pears": ["peras"],
    "lemon": ["lemons", "limón"],
    "cherry": ["cereza"],
    "grape": ["grapes", "uvas"],
}

for a, b in match_seq_syn(list3, list5, synonyms):
    print(a, b, list3[a], list5[b])

Result (comparing lists 3 and 5):

0 1 orange naranjas
1 2 apple manzana
2 3 lemons limón
3 5 grape uvas

1 Comment

thank you so much, that's a smart idea of using userstring and ovverriding the eq
0

So let's say you want to fill lists with elements that should be matched to each others. I didn't use any library but Generators. I'm not sure of the efficiency, I tried this code once but I think it should works pretty good.

orange_list = ["orange", "oranges"] # Fill this with orange matching words
pear_list = ["pear", "pears"]
lemon_list = ["lemon", "lemons"]
apple_list = ["apple", "apples"]
grape_list = ["grape", "grapes"]

lists = [orange_list, pear_list, lemon_list, apple_list, grape_list] # Put your matching lists inside this list

def match_seq_bol(list1, list2):
    output=[]
    for x in list1:
        for lst in lists:
            matches = (y for y in list2 if (x in lst and y in lst))
            if matches:
                for i in matches:
                    output.append((list1.index(x), list2.index(i), x,i))
    return output;

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]

print(match_seq_bol(list3, list4))

match_seq_bol() means match sequences based on lists.

The output matching list3 and list4 will be:

[
    (0, 1, 'orange', 'oranges'),
    (1, 2, 'apple', 'apple'),
    (2, 3, 'lemons', 'lemon'),
    (3, 5, 'grape', 'grapes')
]

1 Comment

the number of iterations for this solution will be the size of the first list multiplied by the size of the second list, so I don't think it will be scalable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.