algorithm - Python: data structure to index sets to find supersets? -
in python project have pool of objects, each tagged set of words. want generate sets including subsets of tags, mapped linked objects. subsets should not smaller complete tag set of project. instance, let's there these objects tags:
apple: fruit, green, nature sometree: tree, green, wood, nature banana: fruit, nature someplant: green, wood, nature otherplant: green, wood, nature the result should be:
(fruit, nature): banana, apple (fruit, green, nature): apple (green, wood, nature): sometree, someplant, otherplant (green, wood, nature, tree): sometree i don't want result include tag sets don't exists complete tag set @ least 1 objects.
to achieve this, came o(n²) algorithm, wonder if there more clever approaches e.g. tree-based index or prefix-trie? won't have objects, maybe 2000. however, should fast.
my current solution iterates objects create dictionary mapping tag sets objects. in second iteration test each tag set if subset of other tag sets , if so, remember objects of supersets.
update: in o(n²) abovel, n referrs number of objects. have many objects each has 5 tags.
solution: replies. ended using method of azorius, because both fast , robust. complete example list groups:
tagged = { 'apple': ['fruit', 'green', 'nature'], 'sometree': ['tree', 'green', 'wood', 'nature'], 'banana': ['fruit', 'nature'], 'someplant': ['green', 'wood', 'nature'], 'otherplant': ['green', 'wood', 'nature'] } tag_sets = set() tags_to_objects = {} obj, tags in tagged.items(): tag in tags: try: tags_to_objects[tag].add(obj) except keyerror: tags_to_objects[tag] = set([obj]) # record tag sets tag_set = frozenset(tags) tag_sets.add(tag_set) groups = {} tag_set in tag_sets: objects = none tag in tag_set: if objects: objects = objects.intersection(tags_to_objects[tag]) else: objects = tags_to_objects[tag] groups[tag_set] = objects k,v in groups.items(): print '(%s): %s' % (', '.join(k), ', '.join(v)) since there several solutions, did timing tests on 1035 objects 5 tags each:
- code of azorius: 127ms
- code of bruce: 115ms
- code of pillmuncher: 137ms
the code of pillmuncher in opinion elegant.
why not use sets?, according http://wiki.python.org/moin/timecomplexity sets have following average time complexity o(min(len(s), len(t)), aka o(n)!
fruit = set(("apple", "banana")) green = set(("apple", "sometree", "someplant", "otherplant")) nature = set(("apple", "banana", "sometree", "someplant", "otherplant")) tree = set (("sometree",)) wood = set (("sometree", "someplant", "otherplant")) in [90]: fruit.intersection(nature) out[90]: set(['apple', 'banana']) in [91]: fruit.intersection(nature).intersection(green) out[91]: set(['apple']) in [92]: green.intersection(wood).intersection(nature) out[92]: set(['sometree', 'someplant', 'otherplant']) in [93]: green.intersection(wood).intersection(nature).intersection(tree) out[93]: set(['sometree']) looking @ code, more elegant answer use reduce:
in [90]: reduce(set.intersection, [fruit, nature]) out[90]: set(['apple', 'banana']) in [91]: reduce(set.intersection, [fruit, nature, green]) out[91]: set(['apple']) in [92]: reduce(set.intersection, [green, wood, nature]) out[92]: set(['sometree', 'someplant', 'otherplant']) in [93]: reduce(set.intersection, [green, wood, nature, tree]) out[93]: set(['sometree'])
Comments
Post a Comment