为了账号安全,请及时绑定邮箱和手机立即绑定

构建大型词到索引到词词典的最有效数据结构是什么?

构建大型词到索引到词词典的最有效数据结构是什么?

交互式爱情 2021-11-30 10:19:29
我想索引大量的字符串(将每个字符串映射到一个数值),但也能够从其数字索引中检索每个字符串。由于内存问题,使用哈希表或 python dict 不是一种选择,所以我决定使用基数树来存储字符串,我可以非常快速地检索任何字符串的索引并处理非常大量的字符串。我的问题是我还需要从它们的数字索引中检索字符串,如果我维护一个“反向索引”列表 [string1, string2, ..., stringn],我将失去 Trie 的内存优势。我想也许“反向索引”可能是指向某种 Trie 结构的最后一个节点的指针列表,但首先,python 中没有指针,其次我不确定我是否可以拥有“节点级” " 访问我当前使用的 Trie 结构。这种数据结构是否已经存在?如果不是,你将如何在 python 中做到这一点?
查看完整描述

2 回答

?
慕无忌1623718

TA贡献1744条经验 获得超4个赞

您需要两个同步的数据结构用于键和值查找,每个都保存对另一个叶节点的引用。

ID 查找的结构可以是任何具有足够效率的结构——平衡树、哈希表、另一个树。

为了能够从叶节点引用中提取值,trie 需要允许 1) 叶节点引用本身(不一定是真正的 Python 引用,它的 API 可以使用的任何东西);2)走上trie从该引用中提取单词。

请注意,引用实际上是一个唯一的整数,因此如果您的 ID 不大于一个整数,那么将某些内容作为 ID 重用是有意义的——例如,trie 节点引用自身。然后,如果 trie API 可以验证这样的引用(即判断它是否具有具有这样引用的已用节点),这将充当 ID 查找,您根本不需要第二个结构!这样,尽管进程和运行之间的引用值(有效的内存地址)会发生变化,但 ID 将是非持久的。


查看完整回答
反对 回复 2021-11-30
?
忽然笑

TA贡献1806条经验 获得超5个赞

我在回答自己,因为我最终只使用 python3 内置函数创建了我自己的数据结构,它非常适合我遇到的单词到索引到单词的问题。


我试图让它干净高效,但显然还有改进的空间,C 绑定会更好。


所以最终的结果是一个 indexedtrie 类,它看起来像一个 python dict(或者 defaultdict,如果你用 default_factory 参数调用它)但也可以像列表一样查询,因为自动维护了一种“反向索引”。


存储在内部基数树中的键可以是任何可下标的对象(字节、字符串、元组、列表)以及您想要在其中存储任何内容的值。


indextrie 类也是可选的,您可以从 radix 尝试的关于“前缀搜索”和此类事情的优势中受益!


树中的每个键都与一个唯一的整数索引相关联,您可以检索带有索引的键或带有键的索引,整个过程快速且内存安全,所以我个人认为这是最好的数据结构之一世界,它应该被集成到 python 标准库中 :)。


话不多说,下面是代码,随意改编和使用:


"""

A Python3 indexed trie class.

An indexed trie's key can be any subscriptable object. 

Keys of the indexed trie are stored using a "radix trie", a space-optimized data-structure which has many advantages (see https://en.wikipedia.org/wiki/Radix_tree).

Also, each key in the indexed trie is associated to a unique index which is build dynamically.


Indexed trie is used like a python dictionary (and even a collections.defaultdict if you want to) but its values can also be accessed or updated (but not created) like a list!


Example:

    >>> t = indextrie()

    >>> t["abc"] = "hello"

    >>> t[0]

    'hello'

    >>> t["abc"]

    'hello'

    >>> t.index2key(0)

    'abc'

    >>> t.key2index("abc")

    0

    >>> t[:]

    [0]

    >>> print(t)

    {(0, 'abc'): hello}

"""


__author__ = "@fbparis"


_SENTINEL = object()


class _Node(object):

    """

    A single node in the trie.

    """

    __slots__ = "_children", "_parent", "_index", "_key"


    def __init__(self, key, parent, index=None):

        self._children = set()

        self._key = key

        self._parent = parent

        self._index = index

        self._parent._children.add(self)


class IndexedtrieKey(object):

    """

    A pair (index, key) acting as an indexedtrie's key

    """

    __slots__ = "index", "key"


    def __init__(self, index, key):

        self.index = index

        self.key = key


    def __repr__(self):

        return "(%d, %s)" % (self.index, self.key)


class indexedtrie(object):

    """

    The indexed trie data-structure.

    """

    __slots__ = "_children", "_indexes", "_values", "_nodescount", "_default_factory"


    def __init__(self, items=None, default_factory=_SENTINEL):

        """

        A list of items can be passed to initialize the indexed trie.

        """

        self._children = set()

        self.setdefault(default_factory)

        self._indexes = []

        self._values = []

        self._nodescount = 0 # keeping track of nodes count is purely informational

        if items is not None:

            for k, v in items:

                if isinstance(k, IndexedtrieKey):

                    self.__setitem__(k.key, v)

                else:

                    self.__setitem__(k, v)


    @classmethod

    def fromkeys(cls, keys, value=_SENTINEL, default_factory=_SENTINEL):

        """

        Build a new indexedtrie from a list of keys.

        """

        obj = cls(default_factory=default_factory)

        for key in keys:

            if value is _SENTINEL:

                if default_factory is not _SENTINEL:

                    obj[key] = obj._default_factory()

                else:

                    obj[key] = None

            else:

                obj[key] = value

        return obj


    @classmethod

    def fromsplit(cls, keys, value=_SENTINEL, default_factory=_SENTINEL):

        """

        Build a new indexedtrie from a splitable object.

        """

        obj = cls(default_factory=default_factory)

        for key in keys.split():

            if value is _SENTINEL:

                if default_factory is not _SENTINEL:

                    obj[key] = obj._default_factory()

                else:

                    obj[key] = None

            else:

                obj[key] = value

        return obj


    def setdefault(self, factory=_SENTINEL):

        """

        """

        if factory is not _SENTINEL:

            # indexed trie will act like a collections.defaultdict except in some cases because the __missing__

            # method is not implemented here (on purpose).

            # That means that simple lookups on a non existing key will return a default value without adding

            # the key, which is the more logical way to do.

            # Also means that if your default_factory is for example "list", you won't be able to create new

            # items with "append" or "extend" methods which are updating the list itself.

            # Instead you have to do something like trie["newkey"] += [...]

            try:

                _ = factory()

            except TypeError:

                # a default value is also accepted as default_factory, even "None"

                self._default_factory = lambda: factory

            else:

                self._default_factory = factory

        else:

            self._default_factory = _SENTINEL


    def copy(self):

        """

        Return a pseudo-shallow copy of the indexedtrie.

        Keys and nodes are deepcopied, but if you store some referenced objects in values, only the references will be copied.

        """

        return self.__class__(self.items(), default_factory=self._default_factory)


    def __len__(self):

        return len(self._indexes)


    def __repr__(self):

        if self._default_factory is not _SENTINEL:

            default = ", default_value=%s" % self._default_factory()

        else:

            default = ""

        return "<%s object at %s: %d items, %d nodes%s>" % (self.__class__.__name__, hex(id(self)), len(self), self._nodescount, default)


    def __str__(self):

        ret = ["%s: %s" % (k, v) for k, v in self.items()]

        return "{%s}" % ", ".join(ret)


    def __iter__(self):

        return self.keys()


    def __contains__(self, key_or_index):

        """

        Return True if the key or index exists in the indexed trie.

        """

        if isinstance(key_or_index, IndexedtrieKey):

            return key_or_index.index >= 0 and key_or_index.index < len(self)

        if isinstance(key_or_index, int):

            return key_or_index >= 0 and key_or_index < len(self)

        if self._seems_valid_key(key_or_index):

            try:

                node = self._get_node(key_or_index)

            except KeyError:

                return False

            else:

                return node._index is not None

        raise TypeError("invalid key type")


    def __getitem__(self, key_or_index):

        """

        """

        if isinstance(key_or_index, IndexedtrieKey):

            return self._values[key_or_index.index]

        if isinstance(key_or_index, int) or isinstance(key_or_index, slice):

            return self._values[key_or_index]

        if self._seems_valid_key(key_or_index):

            try:

                node = self._get_node(key_or_index)

            except KeyError:

                if self._default_factory is _SENTINEL:

                    raise

                else:

                    return self._default_factory()

            else:

                if node._index is None:

                    if self._default_factory is _SENTINEL:

                        raise KeyError

                    else:

                        return self._default_factory()

                else:

                    return self._values[node._index]            

        raise TypeError("invalid key type")


    def __setitem__(self, key_or_index, value):

        """

        """

        if isinstance(key_or_index, IndexedtrieKey):

            self._values[key_or_index.index] = value

        elif isinstance(key_or_index, int):

            self._values[key_or_index] = value

        elif isinstance(key_or_index, slice):

            raise NotImplementedError

        elif self._seems_valid_key(key_or_index):

            try:

                node = self._get_node(key_or_index)

            except KeyError:

                # create a new node

                self._add_node(key_or_index, value)

            else:

                if node._index is None:

                    # if node exists but not indexed, we index it and update the value

                    self._add_to_index(node, value)

                else:

                    # else we update its value

                    self._values[node._index] = value

        else:

            raise TypeError("invalid key type")


    def __delitem__(self, key_or_index):

        """

        """

        if isinstance(key_or_index, IndexedtrieKey):

            node = self._indexes[key_or_index.index]

        elif isinstance(key_or_index, int):

            node = self._indexes[key_or_index]

        elif isinstance(key_or_index, slice):

            raise NotImplementedError

        elif self._seems_valid_key(key_or_index):

            node = self._get_node(key_or_index)

            if node._index is None:

                raise KeyError

        else:

            raise TypeError("invalid key type")

        # switch last index with deleted index (except if deleted index is last index)

        last_node, last_value = self._indexes.pop(), self._values.pop()

        if node._index != last_node._index:

            last_node._index = node._index

            self._indexes[node._index] = last_node

            self._values[node._index] = last_value

        if len(node._children) > 1:

            #case 1: node has more than 1 child, only turn index off

            node._index = None

        elif len(node._children) == 1:

            # case 2: node has 1 child

            child = node._children.pop()

            child._key = node._key + child._key

            child._parent = node._parent

            node._parent._children.add(child)

            node._parent._children.remove(node)

            del(node)

            self._nodescount -= 1

        else:

            # case 3: node has no child, check the parent node

            parent = node._parent

            parent._children.remove(node)

            del(node)

            self._nodescount -= 1

            if hasattr(parent, "_index"):

                if parent._index is None and len(parent._children) == 1:

                    node = parent._children.pop()

                    node._key = parent._key + node._key

                    node._parent = parent._parent

                    parent._parent._children.add(node)

                    parent._parent._children.remove(parent)

                    del(parent)

                    self._nodescount -= 1


    @staticmethod

    def _seems_valid_key(key):

        """

        Return True if "key" can be a valid key (must be subscriptable).

        """

        try:

            _ = key[:0]

        except TypeError:

            return False

        return True


    def keys(self, prefix=None):

        """

        Yield keys stored in the indexedtrie where key is a IndexedtrieKey object.

        If prefix is given, yield only keys of items with key matching the prefix.

        """

        if prefix is None:

            for i, node in enumerate(self._indexes):

                yield IndexedtrieKey(i, self._get_key(node))

        else:

            if self._seems_valid_key(prefix):

                empty = prefix[:0]

                children = [(empty, prefix, child) for child in self._children]

                while len(children):

                    _children = []

                    for key, prefix, child in children:

                        if prefix == child._key[:len(prefix)]:

                            _key = key + child._key

                            _children.extend([(_key, empty, _child) for _child in child._children])

                            if child._index is not None:

                                yield IndexedtrieKey(child._index, _key)

                        elif prefix[:len(child._key)] == child._key:

                            _prefix = prefix[len(child._key):]

                            _key = key + prefix[:len(child._key)]

                            _children.extend([(_key, _prefix, _child) for _child in child._children])

                    children = _children

            else:

                raise ValueError("invalid prefix type")


    def values(self, prefix=None):

        """

        Yield values stored in the indexedtrie.

        If prefix is given, yield only values of items with key matching the prefix.

        """

        if prefix is None:

            for value in self._values:

                yield value

        else:

            for key in self.keys(prefix):

                yield self._values[key.index]


    def items(self, prefix=None):

        """

        Yield (key, value) pairs stored in the indexedtrie where key is a IndexedtrieKey object.

        If prefix is given, yield only (key, value) pairs of items with key matching the prefix.

        """

        for key in self.keys(prefix):

            yield key, self._values[key.index]


    def show_tree(self, node=None, level=0):

        """

        Pretty print the internal trie (recursive function).

        """

        if node is None:

            node = self

        for child in node._children:

            print("-" * level + "<key=%s, index=%s>" % (child._key, child._index))

            if len(child._children):

                self.show_tree(child, level + 1)


    def _get_node(self, key):

        """

        Return the node associated to key or raise a KeyError.

        """

        children = self._children

        while len(children):

            notfound = True

            for child in children:

                if key == child._key:

                    return child

                if child._key == key[:len(child._key)]:

                    children = child._children

                    key = key[len(child._key):]

                    notfound = False

                    break

            if notfound:

                break

        raise KeyError


    def _add_node(self, key, value):

        """

        Add a new key in the trie and updates indexes and values.

        """

        children = self._children

        parent = self

        moved = None

        done = len(children) == 0

        # we want to insert key="abc"

        while not done:

            done = True

            for child in children:

                # assert child._key != key # uncomment if you don't trust me

                if child._key == key[:len(child._key)]:

                    # case 1: child's key is "ab", insert "c" in child's children 

                    parent = child

                    children = child._children

                    key = key[len(child._key):]

                    done = len(children) == 0

                    break

                elif key == child._key[:len(key)]:

                    # case 2: child's key is "abcd", we insert "abc" in place of the child

                    # child's parent will be the inserted node and child's key is now "d"

                    parent = child._parent

                    moved = child

                    parent._children.remove(moved)

                    moved._key = moved._key[len(key):]

                    break

                elif type(key) is type(child._key): # don't mess it up

                    # find longest common prefix

                    prefix = key[:0]

                    for i, c in enumerate(key):

                        if child._key[i] != c:

                            prefix = key[:i]

                            break

                    if prefix:

                        # case 3: child's key is abd, we spawn a new node with key "ab"

                        # to replace child ; child's key is now "d" and child's parent is 

                        # the new created node.

                        # the new node will also be inserted as a child of this node

                        # with key "c"

                        node = _Node(prefix, child._parent)

                        self._nodescount += 1

                        child._parent._children.remove(child)

                        child._key = child._key[len(prefix):]

                        child._parent = node

                        node._children.add(child)

                        key = key[len(prefix):]

                        parent = node

                        break

        # create the new node

        node = _Node(key, parent)

        self._nodescount += 1

        if moved is not None:

            # if we have moved an existing node, update it

            moved._parent = node

            node._children.add(moved)

        self._add_to_index(node, value)


    def _get_key(self, node):

        """

        Rebuild key from a terminal node.

        """

        key = node._key

        while node._parent is not self:

            node = node._parent

            key = node._key + key

        return key


    def _add_to_index(self, node, value):

        """

        Add a new node to the index.

        Also record its value.

        """

        node._index = len(self)

        self._indexes.append(node)

        self._values.append(value)


    def key2index(self, key):

        """

        key -> index

        """

        if self._seems_valid_key(key):

            node = self._get_node(key)

            if node._index is not None:

                return node._index

            raise KeyError

        raise TypeError("invalid key type")


    def index2key(self, index):

        """

        index or IndexedtrieKey -> key.

        """

        if isinstance(index, IndexedtrieKey):

            index = index.index

        elif not isinstance(index, int):

            raise TypeError("index must be an int")

        if index < 0 or index > len(self._indexes):

            raise IndexError

        return self._get_key(self._indexes[index])

分享

编辑

跟随


查看完整回答
反对 回复 2021-11-30
  • 2 回答
  • 0 关注
  • 152 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信