Python 容器原理-工具盒子

Python 提供了多种容器类型，用于存储和组织数据。以下是 Python 中常用的容器类型的简介：

列表（List）：列表是最常用的容器类型之一。它是有序、可变的，可以包含任意类型的元素。列表使用方括号 [] 表示，元素之间用逗号分隔。可以通过索引访问和修改列表中的元素。
元组（Tuple）：元组是有序、不可变的容器类型。它使用圆括号 () 表示，元素之间用逗号分隔。与列表不同，元组的元素不能被修改。元组通常用于存储不可变的数据，例如函数返回多个值。
字典（Dictionary）：字典是一种键值对（key-value）的数据结构。它是无序的，可变的，键和值可以是任意类型的对象。字典使用花括号 {} 表示，键和值之间使用冒号 : 分隔，键值对之间用逗号分隔。
集合（Set）：集合是无序的、可变的容器类型，用于存储唯一的元素。集合不允许重复的元素。集合可以执行交集、并集、差集等常见的集合操作。集合使用花括号 {} 表示，元素之间用逗号分隔。
List 列表 {#title-0} =====================

列表是一个可变长的线性容器，其内存结构如下：

通过内存结构，我们可以得到列表类型的特点如下：

支持存储不同的数据类型
列表的指针数组需要动态扩容
每次增删元素都需要动态内存操作 (通过容量来减少操作次数)
扩容操作步骤，按倍申请内存、元素复制、释放原内存
在尾部添加元素，不需要移动元素，效率比头插入、中间位置插入效率高

Python 列表结构 PyListObject 定义如下：

typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;
/* ob_item contains space for 'allocated' elements.  The number
 * currently in use is ob_size.
 * Invariants:
 *     0 &lt;= ob_size &lt;= allocated
 *     len(list) == ob_size
 *     ob_item == NULL implies ob_size == allocated == 0
 * list.sort() temporarily sets allocated to -1 to detect mutations.
 *
 * Items must normally not be NULL, except during construction when
 * the list is not yet visible outside the function that builds it.
 */
Py_ssize_t allocated;

} PyListObject;

PyObject_VAR_HEAD 用于类型信息存储、支持引用计数
ob_item 指针数组，每一个元素指向真正存储数据的空间
allocated 列表中已存储（分配内存）的元素数量

源码中关于列表类型的主要文件如下：

cpython/Include/cpython/listobject.h 主要定义 PyListObject 结构和相关函数的声
cpython/Include/listobject.h 主要定义宏定义或其他实现细节
cpython/Objects/listobject.c 列表具体函数实现
Tuple 元组 {#title-1} ======================

元组则是一个定长的 List 容器，它不允许使用过程中增删元素，其内存结构如下：

Tuple 和 List 不同之处在于，Tuple 一旦初始化之后，长度就不会发生改变。Tuple 对应的数据结构如下：

typedef struct {
    PyObject_VAR_HEAD
    /* ob_item contains space for 'ob_size' elements.
       Items must normally not be NULL, except during construction when
       the tuple is not yet visible outside the function that builds it. */
    PyObject *ob_item[1];
} PyTupleObject;

Tuple 数据结构中缺少了容量变量 allocated，这表名 Tuple 不需要进行扩容。Tuple 实现可定义不同长度的容器，就是通过 ob_item 来实现，这是 C99 中特性，能够实现在结构体中实现数组成员的可伸缩特性。进而能够定义出不同元素数量的定长数组。

源码中关于列表类型的主要文件如下：

cpython/Include/cpython/tupleobject.h 主要定义 PyTupleObject 结构和相关函数的声
cpython/Include/tupleobject.h 主要定义宏定义或其他实现细节
cpython/Objects/tupleobject.c 元组具体函数实现
Set 集合 {#title-2} ====================

Python 的 Set 集合容器通过哈希表来实现的，如下图所示：

输入的元素经过哈希函数得到 hash 编码，然后再将 hash 编码通过与 mask 进行位与运算映射到数组中。这样的容器优点是查询元素速度非常快，几乎是常数时间。也会有以下的不足：

不同的输入可能会被 hash 到相同的数组位置中，出现散列冲突的问题，当然这个也可以通过拉链法解决
当实际存储元素的数量达到一定的比例，会增大散列冲突的可能性，此时会对数组进行扩容，并重新计算所有元素的哈希值，这会带来性能上的开销

Set 容器的结构定义如下：

#define PySet_MINSIZE 8
typedef struct {
PyObject key;
Py_hash_t hash;             / Cached hash code of the key */
} setentry;
typedef struct {
PyObject_HEAD
Py_ssize_t fill;            /* Number active and dummy entries*/
Py_ssize_t used;            /* Number active entries */

/* The table contains mask + 1 slots, and that's a power of 2.
 * We store the mask instead of the size because the mask is more
 * frequently needed.
 */
Py_ssize_t mask;

/* The table points to a fixed-size smalltable for small tables
 * or to additional malloc'ed memory for bigger tables.
 * The table pointer is never NULL which saves us from repeated
 * runtime null-tests.
 */
setentry *table;
Py_hash_t hash;             /* Only used by frozenset objects */
Py_ssize_t finger;          /* Search finger for pop() */

setentry smalltable[PySet_MINSIZE];
PyObject *weakreflist;      /* List of weak references */

} PySetObject;

fill 表示整个数组的长度
used 表示已存储元素的数量
mask 用于与输入元素的哈希值进行位与运算，以便能够与数组的下标进行映射
table 指向 setentry 类型的数组空间，实际也是存储元素的空间
finger 一种为搜素进行优化而使用的字段
smalltable 如果集合中存储的元素较少，则直接使用该空间，不需要额外动态开辟其他空间
weakreflist 与引用计数有关的字段

源码中关于列表类型的主要文件如下：

cpython/Include/cpython/setobject.h 主要定义 PySetObject 结构和相关函数的声
cpython/Include/setobject.h 主要定义宏定义或其他实现细节
cpython/Objects/setobject.c 元组具体函数实现
Dict 字典容器 {#title-3} =======================

Python 中 Dict 字典容器也是通过哈希表来实现，不同于 Set 容器的是，Dict 会存储 key 和 value ，而对于 Set 而言 key = value。

typedef struct {
    PyObject_HEAD
/* Number of items in the dictionary */
Py_ssize_t ma_used;

/* Dictionary version: globally unique, value change each time
   the dictionary is modified */

#ifdef Py_BUILD_CORE
uint64_t ma_version_tag;
#else
Py_DEPRECATED(3.12) uint64_t ma_version_tag;
#endif
PyDictKeysObject *ma_keys;

/* If ma_values is NULL, the table is &quot;combined&quot;: keys and values
   are stored in ma_keys.

   If ma_values is not NULL, the table is split:
   keys are stored in ma_keys and values are stored in ma_values */
PyDictValues *ma_values;

} PyDictObject;

ma_used 字典中元素的数量
ma_keys 指向存储 key 的空间
ma_values 指向存储 value 的空间

PyDictKeysObject 结构体定义如下：

struct _dictkeysobject {
    Py_ssize_t dk_refcnt;
/* Size of the hash table (dk_indices). It must be a power of 2. */
uint8_t dk_log2_size;

/* Size of the hash table (dk_indices) by bytes. */
uint8_t dk_log2_index_bytes;

/* Kind of keys */
uint8_t dk_kind;

/* Version number -- Reset to 0 by any modification to keys */
uint32_t dk_version;

/* Number of usable entries in dk_entries. */
Py_ssize_t dk_usable;

/* Number of used entries in dk_entries. */
Py_ssize_t dk_nentries;

/* Actual hash table of dk_size entries. It holds indices in dk_entries,
   or DKIX_EMPTY(-1) or DKIX_DUMMY(-2).

   Indices must be: 0 &lt;= indice &lt; USABLE_FRACTION(dk_size).

   The size in bytes of an indice depends on dk_size:

   - 1 byte if dk_size &lt;= 0xff (char*)
   - 2 bytes if dk_size &lt;= 0xffff (int16_t*)
   - 4 bytes if dk_size &lt;= 0xffffffff (int32_t*)
   - 8 bytes otherwise (int64_t*)

   Dynamically sized, SIZEOF_VOID_P is minimum. */
char dk_indices[];  /* char is required to avoid strict aliasing. */

/* &quot;PyDictKeyEntry or PyDictUnicodeEntry dk_entries[USABLE_FRACTION(DK_SIZE(dk))];&quot; array follows:
   see the DK_ENTRIES() macro */

};

PyDictValues 结构体定义如下：

struct _dictvalues {
    PyObject *values[1];
};

51工具盒子

Python 容器原理

厉飞雨

相关推荐

最新文章

猜你喜欢

快捷分类