boltdb 1.3.0实现分析（二）

2022-03-15 00:00:00 数据节点结构页面叶子

本文基于boltdb 1.3.0对其实现进行分析。boltdb是etcd系统存储数据使用的KV嵌入式DB，使用Go编码实现，内部是一个B+树结构体。

•Raft算法原理^[1]•etcd Raft库解析^[2]•Etcd存储的实现^[3]•B树、B+树索引算法原理（上）^[4]•B树、B+树索引算法原理（下）^[5]

本文的写作，主要参考了《区块的持久化之BoltDB》系列文章^[6]

在上一节^[7]里面，系统的介绍了Boltdb中几种类型页面的格式，有了这些基础，本节开始介绍boltdb中的Bucket结构。

Bucket

概述

在上一节中，Bucket类比于mysql中的table，在boltdb中，meta
页面中有一个成员bucket
，其存储了整个数据库根bucket的信息，而一个数据库中存储的其他table的信息，则作为子bucket存储到Bucket中。这几个数据结构的关系如下：

type DB struct {
   ...
    meta0    *meta
    meta1    *meta  
}


type meta struct {
   ...
    root     bucket     根bucket的信息
}


type Bucket struct {
    *bucket


   ...
  buckets  map[string]*Bucket  存储子bucket的对应关系
}


type bucket struct {
     根节点的page id
    root pgid  page id of the bucket's root-level page
     单调递增的序列号
    sequence uint64  monotonically incrementing, used by NextSequence()
}

在bucket
数据结构中，两个成员的作用是：

•root：该bucket的根节点的page id。•sequence：该bucket当前的序列号，单调递增，在函数NextSequence
中使用。

每个Bucket
数据结构，都继承自bucket
，同时其中的buckets
存储了该Bucket
中子Bucket名字的对应关系。

后，meta
页面的root
成员，存储的就是这个db的根bucket
页面信息。这几个数据结构之间的关系如下图：

在上图中：

•DB
结构体是表示整一个boltdb数据库的结构体，其中有meta0
和meta1
两个meta
类型的成员，用于保存meta
页面信息。•meta
页面中，其中的root
是一个bucket
类型成员，保存了根bucket的页面信息。•根据bucket
中的页面信息，就能找到DB的根Bucket
信息，其中的buckets
成员保存了该数据库中所有子bucket
名字与实体之间的映射关系。

Bucket结构体定义

接下来介绍Bucket
结构体成员，其定义如下：

type Bucket struct {
    *bucket
    tx       *Tx                 the associated transaction
    buckets  map[string]*Bucket  subbucket cache
    page     *page              / inline page reference
    rootNode *node              // materialized node for the root page.
    nodes    map[pgid]*node     // node cache


    // Sets the threshold for filling nodes when they split. By default,
    // the bucket will fill to 50% but it can be useful to increase this
    // amount if you know that your write workloads are mostly append-only.
    //
    // This is non-persisted across transactions so it must be set in every Tx.
    FillPercent float64
}

其中：

•bucket：存储该Bucket
所在页面ID，以及当前序列号。•tx：当前Bucket关联的事务。•buckets：前面已经介绍过，维护子bucket
的映射关系。•page：存储inline页面信息。•rootNode：该Bucket的B+树根节点指针。•nodes：缓存已经读入内存的page
对应的node
信息。•FillPercent：这是一个阈值，每个节点的数据量超过该阈值时进行分裂操作，默认值为DefaultFillPercent=0.5。至于B+树分裂操作的流程，可以参考文章开始的B+树原理链接。

子Bucket

按照上面的分析，一个bolt的DB存在一个的根Bucket，而DB中不同的table就是该根Bucket的子Bucket，其对应关系存储在Bucket.buckets
成员中。那么，子Bucket信息保存在哪里呢？

答案是保存在叶子节点，也就是leafPageElement
中，回顾一下这个数据结构的定义：

// leafPageElement represents a node on a leaf page.
type leafPageElement struct {
    flags uint32
    pos   uint32
    ksize uint32
    vsize uint32
}

其中的flags
成员，含义如下：

•0：表示就是普通的叶子节点。•1：表示是子bucket。

即在boltdb中，子Bucket的信息，是做为一种特殊的叶子节点信息存储下来的。boltdb使用了常量来表示这种类型的叶子节点标志位：

const (
    bucketLeafFlag = 0x01
)

即：

•对于一个标志位为0的叶子页面：其内容就是B+树叶子页面的内容，存储的是数据的键值，boltdb中叶子页面的格式示意图在上一节中^[8]已经给出。•对于一个标志位为1的叶子页面：其内存存储的是Bucket结构体的信息。

有了以上的介绍，理解起来返回一个子Bucket的相关代码就不难了：

func (b *Bucket) Bucket(name []byte) *Bucket {
    // Move cursor to key.
    // 否则创建一个Cursor查询
    c := b.Cursor()
    k, v, flags := c.seek(name)


    // Return nil if the key doesn't exist or it is not a bucket.
    // 查询不到，或者不是子bucket节点，都返回nil
    if !bytes.Equal(name, k) || (flags&bucketLeafFlag) == 0 {
        return nil
    }


    // Otherwise create a bucket and cache it.
    // 打开这个bucket并且cache到buckets map中
    var child = b.openBucket(v)
    if b.buckets != nil {
        b.buckets[string(name)] = child
    }


    return child
}


func (b *Bucket) openBucket(value []byte) *Bucket {
    // 创建一个bucket
    var child = newBucket(b.tx)


    // If this is a writable transaction then we need to copy the bucket entry.
    // Read-only transactions can point directly at the mmap entry.
    if b.tx.writable {
        child.bucket = &bucket{}
        *child.bucket = *(*bucket)(unsafe.Pointer(&value[0]))
    } else {
        child.bucket = (*bucket)(unsafe.Pointer(&value[0]))
    }


    // Save a reference to the inline page if the bucket is inline.
    // inline bucket
    if child.root == 0 {
        // bucket的page
        child.page = (*page)(unsafe.Pointer(&value[bucketHeaderSize]))
    }


    return &child
}

Bucket.Bucket
用于根据名字返回一个子Bucket的指针，其流程如下：

•首先根据子Bucket名字查找到叶子页面的数据、标志位，如果查找失败，说明不存在该子Bucket；或者其标志位不是bucketLeafFlag
，说明这个名字已经被普通数据占用，即：boltdb中不允许子Bucket与其父Bucket中写入的键同名。•以上查找成功，就以该叶子页面的数据为参数，调用Bucket.openBucket
函数，根据Bucket
结构体格式，反序列化出来Bucket
结构体信息返回。

inline page

从上面的分析可以看到，子Bucket的信息是独占一个叶子页面来存储的，该页面大部分的内容都是冗余的。如果子Bucket中的数据量很少，就会造成磁盘空间的浪费。为了针对这类型Bucket进行优化，boltdb提供了inline page
这个特殊的页面，将小的子Bucket数据存放在这里。

这类型的子Bucket需要满足以下两个条件：

•该子Bucket再没有嵌套的子Bucket了。•整个子Bucket的大小不能超过page size/4。

Bucket.inlineable
函数就是用来做inline
操作的判断的：

// inlineable returns true if a bucket is small enough to be written inline
// and if it contains no subbuckets. Otherwise returns false.
// 返回这个bucket是否能够inline操作
func (b *Bucket) inlineable() bool {
    var n = b.rootNode


    // Bucket must only contain a single leaf node.
    // 如果没有根节点，或者根节点不是叶子节点的不能进行inline操作
    if n == nil || !n.isLeaf {
        return false
    }


    // Bucket is not inlineable if it contains subbuckets or if it goes beyond
    // our threshold for inline bucket size.
    // 有子bucket，或者大小超过maxInlineBucketSize阈值的，都不能进行inline操作
    var size = pageHeaderSize
    for _, inode := range n.inodes {
        size += leafPageElementSize + len(inode.key) + len(inode.value)


        if inode.flags&bucketLeafFlag != 0 {
            return false
        } else if size > b.maxInlineBucketSize() {
            return false
        }
    }


    return true
}


// Returns the maximum total size of a bucket to make it a candidate for inlining.
func (b *Bucket) maxInlineBucketSize() int {
    return b.tx.db.pageSize / 4
}

Cursor

以上已经大体了解Bucket
的结构，在boltdb查找数据流程中，还是使用了Cursor
结构体来做为游标（iterator），保存查找流程中的中间数据，下面也来简单了解一下。

Cursor
结构体的定义如下：

type Cursor struct {
    bucket *Bucket        // 对应的bucket
    stack  []elemRef    // 存储递归遍历时中间过程的栈，由于栈是先进后出结构，所以遍历的过程中层次高的在栈的低端。
}


// 光标移动过程中，中间过程的信息
type elemRef struct {
    page  *page    // 页面
    node  *node    // 内存中的页面信息
    index int        // 保存在当前page、node遍历到了哪个节点
}

Cursor
有以下成员：

•bucket：游标操作所对应的Bucket
指针。•stack：存储递归遍历过程中的栈，由于栈是先进后出结构，所以遍历的过程中层次高的节点在栈的低端。

每个stack成员类型是elemRef
，其成员如下：

•page：页面指针。•node：内存中的页面信息。•index：保存遍历到当前页面的索引位置。

由于node
是page
在内存中的表示，所以实际上在elemRef
结构体中，page
和node
成员同时只会有一个成员不为NULL。

Cursor
结构体做为一个迭代器，对外提供的就是常规迭代器所支持的操作：

•First：返回当前Bucket的个数据。•Last：返回当前Bucket的后一个数据。•Next：返回当前游标位置的下一个数据。•Prev：返回当前游标位置的上一个数据。•Seek：查找到对应的叶子节点返回其键、值。•keyValue：返回当前游标位置的键、值、标志位。•Delete：删除当前游标位置的数据。

在这里，不打算讨论其中的所有实现，如果读者对B+树的实现并不了解，可以看开始介绍B+树原理的连接。

这里以First
函数为例简单的讲解其实现，由于B+树是中序遍历的树结构，因此First
元素一定是左边叶子节点的左边个元素。带注释的代码实现如下：

// 移动到bucket的个元素上，返回其key value数据
func (c *Cursor) First() (key []byte, value []byte) {
    _assert(c.bucket.tx.db != nil, "tx closed")
    // 修改stack只保存个stack，及栈底部
    c.stack = c.stack[:0]
    // 返回root节点的page、node
    p, n := c.bucket.pageNode(c.bucket.root)
    // 将root节点所在的page、node信息放到栈顶，index为0，表示从个子节点开始遍历
    c.stack = append(c.stack, elemRef{page: p, node: n, index: 0})
    // 移动到当前的个元素
    // first函数做的事情：从树顶端开始，从左边一直往下遍历到叶子节点为止
    // 因为B树是中序遍历的，所以左边的节点数据小
    c.first()


    // If we land on an empty page then move to the next value.
    // https://github.com/boltdb/bolt/issues/450
    // 如果没有元素，就移动到下一个元素
    if c.stack[len(c.stack)-1].count() == 0 {
        c.next()
    }


    // 拿到k、v、flags
    k, v, flags := c.keyValue()
    // 如果是子bucket，就返回k以及nil
    if (flags & uint32(bucketLeafFlag)) != 0 {
        return k, nil
    }
    return k, v


}


// first moves the cursor to the first leaf element under the last page in the stack.
// 找到stack后一个页面中的个叶子节点
func (c *Cursor) first() {
    for {    // 找到左边个叶子节点
        // Exit when we hit a leaf page.
        // 每次循环取出后一个元素
        var ref = &c.stack[len(c.stack)-1]
        if ref.isLeaf() {    // 如果是叶子节点就退出循环，即这个循环终止的条件是向下一直找到叶子节点为止
            break
        }


        // Keep adding pages pointing to the first element to the stack.
        // 根据ref.index拿到pgid
        var pgid pgid
        if ref.node != nil {
            pgid = ref.node.inodes[ref.index].pgid
        } else {
            pgid = ref.page.branchPageElement(uint16(ref.index)).pgid
        }
        // 拿到对应的page、node
        p, n := c.bucket.pageNode(pgid)
        // 放到栈顶，注意这里的index是0
        // 即向下查找的时候取的都是左边的节点
        c.stack = append(c.stack, elemRef{page: p, node: n, index: 0})
    }
}

相关文章