Vector types
ValueVector
Value vector is Kùzu’s column-oriented in-memory data structure to store a chunk of data of the same data type. The size of ValueVector
is defined by DEFAULT_VECTOR_CAPACITY
, which is 2048. This is an empirically selected value with the presumption that the data stored in the value vector will fit into CPU cache.
A value vector has the following core fields:
data
: Stores the actual data which is a trivial byte array managed by a unique pointer.nullMask
: Aligned withdata
and indicate if each entry isNULL
or not.auxilaryBuffer
: Keeps track of additional data that does NOT fit indata
.
Primitive Type ValueVector
For primitive data type, we can represent data in the data
field, for example, to represent 0 to 99 elements with INT64
data type, data
is simply a int64_t[]
with size 100.
String Type ValueVector
Earlier versions of the storage layer required each element to have a fixed size. This however, does NOT work with the STRING
type, whose size may vary. We implement STRING
as a 16 byte data structure with first 8 bytes as prefix and last 8 bytes as a pointer that points to a memory location storing the rest of string. The fixed length part of STRING
is stored in data
and the overflow part is stored in auxiliaryBuffer
.
Nested Type ValueVector
Although nested type value vectors can be organized as overflow in the same way as STRING
, we want to utilize the fact that the child type of a nested type is known and can still be stored in column-oriented value vector.
For STRUCT
type, we store children vectors in auxilaryBuffer
.
Example:
For VAR_LIST
type, we store size and offset of each entry in data
and elements in auxilaryBuffer
.
Example:
SelectionState
Selection state keeps track of data that are valid. This is mainly used to select a subset of data (e.g. data that satisfy a filter) without performing a copy.
A selection state has the following core fields:
selectedPositions
: a vector ofuint16_t
that maps to a position ofdata
vector.selectedSize
: size of selected positions
Example:
DataChunk
A data chunk is a collection of value vector with the same state.
Example:
ResultSet
A result set is a collection of data chunks that forms a cartesian product.
Example: