18671 - Group Project: Indexing

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

18671 - Group Project: Indexing

A. Introduction

Your team needs to design and implement a new layer that utilizes the B+Tree for efficient data indexing and retrieval. This includes considering which modifications and logic need to be done in the current implementation of CapybaraDB to support B+Tree indexing.

Your implementation should provide support for insertion (INSERT), deletion (DELETE), search operations (SELECT) by guaranteeing that the B+Tree new layer interacts with the overall query processing and storage management.

Objective: Students need to modify CapybaraDB to replace its existing B-Tree implementation with B+Tree indexing, optimize performance, ensure correctness, and demonstrate their implementation at the end of the semester. This requires integrating all layers of the database system to work with the new indexing structure.

Key Tasks:
● Integrate all layers: Ensure the B+Tree layer works within the existing CapybaraDB architecture, including query processing, transaction management, and storage (Command Processor layer, VDBE layer, Pager layer and OS layer).● Replace B-Tree with B+Tree: Modify B-Tree layer to rely on a B+Tree indexing mechanism, handling node splitting, merging, and balanced structure maintenance.
● Be able to create an index: Implement the functionality to create an index on the primary key (default).
● Be able to create an index on any attribute: Implement a function that allows users to create an index on any attribute (e.g., indexing a string column like 'name').

Your team is expected to create CapybaraDB 2.0; a new version that supports B+Tree indexing.

B. Project Description: Phases & Tasks

Each phase consists of design and implementation tasks leading to a final live demo, where students will showcase their work: CapybaraDB 2.0.

1. Design a B+Tree Layer in the Database

Your team needs to design a layer that utilizes the B+Tree for efficient data indexing and retrieval. This includes considering which modifications and logic need to be adjusted in the B-Tree layer of CapybaraDB to turn it into a B+Tree. You should analyze aspects such as insertion, deletion, search operations, and how the B+Tree interacts with the overall query processing and storage management.

The goal of this first part of the assignment is to understand how a Btree-inspired data storage engine works under the hood and how to turn it into a B+Tree design. In this task, you need

● Identify where and how the B-Tree data structure is implemented in CapybaraDB.
● Understand how queries are handled by the B-Tree.
● Design a B+Tree indexing structure and algorithm (i.e., the logic).

You can create the design of this layer in any way your team likes and believes it is more suitable to address the B+Tree indexing mechanism. For example:

● Design 1: store the index information (metadata for the B+Tree) together with the users data, i.e., in the same database file.

● Design 2: store the index information in a different file; similar approach employed in the Journaling mechanism.

Your team needs to explain your design in the demo at the end of the semester.

2. Implement the B+ Tree Data Structure and Logic (B+Tree Layer)

In this task, you need to implement your team's design. Therefore, you need to modify the implementation of CapybaraDB's B-Tree Layer to support a B+Tree indexing data structure and algorithm.

Your design needs to contain a B+ Tree node structure 1 with:

● Internal nodes: Store only keys and pointers to child nodes.
● Leaf nodes: Store keys, values, and a pointer to the next leaf (for fast range queries).

It should support the following operations:

● insert(key, value) → Insert a new record and rebalance the B+Tree if necessary.
● delete(key) → Remove a record while maintaining structure.
● search(key) → Retrieve a record by its key.
● range_search(start_key, end_key) → Retrieve all values within a range.
● create_index(index_name, table_name, colum_name) → Create an index for the column_name on table_name.
● drop_index(index_name, table_name) → Delete the index for the table_name.

3. Create Test Cases for the B+Tree Layer

You can design and implement the B+Tree layer in any way your team deems more relevant to create an index mechanism; for example, linking the internal nodes as well.You need to write automated unit tests for all B+Tree operations. You need to handle these cases only:
● Duplicate keys
● Insertions, deletions and Balance
● Create and search: Index for NUMERIC attributes
● Create and search: Index for VARCHAR (string) attributes

Ensure common queries (e.g., SELECT) work correctly with the new indexing.

C. CapybaraDB 2.0: Integration

As part of the group project, students need to demo CapybaraDB 2.0: a full implementation that leverages the B+Tree indexing.

Your implementation needs to support two new SQL 2 operations:

1. EXPLAIN Statement for Query Analysis
2. CREATE INDEX for Indexing with B+ Trees

These functions are explained below.

1. Implement EXPLAIN Statement for Query Analysis

The EXPLAIN statement is used to analyze how a SQL query is executed. It helps determine whether the query is using the B+ Tree index efficiently.

Implementation Tasks:

● Parse the SQL query string to extract the operation (e.g., SELECT, WHERE condition).

Identify whether the query can leverage indexing.

Simulate the query execution without actually running it, showing:

○ Index used (if any)
○ Estimated cost (query execution time, disk reads)
○ Number of rows scanned
Example Usage of a query that benefits from index:
EXPLAIN SELECT * FROM students WHERE ID = 1023;
Expected Output (if B+ Tree is used):
Query Plan:
- Operation: Index Lookup
- Index Used: students_ID_index (B+ Tree)
- Estimated Cost: 5ms
- Rows Scanned: 1
Example Usage of a query that does not benefit from index:
EXPLAIN SELECT * FROM students;
Expected Output (if Full Table Scan is used):
Query Plan:
- Operation: Full Table Scan
- Index Used: None
- Estimated Cost: 150ms
- Rows Scanned: 10,000
2. CREATE INDEX to Enable B+ Tree Indexing
The CREATE INDEX command enables indexing on a specific column using a B+ Tree, improving query performance.
Implementation Tasks:
● Parse the SQL command to extract the table name and column name.Unset
Unset
● Modify the database schema (MASTER TABLE) to store a B+ Tree index for that column.
● Build the B+ Tree by inserting existing table records into the index.
● Automatically create a B+ Tree index for the primary key of each table.
● Store the index persistently (either in memory or on disk).
Example Usage:
CREATE INDEX students_ID_index ON students(ID);
Steps to Implement:
1. Extract Table and Column Name
Identify that an index should be created on students.ID.
2. Construct a B+ Tree for the Column
 Create a B+ Tree object.
 Populate it with all existing ID values from the students table.
3. Link the Index to Query Execution
 Modify SELECT queries to use the B+ Tree instead of scanning the entire table.
4. Store the Index Persistently
 Save the B+ Tree in memory and write it to disk for persistence.
The CREATE INDEX command should be executed automatically when a table is created with a
primary key.
Example Usage:
CREATE TABLE user (
UserID INTEGER PRIMARY KEY,
name VARCHAR(255),
email VARCHAR(255));

The index is created for the primary key UserID using the same steps defined before.C/C++ 

C/C++

3. Connecting Layers via Github Repositories
In order to support the two new SQL operations (EXPLAIN and CREATE INDEX) and have a running system, your team needs to integrate all CapybaraDB's layers. That means the three layers’ repositories need to point to your team repositories.
A. Connect the Command Processor Layer to the VDBE Layer

To integrate your custom Command processor Layer and VDBE Layer, you need to update the external dependency declaration in the root external_dependencies.cmake file of the Command Processor layer to point to your team VDBE repository:

Original code:
FetchContent_Declare(
VDBE_LAYER
GIT_REPOSITORY [email protected]:CMUSV-FDD/VirtualMachineLayer.git
GIT_TAG integration
)
Replace it with your team-specific VDBE Layer repository, using the format below:
FetchContent_Declare(
VDBE_LAYER
GIT_REPOSITORY
[email protected]:CMUSV-FDD/T<Team_Number>_S25_Virtual_Machine_Layer.git
GIT_TAG integration
)
Make sure to substitute <Team_Number> with your actual team number.
B. Connect the VDBE Layer to the Storage Layer
To integrate your Storage Layer into the VDBE Layer, you need to modify the
CMakeLists.txt file at the root of your VDBE Layer project to point to your team Storage
Layer's repository:C/C++
C/C++
Original code:
FetchContent_Declare(
StorageEngine
GIT_REPOSITORY [email protected]:CMUSV-FDD/StorageEngine.git
GIT_TAG integration
)
Replace it with your team-specific Storage Layer repository, following this format:
FetchContent_Declare(
StorageEngine
GIT_REPOSITORY
[email protected]:CMUSV-FDD/T<Team_Number>_S25_SQL_Storage_Engine.git
GIT_TAG integration
)
Again, replace <Team_Number> with your actual assigned team number.
D. Benchmark Performance
To objectively evaluate each group’s implementation, we will employ a profiling framework that measures key performance metrics of SQL query execution in a benchmark.
1. Group Project Evaluation
The profiling process will focus on:
● Execution Time: Measuring the time taken to complete various SQL operations.
● Disk Space Usage: Measuring the disk size consumed by indexes
● Disk I/O Operations: Tracking number and size of read/write operations for queries
● Memory Usage: Monitoring memory consumption during query executionUnset
Unset
How the Profiling Works
● Initialization:
A test database is set up, and a table is created.
A predefined dataset is inserted into the table.
● Query Execution & Profiling:
Various SQL queries are executed.
Each query’s metrics are recorded.
Test Cases and Query Set
Our benchmark tool will evaluate the following types of SQL queries:
CREATE TABLE
CREATE INDEX
INSERT INTO
SELECT:
Full Table Scan
Point Lookups (WHERE condition on primary key)
Range-based Queries (WHERE condition with comparisons)
UPDATE
DELETE
How to run Profiling tool
1.
Pull Google benchmark framework
Run the following command to initialize and update the submodule:
git submodule update --init --recursive
2.
Clion Run Configuration
a. Open the configuration selection dropdown.
b. Select GroupProjectBenchmark from the available configurations.
c. If GroupProjectBenchmark is not in the configuration list, create a new one with target/executable set to GroupProjectBenchmarkUnset
Unset
3.
Run the Benchmark:
Click the green Run button in the IDE or execute the binary from the command line.
Sample Output 3
DB disk usage after table initialization with INSERT: 100 KB
DB disk usage after CREATE INDEX: 110KB
Index disk usage: 10KB
-------------------------------------------------------------------------------
Benchmark Time (ms) Read Ops Write Ops Memory
-------------------------------------------------------------------------------
InitTable 3.38 ms 2084 263 1.2 MB
CreateIndex 3.39 ms 256 536 2.5 MB
Full Table Scan 3.45 ms 1500 0 12.5 MB
Point Query (WHERE = value) 0.56 ms 300 0 5.2 MB
Range Query (WHERE > value) 1.12 ms 500 0 6.8 MB
Range Query (WHERE BETWEEN) 1.85 ms 690 0 7.3 MB
-------------------------------------------------------------------------------
(These numbers are illustrative; actual results may vary.)
2. Benchmark Indexing Queries
To help you with the implementation, we will provide you with a initial benchmark to measure B+ Tree performance (i.e., the group project implementation) using two types of queries:
1. Point Queries: A point query retrieves a single specific record (or a very small set) based on an exact match of a key or indexed column.
Example:
SELECT * FROM students WHERE ID = 1000;
This query fetches only one record (assuming ID is unique).

● It is highly efficient, especially when ID is indexed (e.g., using a B+tree or Hash index).

● Index usage: If there’s a primary key or unique index, databases can use an index lookup (O(1) for hash indexes, O(log N) for B+-trees) to quickly locate the record.
2. Range Queries: A range query retrieves multiple records by selecting values within a specified range.
Example:
Unset
SELECT * FROM students WHERE ID BETWEEN 1000 AND 2000;

This query fetches all records where ID falls between 1000 and 2000.

● Performance:
● If ID is indexed, databases perform an index range scan (O(log N) + O(K), where K is the number of matching rows).
● If ID is not indexed, the database performs a full table scan, making it slower for large datasets.

E. Team Demo and Presentation

Your team needs to demo your implementation at the end of the semester. You need to

Explain the Design and Code Walkthrough

1. Group Project Evaluation

Students must explain their B+ Tree implementation and show how it replaces the B-Tree.

Demonstrate key operations:

● Inserting and deleting records
● Performing single-key searches

● Executing range queries and showing performance benefits

2. Performance Demonstration

Show benchmark results:

● B+ Tree performance
● Query execution time
● Disk I/O operations

3. Run Real-time Queries

Run real-time SQL queries to showcase the B+ Tree’s improved efficiency in range queries.

发表评论

电子邮件地址不会被公开。 必填项已用*标注