背景
DuckDB内置了parquet, csv, iceberg, arrow等数据存储格式的支持. 针对高速向量检索, 则可以通过lance外部存储来实现. 上一篇信息简单介绍过其原理
- 《DuckDB 存储生态: lance(向量存储引擎): Modern columnar data format for ML/超越parquet》
- 《ChatGPT背后的数据库技术体验 - 向量近似搜索之 lance》
本文做个部署实验, 测试过程基于以下docker镜像.
《2023-PostgreSQL/DuckDB/MySQL/PolarDB-X Docker镜像学习环境 ARM64版, 已集成热门插件和工具》
《2023-PostgreSQL/DuckDB/MySQL/PolarDB-X Docker镜像学习环境 AMD64版, 已集成热门插件和工具》
x86
# 拉取镜像, 第一次拉取一次即可. 或者需要的时候执行, 将更新到最新镜像版本.
docker pull registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts
# 启动容器
docker run --platform linux/amd64 -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts
##### 如果你想学习备份恢复、修改参数等需要重启数据库实例的case, 换个启动参数, 使用参数--entrypoint将容器根进程换成bash更好. 如下:
docker run -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g --entrypoint /bin/bash registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts
##### 以上启动方式需要进入容器后手工启动数据库实例: su - postgres; pg_ctl start;
# 进入容器
docker exec -ti pg bash
# 连接数据库
psql
部署lance extension for DuckDB
cd /tmp
git clone --depth 1 -b v0.9.1 https://github.com/lancedb/lance.git
cd /tmp/lance/integration/duckdb_lance/
git clone --depth 1 -b v0.8.1 https://github.com/duckdb/duckdb.git
cp -r duckdb duckdb-ext/
RUST_BACKTRACE=1 cargo build -r
cd /tmp/lance/integration/duckdb_lance/duckdb-ext/
RUST_BACKTRACE=1 cargo build -r
cd /tmp/lance/integration/duckdb_lance/
CMAKE_BUILD_PARALLEL_LEVEL=4 make release
cd /tmp/lance/integration/duckdb_lance/build/release
以非签名模式启动duckdb
root@56000550f873:/tmp/lance/integration/duckdb_lance/build/release# ./duckdb -unsigned
v0.8.1 6536a77
Enter ".help" for usage hints.
安装非签名第三方插件lance
D install '/tmp/lance/integration/duckdb_lance/build/release/extension/duckdb_lance/lance.duckdb_extension';
D load 'lance';
D .mode table
D select * from duckdb_extensions();
┌──────────────────┬─────────┬───────────┬───────────────────────────┬────────────────────────────────────────────────────────────────────────────────────┬───────────────────┐
│ extension_name │ loaded │ installed │ install_path │ description │ aliases │
│ varchar │ boolean │ boolean │ varchar │ varchar │ varchar[] │
├──────────────────┼─────────┼───────────┼───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────────────┤
│ autocomplete │ true │ │ │ Add supports for autocomplete in the shell │ [] │
│ fts │ false │ false │ │ Adds support for Full-Text Search Indexes │ [] │
│ httpfs │ false │ false │ │ Adds support for reading and writing files over a HTTP(S) connection │ [http, https, s3] │
│ icu │ false │ false │ │ Adds support for time zones and collations using the ICU library │ [] │
│ inet │ false │ false │ │ Adds support for IP-related data types and functions │ [] │
│ jemalloc │ true │ true │ (BUILT-IN) │ Overwrites system allocator with JEMalloc │ [] │
│ json │ false │ false │ │ Adds support for JSON operations │ [] │
│ lance │ true │ │ /root/.duckdb/extension… │ │ [] │
│ libduckdb_lance │ false │ false │ /root/.duckdb/extension… │ │ [] │
│ motherduck │ false │ false │ │ Enables motherduck integration with the system │ [md] │
│ parquet │ false │ false │ │ Adds support for reading and writing parquet files │ [] │
│ postgres_scanner │ false │ false │ │ Adds support for reading from a Postgres database │ [postgres] │
│ spatial │ false │ false │ │ Geospatial extension that adds support for working with spatial data and functions │ [] │
│ sqlite_scanner │ false │ false │ │ Adds support for reading SQLite database files │ [sqlite, sqlite3] │
│ tpcds │ false │ false │ │ Adds TPC-DS data generation and query support │ [] │
│ tpch │ false │ false │ │ Adds TPC-H data generation and query support │ [] │
├──────────────────┴─────────┴───────────┴───────────────────────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────────────┤
│ 16 rows 6 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
lance测试对比parquet, 参考文档如下
https://github.com/lancedb/lance/blob/v0.9.1/benchmarks/tpch/README.md
目前lance仅支持tpch q1,q6. 所以对比如下.
Compare lance vs parquet for TPCH Q1 and Q6 using SF1 dataset
Steps to run the benchmark:
cd lance/benchmarks/tpch
mkdir dataset && cd dataset
- download parquet file lineitem from : “https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet"; then rename it to “lineitem_sf1.parquet”
- generate lance file from the parquet file in the same directory
将parquet文件转换为lance文件, 可使用如下项目:
- https://github.com/haoxins/parquet2lance
cd ..
python3 benchmark.py q1
benchmark.py 脚本如下:
# Benchmark performance Lance vs Parquet w/ Tpch Q1 and Q6
import lance
import pandas as pd
import pyarrow as pa
import duckdb
import sys
import time
Q1 = """
SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem
WHERE
l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
"""
Q6 = """
SELECT
sum(l_extendedprice * l_discount) as revenue
FROM
lineitem
WHERE
l_shipdate >= date '1994-01-01'
AND l_shipdate < date '1994-01-01' + interval '1' year
AND l_discount between 0.06 - 0.01 AND 0.06 + 0.01
AND l_quantity < 24;
"""
num_args = len(sys.argv)
assert num_args == 2
query = ''
if sys.argv[1] == 'q1':
query = Q1
elif sys.argv[1] == 'q6':
query = Q6
else:
sys.exit("We only support Q1 and Q6 for now")
print("------------------BENCHMARK TPCH " + sys.argv[1] + "-------------------\n")
##### Lance #####
start1 = time.time()
# read from lance and create a relation from it
lineitem = lance.dataset("./dataset/lineitem.lance")
res1 = duckdb.sql(query).df()
end1 = time.time()
print("Lance Latency: ",str(round(end1 - start1, 3)) + 's')
print(res1)
##### Parquet #####
lineitem = None
start2 = time.time()
# read from parquet and create a view instead of table from it
duckdb.sql("CREATE VIEW lineitem AS SELECT * FROM read_parquet('./dataset/lineitem_sf1.parquet');")
res2 = duckdb.sql(query).df()
end2 = time.time()
print("Parquet Latency: ",str(round(end2 - start2, 3)) + 's')
print(res2)