DuckDB 存储生态: lance(向量存储引擎)的安装使用

Sat Dec 23, 2023

900 Words|Read in about 4 Min
Tags:

背景

DuckDB内置了parquet, csv, iceberg, arrow等数据存储格式的支持. 针对高速向量检索, 则可以通过lance外部存储来实现. 上一篇信息简单介绍过其原理
- 《DuckDB 存储生态: lance(向量存储引擎): Modern columnar data format for ML/超越parquet》
- 《ChatGPT背后的数据库技术体验 - 向量近似搜索之 lance》

本文做个部署实验, 测试过程基于以下docker镜像.

《2023-PostgreSQL/DuckDB/MySQL/PolarDB-X Docker镜像学习环境 ARM64版, 已集成热门插件和工具》

《2023-PostgreSQL/DuckDB/MySQL/PolarDB-X Docker镜像学习环境 AMD64版, 已集成热门插件和工具》

x86

# 拉取镜像, 第一次拉取一次即可. 或者需要的时候执行, 将更新到最新镜像版本.    
docker pull registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts    
    
# 启动容器    
docker run --platform linux/amd64 -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts  
  
##### 如果你想学习备份恢复、修改参数等需要重启数据库实例的case, 换个启动参数, 使用参数--entrypoint将容器根进程换成bash更好. 如下:   
docker run -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g --entrypoint /bin/bash registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts  
##### 以上启动方式需要进入容器后手工启动数据库实例: su - postgres; pg_ctl start;    
    
# 进入容器    
docker exec -ti pg bash    
    
# 连接数据库    
psql    

部署lance extension for DuckDB

cd /tmp  
  
git clone --depth 1 -b v0.9.1 https://github.com/lancedb/lance.git  
  
cd /tmp/lance/integration/duckdb_lance/  
  
git clone --depth 1 -b v0.8.1 https://github.com/duckdb/duckdb.git  
  
cp -r duckdb duckdb-ext/  
RUST_BACKTRACE=1 cargo build -r  
  
cd /tmp/lance/integration/duckdb_lance/duckdb-ext/  
RUST_BACKTRACE=1 cargo build -r  
  
cd /tmp/lance/integration/duckdb_lance/  
CMAKE_BUILD_PARALLEL_LEVEL=4 make release    
  
cd /tmp/lance/integration/duckdb_lance/build/release  

以非签名模式启动duckdb

root@56000550f873:/tmp/lance/integration/duckdb_lance/build/release# ./duckdb -unsigned  
v0.8.1 6536a77  
Enter ".help" for usage hints.  

安装非签名第三方插件lance

D install '/tmp/lance/integration/duckdb_lance/build/release/extension/duckdb_lance/lance.duckdb_extension';  
D load 'lance';  
D .mode table  
D select * from duckdb_extensions();  
┌──────────────────┬─────────┬───────────┬───────────────────────────┬────────────────────────────────────────────────────────────────────────────────────┬───────────────────┐  
│  extension_name  │ loaded  │ installed │       install_path        │                                    description                                     │      aliases      │  
│     varchar      │ boolean │  boolean  │          varchar          │                                      varchar                                       │     varchar[]     │  
├──────────────────┼─────────┼───────────┼───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────────────┤  
│ autocomplete     │ true    │           │                           │ Add supports for autocomplete in the shell                                         │ []                │  
│ fts              │ false   │ false     │                           │ Adds support for Full-Text Search Indexes                                          │ []                │  
│ httpfs           │ false   │ false     │                           │ Adds support for reading and writing files over a HTTP(S) connection               │ [http, https, s3] │  
│ icu              │ false   │ false     │                           │ Adds support for time zones and collations using the ICU library                   │ []                │  
│ inet             │ false   │ false     │                           │ Adds support for IP-related data types and functions                               │ []                │  
│ jemalloc         │ true    │ true      │ (BUILT-IN)                │ Overwrites system allocator with JEMalloc                                          │ []                │  
│ json             │ false   │ false     │                           │ Adds support for JSON operations                                                   │ []                │  
│ lance            │ true    │           │ /root/.duckdb/extension…  │                                                                                    │ []                │  
│ libduckdb_lance  │ false   │ false     │ /root/.duckdb/extension…  │                                                                                    │ []                │  
│ motherduck       │ false   │ false     │                           │ Enables motherduck integration with the system                                     │ [md]              │  
│ parquet          │ false   │ false     │                           │ Adds support for reading and writing parquet files                                 │ []                │  
│ postgres_scanner │ false   │ false     │                           │ Adds support for reading from a Postgres database                                  │ [postgres]        │  
│ spatial          │ false   │ false     │                           │ Geospatial extension that adds support for working with spatial data and functions │ []                │  
│ sqlite_scanner   │ false   │ false     │                           │ Adds support for reading SQLite database files                                     │ [sqlite, sqlite3] │  
│ tpcds            │ false   │ false     │                           │ Adds TPC-DS data generation and query support                                      │ []                │  
│ tpch             │ false   │ false     │                           │ Adds TPC-H data generation and query support                                       │ []                │  
├──────────────────┴─────────┴───────────┴───────────────────────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────────────┤  
│ 16 rows                                                                                                                                                           6 columns │  
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘  

lance测试对比parquet, 参考文档如下

https://github.com/lancedb/lance/blob/v0.9.1/benchmarks/tpch/README.md

目前lance仅支持tpch q1,q6. 所以对比如下.

Compare lance vs parquet for TPCH Q1 and Q6 using SF1 dataset

Steps to run the benchmark:

  1. cd lance/benchmarks/tpch
  2. mkdir dataset && cd dataset
  3. download parquet file lineitem from : “https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet"; then rename it to “lineitem_sf1.parquet”
  4. generate lance file from the parquet file in the same directory

将parquet文件转换为lance文件, 可使用如下项目:
- https://github.com/haoxins/parquet2lance

  1. cd ..
  2. python3 benchmark.py q1


benchmark.py 脚本如下:

# Benchmark performance Lance vs Parquet w/ Tpch Q1 and Q6  
import lance  
import pandas as pd  
import pyarrow as pa  
import duckdb  
  
import sys  
import time  
  
Q1 = """  
SELECT  
    l_returnflag,  
    l_linestatus,  
    sum(l_quantity) as sum_qty,  
    sum(l_extendedprice) as sum_base_price,  
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,  
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,  
    avg(l_quantity) as avg_qty,  
    avg(l_extendedprice) as avg_price,  
    avg(l_discount) as avg_disc,  
    count(*) as count_order  
FROM  
    lineitem  
WHERE  
    l_shipdate <= date '1998-12-01' - interval '90' day  
GROUP BY  
    l_returnflag,  
    l_linestatus  
ORDER BY  
    l_returnflag,  
    l_linestatus;  
"""  
  
Q6 = """  
SELECT  
    sum(l_extendedprice * l_discount) as revenue  
FROM  
    lineitem  
WHERE  
    l_shipdate >= date '1994-01-01'  
    AND l_shipdate < date '1994-01-01' + interval '1' year  
    AND l_discount between 0.06 - 0.01 AND 0.06 + 0.01  
    AND l_quantity < 24;  
"""  
  
num_args = len(sys.argv)  
assert num_args == 2  
  
query = ''  
if sys.argv[1] == 'q1':  
    query = Q1  
elif sys.argv[1] == 'q6':  
    query = Q6  
else:  
    sys.exit("We only support Q1 and Q6 for now")  
  
print("------------------BENCHMARK TPCH " + sys.argv[1] + "-------------------\n")  
##### Lance #####  
start1 = time.time()  
# read from lance and create a relation from it  
lineitem = lance.dataset("./dataset/lineitem.lance")  
res1 = duckdb.sql(query).df()  
end1 = time.time()  
  
print("Lance Latency: ",str(round(end1 - start1, 3)) + 's')  
print(res1)  
  
##### Parquet #####  
lineitem = None  
start2 = time.time()  
# read from parquet and create a view instead of table from it  
duckdb.sql("CREATE VIEW lineitem AS SELECT * FROM read_parquet('./dataset/lineitem_sf1.parquet');")  
res2 = duckdb.sql(query).df()  
end2 = time.time()  
  
print("Parquet Latency: ",str(round(end2 - start2, 3)) + 's')  
print(res2)  

更多信息参考:
- https://lancedb.github.io/lancedb/integrations/

See Also

Sat Dec 23, 2023

900 Words|Read in about 4 Min
Tags: