There are a number of additional configuration parameters available in vectorize
that determine the extension’s behavior.
Embedding locations
Embeddings can be created either on the same table as the source data, or on a separate table in the vectorize
schema.
The table_method
parameter determines where the embeddings are stored, and the default is table_method => 'join'
, which creates a table in the vectorize
schema named _embeddings_<project_name>
for each vectorize
job.
For example, if you create a job named my_search_project
, the embeddings will be stored in a table named vectorize._embedding_my_search_project
.
Alternatively, pg_vectorize
can be configured to create the embeddings on the same table as the source data.
By setting the table_method => 'append'
, pg_vectorize will create two columns on the source table: one for the embedding, and one for the updated-at timestamp.
Embeddings on a Separate table
The default behavior is table_method => 'join'
, and a new table is created in the vectorize
schema to store the embeddings.
SELECT vectorize.table(
job_name => 'my_search_project',
"table" => 'products',
primary_key => 'product_id',
columns => ARRAY['product_name', 'description'],
transformer => 'sentence-transformers/all-MiniLM-L6-v2',
table_method => 'join'
);
postgres=# \d vectorize._embeddings_my_search_project;
Table "vectorize._embeddings_my_search_project"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+---------
product_id | integer | | not null |
embeddings | vector(384) | | not null |
updated_at | timestamp with time zone | | not null | now()
Indexes:
"_embeddings_my_search_project_product_id_key" UNIQUE CONSTRAINT, btree (product_id)
"my_search_project_idx" hnsw (embeddings vector_cosine_ops)
Foreign-key constraints:
"_embeddings_my_search_project_product_id_fkey" FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE CASCADE
Embeddings in a new column, same table
To create the embeddings on the same table as the source data, set the table_method
parameter to append
.
SELECT vectorize.table(
job_name => 'my_search_project',
"table" => 'products',
primary_key => 'product_id',
columns => ARRAY['product_name', 'description'],
transformer => 'sentence-transformers/all-MiniLM-L6-v2',
table_method => 'append'
);
Note two new columns: my_search_project_embeddings
and my_search_project_updated_at
have been added to the table.
postgres=# \d products
Table "public.products"
Column | Type | Collation | Nullable | Default
------------------------------+--------------------------+-----------+----------+----------------------------------------------------------------
product_id | integer | | not null | nextval('vectorize.example_products_product_id_seq'::regclass)
product_name | text | | not null |
description | text | | |
last_updated_at | timestamp with time zone | | | CURRENT_TIMESTAMP
my_search_project_embeddings | vector(384) | | |
my_search_project_updated_at | timestamp with time zone | | |
Indexes:
"products_pkey" PRIMARY KEY, btree (product_id)
"my_search_project_idx" hnsw (my_search_project_embeddings vector_cosine_ops)
Ad-hoc embedding requests
Any text can be transformed into an embedding using vectorize.encode()
.
This works with any of the sentence-transformers:
SELECT vectorize.encode(
input => 'the quick brown fox jumped over the lazy dogs',
model => 'sentence-transformers/all-miniLM-L6-v2'
);
{-0.2556323707103729,-0.3213586211204529 ..., -0.0951206386089325}
Privately hosted models on Hugging Face can be referenced like so:
SELECT vectorize.encode(
input => 'the quick brown fox jumped over the lazy dogs',
model => 'my-private-org/my-private-model',
api_key => 'your Hugging Face key'
)
For OpenAI requests, you can either set the API key as a Postgres configuration parameter or pass it in as an argument.
Passing it as an argument will override the configuration parameter.
As an argument:
SELECT vectorize.encode(
input => 'the quick brown fox jumped over the lazy dogs',
model => 'openai/text-embedding-ada-002',
api_key => 'your OpenAI API key'
)
You do not need to provide the API key as an argument if it already been set via ALTER SYSTEM SET vectorize.openai_key
.
SELECT vectorize.encode(
input => 'the quick brown fox jumped over the lazy dogs',
model => 'openai/text-embedding-ada-002'
)
Filtering Results
vectorize.search()
results can be filtered by supplying a where clause to the where_sql
parameter in vectorize.search()
.
The filter operation happens after the embeddings are searched. To pre-filter the search of embeddings, you will need to separate
embeddings into multiple tables or deploy a partitioning strategy.
In the example below, we will filter results to only return the product with product_id = 3
.
Note that product_id is unique, so only one result will be returned.
SELECT * FROM vectorize.search(
job_name => 'product_search_hf',
query => 'accessories for mobile devices',
return_columns => ARRAY['product_id', 'product_name'],
num_results => 3,
where_sql => $$product_id = 3$$
);
search_results
----------------------------------------------------------------------------------------
{"product_id": 3, "product_name": "Desk Lamp", "similarity_score": 0.6498761419705363}
Manually searching embeddings
vectorize.encode()
can be useful when you want to manually query your embeddings.
To do this, place the vectorize.encode()::vector
call into your query and manually compute the distance using pgvector’s distance operators. Note that you must select the same transformer model that was used to generate the embeddings.
The example below assumes embeddings are in a column named my_search_project_embeddings
on the products
table.
SELECT
product_name,
description,
1 - (
my_search_project_embeddings <=>
vectorize.encode('mobile electronic devices', 'sentence-transformers/all-MiniLM-L6-v2')::vector
) as similarity
FROM products
ORDER by similarity DESC
LIMIT 3;
product_name | description | similarity
-------------------+------------------------------------------------------------+---------------------
Phone Charger | Device to replenish the battery of mobile phones | 0.5351522883863631
Bluetooth Speaker | Portable audio device with wireless connectivity | 0.38232471837548787
Wireless Mouse | Pointing device without the need for a physical connection | 0.35592426991011383
Changing the configured database
By default, vectorize
is configured to run on the postgres
database, but that can be changed to any database in Postgres.
Update the following configuration parameters so that the corresponding background workers connect to the correct database.
Both pg_vectorize
and pg_cron
will need their configuration updated.
This can be done by running the following SQL commands.
ALTER SYSTEM SET cron.database_name TO 'my_new_db';
ALTER SYSTEM SET vectorize.database_name TO 'my_new_db';
Then, restart Postgres to apply the changes and, if you haven’t already, enable the extension in your new database.
CREATE EXTENSION vectorize CASCADE;