250_002
dimensional space.
These sparse vectors have values where each token is weighted
according to the input text, which enhances traditional sparse vectors
with contextuality.
f(qᵢ, D)
is the frequency of term qᵢ
in document D
.|D|
is the length of document D
.avg(|D|)
is the average document length in the collection.k₁
is the term frequency saturation parameter.b
is the length normalization parameter.IDF(qᵢ)
is the inverse document frequency of term qᵢ
k₁
= 1.2
, a widely used value in the absence of advanced optimizationsb
= 0.75
, a widely used value in the absence of advanced optimizationsavg(|D|)
= 32
, which was chosen by tokenizing and taking the average of
MSMARCO dataset vectors, rounded
to the nearest power of two.IDF(qᵢ)
, we maintain that information
per token in the vector database itself. You can use it by providing it
as the weighting strategy for your queries so that you don’t have to weight
it yourself.
1_000
non-zero valued dimension.
N
is the total number of documents in the collection.n(qᵢ)
is the number of documents containing term qᵢ
.