Exploring Cosine Similarity with Bit Values: A Deep Dive
Written on
Understanding Binary Representation of Vectors
In this discussion, I aim to demonstrate how complex floating point vectors can be effectively represented using binary values by converting each floating point number to a single binary digit. This transformation enables a single integer operation to simultaneously handle 32 dimensions (4 bytes). Despite a slight increase in calculation complexity, this 32-fold enhancement in computational speed is further amplified when utilizing integer-based operations.
You can find my code on GitHub. In a basic test I conducted, my method was nearly three times faster than manual comparisons. While I've only explored this in C code and haven't ventured into CUDA or other GPU optimizations, nor integrated this technique into a quantized LLM, I believe this approach shows potential, even for simpler platforms like Arduino. There’s much more to vector spaces than just LLMs and Retrieval-Augmented Generation (RAG), and I'm always interested in projects suitable for embedded systems.
Having worked with vector spaces for over 15 years in various document retrieval contexts, I recently published an article titled "VSDB: My Experiments with Vectorspaces," which outlines my insights on the topic.
Sample Vector Analysis
Let's examine some sample vectors that represent specific phrases: "My Dog has Fleas," "My Cat has Fleas," "My [-Dog] has [-fleas]," and "My Hamster has Hiccups." These examples include two somewhat similar vectors, one that diverges considerably, and another that maintains a similar structure but differs from the third. Each word is treated as a dimension in our vectors, with the count of that term represented (e.g., "Dog=1"), while negative values (e.g., "-Dog") symbolize the opposite concept (Dog=-1), impacting our mathematical calculations later.
Now, let's assess the Cosine Similarity of our test phrases:
Calculating Cosine Similarity
The similarity scores for our phrases are as follows:
- MyDogHasFleas * MyCatHasFleas: cosine=0.750000
- MyDogHasFleas * MyNotDogHasNotFleas: cosine=0.000000
- MyDogHasFleas * MyHamsterHasHiccups: cosine=0.500000
From this analysis, we observe that "My Dog Has Fleas" and "My Cat Has Fleas" differ by only one vector, indicating a 75% similarity. The difference of 25% between these sentences arises from a single word variation. Conversely, "My Hamster has Hiccups" shares only two matching vectors, resulting in a 50% similarity score. The -1 in the "Not" dimensions negates any similarity, as expected.
Computational Efficiency with Integer Operations
To compute these similarity values, considerable data processing is required, utilizing floating point operations for each dimension in comparison to the other vector. For our four-word phrase, this involves at least four multiplications for the dot product, eight more for calculating magnitude, and additional computations to derive the final score—totaling approximately 13–15 floating point operations per pair.
Quantized models convert floating point numbers into integers, as 15 integer operations can be executed significantly faster than floating point operations, yet the count remains at 15.
Imagine utilizing the bits within an integer. For instance, the number 5 is represented in binary as 1001, while 6 is represented as 1010. Each bit can be viewed as a dimension. In the binary representation of the number 5 (1001), the first dimension is 1, the second is 0, and so forth. An unsigned integer provides 32 potential dimensions to compare in a single operation! Adding just one more operation for the actual similarity dramatically boosts speed.
Challenges in Bit Vector Representation
However, this method presents some challenges:
- The ability to represent "Negative Vectors" is lost.
- Only two states are available, complicating the process. I recommend utilizing additional dimensions for metadata, such as parts of speech, to enhance data dimensionality. In my tests, I created a "Mask" to limit consideration to dimensions with values, as comparisons need to encompass all 32 bits, regardless of the number of dimensions involved.
The MD5 hashes of VSDB provide a convenient "built-in key" for mapping these values back to their original forms. A positional map is necessary for each vector to ensure dimensions carry meaning, which is less critical in LLMs due to their lack of intrinsic meaning.
Magnitude Calculation in Bit Vectors
In a conventional vector space, the magnitude is calculated by multiplying each dimension by itself and summing the results. For binary numbers, this involves counting the "population" of 1s, also known as Hamming weight. This task can be challenging for computers. Fortunately, elegant solutions exist, such as the following function:
unsigned int magnitude(unsigned int u) {
unsigned int uCount;
uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111);
return ((uCount + (uCount >> 3)) & 030707070707) % 63;
}
Dot Product Calculation with Bit Vectors
I found the dot product calculation to be quite engaging. By simplifying the math to focus on signs, it resembled a truth table, lending itself to an XNOR approach. This method allows us to check for equality across dimensions efficiently without iterating through each dimension individually.
unsigned int dotproduct(unsigned int a, unsigned int b) {
unsigned int product;
unsigned int Mask = (a | b);
product = ~(a ^ b); // XNOR!
product = (product & Mask); // masking out unused fields
return magnitude(product); // summing the 1s
}
Calculating Cosine Similarity in Bit Vectors
With these calculations established, we can determine the similarity. I utilize the same formula as with floating point numbers, but require a "Mask" since not all bits in the integer are utilized. A simple average of the magnitudes yields satisfactory results, particularly when not all bits are engaged.
float cosine_similarity(unsigned int a, unsigned int b) {
unsigned int ma, mb, dp, sz;
float mt, result;
ma = magnitude(a);
mb = magnitude(b);
mt = (float) ma * (float) mb;
sz = (ma + mb) / 2;
if(mt != 0) {
dp = dotproduct(a, b);
result = sz * (float) dp / mt;
return result;
} else {
return 0;}
}
Testing the Implementation
Float Vectors:
- MyDogHasFleas * MyCatHasFleas: cosine=0.750000
- MyDogHasFleas * MyNotDogHasNotFleas: cosine=0.000000
- MyDogHasFleas * MyHamsterHasHiccups: cosine=0.500000
Bit Vectors:
- MyDogHasFleas * MyCatHasFleas: cosine=0.750000
- MyDogHasFleas * MyHamsterHasHiccups: cosine=0.500000
While we lose the capacity for negative dimensions, the performance remains impressive. The computation of 30 million cosines took about 0.60425 seconds using bit vectors, demonstrating a speed increase of nearly three times with basic implementations.
Potential Applications for Arduino
Given that my demo uses basic C and simple logical operations, it could even be implemented on an Arduino. While it may not be suitable for complex document processing, it can serve in state-machine logic applications. It’s unlikely to run an LLM, but it can respond to embeddings from external sources. For example, it could implement a query-response mechanism for robotic control, such as assessing current conditions and determining appropriate actions.
Introducing Dreamcoat
I am developing a variant of VSDB for RAG that utilizes this bit vector technique with LLM embeddings. This straightforward implementation could function via command line or API, allowing users to input phrases and retrieve documents for display in a browser. It promises to be an exciting project, potentially resembling an Apache-like server for RAGs, which I whimsically plan to name Dreamcoat, reminiscent of Joseph's story.
The command line interface would allow queries such as:
$> dream "A short story about a princess.."
In server mode, Dreamcoat would accept text queries like:
FIND RAG/0.1
A short story written in the early 1900s about the collapse of an automated internet-based world.
This setup may support POST and UPDATE requests to enable document uploads, utilizing secure connections for accessibility by JavaScript clients.