Unlocking the Power of Google Cloud's ML Generate Embedding
Written on
Introduction to Google Cloud's ML Functions
Recently, Google has significantly expanded its machine learning (ML) capabilities, particularly for handling text and unstructured data. Among these advancements is the integration of vector database features within BigQuery. As AI and chatbots gain prominence, major cloud and analytics providers are embedding such functionalities into their platforms. In this context, Google has incorporated its most robust and promising model, Gemini, into BigQuery.
Shortly after unveiling these features, Google announced the general availability of the ML.GENERATE_EMBEDDING function, which allows users to embed text stored in BigQuery using a remote model. This development is crucial, as leveraging vector database capabilities is essential for effective analytics and model deployment.
Understanding Text Embedding
Text embedding refers to the transformation of a piece of text into a dense vector representation. When two pieces of text share semantic similarity, their corresponding embeddings will be positioned close together within the embedding vector space. This proximity can facilitate various tasks, such as:
- Semantic Search: Ranking text based on semantic similarity.
- Recommendation Systems: Returning items with text attributes akin to a given input.
- Classification: Identifying the category of items with text attributes similar to a specified text.
- Clustering: Grouping items with similar text attributes.
- Outlier Detection: Identifying items whose text attributes are least related to a specific text.
To analyze data using these models, you would construct a new query in BigQuery as follows:
ML.GENERATE_EMBEDDING(
MODEL your_project_id.your_dataset.model_name,
{ TABLE table_name | (query_statement) },
STRUCT([flatten_json_output AS flatten_json_output, task_type AS task_type])
)
In addition to specifying the project ID and dataset, you can include parameters such as the model_name (referring to a remote model utilizing one of the textembedding-gecko* models) and table_name (the BigQuery table containing a STRING column for embedding). For a comprehensive list of arguments, refer to the official documentation.
This command sends a request to a BigQuery ML remote model that corresponds with one of the Vertex AI textembedding-gecko* foundational models.
Results and Practical Applications
The response from the LLM will resemble the following:
Result of the Query — Screenshot by Author
In this instance, I utilized an open dataset of patent data to conduct a semantic search. The goal was to locate the nearest neighbor for the embedding found in the embedding_v1 column of the patents2 table. This query employed a vector index, utilizing the Approximate Nearest Neighbor method to identify the closest embedding.
To explore the full tutorial, refer to the linked article.
Chapter 1: Exploring Machine Learning Applications
The first video, "Generative AI with Google Cloud: Embeddings for Custom Applications," provides insights into how Google Cloud's generative AI can be utilized for embedding tasks.
Chapter 2: Building AI from Scratch
The second video, "Let's Build GPT: From Scratch, in Code, Spelled Out," walks through the process of constructing a GPT model from the ground up, offering a comprehensive understanding of the underlying principles.