This example shows how to pull data from the HuggingFace datasets hub, create embeddings for text passage using Cohere / OpenAI / SentenceTransformer, and store them in a vector database using LanceDB / Weaviate / Pinecone.
DAG for OpenAI embeddings and Weaviate vector database
In addition, you'll see how Hamilton can help you create replaceable components. This flexibility, makes it easier to assess service providers and refactor code to fit your needs. The above and below DAGs were generated simply by changing a string value and a module import. Try to spot the differences!
DAG for SentenceTransformers embeddings and Pinecone vector database
DAG for Cohere embeddings and Lancedb vector database
DAG for Marqo Marqo vector database; note marqo processes embeddings itself.
run.py contains the code to test the example. It uses click to provide a simple command interface.data_module.py contains the code to pull data from HuggingFace. The code is in a separate Python module since it doesn't depend on the other functionalities and could include more involved preprocessing.embedding_module.py contains the code to embed text using either Cohere API, OpenAI API or SentenceTransformer library. The use of @config.when allows to have all options in the same Python module. This allows to quickly rerun your Hamilton DAG by simply changing your config. You'll see that functions share similar signature to enable interchangeability.lancedb_module.py, weaviate_module.py, marqo_module.py and pinecone_module.py implement the same functionalities for each vector database. Having the same function names allows Hamilton to abstract away the implementation details and reinforce the notion that both modules shouldn't be loaded simultaneously.docker-compose.yml allows you to start a local instance of Weaviate (More information).Prerequisite:
requirements.txt you don't want before doing pip install -r requirements.txt.docker compose up -d.docker pull marqoai/marqo:latest followed by docker run --name marqo -it --privileged -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:latest. Note: this will require 10GB+ of disk space since the image is large...python run.py --help to learn about the options. You will options to:python run.py to execute the code with lancedb and sentence_transformerTo change embedding service, you can use the following:
--embedding_service=sentence_transformer --model_name=MODEL_NAME--embedding_service=openai --embedding_api_key=API_KEY--embedding_service=openai --embedding_api_key=API_KEY--embedding_service=marqo # note there is no function in embeddings_module.py for marqo.To change vector database you need to pass a JSON config argument:
--vector_db=lancedb --vector_db_config='{"uri": "data/lancedb"}'--vector_db=weaviate --vector_db_config='{"url": "http://locahost:8080/"}'--vector_db=pinecone --vector_db_config='{"environment": "ENVIRONMENT", "api_key": "API_KEY"}'--vector_db=marqo --vector_db_config='{"url":"http://localhost:8882"}' --other_input_kwargs '{"index_name":"hamilton"}'If you run into lzma not being installed and you‘re using pyenv, then if you’re on a mac you can try the following:
brew install xz
CFLAGS="-I$(brew --prefix xz)/include" LDFLAGS="-L$(brew --prefix xz)/lib" pyenv install 3.X.XX