# Running Haystack Pipelines in Asynchronous Environments

_Notebook by [Madeeswaran Kannan](https://www.linkedin.com/in/m-kannan)_

In this notebook, you'll learn how to use the `AsyncPipeline` and async-enabled components from the [haystack-experimental](https://github.com/deepset-ai/haystack-experimental) repository to build and execute a Haystack pipeline in an asynchronous environment. It's based on [this short Haystack tutorial](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline), so it would be a good idea to familiarize yourself with it before we begin. A further prerequisite is working knowledge of cooperative scheduling and [async programming in Python](https://docs.python.org/3/library/asyncio.html).

## Motivation

By default, the `Pipeline` class in `haystack` is a regular Python object class that exposes non-`async` methods to add/connect components and execute the pipeline logic. Currently, it *can* be used in async environments, but it's not optimal to do so since it executes its logic in a '[blocking](https://en.wikipedia.org/wiki/Blocking_(computing))' fashion, i.e., once the `Pipeline.run` method is invoked, it must run to completion and return the outputs before the next statement of code can be executed<sup>1</sup>. In a typical async environment, this prevents active async event loop from scheduling other `async` coroutines, thereby reducing throughput. Similarly, Haystack components currently only provide a non-`async` `run` method for their execution. To mitigate this bottleneck, we introduce the concept of async-enabled Haystack components and an `AsyncPipeline` class that cooperatively schedules the execution of both async and non-async components.

### Goals
- Allow individual components to opt into `async` support.
    - Not all components benefit from being async-enabled - I/O-bound components are the most suitable candidates.
- Provide a backward-compatible way to execute Haystack pipelines containing both async and non-async components.

### Non-goals
- Add async support to all existing components.
- Execute components concurrently.
    - While async support opens the door for concurrent execution, we're currently only focusing on providing basic `async` utility.

<sup>1</sup> - This is a simplification as the Python runtime can potentially schedule another thread, but it's a detail that we can ignore in this case.

Let's now go ahead and see what it takes to add async support to the original tutorial, starting with installing Haystack, the experimental package and the requisite dependencies.


In [None]:
%%bash

pip install -U haystack-ai
pip install -U haystack-experimental
pip install datasets
pip install sentence-transformers
pip install -q --upgrade openai 

Provide an [OpenAI API key](https://platform.openai.com/api-keys) to ensure that LLM generator can query the OpenAI API.

In [3]:
import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

# If you're running this notebook on Google Colab, you might need to the following instead:
#
# from google.colab import userdata
# if "OPENAI_API_KEY" not in os.environ:
#  os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Enter OpenAI API key:¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


Initialize a `DocumentStore` to index your documents. We use the `InMemoryDocumentStore` from the `haystack-experimental` package since it has support for `async`.

In [4]:
from haystack_experimental.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

Fetch the data and convert it into Haystack `Document`s.

In [5]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/46.0 [00:00<?, ?B/s]

(‚Ä¶)-00000-of-00001-4077bd623d55100a.parquet:   0%|          | 0.00/119k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/151 [00:00<?, ? examples/s]

To store your data in the `DocumentStore` with embeddings, initialize a `SentenceTransformersDocumentEmbedder` with the model name and call `warm_up()` to download the embedding model.

Then, we calculate the embeddings of the docs with the newly warmed-up embedder and write the documents to the document store. Notice that we call the `write_documents_async` method and use the `await` keyword with it. The `DocumentStore` protocol in `haystack-experimental` exposes `async` variants of common methods such as `count_documents`, `write_documents`, etc. These [coroutines](https://docs.python.org/3/library/asyncio-task.html#coroutines) are awaitable when invoked inside an async event loop (the notebook/Google Colab kernel automatically starts an event loop).

In [6]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

doc_embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
doc_embedder.warm_up()

docs_with_embeddings = doc_embedder.run(docs)
n_docs_written = await document_store.write_documents_async(docs_with_embeddings["documents"])
print(f"Indexed {n_docs_written} documents")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Indexed 151 documents


The next step is to build the RAG pipeline to generate answers for a user query.

Initialize a text embedder to create an embedding for the user query and an `InMemoryEmbeddingRetriever` to use with the `InMemoryDocumentStore` you initialized earlier. As with the latter, the async-enabled embedding retriever class stems from the `haystack-experimental` package.

In [7]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack_experimental.components.retrievers.in_memory import InMemoryEmbeddingRetriever

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
retriever = InMemoryEmbeddingRetriever(document_store)


Create a custom prompt to use with the `ChatPromptBuilder` and initialize a `OpenAIChatGenerator` to consume the output of the former.

In [8]:
from haystack_experimental.components.builders import ChatPromptBuilder
from haystack_experimental.components.generators.chat import OpenAIChatGenerator
from haystack_experimental.dataclasses import ChatMessage

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(template)])
generator = OpenAIChatGenerator(model="gpt-4o-mini")

We finally get to the creation of the pipeline instance. Instead of using the `Pipeline` class, we use the `AsyncPipeline` class from the `haystack-experimental` package.

The rest of the process, i.e., adding components and connecting them with each other remains the same as with the original `Pipeline` class.

In [9]:
from haystack_experimental.core import AsyncPipeline

async_rag_pipeline = AsyncPipeline()
# Add components to your pipeline
async_rag_pipeline.add_component("text_embedder", text_embedder)
async_rag_pipeline.add_component("retriever", retriever)
async_rag_pipeline.add_component("prompt_builder", prompt_builder)
async_rag_pipeline.add_component("llm", generator)

# Now, connect the components to each other
async_rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
async_rag_pipeline.connect("retriever", "prompt_builder.documents")
async_rag_pipeline.connect("prompt_builder.prompt", "llm.messages")

<haystack_experimental.core.pipeline.async_pipeline.AsyncPipeline object at 0x7eda5d4aace0>
üöÖ Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
üõ§Ô∏è Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

Now, we create a coroutine that queries the pipeline with a question.

The key differences between the `AsyncPipeline.run` and `Pipeline.run` methods have to do with their parameters and return values.

Both `Pipeline.run` and `AsyncPipeline.run` share the `data` parameter that encapsulates the initial inputs for the pipeline's components.

While `Pipeline.run` accepts an additional `include_outputs_from` parameter to return the outputs of intermediate, non-leaf components in the pipeline graph, `AsyncPipeline.run` does not. This is because the latter is implemented as an `async` generator that yields the output of **each component** as soon as it executes successfully. This has the following implications:

- The output of `AsyncPipeline.run` must be consumed in an `async for` loop for the pipeline execution to make progress.
- By providing the intermediate results as they are computed, it allows for a tighter feedback loop between the backend and the user. For example, the results of the retriever can be displayed to the user before the LLM's response is generated.

Whenever a component needs to be executed, the logic of `AsyncPipeline.run` will determine if it supports async execution.
- If the component has opted into async support, the pipeline will schedule its execution as a coroutine on the event loop and yield control back to the async scheduler until the component's outputs are returned.
- If the component has not opted into async support, the pipeline will launch its execution in a separate thread and schedule it on the event loop.

In both cases, given an `AsyncPipeline` only one of its components can be executing at any given time. However, this does not prevent multiple, different `AsyncPipeline` instances from executing concurrently.

The execution of an `AsyncPipeline` is deemed to be complete once program flow exits the `async for` loop. At this point, the final results of the pipeline (the outputs of the leaf nodes in the pipeline graph) can be accessed with the loop variable.

In [10]:
from typing import Dict, Any


async def query_pipeline(question: str) -> Dict[str, Dict[str, Any]]:
    input = {
        "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
    }

    result_idx = 0
    # The AsyncPipeline.run() method is an async generator that yields the output of each component.
    async for pipeline_output in async_rag_pipeline.run(input):
        print(f"Pipeline result '{result_idx}' = {pipeline_output}")
        result_idx += 1

    # The last output of the pipeline is the final pipeline output.
    return pipeline_output

We can now execute the pipeline with some examples.

In [11]:
examples = [
    "Where is Gardens of Babylon?",
    "Why did people build Great Pyramid of Giza?",
    "What does Rhodes Statue look like?",
]

async def run_query_pipeline():
    global examples
    for question in examples:
        print(f"Querying pipeline with question: '{question}'")
        response = await query_pipeline(question)
        print(f'\tOutput: {response["llm"]["replies"][0]}\n')

    print("Done!")


await run_query_pipeline()

Querying pipeline with question: 'Where is Gardens of Babylon?'


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Pipeline result '0' = {'text_embedder': {'embedding': [0.06369034945964813, 0.0629514828324318, 0.0017213376704603434, -0.028603026643395424, -0.015677345916628838, -0.0549798384308815, -0.011346851475536823, -0.06269995123147964, 0.029381077736616135, 0.015217330306768417, -0.03562495484948158, -0.018175331875681877, -0.06586390733718872, -0.02516329102218151, -0.050018150359392166, -0.036443475633859634, 0.021266600117087364, 0.026529664173722267, 0.006462667603045702, -0.051425740122795105, 0.013276797719299793, 0.018056493252515793, 0.09243573248386383, -0.021958094090223312, 0.06896712630987167, 0.01664097234606743, -0.0357690192759037, 0.0948345810174942, -0.01992974802851677, -0.015581508167088032, 0.019395316019654274, 0.0671977624297142, -0.00363683863542974, 0.049664705991744995, -0.0619245208799839, 0.14623546600341797, 0.05274955555796623, -0.030615637078881264, 0.10668489336967468, 0.008385417982935905, -0.04064483568072319, -0.052015066146850586, 0.0004349834634922445, 0.

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Pipeline result '0' = {'text_embedder': {'embedding': [-0.05154181271791458, 0.10799205303192139, -0.005220727063715458, 0.05285795405507088, -0.11021178960800171, -0.10543033480644226, -0.011153997853398323, 0.03415222465991974, -0.07646123319864273, 0.04985126480460167, -0.011636505834758282, 0.024333039298653603, -0.0006020390428602695, 0.025633342564105988, -0.031033437699079514, -0.03249059617519379, 0.04111979156732559, -0.025796059519052505, 0.0027208959218114614, -0.04670005664229393, 0.028048884123563766, -0.0655210018157959, 0.0016593938926234841, 0.05139290913939476, 0.0308478195220232, 0.009952496737241745, -0.08076996356248856, 0.05317305773496628, 0.09119061380624771, -0.03456465154886246, -0.009730314835906029, -0.00017468922305852175, 0.06056418642401695, 0.02022671513259411, -0.02049245871603489, 0.05862937867641449, 0.09268374741077423, -0.03691697493195534, 0.00972905382514, -0.058969538658857346, 0.0074142394587397575, 0.0445503331720829, 0.05215753614902496, 0.0346

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Pipeline result '0' = {'text_embedder': {'embedding': [0.027440747246146202, 0.08271684497594833, -0.02266058325767517, 0.030886027961969376, 0.01796676032245159, -0.0462363101541996, 0.01911354809999466, -0.011979070492088795, -0.028748994693160057, -0.0068742819130420685, -0.012549979612231255, -0.03317374363541603, -0.03443441167473793, 0.027602892369031906, -0.030476635321974754, -0.01894759014248848, 0.038222525268793106, 0.03223677724599838, -0.020494744181632996, 0.04793812334537506, 0.06141790375113487, -0.0447174534201622, -0.001174690667539835, 0.07369746267795563, -0.00013805070193484426, 0.06160816550254822, -0.051456600427627563, 0.01998910680413246, 0.030330441892147064, -0.08683960884809494, -0.04924558475613594, -0.07623918354511261, 0.0013470710255205631, -0.015688665211200714, 0.052812620997428894, 0.041563086211681366, 0.10684378445148468, 0.07633957266807556, 0.006814741063863039, -0.03774801269173622, -0.07733006030321121, -0.05712191015481949, 0.055234503000974655

You can alternatively use the `run_async_pipeline` helper function to execute an `AsyncPipeline` in the same manner as a regular `Pipeline` while retaining the benefits of cooperative scheduling.

In [12]:
from haystack_experimental.core import run_async_pipeline

question = examples[0]
outputs = await run_async_pipeline(
    async_rag_pipeline,
    {"text_embedder": {"text": question}, "prompt_builder": {"question": question}},
    include_outputs_from={"retriever"},
)

print(outputs)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='The Hanging Gardens of Babylon are said to have been located in the ancient city of Babylon, which is near present-day Hillah in Babil province, Iraq. However, the exact location of the gardens has not been definitively established, and there is ongoing debate regarding their historical existence and location. Some theories suggest they may actually refer to gardens built in Nineveh by the Assyrian King Sennacherib.')], _meta={'model': 'gpt-4o-mini-2024-07-18', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 82, 'prompt_tokens': 2627, 'total_tokens': 2709, 'completion_tokens_details': CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), 'prompt_tokens_details': PromptTokensDetails(audio_tokens=0, cached_tokens=2432)}})]}, 'retriever': {'documents': [Document(id=1d3b00c0c761487040b62edb06fdcd47b84a7e

## Custom Asynchronous Components

Individual components can opt into async by implementing a `run_async` coroutine that has the same signature, i.e., input parameters and outputs as the `run` method. This constraint is placed to ensure that pipeline connections are the same irrespective of whether a component supports async execution, allowing for plug-n-play backward compatibility with existing pipelines.


In [None]:
from typing import Dict, Any
from haystack import component

@component
class MyCustomComponent:
    def __init__(self, my_custom_param: str):
        self.my_custom_param = my_custom_param

    @component.output_types(original=str, concatenated=str)
    def run(self, input: str) -> Dict[str, Any]:
        return {
            "original": input,
            "concatenated": input + self.my_custom_param
        }

    async def do_io_bound_op(self, input: str) -> str:
        # Do some IO-bound operation here
        return input + self.my_custom_param

    @component.output_types(original=str, concatenated=str)
    async def run_async(self, input: str) -> Dict[str, Any]:
        return {
            "original": input,
            "concatenated": await self.do_io_bound_op(input)
        }