Haystack is a python framework for developing End to End question answering systems. It provides a flexible way to use the latest NLP models to solve several QA tasks in real-world settings with huge data collections. Haystack is useful for providing solutions to diverse QA tasks such as Financial Governance, Knowledge Base Search, Competitive intelligence etc. Haystack is available here as an open-source library with an extended paid enterprise version with all the bells and whistles.

Core Idea

Large Neural networks, especially ones with transformer-based architectures, perform extremely well not only on Extractive Question Answering but also on Generative Question Answering(QA). But these models are computationally expensive and time-consuming. This makes them unusable in latency-sensitive applications. HayStack solves this problem by prefiltering the documents using faster but less powerful solutions. This allows the Neural Model to complete inference in a small amount of time.

Retriever is a lightweight filter that scans through all the documents in the Document store and identifies a small relevant candidate set of documents.Retrievers can be models using both sparse or dense methods.One very efficient retriever model is ElasticSearch.It is a proper sparse indexing based search engine. Learn more about Elastic Search here.

The reader is a powerful model that closely examines each document and finds the answer to the question. Using the latest models like the transformer models gives the Reader the capability to semantically extract information from documents instead of plain lexical search. Reader models are generally built by adding a Question Answering Head on Language Models like BERT.

To Do this, Deepset provides a framework called FARM: Framework for Adapting Representation Models. This framework facilitates transfer learning on representation models. You can check this framework FARM.

Let’s see a basic example of a QA solution using a haystack.

Usage

The following code implementation is in reference to the official examples/tutorials provided by Haystack.

Setup

Install haystack and dependencies using pip. The official code is available here.

 !pip install git+https://github.com/deepset-ai/haystack.git
 !pip install urllib3==1.25.4

For the retriever model, we can use many models ranging from simple TFIDFReader to ElasticSearch. Let’s see how to set up an elastic search.

 ! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
 ! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
 ! chown -R daemon:daemon elasticsearch-7.9.2
 import os
 from subprocess import Popen, PIPE, STDOUT
 es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                    stdout=PIPE, stderr=STDOUT,
                    preexec_fn=lambda: os.setuid(1)  # as daemon
                   )
 # wait until ES has started
 ! sleep 30

Importing functions from haystack

 from haystack import Finder
 from haystack.preprocessor.cleaning import clean_wiki_text
 from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
 from haystack.reader.farm import FARMReader
 from haystack.reader.transformers import TransformersReader
 from haystack.utils import print_answers

Loading the Data

 # Let's first get some documents that we want to query
 # Here: 517 Wikipedia articles for Game of Thrones
 doc_dir = "data/article_txt_got"
 s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
 fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
 # convert files to dicts containing documents that can be indexed to our datastore
 # You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
 # It must take a str as input, and return a str.
 dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
 # We now have a list of dictionaries that we can write to our document store.
 # If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
 # The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}
 # Let's have a look at the first 3 entries:
 print(dicts[:3])
 # Now, let's write the docs to our DB.
 document_store.write_documents(dicts)

Model Pipeline

Haystack provides a flexible way to build search pipelines. Each pipeline is a Directed Acyclic graph where each node is a reader/ retriever/ generator etc. Let’s just build a simple pipeline.

 from haystack.retriever.sparse import ElasticsearchRetriever
 retriever = ElasticsearchRetriever(document_store=document_store)
 reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
 from haystack.pipeline import ExtractiveQAPipeline
 pipe = ExtractiveQAPipeline(reader, retriever)

You can visualize the pipeline built using haystack.

pipe.draw()

Although ElasticSearch is a decent and fast retriever based on BM25 algo, it doesn’t compare in accuracy to dense methods. Dense methods are parameter-dependent and require training to learn these parameters. We can build more complex pipelines using multiple retrievers in the following way.

 from haystack.retriever.dense import DensePassageRetriever
 dense_retriever = DensePassageRetriever(
     document_store=document_store,
     query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
     passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
 )
 from haystack.pipeline import Pipeline,JoinDocuments
 p = Pipeline()
 p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
 p.add_node(component=dense_retriever, name="DPRRetriever", inputs=["Query"])
 p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
 p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])

Inference

Question Answering is simple using haystack. We just need to run the pipeline to get the answers.

prediction = pipe.run(query="Who is the father of Joffrey?", top_k_retriever=20, top_k_reader=5)

Here top_k_retriever and top_k_reader are the numbers of results we need at each stage. They should be decided by considering the number or results we need in the output and latency constraints. Following are the answers to our question.

print_answers(prediction, details="minimal")