INFOSCOUT : Revolutionizing Financial Data Analysis

Unleash the power of Large Language Models and Retrieval-Augmented Generation to elevate your data-driven decision-making and uncover actionable insights like never before. Supercharge your business with InfoScout’s cutting-edge LLM technology, turning raw data into gold-standard insights that drive success and outpace the competition.

In the fast-paced world of digital data, finding the right information amidst a sea of documents can be overwhelming. That’s where InfoScout steps in. Designed to revolutionize how you interact with PDF documents, InfoScout specifically targets the 10-Q filings of S&P 500 companies, ensuring you get the information you need, precisely when you need it. Say goodbye to tedious searches and hello to streamlined, efficient, and incredibly accurate data retrieval.

InfoScout offers a cost-effective alternative to traditional data retrieval methods and other AI solutions. It delivers high-quality results at a fraction of the cost, making it a smart investment for any business looking to maximize ROI.

Objective: Enhancing Data Retrieval with AI

The primary goal of InfoScout is to implement a Retrieval-Augmented Generation (RAG) system that efficiently searches and retrieves relevant information from a collection of 1400 PDF documents. These documents contain vital financial data from S&P 500 companies, and InfoScout aims to provide accurate and comprehensive answers to user queries. By leveraging the combined power of Milvus for document storage and retrieval, NLP models for data cleaning and processing, and a transformer-based language model for generating detailed responses, InfoScout sets a new standard in data retrieval.

The Challenge: Navigating Complex Financial Documents

With the exponential growth of digital data, finding relevant information in large document collections has become increasingly challenging. S&P 500 companies regularly file 10-Q documents that contain critical financial information, often buried in lengthy and complex texts. InfoScout addresses this challenge head-on, providing a robust framework for efficient document retrieval and enhancing the user experience in accessing relevant financial data.

InfoScout not only enhances the accuracy and speed of information retrieval but also significantly reduces operational costs. Organizations can now achieve more with less expenditure, making InfoScout a cost-effective choice for businesses aiming to optimize their data handling processes. Leveraging open-source solutions, InfoScout offers substantial cost reductions compared to traditional methods without compromising on quality. Experience efficient data handling at a fraction of the cost.

How InfoScout Works: A Deep Dive into the Methodology

1. Data Generation and Storage:

PDF Extraction: The 10-Q documents are downloaded, renamed and stored in an Amazon S3 bucket with help of automation script. Each PDF is processed to extract text using pdfplumber, segmented by pages, and stored with metadata.
Text Preprocessing: Text is tokenized, converted to lowercase, and cleaned by removing stop words using the NLTK library. Financial terms, years, quarters, and company names are identified and extracted as keywords.
Embedding Generation: The cleaned text and keywords are encoded into high-dimensional vectors using the thenlper/gte-large embedding model. These chunks are stored in Milvus, a vector database optimized for similarity search.

2. Query Processing and Document Retrieval:

Query Input: Users enter their queries via a Streamlit UI. Queries are tokenized, cleaned, and essential keywords extracted using NLTK.
Keyword Categorization: Keywords are categorized into years, quarters, company names, and financial terms. Various keyword combinations are generated to cover different search scenarios.
Query Encoding: Each keyword combination is encoded into an embedding. The query embeddings are used to search Milvus for the most relevant document embeddings, filtered based on company symbols and relevance scores.

3. Answer Generation via Large Language Models:

Context Preparation: Contextual data from the query is combined into a structured prompt for the Mixtral 8 x 7B model via the Groq API.
Answer Generation: The language model processes the prompt to generate a detailed and contextually relevant answer. The response is displayed to the user in the Streamlit UI along with source files for verification.

Architecture

Deployment Architecture

Infrastructure: Robust and Scalable

InfoScout’s infrastructure is meticulously designed to achieve optimal performance and cost-efficiency. We guarantee robust scalability to efficiently handle extensive document collections.

In addition, InfoScout integrates a suite of leading open-source software solutions this combination of hardware excellence and open-source innovation ensures InfoScout delivers superior performance and scalability while optimizing operational costs, making it an ideal solution for organizations handling complex financial data.

On-Prem Hardware :

CPU : AMD Ryzen™ 9 7950X3D | 4.2 GHz | 5nm Processor
RAM : 128 GB
Storage : 2 TB
Thread Count : 16 cores | 32 Threads

On-Prem Software :

Milvus Vector Database (for vector storage and semantic search) ^*
Groq – Mixtral 8 X 7B (for inference) ^*
Hugging Face – thenlper/gte-large (for embedding) ^*
Streamlit (for User Interface) ^*
NLTK Python library (for document chunking) ^*

Note : * indicates Open Source

Cost comparison with OpenAI

OpenAI
Model	Tokens	Tokens / Query	Standard Cost	Cost / Query	Total Cost
gpt-4o-2024-05-13	Input Tokens	5000 / query	US $5.00 / 1M tokens	0.025 USD / query	0.0325 USD / query
gpt-4o-2024-05-13	Output Tokens	500 / query	US $15.00 / 1M tokens	0.0075 USD / query	0.0325 USD / query

InfoScout
Model	Tokens	Tokens / Query	Groq Concurrent Requests / minute	Query / Hour	AWS Instance Type	Hosting Cost	Total Cost
mixtral-8x7b-32768	Input Tokens	5000 / query	30	1800	ml.m5.8xlarge	1.843 USD / Hour	0.001024 USD / query
	Output Tokens	500 / query

Therefore, We offer a 30x cheaper solution with Open Source stack

Benefits: Why Choose InfoScout?

Efficiency: The RAG system significantly reduces the time required to find relevant information, delivering answers quickly without manual sifting through lengthy documents.

Accuracy: Leveraging state-of-the-art NLP models, the system ensures high accuracy in retrieving relevant information.

User Experience: The simple Streamlit UI offers an intuitive interface, making it accessible even to those with limited technical expertise.

Scalability: The use of Milvus and distributed processing techniques allows the system to scale efficiently with increasing data volumes.

Cost Efficiency: By leveraging open-source technologies and optimized hardware solutions, InfoScout effectively cuts operational costs while maintaining superior performance and scalability.

Contextual Understanding: The integration of Mixtral ensures answers are not only relevant but also contextually comprehensive, enhancing overall information retrieval quality.

Results: Proven Efficiency and Accuracy

Our tests with various user queries related to financial information from S&P 500 companies 10-Q filings demonstrated InfoScout’s efficiency, accuracy, and overall effectiveness. Here are some key observations:

Query Processing and Response Time:

Average response time: 12 – 15 seconds
- Note: May increase when the query is resource-extensive
Consistently quick processing across various queries

Relevance and Accuracy:

High relevance of retrieved documents to user queries
Detailed and contextually appropriate answers generated by the Mixtral model

Scalability:

Efficient handling of 1400 PDF documents
Responsive performance even with large datasets

Cost Reduction:

A significant cost reduction compared to other solutions like OpenAI without sacrificing result quality

Product Demo: Visual Guide and Explanation

Conclusion:

InfoScout represents a breakthrough in the realm of financial data retrieval from extensive document repositories. By integrating Milvus for streamlined vector storage and retrieval and harnessing the power of a large language model for nuanced answer generation, InfoScout sets a new standard for efficiency and accuracy in data handling.

This innovative system not only enhances the speed and precision of information retrieval but also significantly reduces operational costs through its use of open-source technologies and optimized hardware solutions. By leveraging open-source tools like Milvus, Groq – Mixtral 8 X 7B, and Hugging Face – thenlper/gte-large, InfoScout ensures cost-effective scalability without compromising on performance.

InfoScout’s ability to deliver contextually comprehensive answers, supported by Mixtral’s integration, further enhances its utility for financial analysts, researchers, and professionals needing rapid and precise insights from S&P 500 companies’ 10-Q filings. This comprehensive approach not only meets but exceeds the demands of modern data-intensive applications, making InfoScout an invaluable asset for enhancing decision-making processes in today’s competitive landscape.