But But, You Were Supposed to Be a GPT Wrapper?!

The technical details behind our AI Financial Agent: Fintool, Warren Buffett as a service.

Oct 30, 2024

My team and I are building Fintool, Warren Buffett as a service. It's a set of AI agents that analyze massive amounts of financial data and documents to assist institutional investors in making investment decisions. To simplify for customers, we explain Fintool as a sort of ChatGPT on SEC filings and earnings calls.

We got our fair share of "yOU aRe JuST a GPT wRapPer" from people who had no clue what they were talking about but wanted to sound smart and provocative. Anyway! For more serious people I thought it would be nice to disclose our infrastructure and unique challenges.

Terabytes of Real-time Data Ingested with Spark

Our goal is to ingest as much financial data as possible—ranging from news, management presentations, internal notes, broker research, market data, rating agency reports, alternative data, internal data and much more. We started with SEC filings, but our infrastructure is designed to scale and adapt, with no limit to the types of data sources it can handle.

Our data ingestion pipeline uses Apache Spark to efficiently process vast amounts of structured and unstructured data. The primary data source is the SEC database, which provides, on average, around 3,000 filings daily. We've built a custom Spark job to pull data from the SEC, process HTML files, and distribute the workload across our Spark cluster for real-time ingestion. With SEC filings and earnings calls alone, we manage 70 million chunks, 2 million documents, and around 5 TB of data in Databricks for every ten years of data.

Many documents are unstructured and often exceed 200 pages in length. Each data source has a dedicated Spark streaming job, ensuring a continuous flow of data into our system, making Fintool one of the very few real-time systems in production in our market. We outperform nearly all incumbents in processing time, often being hours faster.

Monitoring the 100% uptime of all these pipelines and catching errors early is a significant challenge. Any failure in these processes could lead to incomplete or delayed data, affecting the reliability of Fintool. Our customers can’t miss a company earnings or an 8-K filing announcing that an executive is departing the company.

To address this, we have built robust monitoring tools that help us detect and resolve issues swiftly, ensuring the system remains operational and dependable.

50 Billions Tokens per Week? Parsing Complex Financial Data

To make sense of the different formats, we've developed a custom parser that can handle both structured and unstructured data. This parser extracts millions of data points using a combination of unsupervised machine learning models, all optimized for financial documents. For instance, extracting tables with numerical data and footnotes accurately presents unique challenges, as it requires ensuring the numbers are correctly linked to their respective headers and that important context from footnotes is preserved. Imagine a company reports non-GAAP earnings with a footnote clarifying that $2 billion in employee stock-based compensation isn’t included; without accounting for that $2 billion, the earnings figures could be misleading!

One of our goals is to handle as many complex operations offline as possible. By doing this, we save on costs and improve quality, as it allows us to thoroughly analyze the output—something that is not feasible during real-time user queries.

We have recently partnered with OpenAI on a research project to use LLMs to extract every data point in SEC filings. Every week, we process 50 billion tokens, equivalent to 468,750 books of 200 pages each, or 12 times the size of Wikipedia.

Accounting is exceptionally complex. SEC filings often use different terminologies or formats for similar items—terms like “Revenue,” “Net Sales,” or “Turnover” vary by company or industry—making consistent data extraction a challenge. Key figures like "Net Income" may come with footnotes detailing adjustments (e.g., “excluding litigation costs”), and companies frequently report figures for different time periods, such as quarterly versus year-to-date, within the same filing. Some companies don’t report in USD, and others occasionally change accounting methods (e.g., revenue recognition policies), noted in footnotes, which requires careful adjustments to make financials comparable over time.

It’s complex, but Fintool is bringing order to it all. Our advanced data pipelines are engineered to locate, verify, deduplicate, and cross-compare every data point, ensuring unmatched accuracy and insight. This is how we've built the most reliable financial fundamentals database on the market!

Smart Chunking for Context-Aware Document Segmentation

Next, we break down these documents into manageable, meaningful segments while preserving context—crucial for downstream tasks like search and question answering.

We use a sliding window approach with a variable-sized window (typically 400 tokens) to ensure coherence between segments. We also employ hierarchical chunking to create a tree-like structure of document sections, capturing everything from top-level sections like "Financial Statements" to specific sub-sections. Our system treats tables as atomic units, keeping table headers and data cells intact for accuracy.

To maintain context, each chunk is enriched with metadata (e.g., document title, section headers), and we use an overlap strategy where consecutive chunks share a small overlap (about 10%) to ensure continuity. This allows us to accurately capture the narrative, even in long documents - a 10-K annual report is between 150 to 200 pages.

Those docs are then ready to be embedded!

Custom Embeddings for Semantic Representation

We compute embeddings for each document chunk using a fine-tuned open-source model running on our GPUs. This model was fine-tuned on hundreds of real-life examples from expert financial questions. These embeddings allow us to represent complex financial data in a way that captures semantic meaning. For example, if a document mentions 'net income growth' alongside 'operating cash flow trends,' the embeddings capture the relationship between these terms, allowing the system to understand the context and link related financial concepts effectively.

The embedding computation pipeline processes data in batches and stores the results in Elasticsearch, which supports vector storage and search through its dense_vector field type. Elasticsearch enables k-nearest neighbor (kNN) search using similarity metrics such as cosine similarity and dot product. Since we normalize our embeddings to unit length, cosine similarity and dot product yield equivalent results, allowing us to use either for efficient similarity search. We chose not to use a dedicated vector database, as it would add complexity and reduce performance, particularly when merging results from both keyword and vector searches. Managing this combination effectively without compromising speed and accuracy is challenging, which is why we opted for this more streamlined approach.

To speed up our embeddings search, we quantize the embeddings, compressing them to significantly reduce memory usage—by as much as 75%. This reduction means we can access and process data faster, allowing for quicker responses while maintaining effective search performance. Quantization not only optimizes memory but also boosts efficiency across the entire search process.

Search Infra: Combining Keywords and Semantic Search

Our search infrastructure integrates both keyword-based and semantic search methods to deliver accurate and comprehensive answers. For keyword search, we use an enhanced BM25 algorithm, which helps us find relevant information based on traditional keyword matching. On the semantic side, we leverage vector-based similarity search using ElasticSearch to locate information based on meaning rather than just keywords.

Despite all the buzz around vector search, our evaluations revealed that relying on vector search alone falls short of expectations. While many startups offer vector databases combined with vector search as a service, we have more confidence in Elastic's technology. Through extensive optimizations, we’ve achieved a streamlined Elastic index of approximately 500GB, containing about 2 million documents for every 10 years of data

This combination of keyword and semantic search allows us to achieve hybrid retrieval, which significantly enhances search relevance and accuracy. For example, keyword search is ideal for finding specific financial terms like 'net income,' which require precise matching. Meanwhile, vector search helps understand broader questions, such as "companies showing signs of liquidity stress," which involves context and relationships between multiple financial metrics.

We then use reranking techniques to improve retrieval performance. Our re-ranker takes a list of candidate chunks and uses a cross-encoder model to assign a relevance score, ensuring the most relevant chunks are prioritized. This cross-encoder model allows for a deeper and more precise evaluation of the relationship between the query and each document, resulting in significantly more accurate final rankings. Re-ranking can add hundreds of milliseconds of latency but, in our experience, is worth it.

Knowledge Graph, the Next Step to Connect the Dots

Talking about improving the search, we are currently exploring knowledge graphs since the publication of the GraphRAG framework by Microsoft. It uses an LLM to automatically extract data points to create a rich graph from a collection of text documents. This graph represents entities as nodes, relationships as edges, and claims as covariates on edges. An example of a node in the knowledge graph could be 'Apple Inc. (AAPL)' as an entity, representing the company. Relationships (edges) might include connections like 'has CEO' linked to 'Tim Cook' or 'sold shares on [date].' These nodes and relationships help institutional investors quickly identify key details about companies, such as executive leadership changes, important filings, or financial events. GraphRAG automatically generates summaries for these entities.

When a user asks a query, we will leverage the knowledge graph and community summaries to provide more structured and contextually relevant information compared to traditional retrieval-augmented generation approaches. For example, an institutional investor might ask, "Which companies in the S&P 500 are experiencing liquidity stress and have recently made executive changes?" GraphRAG supports both global search to reason about the holistic context (e.g., liquidity stress across the market) and local search for specific entities (e.g., identifying companies with recent executive changes). This hybrid approach helps connect disparate pieces of information, providing more comprehensive and insightful answers.

The challenge with GraphRAG search lies in the high cost of both building and querying the graph, as well as managing query-time latency and integrating it with our keyword + vector search. A potential solution could be an efficient, fast classifier to reserve GraphSearch for only the most complex queries.

LLM Benchmarking: Routing to the Best Model

We use LLMs for a variety of tasks such as understanding the query, expanding it, and classifying its type. For each user query, we trigger multiple classifiers that help determine whether the question requires searching specific filings, calculating numerical values, or taking other specific actions.

To handle these tasks, we utilize a variety of LLMs—from proprietary models to open-source Llama models, with different sizes and providers to balance speed and cost. For instance, we might use OpenAI GPT4o for complex tasks and Llama-3 8B on Groq, a specialized provider for fast inference, for simpler tasks.

We created an LLM Benchmarking Service that continuously evaluates the performance of these models across numerous tasks. This service helps us dynamically route each query to the best-performing model.

Having a model-agnostic interface is crucial to ensure we are not constrained by any particular model, especially with new models emerging every six months with enhanced capabilities. This flexibility allows us to always leverage the best available tools for optimal performance. We don't spend any resources training or fine-tuning our own models - we wrote about this strategy in Burning Billions: The Gamble Behind Training LLM Models.

As you can see, answering a user's question is not trivial. It relies on a massive infrastructure, dozens of classifiers, and a hybrid retrieval pipeline. Additionally, we use a specialized LLM pipeline to generate accurate citations for every piece of information in the response, which also serves as a way to fact-check everything the LLM outputs. For example, if the answer references a specific SEC filing, the LLM provides an exact citation, guiding the user directly to the original document.

LLM Evaluation and Monitoring

Evaluating and monitoring an LLM-based Retrieval Augmented Generation system presents its own challenges. Any problem could originate from various components—such as data pipelines, machine learning models for structuring data, the retrieval search and vector representation, the reranker, or the LLM itself. Identifying the root cause of an issue requires a comprehensive understanding of each part of the infrastructure and its interactions, ensuring that every step contributes effectively to the overall accuracy and reliability of the system.

To address these challenges, we have developed specialized monitoring tools that help us catch potential errors across the entire pipeline. We also use Datadog to store a lot of logs so we can quickly identify and fix production issues. Obviously, we want to catch errors early so we always benchmark our product against finance-specific benchmarks. The catch is that some improvements can improve our embeddings but might deteriorate the overall performance of the product. As you see, it’s very complex!

There is so much more we could talk about, and I hope this provides a broad overview of our approach. Each of these sections could easily be expanded into a dedicated blog post!

In short, I believe that making LLMs work in finance is both highly challenging and immensely rewarding. We're steadily building our infrastructure piece by piece, productizing and delivering each advancement along the way. Our ultimate goal is to create an autonomous "Warren Buffett as a Service" that can handle the workload of dozens of analysts, transforming the financial research landscape.

Let me finish by sharing some of the things I'm most excited about for the future

Faster inference

Many companies are working on specialized chips that are designed to deliver extremely low-latency, high-throughput processing with high parallelism. Today, we are using Groq a provider capable of streaming at 800 tokens per second, but they are now claiming they can reach 2000 tokens per second. To put this into context, processing at multiple thousands tokens per second means that complex responses will be delivered almost instantaneously.

I'm more excited by faster inferences than by smaller models like LLama 8B or Mistral 3B. While smaller LLMs are useful because they are faster, if larger models become extremely efficient and deliver superior intelligence, there may be no need for smaller models. The power of large, smart models would make them the optimal choice for most tasks.

Why does this matter? With such speed, an advanced AI agent can take control of Fintool to analyze thousands of companies simultaneously, performing billions of searches on company data in a fraction of the time. Imagine if Warren Buffett could read all filings, compute numbers, and analyze management teams instantly for thousands of companies.

Cheaper cost per token

I'm excited by the price of superintelligence getting closer to zero. The cost per GPT token has already dropped by 99%, and I'm confident it will continue to drop due to intense competition between major players like Microsoft and Meta, as well as innovations in semiconductors and economies of scale with large data centers. With costs continuing to decrease, we are approaching a future where large-scale AI computations are affordable, enabling widespread adoption and insane innovations.

Autonomous AI Agents

Multi-Agent Systems, which consist of AI agents that can work independently or collaborate with other agents to perform complex tasks. For example, these agents could autonomously collaborate in stress-testing scenarios or optimize complex investment strategies. Additionally, Self-Healing Systems, capable of real-time monitoring, debugging, and repairing themselves could, for instance, detect and correct discrepancies in market data or errors in algorithms, enhancing reliability and resilience.

Onwards!

Ps: Fintool is hiring Data Engineer, ML engineer and Product Engineer in San Francisco

Ed Barahona

Oct 14

NIT, I just stumbled on this after reading your other post about RAGs

A lot of what you describe, the 50B tokens a week, 70M chunks, and constant embedding tuning, comes down to the scale friction every RAG setup eventually hits. A cleaner long-term path could be adding a token-level filter before embedding and a graph-backed RAG retrieval layer. The token filter cuts noise before it even hits your GPU by removing boilerplate like “Forward-Looking Statements,” repeated disclaimers, duplicate tables, and normalizing numerics ($45.2M ↔ 45,200,000). That alone can drop embedding volume by 30% to 40%, lower GPU load, and make your vectors far more signal heavy.

On the retrieval side, a graph layer on top of the vector index fixes context fragmentation by linking sections, tables, and footnotes through relationships like explains, refers_to, or quantifies. So when someone searches “companies with declining net income excluding stock-based comp,” it can follow Metric(Net Income) → Footnote(Stock-Based Comp) → Table(Income Statement) instead of guessing across scattered chunks. The graph restores structure while vectors handle semantics, turning retrieval from flat and noisy into something relational and context aware.

P.S. You could take it a step further by fine tuning lightweight LoRA/QLoRA retrieval models on common analyst queries and topics, like the ones you mentioned identifying revenue drivers or detecting liquidity stress patterns. Since LoRA adapters can be hot loaded at runtime, you could dynamically switch retrieval behavior between domains or query types without retraining or redeploying the main model, keeping inference light while improving domain precision

Expand full comment

Pierre Brunelle

Oct 5

This is a great blog post.. I'd love to chat about your workflow and see what tools/UDF I need to build in https://github.com/pixeltable/pixeltable to support it from A to Z.

8 more comments...

Nicolas Bustamante

Discussion about this post