DoctorGPT: Harnessing the Power of Semantic Knowledge Graphs for Unstructured Data

In the era of big data, companies have access to more information than ever before. However, a significant portion of this data is unstructured, which poses a challenge for traditional data processing techniques. According to Gartner and other analysts, up to 80% of data in companies' data lakes may be unstructured - an untapped gold mine of potential insights.

Enter Semantic Knowledge Graphs (SKGs), a powerful tool that can transform this wealth of unstructured data into structured, actionable intelligence.

Shameless Plug by the AI: Sign up for FeatureBase Cloud today and receive a $300 credit. Join us on Discord if you have any questions.

What are Semantic Knowledge Graphs?

Semantic Knowledge Graphs are a type of knowledge representation used to organize and interpret complex networks of information. They structure information as directed, labeled graphs, with nodes representing entities and edges representing relationships between these entities. In the context of unstructured data in data lakes, nodes could represent key terms or sections of text, while edges could represent the relationships between these elements. By applying SKGs to unstructured data, we can label and structure the data, facilitating more sophisticated analysis and insight extraction.

The DoctorGPT Project: An Exploratory Study

The DoctorGPT project hosted on GitHub serves as a practical example of SKGs applied to unstructured data. DoctorGPT uses advanced Language Model (LLM) prompting for organizing, indexing, and discussing PDFs and webpages. It uses tools like PyPDF2 and pdf2image for PDF processing, Google Vision for text extraction from images, nltk for text fragment extraction, Weaviate for dense vector search and embedding handling, and FeatureBase Cloud for back-of-the-book indexing and graph traversal of terms and questions.

In the context of the DoctorGPT project, the semantic knowledge graph is used to build connections between key terms, questions about the text, and the text in the document itself. This is a more specialized application of the general concept of semantic knowledge graphs, tailored for prompt building, natural language understanding and information retrieval:

Key terms: These are significant or meaningful words or phrases within the document. For instance, in a document about global warming, key terms might include "greenhouse gases", "carbon dioxide", "climate change", etc.
Questions about the text: These are queries or prompts that can be answered using the information in the document. For instance, a question could be "What are the main causes of global warming?" or "How does carbon dioxide contribute to global warming?"
Text in the document: This is the actual content of the document. It contains all the information and context needed to answer the questions and define the key terms.

In the semantic knowledge graph:

Nodes could represent key terms, sections of text in the document, or even specific questions about the text.
Edges would then represent the relationships between these nodes. For instance, an edge might connect a key term to a section of the document where it's defined, or connect a question to the section of text that answers it.

By structuring the information in this way, the DoctorGPT system may quickly and efficiently find answers to queries, provide definitions for key terms, or identify the relevant sections of a document based on a given question or term. It provides a structured, searchable, and interactive way to navigate and understand the information in a document.

Install and Example Output

Here's a simplified version of the installation process for DoctorGPT. Please note that the actual installation is a bit more complex depending on your system configuration and the services used to process and store data for the project.


# Clone the repository
git clone https://github.com/FeatureBaseDB/DoctorGPT.git

# Change into the project directory
cd DoctorGPT

# Install the required Python packages
pip install -r requirements.txt

# Run the index step.
python index_pdf.py

# Run the keyterm and question extraction.
python index_tandqs.py

# Run the question answerer.
python doc_questions.py

# Run the chat interface.
python doc_chat.py

Now, let's consider an example chat with the DoctorGPT system:


Entering conversation with DoctorGPT.pdf. Use ctrl-C to end interaction.
user-P61W[DoctorGPT.pdf]> Briefly introduce yourself, DoctorGPT. It's OK 
if you pretend to be Doc Brown from Back to the Future.
bot> Querying GPT...
bot> My name is DoctorGPT and I'm an AI agent designed to help you organize 
and manage PDF documents. Just like Doc Brown, I'm here to help you navigate 
the future of PDFs.

In this example, the user is interacting with the DoctorGPT system, asking it to introduce itself. The system processes the request, consults the document, and provides a response. This is a simple example of the kind of interactive dialogue that DoctorGPT can provide.

Leveraging FeatureBase's Cloud Solution

A standout feature of FeatureBase's cloud solution is its innovative use of B-Tree indexing to enhance the search and retrieval of vectors. This is particularly impactful when dealing with the vast amounts of unstructured data in data lakes.

The B-tree index is a time-tested data structure that excels in efficiently storing, retrieving, and managing large amounts of data. In the context of FeatureBase's solution, the B-tree index can be used to accelerate the search of vectors, enabling quick and efficient access to relevant data.

The real power of the B-tree index emerges when it's used in conjunction with SKGs to filter the vector space by features extracted from the unstructured data. By doing this, FeatureBase is able to narrow down the search space and significantly reduce the time it takes to find the most relevant vectors. This makes the process of extracting valuable insights from unstructured data faster and more efficient.

Comparatively, FeatureBase's use of B-tree indexing stands out against other vector engine solutions. While those solutions employ vector-based approaches for data retrieval, and support limited filtering on the vectors searched, FeatureBase's integration of B-tree indexing with vector search will offer a level of efficiency that is hard to match. The combination of B-tree indexing and vector search allows FeatureBase to provide a unique, optimized solution for managing and extracting insights from unstructured data.

Facilitating Sophisticated Analysis and Insight Extraction

In summary, Semantic Knowledge Graphs represent a powerful tool for dealing with the challenge of unstructured data in data lakes. By structuring this data and facilitating sophisticated analysis, SKGs unlock the potential of unstructured data and pave the way for a new era of data-driven decision making.

As we've explored the transformative potential of Semantic Knowledge Graphs in structuring unstructured data and facilitating sophisticated analysis, it's exciting to share that FeatureBase is further enhancing these capabilities with the introduction of vector support in its OLAP analytic engine. This is particularly important when dealing with large amounts of unstructured data, as vector representations can capture the nuances and complexities of this data in a compact form.

Furthermore, FeatureBase is introducing embedding functions via User Defined Functions (UDFs) in Python. This allows users to apply their own functions to the data, increasing the flexibility and adaptability of the system to cater to specific needs. It essentially allows users to customize their data processing and analysis pipeline, which is a significant advantage when dealing with diverse and complex data.

FeatureBase's upcoming releases represents a significant step forward in the company's mission to help businesses unlock the potential of their data. By integrating Semantic Knowledge Graphs, vector support, and customizable UDFs, FeatureBase is not only enabling more sophisticated data analysis but also empowering businesses to extract meaningful insights from their data lakes.

Keep an eye on FeatureBase's updates to see when these exciting features become available!

‍