gtag('config', 'G-0PFHD683JR');
Price Prediction

A local organizer extract with LLM using ollama


summary

On this blog, we will show you how to use ollama to extract structured data that you can turn on locally and post it on your cloud/server.

We will use Python PDF documents as an example. You can find the full code here. only ~ 100 lines From Bethon code, check this 🤗!

Please give Cocoindex on Github a star to support us if you like our work. Thank you very much with the warm coconut hug 🥥🤗. JaytabJaytab

OLLAMA installation

OLLAMA allows you to play LLM models on your local device easily. To start:

Download and install ollama. Drag your favorite LLM models with OLLAMA PULL, for example

ollama pull llama3.2

1. Determine the output

We will extract the following information from Python as organized data.

So we will determine the output data category as follows. The goal is to extract and fill ModuleInfo.

@dataclasses.dataclass
class ArgInfo:
    """Information about an argument of a method."""
    name: str
    description: str

@dataclasses.dataclass
class MethodInfo:
    """Information about a method."""
    name: str
    args: cocoindex.typing.List[ArgInfo]
    description: str

@dataclasses.dataclass
class ClassInfo:
    """Information about a class."""
    name: str
    description: str
    methods: cocoindex.typing.List[MethodInfo]

@dataclasses.dataclass
class ModuleInfo:
    """Information about a Python module."""
    title: str
    description: str
    classes: cocoindex.typing.List[ClassInfo]
    methods: cocoindex.typing.List[MethodInfo]

2. Determine the flow of Cocoindex

Let’s define the flow of Cocoindex to extract the structured data from the discounts, which is very simple.

First, let’s add Python documents in Markdown as a source. We will explain how to download PDF a few sections below.

@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="markdown_files")) 

    modules_index = data_scope.add_collector()

flow_builder.add_source A schedule will be created with the following sub -fields, see documents here.

  • filename (The key, write: str): File file name, for example dir1/file1.md
  • content (He writes: str if binary He is FalseOtherwise bytes): File content

Then, let’s extract the structured data from the Markdown files. Very easy, you just need to provide LLM specifications, and transfer the specified output type.

Cocoindex provides EXTRACTBYLM, which processes the data using LLM. We offer compact support for OLLAMA, which allows you to play LLM models on your local device easily. You can find the full list of models here. We also support Openai API. You can find full documents and instructions here.

    # ...
    with data_scope["documents"].row() as doc:
        doc["module_info"] = doc["content"].transform(
            cocoindex.functions.ExtractByLlm(
                llm_spec=cocoindex.LlmSpec(
                     api_type=cocoindex.LlmApiType.OLLAMA,
                     # See the full list of models: https://ollama.com/library
                     model="llama3.2"
                ),
                output_type=ModuleInfo,
                instruction="Please extract Python module information from the manual."))

After extracting, we just need CherPick collect Job from the data range collection above.

    modules_index.collect(
        filename=doc["filename"],
        module_info=doc["module_info"],
    )

Finally, let’s lead to the export of data extracted to a table.

    modules_index.export(
        "modules",
        cocoindex.storages.Postgres(table_name="modules_info"),
        primary_key_fields=["filename"],
    )

3. Inquiries and test your index

🎉 Now you are all appointed!

Run the following orders to prepare and update the index.

python main.py cocoindex setup
python main.py cocoindex update

You will see the status of the index updates at the station

Terminal screenshot of the update indexTerminal screenshot of the update index

After creating the index, you have a schedule with the name modules_info. You can inquire about it at any time, for example, start postgres:

psql postgres://cocoindex:cocoindex@localhost/cocoindex

And run SQL query:

SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;

You can see the structured data extracted from the documents. Below is a screenshot of the extracted stereotypes information:

SQL inquiries for organized dataSQL inquiries for organized data

coconut

Cocoinsight is a tool to help you understand the data pipeline and data index. Cocoinsight early now (free) 😊 You have found us! Fast educational program for 3 minutes on Cocoinsight: Watch YouTube.

1. Run Cocoindex server

python main.py cocoindex server -c https://cocoindex.io

To see Cocoinsight Https://cocoindex.io/cocoinsight. It connects to the local Cocoindex server while keeping the zero data.

There are two parts of the Cocoinsight Cocoinsight Panel:

coconutcoconut

  • Flowers: You can see the flow you set, and the data you collect.
  • Data: You can see data at data index.

On the side of the data, you can click any data and scroll down to know the details. In this data extraction example, you can see data extracted from Markdown files and the organized data offered in my cavant format.

Cocoinsight Data PanelCocoinsight Data Panel

For example, for the wrapping unit, you can preview the data by clicking on the data.

Python Cover Unity DataPython Cover Unity Data

A lot of great updates for Cocoinsight soon, watch!

Add a summary of the data

Using Cocoindex as a business frame, you can easily add any data conversion (including LLM summary), and collect it as part of the data index. For example, let’s add some simple summaries to each unit – such as the number of categories and methods, using simple Python Funciton.

We will add an example LLM later.

1. Determine the output

First, let’s add the structure we want as part of the definition of the output.

@dataclasses.dataclass
class ModuleSummary:
    """Summary info about a Python module."""
    num_classes: int
    num_methods: int

2. Determine the flow of Cocoindex

After that, let’s define a dedicated function to summarize the data. You can see detailed documents here

@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
    """Summarize a Python module."""
    return ModuleSummary(
        num_classes=len(module_info.classes),
        num_methods=len(module_info.methods),
    )

3. Connect the job in the flow

    # ...
    with data_scope["documents"].row() as doc:
      # ... after the extraction
      doc["module_summary"] = doc["module_info"].transform(summarize_module)

🎉 Now you are all appointed!

Run the following orders to prepare and update the index.

python main.py cocoindex setup
python main.py cocoindex update

OLLAMA does not support PDF files directly as inputs, so we need to convert them to Markdown first.

To do this, we can adds a dedicated function to convert PDF to a reduction. See full documents here.

1. Determine the job specifications

The functional specifications of the function create a specific behavior of the function.

class PdfToMarkdown(cocoindex.op.FunctionSpec):
    """Convert a PDF to markdown."""

2. Determine the port category

The port category is a category that carries out job specifications. It is responsible for the actual implementation of the job.

This category takes PDF content as homes, saves it in a temporary file, and is used by PDFCONVIRER to extract text content. Then the extracted text is returned as a chain, and the PDF turns into a reduction format.

It is associated with job specifications by spec: PdfToMarkdown.

@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
    """Executor for PdfToMarkdown."""

    spec: PdfToMarkdown
    _converter: PdfConverter

    def prepare(self):
        config_parser = ConfigParser({})
        self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())

    def __call__(self, content: bytes) -> str:
        with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
            temp_file.write(content)
            temp_file.flush()
            text, _, _ = text_from_rendered(self._converter(temp_file.name))
            return text

You may wonder why we want to select SPEC + Executor (instead of using an independent job) here. The main reason is that there are some heavy preparation work (the preparation of the analyst) that must be done before you are ready to process real data.

3. Auxiliary program to flow

    # Note the binary = True for PDF
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
    modules_index = data_scope.add_collector()

    with data_scope["documents"].row() as doc:
        # plug in your custom function here
        doc["markdown"] = doc["content"].transform(PdfToMarkdown())

🎉 Now you are all appointed!

Run the following orders to prepare and update the index.

python main.py cocoindex setup
python main.py cocoindex update

Thank you very much for reading! You can 🌟 Our star is on Jabbap Or 👋 Join Discord To get the latest updates!

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button