A local organizer extract with LLM using ollama

summary
On this blog, we will show you how to use ollama to extract structured data that you can turn on locally and post it on your cloud/server.
We will use Python PDF documents as an example. You can find the full code here. only ~ 100 lines From Bethon code, check this !
Please give Cocoindex on Github a star to support us if you like our work. Thank you very much with the warm coconut hug .
OLLAMA installation
OLLAMA allows you to play LLM models on your local device easily. To start:
Download and install ollama. Drag your favorite LLM models with OLLAMA PULL, for example
ollama pull llama3.2
1. Determine the output
We will extract the following information from Python as organized data.
So we will determine the output data category as follows. The goal is to extract and fill ModuleInfo
.
@dataclasses.dataclass
class ArgInfo:
"""Information about an argument of a method."""
name: str
description: str
@dataclasses.dataclass
class MethodInfo:
"""Information about a method."""
name: str
args: cocoindex.typing.List[ArgInfo]
description: str
@dataclasses.dataclass
class ClassInfo:
"""Information about a class."""
name: str
description: str
methods: cocoindex.typing.List[MethodInfo]
@dataclasses.dataclass
class ModuleInfo:
"""Information about a Python module."""
title: str
description: str
classes: cocoindex.typing.List[ClassInfo]
methods: cocoindex.typing.List[MethodInfo]
2. Determine the flow of Cocoindex
Let’s define the flow of Cocoindex to extract the structured data from the discounts, which is very simple.
First, let’s add Python documents in Markdown as a source. We will explain how to download PDF a few sections below.
@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"))
modules_index = data_scope.add_collector()
flow_builder.add_source
A schedule will be created with the following sub -fields, see documents here.
filename
(The key, write:str
): File file name, for exampledir1/file1.md
content
(He writes:str
ifbinary
He isFalse
Otherwisebytes
): File content
Then, let’s extract the structured data from the Markdown files. Very easy, you just need to provide LLM specifications, and transfer the specified output type.
Cocoindex provides EXTRACTBYLM, which processes the data using LLM. We offer compact support for OLLAMA, which allows you to play LLM models on your local device easily. You can find the full list of models here. We also support Openai API. You can find full documents and instructions here.
# ...
with data_scope["documents"].row() as doc:
doc["module_info"] = doc["content"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OLLAMA,
# See the full list of models: https://ollama.com/library
model="llama3.2"
),
output_type=ModuleInfo,
instruction="Please extract Python module information from the manual."))
After extracting, we just need CherPick collect
Job from the data range collection above.
modules_index.collect(
filename=doc["filename"],
module_info=doc["module_info"],
)
Finally, let’s lead to the export of data extracted to a table.
modules_index.export(
"modules",
cocoindex.storages.Postgres(table_name="modules_info"),
primary_key_fields=["filename"],
)
3. Inquiries and test your index
Now you are all appointed!
Run the following orders to prepare and update the index.
python main.py cocoindex setup
python main.py cocoindex update
You will see the status of the index updates at the station
After creating the index, you have a schedule with the name modules_info
. You can inquire about it at any time, for example, start postgres:
psql postgres://cocoindex:cocoindex@localhost/cocoindex
And run SQL query:
SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
You can see the structured data extracted from the documents. Below is a screenshot of the extracted stereotypes information:
coconut
Cocoinsight is a tool to help you understand the data pipeline and data index. Cocoinsight early now (free) You have found us! Fast educational program for 3 minutes on Cocoinsight: Watch YouTube.
1. Run Cocoindex server
python main.py cocoindex server -c https://cocoindex.io
To see Cocoinsight Https://cocoindex.io/cocoinsight. It connects to the local Cocoindex server while keeping the zero data.
There are two parts of the Cocoinsight Cocoinsight Panel:
- Flowers: You can see the flow you set, and the data you collect.
- Data: You can see data at data index.
On the side of the data, you can click any data and scroll down to know the details. In this data extraction example, you can see data extracted from Markdown files and the organized data offered in my cavant format.
For example, for the wrapping unit, you can preview the data by clicking on the data.
A lot of great updates for Cocoinsight soon, watch!
Add a summary of the data
Using Cocoindex as a business frame, you can easily add any data conversion (including LLM summary), and collect it as part of the data index. For example, let’s add some simple summaries to each unit – such as the number of categories and methods, using simple Python Funciton.
We will add an example LLM later.
1. Determine the output
First, let’s add the structure we want as part of the definition of the output.
@dataclasses.dataclass
class ModuleSummary:
"""Summary info about a Python module."""
num_classes: int
num_methods: int
2. Determine the flow of Cocoindex
After that, let’s define a dedicated function to summarize the data. You can see detailed documents here
@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
"""Summarize a Python module."""
return ModuleSummary(
num_classes=len(module_info.classes),
num_methods=len(module_info.methods),
)
3. Connect the job in the flow
# ...
with data_scope["documents"].row() as doc:
# ... after the extraction
doc["module_summary"] = doc["module_info"].transform(summarize_module)
Now you are all appointed!
Run the following orders to prepare and update the index.
python main.py cocoindex setup
python main.py cocoindex update
OLLAMA does not support PDF files directly as inputs, so we need to convert them to Markdown first.
To do this, we can adds a dedicated function to convert PDF to a reduction. See full documents here.
1. Determine the job specifications
The functional specifications of the function create a specific behavior of the function.
class PdfToMarkdown(cocoindex.op.FunctionSpec):
"""Convert a PDF to markdown."""
2. Determine the port category
The port category is a category that carries out job specifications. It is responsible for the actual implementation of the job.
This category takes PDF content as homes, saves it in a temporary file, and is used by PDFCONVIRER to extract text content. Then the extracted text is returned as a chain, and the PDF turns into a reduction format.
It is associated with job specifications by spec: PdfToMarkdown
.
@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
"""Executor for PdfToMarkdown."""
spec: PdfToMarkdown
_converter: PdfConverter
def prepare(self):
config_parser = ConfigParser({})
self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
def __call__(self, content: bytes) -> str:
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
temp_file.write(content)
temp_file.flush()
text, _, _ = text_from_rendered(self._converter(temp_file.name))
return text
You may wonder why we want to select SPEC + Executor (instead of using an independent job) here. The main reason is that there are some heavy preparation work (the preparation of the analyst) that must be done before you are ready to process real data.
3. Auxiliary program to flow
# Note the binary = True for PDF
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
modules_index = data_scope.add_collector()
with data_scope["documents"].row() as doc:
# plug in your custom function here
doc["markdown"] = doc["content"].transform(PdfToMarkdown())
Now you are all appointed!
Run the following orders to prepare and update the index.
python main.py cocoindex setup
python main.py cocoindex update
Thank you very much for reading! You can