gtag('config', 'G-0PFHD683JR');
Price Prediction

Turbin charging feelings analysis: How did we get to 50 kilos of Paster with small GPU services

I remember the day when the one -feel analysis pipeline has been distributed to us recently under the increased requests. The records were fateful: the ponds of the yarns spraled, the payment functions stopped, and the memory rose. This is when we decided to be free from our homogeneous design and rebuild everything from scratch. In this post, I will explain to you how to focus us for microscopic services that benefit from kubernetes, GPU-TustOSCALING, and ETL pipeline to deal with huge social data in actual time.


Melting is fine, so that it is not so

Originally, our emotional analysis staple was a large code base to abolish the distinctive data and symbol, and to infer the model, record and store it. It has succeeded in a wonderful way, until it launched the traffic to force us to excessively provide every component. The updates were worse, as the entire application was republished just to correct the inference form that was distressed.

By switching to small services, we isolate each job:

  1. API portalRoad requests and authentication handles.
  2. Cleaning the text and the distinctive symbolTreating the texts that are friendly to the central treatment unit.
  3. GPU Service Service: The deduction of the actual model.
  4. Storing data and its recordsMaintains the results of the final feelings and errors.
  5. MonitorNotice the performance with the minimum public expenditures.

We can now expand the scope of each piece independently, which enhances performance in specific bottlenecks.


Containers for the conclusion of the graphics processing unit

Our first big step was the container. Let’s take a look at Dockerfile for the inference service that supports GPU:

FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04

WORKDIR /app

# Install Python and system dependencies
RUN apt-get update && \
    apt-get install -y python3 python3-pip git && \
    rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install --upgrade pip

# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy project files
COPY . .

EXPOSE 5000
CMD ["python3", "sentiment_inference.py"]

This basic image includes CUDA drivers and libraries to accelerate GPU. Once we build it and push the container to the record, it is ready for coordination.

Kubernetes: GPU Autoscanging at work

With kubernetes (K8s), we can spread and expand each small service. We connect the pods of inference to the types of knots supported by GPU and Section on the basis of the use of GPU:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: sentiment-inference-gpu
spec:
 replicas: 2
 selector:
   matchLabels:
     app: sentiment-inference-gpu
 template:
   metadata:
     labels:
       app: sentiment-inference-gpu
   spec:
     nodeSelector:
       kubernetes.io/instance-type: "g4dn.xlarge"
     containers:
     - name: inference-container
       image: myrepo/sentiment-inference:gpu-latest
       resources:
         limits:
           nvidia.com/gpu: 1
           memory: "8Gi"
           cpu: "2"
         requests:
           nvidia.com/gpu: 1
           memory: "4Gi"
           cpu: "1"

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: sentiment-inference-hpa
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: sentiment-inference-gpu
 minReplicas: 2
 maxReplicas: 15
 metrics:
 - type: Pods
   pods:
     metric:
       name: nvidia_gpu_utilization
     target:
       type: AverageValue
       averageValue: "70"

When the GPU downloads the Kubernetes threshold by 70 %. This mechanism keeps the system unhappy under heavy load, but it avoids unnecessary costs during the stoppage period.

Achieving 50 km rap with the inference of the batch and I/O Ansync

Individual inferring each request can violate the performance. We collect multiple requests together to get the best GPU:

import asyncio
from fastapi import FastAPI, Request
from threading import Thread
from queue import Queue
import torch
import tensorrt as trt

app = FastAPI()

REQUEST_QUEUE = Queue(maxsize=10000)
BATCH_SIZE = 32

TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
engine_path = "models/sentiment_model.trt"

def load_trt_engine():
    with open(engine_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

engine = load_trt_engine()

def inference_worker():
    while True:
        batch = []
        while len(batch) < BATCH_SIZE and not REQUEST_QUEUE.empty():
            batch.append(REQUEST_QUEUE.get())

        if batch:
            texts = [item["text"] for item in batch]
            scores = run_tensorrt_inference(engine, texts)  # Batches 32 inputs at once
            for idx, score in enumerate(scores):
                batch[idx]["future"].set_result(score)

Thread(target=inference_worker, daemon=True).start()

@app.post("/predict")
async def predict(req: Request):
    body = await req.json()
    text = body.get("text", "")
    loop = asyncio.get_running_loop()
    future = loop.create_future()
    REQUEST_QUEUE.put({"text": text, "future": future})
    result = await future

    return {"sentiment": "positive" if result > 0.5 else "negative"}

This strategy keeps the GPU tinnitus efficiently, leading to dramatic productivity gains.

ETL in actual time: Kafka, Spark and Cloud Storage

We also needed to handle high -sized social data. Our Kafka pipeline is used for flow, spark for actual transformation, and red bending for storage.

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, udf
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("TwitterETLPipeline").getOrCreate()

schema = StructType([
    StructField("tweet_id", StringType()),
    StructField("text", StringType()),
    StructField("user", StringType())
])

df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
    .option("subscribe", "tweets") \
    .option("startingOffsets", "latest") \
    .load()

parsed_df = df.select(from_json(col("value").cast("string"), schema).alias("tweet"))

def custom_preprocess(txt):
    return txt.replace("#", "").lower()

udf_preprocess = udf(custom_preprocess, StringType())

clean_df = parsed_df.select(
    col("tweet.tweet_id").alias("id"),
    udf_preprocess(col("tweet.text")).alias("clean_text"),
    col("tweet.user").alias("username")
)

query = clean_df \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()

Spark picks up raw tweets from Kafka, cleans them, sends them to store or record them. We can expand the scope of both Kafka and Spark to accommodate millions of tweets per hour.

Pain points and lessons learned

Early, we got to confusing memory problems because our GPU borders were not identical to physical devices. The pods were randomly shattered under pregnancy. We have also realized that payment control volumes are a budget action: for internal analyzes, we want larger batches, for the ultimate user requests, we keep them modest to reduce cumin.

conclusion

By weaving small services together, speeding GPU, and the first ETL engineering, our old speech turned us into a high -octane -laughing pipeline laughing from 50 thousand RPS. Not only does it come, but also guarantees the minimum resource waste insecurity while flexible ETL pipelines let’s adapt to the growing data sizes in the actual time. The days of excessive presentation or correct everything are only to fix one conclusion error. With a strong approach to containers, each service is limited to its own terms, while maintaining the entire stack, reliable and ready for the next traffic. If you feel a stacked tank, it’s now time to restore the engine with small services and data in the actual time.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button