Sitemap

Taming the Titan: Interactive ClickHouse Analytics with DuckDB

20 min readOct 31, 2025
Press enter or click to view image in full size

Introduction: Taming Big Data in the Notebook

In the landscape of modern data science, a fundamental challenge persists: the chasm between where data lives and where it is analyzed. Production data, often measured in terabytes or petabytes, resides in powerful, distributed data warehouses architected for scale and performance. Conversely, the data scientist’s primary workspace remains the interactive, memory-constrained environment of a Jupyter Notebook. This disconnect creates a significant bottleneck, forcing practitioners into suboptimal workflows. They must either work with small, often stale, data samples that may not represent the full dataset, or they must engage in complex, time-consuming ETL (Extract, Transform, Load) processes to downsize the data before analysis can even begin. This friction stifles creativity, slows down the pace of discovery, and can lead to models built on incomplete or biased information.

This article presents a powerful and elegant architectural pattern designed to bridge this gap. The proposed solution leverages a combination of best-in-class open-source tools to create a seamless, memory-efficient workflow for analyzing massive datasets interactively. At the core of this architecture is the synergy between ClickHouse, a high-performance OLAP (Online Analytical Processing) data warehouse, and DuckDB, an embedded, in-process OLAP database. ClickHouse serves as the robust, scalable repository for the entire dataset. DuckDB, running directly within the Jupyter environment, acts as a high-performance “analytical bridge,” capable of querying subsets of data directly from the ClickHouse server without needing to load the entire dataset into local memory. This pattern combines the raw power of a distributed data warehouse with the interactive, fluid experience of local data analysis.

The journey through this article will be a comprehensive, hands-on tutorial, guiding the reader through the construction of a complete, end-to-end data science project. The roadmap is as follows:

  1. First, a reproducible data science environment will be built using Docker and Docker Compose, containerizing the ClickHouse server and a JupyterLab instance.
  2. Next, a standalone Python script will be developed to generate a massive, realistic e-commerce dataset, which will then be ingested into ClickHouse.
  3. The core technique of using DuckDB to query the remote ClickHouse instance will be detailed, including solving practical challenges like authentication.
  4. Finally, to demonstrate the real-world applicability of this workflow, three distinct machine learning models will be developed to solve practical business problems: a sales prediction model to optimize discount strategies, a sentiment analysis model for customer feedback, and an anomaly detection model to prevent stockouts.

This article is more than a tutorial on individual tools; it is a blueprint for a modern, scalable, and efficient data science workflow. By mastering this pattern, data professionals can unlock the ability to interact with massive datasets fluidly, accelerating the cycle from data to insight and from insight to business value.

Part 1: Building a Reproducible Data Science Environment with Docker

A robust and reproducible environment is the bedrock of any serious data project. Using containerization technology, specifically Docker, ensures that the entire development stack — from the database to the analytical tools — is consistent, portable, and isolated from the host system. This section details the system architecture and provides a complete guide to setting it up using Docker Compose.

The System Architecture: A Tale of Two OLAP Engines

The architecture for this project is built upon the complementary strengths of two distinct OLAP databases: ClickHouse and DuckDB. Understanding their respective roles is key to appreciating the efficiency of the overall workflow.

An architectural diagram would show the host machine running the Docker daemon. Inside Docker, two containers operate on a shared virtual network: aclickhouse-servercontainer and ajupyter-labcontainer. The data scientist interacts with the JupyterLab container via a web browser. From within a Jupyter Notebook, Python code using the DuckDB library establishes an HTTP connection to the ClickHouse container to retrieve data for analysis.

ClickHouse: The Heavy Lifter

ClickHouse is a distributed, columnar database management system designed for large-scale analytical workloads across multiple servers. Its column-oriented storage model allows for highly efficient data compression and extremely fast execution of aggregation queries, making it an ideal choice for the primary data warehouse. In this project, ClickHouse acts as the “heavy lifter” — the central, persistent repository for our entire multi-million-row e-commerce dataset. Its server-based architecture is designed to handle high-concurrency queries and massive data volumes that far exceed the capacity of a single machine.

DuckDB: The Interactive Specialist

DuckDB, in contrast, is an in-process, embedded OLAP database. It runs as a library within an application process (in this case, the Python kernel of our Jupyter Notebook) and requires no separate server installation or management. Its primary strength lies in providing extremely fast analytical query performance on datasets that can fit on a single machine. It can directly query data from various sources, including local files (like CSV or Parquet) and in-memory objects such as Pandas DataFrames, using a rich SQL dialect. In our architecture, DuckDB serves as the “interactive specialist” or “analytical scalpel,” empowering the data scientist to perform complex, ad-hoc analysis on targeted subsets of data retrieved from ClickHouse.

The synergy between these two systems is the cornerstone of this workflow. It is not a matter of choosing one over the other, but of leveraging each for what it does best. ClickHouse provides the industrial-strength storage and scalability for the entire dataset, while DuckDB delivers the nimble, high-performance, interactive query layer needed for exploratory data analysis and model development. This avoids the common anti-pattern of attempting to use a massive, distributed system for small, interactive tasks or trying to shoehorn an entire enterprise dataset into an embedded, single-node tool.

Orchestration with docker-compose.yml

Docker Compose is a tool for defining and running multi-container Docker applications. A singledocker-compose.ymlfile is used to configure all the application's services, networks, and volumes, allowing the entire stack to be brought up or down with a single command.

The following is the complete, annotateddocker-compose.ymlfile for this project.

services:
clickhouse-server:
image: clickhouse/clickhouse-server:latest
container_name: clickhouse-server
ports:
- "8123:8123" # HTTP interface
- "9000:9000" # Native TCP interface
environment:
- CLICKHOUSE_DB=ecommerce
- CLICKHOUSE_USER=datasci
- CLICKHOUSE_PASSWORD=password
volumes:
- clickhouse_data:/var/lib/clickhouse/
- clickhouse_logs:/var/log/clickhouse-server/
ulimits:
nofile:
soft: 262144
hard: 262144
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8123/ping"]
interval: 30s
timeout: 10s
retries: 3
jupyter-lab:
image: jupyter/scipy-notebook:latest
container_name: jupyter-lab
ports:
- "8888:8888"
volumes:
- ./notebooks:/home/jovyan/work
depends_on:
clickhouse-server:
condition: service_healthy
volumes:
clickhouse_data:
clickhouse_logs:

ClickHouse Service Definition:

  • image: clickhouse/clickhouse-server:latest: This line pulls the latest official ClickHouse server image from Docker Hub, ensuring a stable and well-maintained base.
  • ports: Two ports are exposed. Port8123is the HTTP interface, which is essential for DuckDB'schsqlextension to communicate with ClickHouse (that had serious issues due different db name). Port9000is the native TCP interface, useful for connecting with other clients like theclickhouse-clientCLI or the official Python driver.
  • environment: These variables are used by the entrypoint script within the Docker image to initialize the database. On the first run, it will create a database namedecommerceand a user nameddatasciwith the specified password. This is a clean and effective way to bootstrap the database configuration.
  • volumes: Two named volumes,clickhouse_dataandclickhouse_logs, are defined. These instruct Docker to create persistent storage locations on the host machine for the ClickHouse data and log directories. This is critical because it ensures that all ingested data and server logs are preserved even if the container is stopped, removed, and recreated.
  • ulimits: Thenofilelimit (number of open file descriptors) is increased as recommended by the official ClickHouse documentation for optimal performance under heavy load.
  • healthcheck: This configuration tells Docker to periodically check if the ClickHouse server is healthy by pinging its HTTP endpoint. This is used in conjunction with thedepends_onclause in the Jupyter service.

JupyterLab Service Definition:

  • image: jupyter/scipy-notebook:latest: This uses a popular base image from the Jupyter project that comes pre-installed with common data science libraries like NumPy, SciPy, Pandas, and Scikit-learn.
  • volumes: This line mounts the local./notebooksdirectory on the host machine to the/home/jovyan/workdirectory inside the container. This allows the data scientist to create, edit, and save notebooks and other files, with the work being persisted directly on their local filesystem.
  • ports: The container's port8888(the default for Jupyter) is mapped to the host's port8888, making the JupyterLab interface accessible in a web browser athttp://localhost:8888.
  • depends_on: This clause ensures that thejupyter-labcontainer will only start after theclickhouse-servercontainer is running and has passed its health check. This prevents connection errors that could occur if the notebook environment came up before the database was ready to accept connections.

A crucial aspect of this setup is the networking managed by Docker Compose. When these services are launched, Docker Compose automatically creates a private virtual network and connects both containers to it. Within this network, containers can communicate with each other using their service names as hostnames. Therefore, when connecting from the Jupyter notebook to ClickHouse, the correct hostname to use isclickhouse-server, notlocalhost. This concept of service discovery is fundamental to multi-service containerized applications and is a key piece of knowledge for avoiding common connectivity issues.

To launch the entire environment, one simply navigates to the directory containing thedocker-compose.ymlfile and runs the commanddocker-compose up -d.

Part 2: Forging the Data: A Large-Scale E-commerce Dataset

With the infrastructure in place, the next step is to populate it with data. A realistic, large-scale dataset is essential for demonstrating the power of the analytical workflow and for training meaningful machine learning models. This section covers the design of the database schema in ClickHouse and provides a detailed walkthrough of a Python script designed to generate and ingest millions of relational records.

Database Schema Design in ClickHouse

The schema is designed to model a typical e-commerce business with sales transactions, recurring subscriptions, and customer feedback from social media. The tables are created using ClickHouse’s native DDL syntax, specifying the powerfulMergeTreeengine, which is the standard choice for most analytical tasks due to its performance characteristics, data sorting capabilities, and support for primary keys.

The following SQL statements define the three core tables in theecommercedatabase.

-- sales_schema.sql
-- Table for individual sales transactions
CREATE TABLE ecommerce.sales (
sale_id UUID,
product_id UInt32,
customer_id UInt32,
quantity UInt16,
price_per_unit Float64,
discount Float32,
sale_timestamp DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(sale_timestamp)
ORDER BY (product_id, sale_timestamp);

-- Table for customer subscriptions
CREATE TABLE ecommerce.subscriptions (
subscription_id UUID,
customer_id UInt32,
plan_type String,
start_date Date,
end_date Date,
monthly_price Float64
) ENGINE = MergeTree()
ORDER BY (customer_id, start_date);

-- Table for social media feedback
CREATE TABLE ecommerce.social_media_feedback (
feedback_id UUID,
customer_id UInt32,
product_id Nullable(UInt32),
feedback_text String,
platform String,
feedback_timestamp DateTime
) ENGINE = MergeTree()
ORDER BY (feedback_timestamp);

For thesalestable, partitioning by month (toYYYYMM(sale_timestamp)) is a common optimization in ClickHouse. It physically groups data on disk by month, which can dramatically speed up queries that filter on a specific time range. TheORDER BYclause defines the primary key, which physically sorts the data within each partition. A well-chosen primary key can act as a "super-index," enabling extremely fast data retrieval for queries that filter or aggregate on those key columns.

The Data Generation Engine: generate_data.py

This standalone Python script is responsible for creating a large volume of synthetic but realistic data and inserting it into the ClickHouse tables. It uses theFakerlibrary to generate plausible data points and theclickhouse-connectlibrary for efficient communication with the database.

The script’s logic is carefully designed to maintain relational integrity. A naive approach of generating random IDs for each transaction would result in a disconnected dataset. Instead, this script first generates a fixed pool of primary entities (customers and products) and then samples from these pools when creating transactional records (sales and feedback). This ensures that every transaction is linked to a valid customer and product.

A critical aspect of the script is its focus on performance. Ingesting millions of rows into a database one by one is extremely inefficient due to the high overhead of network latency and individual transaction processing. The correct approach is to perform bulk insertions. The script accumulates generated records into large batches (e.g., 100,000 rows at a time) in memory. It then uses theclient.insert()method from theclickhouse-connectlibrary, which is highly optimized for sending large amounts of data to ClickHouse in a single, efficient operation. This batching strategy is a fundamental data engineering principle that reduces database load and dramatically decreases the total time required for data ingestion.

Below is the complete, runnablegenerate_data.pyscript.

# generate_data.py
import os
import uuid
import random
import pandas as pd
from faker import Faker
from datetime import datetime, timedelta
import clickhouse_connect

# --- Configuration ---
CLICKHOUSE_HOST = os.getenv('CLICKHOUSE_HOST', 'localhost')
CLICKHOUSE_PORT = int(os.getenv('CLICKHOUSE_PORT', 8123))
CLICKHOUSE_USER = os.getenv('CLICKHOUSE_USER', 'datasci')
CLICKHOUSE_PASSWORD = os.getenv('CLICKHOUSE_PASSWORD', 'password')
CLICKHOUSE_DB = os.getenv('CLICKHOUSE_DB', 'ecommerce')
NUM_CUSTOMERS = 10_000
NUM_PRODUCTS = 1_000
NUM_SALES = 5_000_000
NUM_SUBSCRIPTIONS = 15_000
NUM_FEEDBACKS = 50_000
BATCH_SIZE = 100_000

# --- Initialize Faker and Database Connection ---
fake = Faker()
client = clickhouse_connect.get_client(
host=CLICKHOUSE_HOST,
port=CLICKHOUSE_PORT,
user=CLICKHOUSE_USER,
password=CLICKHOUSE_PASSWORD,
database=CLICKHOUSE_DB
)
print("Successfully connected to ClickHouse.")


# --- Helper Functions ---
def create_tables():
with open('sales_schema.sql', 'r') as f:
schema_sql = f.read()
for statement in schema_sql.split(';'):
if statement.strip():
client.command(statement)
print("Tables created or already exist.")

def generate_customers(n):
return [{'customer_id': i} for i in range(1, n + 1)]

def generate_products(n):
return [{'product_id': i, 'base_price': round(random.uniform(5.0, 500.0), 2)} for i in range(1, n + 1)]


# --- Main Data Generation Logic ---
def generate_and_insert_data():
print("Generating primary entities...")
customers = generate_customers(NUM_CUSTOMERS)
products = generate_products(NUM_PRODUCTS)
customer_ids = [c['customer_id'] for c in customers]
product_info = {p['product_id']: p['base_price'] for p in products}
product_ids = list(product_info.keys())
# Generate Sales Data in Batches
print(f"Generating and inserting {NUM_SALES} sales records...")
sales_data = []
for i in range(NUM_SALES):
product_id = random.choices(product_ids, weights=[1/id for id in product_ids], k=1)[0] # Skew towards popular (lower ID) products
base_price = product_info[product_id]
quantity = random.randint(1, 5)
discount = random.choices([0.0, 0.1, 0.15, 0.2], weights=[0.7, 0.15, 0.1, 0.05], k=1)[0]

sales_data.append([
str(uuid.uuid4()),
product_id,
random.choice(customer_ids),
quantity,
round(base_price * (1 - discount), 2),
discount,
fake.date_time_between(start_date='-2y', end_date='now')
])
if (i + 1) % BATCH_SIZE == 0:
client.insert('sales', sales_data, column_names=[
'sale_id', 'product_id', 'customer_id', 'quantity',
'price_per_unit', 'discount', 'sale_timestamp'
])
sales_data = []
print(f" Inserted batch { (i + 1) // BATCH_SIZE } of sales.")

if sales_data:
client.insert('sales', sales_data, column_names=[
'sale_id', 'product_id', 'customer_id', 'quantity',
'price_per_unit', 'discount', 'sale_timestamp'
])
print("Sales data insertion complete.")
# Generate Subscriptions Data
print(f"Generating and inserting {NUM_SUBSCRIPTIONS} subscription records...")
subscriptions_data = []
for _ in range(NUM_SUBSCRIPTIONS):
start_date = fake.date_between(start_date='-3y', end_date='-1m')
subscriptions_data.append([
str(uuid.uuid4()),
random.choice(customer_ids),
random.choice(['Basic', 'Standard', 'Premium']),
start_date,
start_date + timedelta(days=365),
round(random.uniform(9.99, 99.99), 2)
])
client.insert('subscriptions', subscriptions_data, column_names=[
'subscription_id', 'customer_id', 'plan_type', 'start_date', 'end_date', 'monthly_price'
])
print("Subscriptions data insertion complete.")
# Generate Social Media Feedback
print(f"Generating and inserting {NUM_FEEDBACKS} feedback records...")
feedback_data = []
platforms = ['Twitter', 'Facebook', 'Instagram', 'Reddit']
for i in range(NUM_FEEDBACKS):
feedback_data.append([
str(uuid.uuid4()),
random.choice(customer_ids),
random.choice(product_ids) if random.random() < 0.8 else None,
fake.sentence(nb_words=15),
random.choice(platforms),
fake.date_time_between(start_date='-2y', end_date='now')
])
if (i + 1) % BATCH_SIZE == 0:
client.insert('social_media_feedback', feedback_data, column_names=[
'feedback_id', 'customer_id', 'product_id', 'feedback_text', 'platform', 'feedback_timestamp'
])
feedback_data = []
print(f" Inserted batch { (i + 1) // BATCH_SIZE } of feedback.")
if feedback_data:
client.insert('social_media_feedback', feedback_data, column_names=[
'feedback_id', 'customer_id', 'product_id', 'feedback_text', 'platform', 'feedback_timestamp'
])
print("Feedback data insertion complete.")


if __name__ == "__main__":
create_tables()
generate_and_insert_data()
print("\nData generation and ingestion finished successfully!")

Part 3: The Core Pattern: The “Extract, Store, Analyze” Workflow

This section details the central and most innovative component of the workflow: using the ClickHouse Python connector to extract aggregated data and DuckDB as a high-performance local engine for analysis.

This pattern avoids common pitfalls of trying to connect an in-memory tool directly to a remote petabyte-scale warehouse, which can lead to connection unreliability and authentication challenges. Instead, we use a robust, explicit three-step process: Extract, Store, and Analyze.

  1. Extract: Run a heavy aggregation query directly in ClickHouse using the official clickhouse-connect Python library. This leverages ClickHouse’s power to process billions of rows and return only the small, aggregated result set.
  2. Store: The JSON/dictionary result from ClickHouse is loaded into a local DuckDB database. This step is “schema-on-read” and extremely fast, persisting the intermediate result. This step will work faster being able to connect to ClickHouse directly
  3. Analyze: All further analysis (joins, filtering, feature engineering) is performed on the local DuckDB instance, which provides a full-featured SQL environment with near-instantaneous performance on the pre-aggregated data.

This approach gives the data scientist the best of both worlds: the massive scale of ClickHouse and the interactive, in-memory speed of DuckDB.

The “Extract, Store, Analyze” Pattern in Practice

Here is the code demonstrating this pattern. We first connect to both services: ClickHouse (running on its Docker service name clickhouse-server) and a local DuckDB database file.

import duckdb
import pandas as pd
import clickhouse_connect
import json

# 1. Connect to ClickHouse (the "Extractor")
ch_client = clickhouse_connect.get_client(
host='clickhouse-server',
port=8123,
user='datasci',
password='password',
database='ecommerce'
)

# 2. Connect to DuckDB (the local "Analyzer")
con = duckdb.connect('local_analytics.db')

# Example: Aggregate sales by product
# Step 1: EXTRACT - Run heavy query on ClickHouse
clickhouse_query = "SELECT product_id, SUM(quantity) AS total_quantity FROM sales GROUP BY product_id"
result = ch_client.query(clickhouse_query).result_rows
json_data = [dict(zip(['product_id', 'total_quantity'], row)) for row in result]
df_json = pd.DataFrame(json_data)

# Step 2: STORE - Load aggregated data into DuckDB
con.execute("CREATE OR REPLACE TABLE sales_agg AS SELECT * FROM df_json")

# Step 3: ANALYZE - Query the local table in DuckDB
df = con.execute("SELECT * FROM sales_agg WHERE total_quantity > 100").df()
print(df.head())

Advantages of this pattern:

  • Reliability: It uses the official, stable ClickHouse Python connector, avoiding any third-party extension issues.
  • Performance: Heavy lifting (billions of rows) is done by ClickHouse. Light, interactive analysis (thousands of rows) is done by DuckDB.
  • Reproducibility: The local local_analytics.db file can be saved, versioned, and shared, making the analysis fully reproducible without re-querying the production warehouse.

Part 4: From Data to Decisions: Three Applied Machine Learning Models

With the data pipeline established, this section demonstrates its practical utility by building three distinct machine learning models to solve real-world business problems. Each use case follows a complete workflow: defining the business objective, retrieving the necessary data using the ClickHouse-DuckDB pattern, performing feature engineering, training a model, and, most importantly, interpreting the results to derive actionable business decisions.

Important: the datasets has been generated, you will have different results based on the amount of data, you can generate more data or change IDs if you wanna get different results (Ex: more anomalies)

Use Case 1: Optimizing Discount Strategy with Sales Prediction (Regression)

Business Goal: To predict daily sales for a top-selling product. The objective is to identify future periods of low demand, which represent opportunities to apply a targeted discount to stimulate sales and maximize overall revenue, rather than applying broad, inefficient discounts.

Data Retrieval: The first step is to retrieve the necessary historical data. A query is constructed to aggregate the total quantity sold per day for a specific product (product_id = 42) over the past two years. This aggregation is performed efficiently by ClickHouse, and only the resulting time-series data is pulled into the notebook via DuckDB.

# Retrieve daily sales data for a single product
product_id_to_forecast = 42 # You can change to any existing ID
daily_sales_query = f"""SELECT toDate(sale_timestamp) AS sale_date, SUM(quantity) AS daily_quantity FROM sales WHERE product_id = {product_id_to_forecast} GROUP BY sale_date ORDER BY sale_date"""
result = ch_client.query(daily_sales_query).result_rows
json_data = [dict(zip(['sale_date', 'daily_quantity'], row)) for row in result]
df_json_2 = pd.DataFrame(json_data)
con.execute("CREATE TABLE IF NOT EXISTS daily_sales AS SELECT * FROM df_json_2")
daily_sales_df = con.execute("SELECT * FROM daily_sales").df()
daily_sales_df['sale_date'] = pd.to_datetime(daily_sales_df['sale_date'])
daily_sales_df.set_index('sale_date', inplace=True)
daily_sales_df = daily_sales_df.asfreq('D', fill_value=0)

Feature Engineering: Raw time-series data is often insufficient for predictive models. New features are engineered from the date index to capture temporal patterns like seasonality and weekly cycles. Lag features (sales from previous days) are also created, as past sales are often a strong predictor of future sales.

# Create time-based features
df = daily_sales_df.copy()
df['day_of_week'] = df.index.dayofweek
df['day_of_year'] = df.index.dayofyear
df['month'] = df.index.month
df['year'] = df.index.year
df['week_of_year'] = df.index.isocalendar().week.astype(int)

# Create lag features
for i in range(1, 8):
df[f'lag_{i}'] = df['daily_quantity'].shift(i)
df.dropna(inplace=True)

Model Selection: ARandomForestRegressorfrom the scikit-learn library is chosen for this task. Random Forests are robust, handle non-linear relationships well, and are less prone to overfitting than single decision trees, making them a strong choice for many tabular regression problems.

Training and Evaluation: The data is split into a training set (older data) and a testing set (more recent data) to simulate a real-world forecasting scenario. The model is trained on the training set and evaluated on its ability to predict sales on the unseen test set.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

X = df.drop('daily_quantity', axis=1)
y = df['daily_quantity']

# Split data chronologically
split_point = int(len(df) * 0.8)
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]

# Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42, min_samples_leaf=5)
model.fit(X_train, y_train)

# Make predictions and evaluate
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R2): {r2:.2f}")

The true value of this model lies not in its accuracy metrics but in its application to business strategy. A raw prediction is just a number; an actionable insight is the goal. The model’s predictions can be used to create a “Discount Opportunity” report. By comparing predicted sales against a historical average, the business can identify days where demand is expected to be low and proactively offer a discount. This transforms the machine learning output into a concrete, revenue-generating action. For example, a final report might state:

“For Product 42, predicted sales on upcoming Tuesdays are 40% below the weekly average. A targeted 10% discount on these days is projected to increase sales volume by 25%, leading to a net revenue increase of 12.5%.”

Use Case 2: Gauging Public Opinion with Feedback Classification (NLP)

Business Goal: To automatically classify unstructured customer feedback from social media into ‘Positive’, ‘Negative’, and ‘Neutral’ sentiment categories. This allows the business to monitor brand perception in near real-time, quickly identify widespread product issues, and celebrate areas of customer satisfaction without manually reading thousands of comments.

Data Retrieval: For this Natural Language Processing (NLP) task, a sample of 50,000 feedback comments is retrieved from thesocial_media_feedbacktable. Since the raw text is needed, no server-side aggregation is performed.

# Retrieve 50,000 feedback comments
feedback_query = "SELECT feedback_text FROM social_media_feedback LIMIT 50000"
result = ch_client.query(feedback_query).result_rows
json_data = [dict(zip(['feedback_text'], row)) for row in result]
df_json_3 = pd.DataFrame(json_data)
con.execute("CREATE TABLE IF NOT EXISTS feedback AS SELECT * FROM df_json_3")
feedback_df = con.execute("SELECT * FROM feedback").df()

# In a real-world project, this step would involve expensive, time-consuming manual labeling
# to create a "ground truth" dataset.
# To simulate this for the tutorial, we will use the TextBlob library to generate
# "pseudo-labels" for our feedback.
# The goal is to then train our own, more efficient LogisticRegression model
# on these labels, creating a custom classifier tailored to our data.
from textblob import TextBlob
def get_sentiment(text):
polarity = TextBlob(text).sentiment.polarity
if polarity > 0.1:
return 'Positive'
elif polarity < -0.1:
return 'Negative'
else:
return 'Neutral'
feedback_df['sentiment'] = feedback_df['feedback_text'].apply(get_sentiment)

Text Preprocessing and Feature Extraction: Text data must be cleaned and converted into a numerical format that machine learning models can understand. A standard NLP pipeline is applied: converting text to lowercase, removing punctuation and common “stop words” (like ‘the’, ‘a’, ‘is’), and then using aTfidfVectorizer. TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It effectively highlights words that are important for distinguishing between different categories of text.

from sklearn.feature_extraction.text import TfidfVectorizer
import re

def preprocess_text(text):
text = text.lower()
text = re.sub(r'\W', ' ', text)
text = re.sub(r'\s+', ' ', text)
return text
feedback_df['clean_text'] = feedback_df['feedback_text'].apply(preprocess_text)
tfidf = TfidfVectorizer(max_features=2000, stop_words='english')
X_text = tfidf.fit_transform(feedback_df['clean_text'])
y_sentiment = feedback_df['sentiment']

Model Selection: LogisticRegressionis an excellent choice for a baseline text classification model. It is computationally efficient, performs well, and, crucially, its results are highly interpretable.

Training and Evaluation: The model is trained on the TF-IDF matrix and evaluated using a classification report, which provides key metrics like precision, recall, and F1-score for each sentiment class.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X_text, y_sentiment, test_size=0.2, random_state=42)
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
predictions = log_reg.predict(X_test)
print(classification_report(y_test, predictions))
Press enter or click to view image in full size

The power of using an interpretable model like Logistic Regression is the ability to go beyond simple classification and understand why a piece of text was classified a certain way. By examining the model’s coefficients, one can identify the words that are most strongly associated with each sentiment category. This provides direct, actionable feedback to product and marketing teams. For instance, an analysis might reveal: “Negative feedback is strongly driven by terms like ‘battery’, ‘slow’, ‘crashes’, and ‘update’. Positive feedback is correlated with ‘camera’, ‘design’, ‘speed’, and ‘screen’.” This insight is far more valuable than a simple sentiment score.

Use Case 3: Preventing Stockouts with Anomaly Detection

Business Goal: To proactively identify products that are experiencing an unusually high sales velocity. Such anomalies can signal a viral trend, a successful marketing campaign, or a competitor’s failure, and they require an immediate response from the supply chain to increase stock and prevent a costly out-of-stock event.

Data Retrieval & Reshaping: The process begins by retrieving daily sales data for all products, similar to the regression use case. This time, however, the data is reshaped using a Pandas pivot_table to create a time series of daily sales quantities for each individual product.

# Retrieve daily sales for all products
all_sales_query = "SELECT toDate(sale_timestamp) AS sale_date, product_id, SUM(quantity) AS daily_quantity FROM sales GROUP BY sale_date, product_id ORDER BY sale_date"
result = ch_client.query(all_sales_query).result_rows
json_data = [dict(zip(['sale_date', 'product_id', 'daily_quantity'], row)) for row in result]
df_json_4 = pd.DataFrame(json_data)
con.execute("CREATE TABLE IF NOT EXISTS all_sales AS SELECT * FROM df_json_4")
all_sales_df = con.execute("SELECT * FROM all_sales").df()
all_sales_df['sale_date'] = pd.to_datetime(all_sales_df['sale_date'])
sales_pivot = all_sales_df.pivot_table(index='sale_date', columns='product_id', values='daily_quantity').fillna(0)

Model Selection: For this task, an unsupervised anomaly detection algorithm,IsolationForest, is employed. This model works by building a forest of random trees. The core idea is that anomalies are "few and different," which means they should be easier to "isolate" from the rest of the data points. Therefore, anomalous points will have a shorter average path length in the trees of the forest. It is an effective method for identifying outliers in a dataset without requiring pre-labeled examples of what constitutes an anomaly.

Training and Prediction: AnIsolationForestmodel is trained on the time series of a single product's sales. The model learns the profile of "normal" sales fluctuations for that specific product. It is then used to predict which days represent anomalies. The model assigns a score to each data point and classifies them as either an inlier (1) or an outlier/anomaly (-1).

from sklearn.ensemble import IsolationForest

# Focus on a single product's time series
product_id = 42 # Example product ID
product_ts = sales_pivot[[product_id]].copy()
product_ts = product_ts.rename(columns={product_id: 'daily_quantity'})
# Train the anomaly detection model
iso_forest = IsolationForest(contamination=0.01, random_state=42) # Assume 1% of data are anomalies
iso_forest.fit(product_ts[['daily_quantity']])
# Predict anomalies
product_ts['anomaly'] = iso_forest.predict(product_ts[['daily_quantity']])
anomalies = product_ts[product_ts['anomaly'] == -1]
print("Detected Anomalous Sales Days:")
print(anomalies)
Press enter or click to view image in full size

The final step is to translate these detected anomalies into a clear, actionable format. A simple list of dates is not enough. A powerful approach is to visualize the time series, highlighting the anomalous points in a different color. This provides immediate visual confirmation of the sales spikes. Furthermore, the output can be formatted into an “Alerts Log” DataFrame. This log would contain the product ID, the date of the anomaly, the sales quantity on that day, the product’s historical average sales, and the anomaly score. This simulates a real-world monitoring system and provides the supply chain team with a clear, prioritized list of products that require immediate attention to prevent stockouts.

Conclusion: A Blueprint for Scalable, Interactive Analytics

This article has detailed a comprehensive, end-to-end workflow that effectively addresses one of the most persistent challenges in modern data science: the interactive analysis of massive datasets. By architecting a solution that leverages the distinct strengths of ClickHouse and DuckDB, a powerful and efficient pattern emerges. This workflow is not merely a technical curiosity; it represents a paradigm shift in how data scientists can interact with large-scale data, moving away from cumbersome, batch-oriented processes toward a more fluid, interactive, and productive mode of operation.

The core of this workflow is the synergistic relationship between a centralized, high-performance data warehouse and a nimble, in-process analytical engine. ClickHouse serves as the scalable, persistent source of truth, capable of storing and managing petabytes of data while handling complex aggregations with exceptional speed. DuckDB, operating as a library within the data scientist’s local Jupyter environment, provides the crucial “last-mile” analytical layer. It empowers the user to query the massive backend data store intelligently, performing heavy computations on the server and pulling only the necessary, manageable results into local memory for modeling and visualization. This approach provides the best of both worlds: the scale of a distributed system and the interactivity of a local tool.

The key advantages of this architectural pattern can be summarized as follows:

  • Scalability: The system’s ability to handle data growth is virtually unlimited, as ClickHouse is designed to scale horizontally across clusters.
  • Interactivity and Memory Efficiency: By pushing down computations to ClickHouse and retrieving only aggregated or filtered results, the data scientist’s local machine is never overwhelmed, enabling a smooth, interactive analytical experience regardless of the backend data size.
  • Reproducibility and Portability: The use of Docker and Docker Compose encapsulates the entire environment, from the database to the analytical libraries. This ensures that the project is easy to set up, share, and deploy consistently across different machines, eliminating “it works on my machine” problems.
  • Flexibility:The combination of SQL (via DuckDB) and Python (via Pandas and Scikit-learn) provides a rich, flexible toolset for tackling a wide array of data science problems, from regression and classification to anomaly detection.

The project detailed here serves as a robust blueprint that can be adapted and extended in numerous ways. Future directions could include deploying the trained machine learning models as production-ready REST APIs using a framework like FastAPI, creating real-time monitoring dashboards with tools like Streamlit or Dash built on top of the anomaly detection alerts, or scaling the backend to a multi-node ClickHouse cluster to handle true petabyte-scale analytical workloads. By adopting the principles and patterns outlined in this article, data teams can empower their scientists to work more effectively, accelerate the pace of innovation, and unlock deeper insights from their largest and most valuable data assets.

--

--

Jason Jimenez Cruz
Jason Jimenez Cruz

Written by Jason Jimenez Cruz

Data Architect, Fullstack Developer and GameDev. I mainly work with Python & Javascript but I like to build tools in Golang, C# and Scala

No responses yet