MACHINELEARNING_INTERVIEW Telegram 1778
🔥Продвинутая SQL‑задача для аналитиков данных и ML‑специалистов
*(PostgreSQL ≥ 13; приёмы также работают в BigQuery, Snowflake, ClickHouse, Databricks SQL и др.)*

⭐️⭐️⭐️

💼 Сценарий

Вы работаете в потоковом видеосервисе StreamHub и располагаете двумя таблицами:


-- История всех заказов (подписок, апгрейдов, «pay‑per‑view» и т.п.)
CREATE TABLE fact_orders (
order_id BIGINT PRIMARY KEY,
user_id BIGINT,
order_ts TIMESTAMPTZ,
gross_amount NUMERIC(12,2)
);

-- Признаки пользователя
CREATE TABLE dim_users (
user_id BIGINT PRIMARY KEY,
signup_ts TIMESTAMPTZ, -- момент регистрации
country_code VARCHAR(2),
marketing_src TEXT -- utm‑source, рекламный канал и т.д.
);


⭐️⭐️⭐️

📝 Постановка задачи

1. Когорта — месяц первой покупки (а не регистрации).
2. В ячейке ( *Cohort = MM‑YYYY*, Age = *N* месяцев ) должна стоять доля пользователей когорты, совершивших ≥ 1 покупку в N‑м календарном месяце после первой.
3. Диапазон Age — 0 … 12 месяцев.
4. Рассмотреть только когорты *2023‑01 … 2025‑03*.
5. Реализовать одним SQL‑запросом (*single‑shot*) без материализации промежуточных таблиц.
6. Затем вывести TOP‑5 когорт с наилучшим удержанием на 12‑м месяце.

⭐️⭐️⭐️

🛠️ Решение — единый запрос


WITH first_purchase AS ( -- 1. первая покупка каждого пользователя
SELECT
o.user_id,
date_trunc('month', MIN(o.order_ts)) AS cohort_month
FROM fact_orders o
GROUP BY o.user_id
),

activity AS ( -- 2. месяц каждой покупки
SELECT
o.user_id,
date_trunc('month', o.order_ts) AS activity_month
FROM fact_orders o
),

calendar AS ( -- 3. календарь месяцев 2023‑01 … 2025‑03
SELECT generate_series(
date '2023-01-01',
date '2025-03-01',
interval '1 month'
)::date AS month_start
),

cohort_grid AS ( -- 4. «кохорта × возраст» (0–12) сетка
SELECT
fp.user_id,
fp.cohort_month,
c.month_start AS activity_month,
EXTRACT(epoch FROM (c.month_start - fp.cohort_month))
/ (30*24*3600) / 30 AS age_months -- округлится до int
FROM first_purchase fp
JOIN calendar c
ON c.month_start BETWEEN fp.cohort_month
AND fp.cohort_month + interval '12 months'
),

retention AS ( -- 5. факт активности
SELECT
cohort_month,
age_months::int AS age_m,
COUNT(DISTINCT user_id)
FILTER (WHERE a.user_id IS NOT NULL) AS active_users,
COUNT(DISTINCT user_id) AS cohort_size
FROM cohort_grid cg
LEFT JOIN activity a
ON a.user_id = cg.user_id
AND a.activity_month = cg.activity_month
GROUP BY cohort_month, age_months
)

SELECT
to_char(cohort_month, 'YYYY-MM') AS cohort,
MAX(active_users) FILTER (WHERE age_m = 0)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m0,
MAX(active_users) FILTER (WHERE age_m = 1)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m1,
MAX(active_users) FILTER (WHERE age_m = 3)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m3,
MAX(active_users) FILTER (WHERE age_m = 6)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m6,
MAX(active_users) FILTER (WHERE age_m = 12)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m12
FROM retention
WHERE cohort_month BETWEEN date '2023-01-01' AND date '2025-03-01'
GROUP BY cohort_month
ORDER BY cohort_month;


⭐️⭐️⭐️

🔍 Как работает запрос в комментариях
👍176🔥6



tgoop.com/machinelearning_interview/1778
Create:
Last Update:

🔥Продвинутая SQL‑задача для аналитиков данных и ML‑специалистов
*(PostgreSQL ≥ 13; приёмы также работают в BigQuery, Snowflake, ClickHouse, Databricks SQL и др.)*

⭐️⭐️⭐️

💼 Сценарий

Вы работаете в потоковом видеосервисе StreamHub и располагаете двумя таблицами:


-- История всех заказов (подписок, апгрейдов, «pay‑per‑view» и т.п.)
CREATE TABLE fact_orders (
order_id BIGINT PRIMARY KEY,
user_id BIGINT,
order_ts TIMESTAMPTZ,
gross_amount NUMERIC(12,2)
);

-- Признаки пользователя
CREATE TABLE dim_users (
user_id BIGINT PRIMARY KEY,
signup_ts TIMESTAMPTZ, -- момент регистрации
country_code VARCHAR(2),
marketing_src TEXT -- utm‑source, рекламный канал и т.д.
);


⭐️⭐️⭐️

📝 Постановка задачи

1. Когорта — месяц первой покупки (а не регистрации).
2. В ячейке ( *Cohort = MM‑YYYY*, Age = *N* месяцев ) должна стоять доля пользователей когорты, совершивших ≥ 1 покупку в N‑м календарном месяце после первой.
3. Диапазон Age — 0 … 12 месяцев.
4. Рассмотреть только когорты *2023‑01 … 2025‑03*.
5. Реализовать одним SQL‑запросом (*single‑shot*) без материализации промежуточных таблиц.
6. Затем вывести TOP‑5 когорт с наилучшим удержанием на 12‑м месяце.

⭐️⭐️⭐️

🛠️ Решение — единый запрос


WITH first_purchase AS ( -- 1. первая покупка каждого пользователя
SELECT
o.user_id,
date_trunc('month', MIN(o.order_ts)) AS cohort_month
FROM fact_orders o
GROUP BY o.user_id
),

activity AS ( -- 2. месяц каждой покупки
SELECT
o.user_id,
date_trunc('month', o.order_ts) AS activity_month
FROM fact_orders o
),

calendar AS ( -- 3. календарь месяцев 2023‑01 … 2025‑03
SELECT generate_series(
date '2023-01-01',
date '2025-03-01',
interval '1 month'
)::date AS month_start
),

cohort_grid AS ( -- 4. «кохорта × возраст» (0–12) сетка
SELECT
fp.user_id,
fp.cohort_month,
c.month_start AS activity_month,
EXTRACT(epoch FROM (c.month_start - fp.cohort_month))
/ (30*24*3600) / 30 AS age_months -- округлится до int
FROM first_purchase fp
JOIN calendar c
ON c.month_start BETWEEN fp.cohort_month
AND fp.cohort_month + interval '12 months'
),

retention AS ( -- 5. факт активности
SELECT
cohort_month,
age_months::int AS age_m,
COUNT(DISTINCT user_id)
FILTER (WHERE a.user_id IS NOT NULL) AS active_users,
COUNT(DISTINCT user_id) AS cohort_size
FROM cohort_grid cg
LEFT JOIN activity a
ON a.user_id = cg.user_id
AND a.activity_month = cg.activity_month
GROUP BY cohort_month, age_months
)

SELECT
to_char(cohort_month, 'YYYY-MM') AS cohort,
MAX(active_users) FILTER (WHERE age_m = 0)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m0,
MAX(active_users) FILTER (WHERE age_m = 1)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m1,
MAX(active_users) FILTER (WHERE age_m = 3)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m3,
MAX(active_users) FILTER (WHERE age_m = 6)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m6,
MAX(active_users) FILTER (WHERE age_m = 12)
/ MAX(cohort_size) FILTER (WHERE age_m = 0) AS retention_m12
FROM retention
WHERE cohort_month BETWEEN date '2023-01-01' AND date '2025-03-01'
GROUP BY cohort_month
ORDER BY cohort_month;


⭐️⭐️⭐️

🔍 Как работает запрос в комментариях

BY Machine learning Interview


Share with your friend now:
tgoop.com/machinelearning_interview/1778

View MORE
Open in Telegram


Telegram News

Date: |

With Bitcoin down 30% in the past week, some crypto traders have taken to Telegram to “voice” their feelings. How to create a business channel on Telegram? (Tutorial) The initiatives announced by Perekopsky include monitoring the content in groups. According to the executive, posts identified as lacking context or as containing false information will be flagged as a potential source of disinformation. The content is then forwarded to Telegram's fact-checking channels for analysis and subsequent publication of verified information. Avoid compound hashtags that consist of several words. If you have a hashtag like #marketingnewsinusa, split it into smaller hashtags: “#marketing, #news, #usa. In 2018, Telegram’s audience reached 200 million people, with 500,000 new users joining the messenger every day. It was launched for iOS on 14 August 2013 and Android on 20 October 2013.
from us


Telegram Machine learning Interview
FROM American