Neural Deep@neuraldeep P.1376

Neural Deep

Как заставить Qwen2.5-VL-72B-Instruct 8FP dynamic работать идеально с документами?
И еще извлекать bbox

Недавно Илья победитель ERC обратился ко мне с проблемой: ему нужно было обрабатывать 44-страничное письмо,
получая не только координаты текстовых блоков (bbox), но и полностью извлекать текст из каждого распознанного блока

Он уже пробовал Qwen2.5-VL-72B-Instruct через OpenRouter, но результаты были неудовлетворительными:
"Qwen 2.5 VL просто генерит полную дичь!"

Интересное наблюдение по провайдерам:
1. Parasail: $0.7 за 1M токенов (FP8) — лучший результат (после того как я показал правильную схему и промпт)
2. NovitaAI: $0.8 за 1M токенов — плохие результаты
3. Together: $8 за 1M токенов — худшие результаты

Удивительно, что самый дешевый провайдер давал значительно лучшие результаты!

Моё решение:

Я предложил протестировать модель на моей A100 с правильным промптом и JSON-схемой:

{
    "type": "object",
    "properties": {
        "objects": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "bbox_2d": {
                        "type": "array",
                        "description": "Coordinates of the object bounding box [x1, y1, x2, y2]",
                        "items": {
                            "type": "integer"
                        }
                    },
                    "label": {
                        "type": "string",
                        "description": "Document element label"
                    },
                    "text": {
                        "type": "string",
                        "description": "Extracted text content from the detected area"
                    },
                    "confidence": {
                        "type": "number",
                        "description": "Confidence score for the detection (0.0 to 1.0)"
                    }
                },
                "required": ["bbox_2d", "label"]
            }
        }
    },
    "required": ["objects"]
}

Ключевые факторы успеха:

1. Предобработка изображений: уменьшение размера до 2000 пикселей по широкой стороне для
баланса между качеством и контекстом (8K токенов)

2. Детальный промпт:

Detect all distinct text blocks and key visual elements in the document image. 
Group text lines that logically, semantically, and visually belong together into single elements cluster.
For each detected element, provide:
1. A concise and descriptive label (e.g., 'heading', 'paragraph', 'list', 'table', 'section', etc.)
2. A bounding box [x1, y1, x2, y2] that encompasses the entire grouped element.
3. The complete text content of the cluster, adjusted to the Markdown format.
Ignore "manifest immigration" header and "Manifest Law PLLC." with page number footers.

3. Структурированный вывод через guided_json vLLM:

extra_body = {
    "guided_json": json.dumps(DOCUMENT_JSON_SCHEMA),
    "guided_decoding_backend": "xgrammar"
}

Выводы:

1. Не все провайдеры одинаково полезны, даже с одной и той же моделью
2. Цена не всегда коррелирует с качеством
3. Правильный промпт критически важен
4. JSON-схема значительно повышает качество и стабильность результатов
5. FP8-квантизация вполне может обеспечивать высокое качество
6. Собственный хостинг даёт больше контроля и стабильности даже проверить стартовый результат

В комментариях пришлем что было до как показывали другие API провайдеры и что вышло после

В итоге Илья реализовал полный пайплайн обработки документов с точностью распознавания 100% на все документы

👍29🔥106❤2

www.tgoop.com/neuraldeep/1376

1.68K viewsedited Apr 8 at 10:21

tgoop.com/neuraldeep/1376

Create: 2025-04-08
Last Update: 2025-07-29 13:31:39

{
    "type": "object",
    "properties": {
        "objects": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "bbox_2d": {
                        "type": "array",
                        "description": "Coordinates of the object bounding box [x1, y1, x2, y2]",
                        "items": {
                            "type": "integer"
                        }
                    },
                    "label": {
                        "type": "string",
                        "description": "Document element label"
                    },
                    "text": {
                        "type": "string",
                        "description": "Extracted text content from the detected area"
                    },
                    "confidence": {
                        "type": "number",
                        "description": "Confidence score for the detection (0.0 to 1.0)"
                    }
                },
                "required": ["bbox_2d", "label"]
            }
        }
    },
    "required": ["objects"]
}

Detect all distinct text blocks and key visual elements in the document image. 
Group text lines that logically, semantically, and visually belong together into single elements cluster.
For each detected element, provide:
1. A concise and descriptive label (e.g., 'heading', 'paragraph', 'list', 'table', 'section', etc.)
2. A bounding box [x1, y1, x2, y2] that encompasses the entire grouped element.
3. The complete text content of the cluster, adjusted to the Markdown format.
Ignore "manifest immigration" header and "Manifest Law PLLC." with page number footers.

3. Структурированный вывод через guided_json vLLM:

extra_body = {
    "guided_json": json.dumps(DOCUMENT_JSON_SCHEMA),
    "guided_decoding_backend": "xgrammar"
}

BY Neural Deep

Share with your friend now:
tgoop.com/neuraldeep/1376

Telegram News

Как заставить Qwen2.5-VL-72B-Instruct 8FP dynamic работать идеально с документами?