llm security и каланы@llmsecurity P.16

llm security и каланы

Основная проблема всех текстовых adversarial-атак в том, что мы работаем с дискретными токенами. В идеале мы бы хотели поменять в суффиксе так, чтобы лосс по искомому префиксу был минимальным. Сделать это жадно не получится из-за вычислительной сложности, поэтому мы выбираем токены на основе градиента: вспомнив, что эмбеддинги можно получить перемножением one-hot-матрицы на матрицу эмбеддингов, для каждой позиции в суффиксе мы выбираем top-k токенов, которые имеют максимальный отрицательный градиент:

    one_hot = torch.zeros(
        input_ids[input_slice].shape[0],
        embed_weights.shape[0],
        device=model.device,
        dtype=embed_weights.dtype
    )
    one_hot.scatter_(
        1, 
        input_ids[input_slice].unsqueeze(1),
        torch.ones(one_hot.shape[0], 1, device=model.device, dtype=embed_weights.dtype)
    )
    one_hot.requires_grad_()
    input_embeds = (one_hot @ embed_weights).unsqueeze(0)
    # now stitch it together with the rest of the embeddings
    embeds = get_embeddings(model, input_ids.unsqueeze(0)).detach()
    full_embeds = torch.cat(
        [
            embeds[:,:input_slice.start,:], 
            input_embeds, 
            embeds[:,input_slice.stop:,:]
        ], 
        dim=1)
    logits = model(inputs_embeds=full_embeds).logits
    targets = input_ids[target_slice]
    loss = nn.CrossEntropyLoss()(logits[0,loss_slice,:], targets)
    loss.backward()

www.tgoop.com/llmsecurity/16

195 viewsJan 22, 2024 at 19:56

tgoop.com/llmsecurity/16

Create: 2024-01-22
Last Update: 2025-07-25 23:01:45

    one_hot = torch.zeros(
        input_ids[input_slice].shape[0],
        embed_weights.shape[0],
        device=model.device,
        dtype=embed_weights.dtype
    )
    one_hot.scatter_(
        1, 
        input_ids[input_slice].unsqueeze(1),
        torch.ones(one_hot.shape[0], 1, device=model.device, dtype=embed_weights.dtype)
    )
    one_hot.requires_grad_()
    input_embeds = (one_hot @ embed_weights).unsqueeze(0)
    # now stitch it together with the rest of the embeddings
    embeds = get_embeddings(model, input_ids.unsqueeze(0)).detach()
    full_embeds = torch.cat(
        [
            embeds[:,:input_slice.start,:], 
            input_embeds, 
            embeds[:,input_slice.stop:,:]
        ], 
        dim=1)
    logits = model(inputs_embeds=full_embeds).logits
    targets = input_ids[target_slice]
    loss = nn.CrossEntropyLoss()(logits[0,loss_slice,:], targets)
    loss.backward()

BY llm security и каланы

Share with your friend now:
tgoop.com/llmsecurity/16

Telegram News

Основная проблема всех текстовых adversarial-атак в том