기술 블로그 안하세요?

[Paper Review] CPR: Retrieval Augmented Generation for Copyright Protection

quasar529 — Sun, 2 Jun 2024 22:40:03 +0900

지난 포스팅의 주제였던 Copyright Protection이 적용된 RAG에 대한 논문입니다.

CPR: Retrieval Augmented Generation for Copyright Protection

Retrieval Augmented Generation (RAG) is emerging as a flexible and robust technique to adapt models to private users data without training, to handle credit attribution, and to allow efficient machine unlearning at scale. However, RAG techniques for image

arxiv.org

RAG는 private users data를 학습 없이 사용가능하게 하지만 모델이 retrieved samples을 그대로 복사할 위험 있습니다.

그래서 제시한 것이 본 논문의 Copy-Protected generation with Retrieval (CPR)입니다.

Inference 할 때 public private 분포의 diffusion score 섞어서 sampling 하고

이는 NAF를 만족합니다. (Near Acess Free)

Mixed-Privacy RAG

$D_{private}$이 parameter update에 사용되지 않아 immediate application to privacy 가능,

하지만 inference 때 retrtrived된 sample이 정보를 유출할 수 있습니다.

$D$ (Safe dataset)로 학습한 public diffusion model = $s_\theta (x_t, t, c)$

c는 clip encoder의 결과 ($c = CLIP(<prompt>$)

$D_{private}$ = protection 필요한 데이터셋, 본 논문에선 data for retrieval

$D_{private}$ 의 부분집합인 $D_{retr} = \{(x_i,\phi(c_i, c_{test})) \}^m_{i=1}$ 를 generation 향상 위해 사용

$score = || c_{test} - c_i|| + ||c_{test} - CLIP(x_i)||$

이를 기반으로 가장 가까운 m개 sample 뽑는다.

$\phi(c_i, c_{test}) = c_i + c_{test}$

Mixture-of-Distribution

public과 private 분포 섞습니다.

$p(x|c) = w_0 p_D (x|c) + w_1 p_D{retr}(x|c)$

$w_0 = λ$

$w_1 = 1−λ$

Mixture-of-Score

mixture distribution에서 샘플링하기 위해 score fucntion 계산하고

섞은 분포에서 샘플링을 진행합니다.

Proposition 1.

$\hat{w}$ : fixed hyper-parameters

$\nabla_{x_t} \log_{p_D}(x_t|c)$ → $s_\theta$ (diffusion model )로 근사 가능

근데 $D_{retr}$ 로 학습한 model은 없어 → $s_{\theta_1}$

근데 s_{\theta_1} 계산량 너무 많아 CLIP 으로 대체합니다.

Proposition 2.

modifying the user prompt c_test using the CLIP embeddings of the retrieved samples

∵ fine-tuning 안하려고

optimal diffusion model trained on retrieved data를 CLIP embedding으로 근사시킬 수 있다.

Retrieval-Mixture-Score

expression for the score function of retrieval- augmented mixture of distributions

Copy-Protected Generation

CPR-KL

q1과 q2에 access 못해
➡️ 대신 $∇_{x_t} \log \int q_t(x_t|x_0)q^{(1)}(x|c)dx_0$ , $∇_{x_t} \log \int q_t(x_t|x_0)q^{(2)}(x|c)dx_0$

average the two scores at every step during backward diffusion using Langevin Dynamics

Algorithm 1 → k-NAF 보장한다

optimal score 몰라 → DNN으로 근사

inference time의 computation cost가 2배

...smoothly interpolates between N(0,I) at t = T...
...Langevin dynamics converge exponentially fast to the distribution estimated by the gradients...

Experiments

w1이 커지면 k는 작아져

w1커지면 k작아짐 → 더 safe 해짐

즉 Textual prompts (from clip)과의 유사성을 높아지고

Retrieved images와의 유사성은 감소합니다.

현재 Copyright Protection을 적용한 논문들은

더 추가되거나 발전시키는 부분은 없고

저자의 방법이 기존에 제시된 NAF를 만족한다고 주장하는 것에서 더 나아가지 못하는 모습입니다.

아마 아직 연구 초기 단계라 그런 것 같습니다.

[Paper Review] On Provable Copyright Protection for Generative Models

quasar529 — Sun, 26 May 2024 23:28:38 +0900

Privacy를 지키기 위해 DP라 불리는 Differential Privacy가 주로 사용됩니다.

하지만 이 방법은 성능의 한계가 뚜렷해 적극적으로 사용하기 힘듭니다.

이의 대안으로 본 논문은 Copyright 개념을 제시하며

Privacy보다는 덜 엄밀하지만 충분히 정보 보호를 할 수 있는 방법에 대해 이야기합니다.

맨 왼쪽 = p

가운데 두개 = q1,q2 (q1은 q2이미지 없다 vice versa)

마지막 = p,q1,q2 이용 → p_k (둘다 없음)

Dataset : CIFAR-10 (along with horizontal flips) augmented with multiple copies of two images taken from the CIFAR-10 test set

2장을 test set에서 가져오고, 이를 copyrighted works로 가정
- 전체의 2%

Model p

full dataset으로 학습
two copyrighted works를 생성

Algorithm

copyrighted images가 나눠지도록 두개의 데이터셋으로 분리
CP-k using a threshold of k = 500 : $p_k$

$max_{i∈\{1,2\}}(log(p(y)/q_i(y))$

분포 bimodal
first mode는 그냥이미지
second mode는 모두 copyrighted images

결과적으로 $p_k$를 파란색, 초록색 선의 분포를 가지게 만드는 것이 목표입니다.

DP와 유사해보이지만, 사실 매우 다릅니다.

Copyright가 훨씬 느슨한 기준을 가지고 있어 달성하기 수월합니다.

본 논문도 이를 명시해서 설명합니다.

Comparison with Differentially Private Prediction

Privacy

Privacy is focused on an individual and the attributes of that individual
if any particular generative output leaks even a few bits about a training sample, this could still be a significant privacy violation
privacy requires that the output of a mechanism does not reveal whether or not an individual’s data was in the database

Copyright

copyright protection is only for a specific piece of work
a few bits of leakage are unlikely to constitute a copyright violation since copyright requires a minimum amount of information content
we only need to ensure that no particular output is substantially similar to a copyrighted work

[Paper Review] FLORA: Low-Rank Adapters Are Secretly Gradient Compressors

quasar529 — Sun, 14 Apr 2024 20:12:03 +0900

이번 포스팅은 저에게 큰 절망감을 안겨줬던 논문에 대해 이야기하려 합니다.

저희가 해오던 연구를 거의 반파시킨 FLORA라는 논문입니다.

한도 끝도 없이 쓸 수 있지만 마음 아프니 짧게 포스팅하겠습니다.

하오 용창... 잊지 않겠습니다

본 논문은 LoRA의 작동원리를 해석하고 이를 Meomory Efficient하게 적용하는 방법을 제시하는 논문입니다.

결론부터 말하면,

LoRA는 사실상 A를 활용해 W의 Gradient를 Down-Projection, Up-Projection을 반복하는 과정이다.

라고 볼 수 있겠습니다.

즉 Random Projection을 통해 Gradient를 Compress하고 다시 Decompress 합니다.

여러 가정이 뒷받침되어야 하지만, LoRA를 사용한 학습 원리를 명료하게 설명한다는 것은 틀림없습니다.

구체적인 증명과정을 살펴보겠습니다.

LoRA를 사용하면 다음 Matrix가 존재합니다.

$W$ : Pre-trained Weight Matrix (n x m)

$B$ : LoRA B initialized by zero (n x r)

$A$: LoRA A initialized by normal distribution (r x m)

이 때, Forward Pass는 다음과 같습니다.

$y = (W+BA)x = Wx+BAx$

($BA$는 $W$를 변화시키지 않습니다.)

Back-Propagation에서 W의 Gradient는 다음과 같습니다.

$∇_WL_t = \frac{\delta L}{\delta y}x^T$

이 때, $B$, $A$의 Gradient는 다음과 같습니다.

$\frac{\delta L}{\delta A} = B^T \frac{\delta L}{\delta y}x^T = B^T (\nabla_WL)$

$\frac{\delta L}{\delta B} = \frac{\delta L}{\delta y}x^TA^T = (\nabla_WL)A^T$

SGD 과정을 보면 이와 같습니다.

$A_{t+1} ←A_t −ηB_t^⊤(∇_WL_t)$

$B_{t+1} ←B_t −η(∇_WL_t)A^⊤_t $

이 때 W의 Gradient인 $∇_WL_t $의 프로베니우스 놈이 $L$보다 작거나 같다고 가정합니다.

즉, Model이 Finiite Euclidean ball에 존재한다고 가정합니다.

그러면 $A_t$, $B_t$의 Dynamics는 다음과 같아집니다.

$A_T = A_0+ηA_0f_A(T) $

$B_T =ηf_B(T)A_0^⊤$

그리고 LoRA Update의 Dynamics를 살펴보겠습니다.

$W + (B_0 + ∆B)(A_0 + ∆A)$

$= W + B_0 A_0+ B_0∆A + ∆BA_0+ ∆B∆A $

$= W + ∆BA_0 + ∆B∆A$

$B_0$는 0으로 초기화 했고, Learning Rate η가 충분히 작다면 다음처럼 정리됩니다.

$W + (B_0 + ∆B)(A_0 + ∆A)$ ≈ $W + ∆BA_0$

➡ $W + η \hat{f_B}(T)A_0^⊤A_0$

이 때 $\hat{f_B}$는 다음처럼 Update 됩니다.

$\hat{f_B}(t+1):= \hat{f_B}(t)−∇_WL_t$

최종적으로 다음과 같습니다

$W−η∑^T_{t=0}[(∇_WL_t)A^T_0A_0]$

즉 LoRA 학습은,

W의 Gradient인 $∇_WL_t$ 를 $A^T_0$ 로 Down-projection해서 Compresse하고,

$A_0$로 Up-projection해서 Decompress하는 과정입니다.

제가 23년 초에 처음 LoRA를 봤을 때는 인용수가 약 500회 정도였는데

24년 4월 현재 약 3400회가 됐네요.

LoRA가 얼마나 혁신적이고 효과적이었으며, 파고들 부분이 많았다는 것을 알 수 있습니다.

아쉬움이 많이 남습니다.

추후에 그동안 해왔던 연구 과정을 가설, 실험 등을 포함하여 업로드 해볼까 합니다.

연구...어렵다... ‍

[Paper Briefing] 2403. LoRA-SP / AutoLoRA

quasar529 — Sun, 24 Mar 2024 22:51:18 +0900

24년 3월에 arXriv에 올라온 두 편의 LoRA 관련 논문을 살펴보겠습니다.

LoRA-SP: Streamlined Partial Parameter Adaptation for Resource- Efficient Fine-Tuning of Large Language Models

23년 발표된 논문인 LoRA-FA
- LoRA 학습 시 A는 Freeze 하고 B만 학습 시키는 것이 A,B 둘 다 학습시키는 것과 Comparable한 성능을 보인다고 주장
- 단, 본 논문은 Contribution이 약하다(memory saving이 크지 않다 등)는 이유로 ICLR에서 Reject
  ➡️ 즉 LoRA도 충분히 작지만, 여기서도 Redundancy가 존재한다고 볼 수 있음
LoRA-SP도 일부를 Freeze하는 비슷한 방법을 제시
Binary Matrix S 를 도입해 A,B 모두 절반만 Update
- 이때 S는 Random
$\Delta W = (A⨀S)(B⨀S)^T$
추가적으로 efficiency를 위해 Quantization, Selective Activation Recomputation 사용
- Selective Activation Recomputation 이란 backward 시 필요한 activation만 계산하므로써 memory utilization을 최적화하는 기법

LoRA보다 Parameters수가 정확히 절반만큼 감소하지만 성능은 비슷

개인적인 경험으론 LoRA는 쉬운 task에서는 어떻게 변형해도 실험결과가 대동소이하게 잘 나오는 경우가 많은데,
본 논문도 여러 실험을 하다가 얻어 걸려서
무지성 아카이브 업로드를 한 것이 아닌가라는 생각이 듭니다.

S가 어떤 기준에 따라 결정되는 것도 아니고,
중간에 논리 전개 과정에 맞지 않는 Quantization, Selective Activation Recomputation을

뜬금없이 사용해서 더욱 그렇습니다.

또한 가장 강조하는 것이 enhancing computational efficiency, reducing memory usage인데,
정작 메모리 사용량에 대한 구체적인 언급이 없는 것이 결정적입니다.

물론 제가 인비져블 썸띵을 보지 못한 것일 수도 있지만...
잘 모르겠네요.

AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning

AdaLoRA
- Layer 별로 Task에 따라 중요도가 달라지기 때문에, 모두 동일한 rank를 주는 것인 비합리적
- Importance Score에 따라 정해진 Budget하에 LoRA Rank를 다르게 부여하는 방법을 제시
Meta Learning을 통해 Layer 별 최적 Lora Rank 찾는다
- Meta Learning이 사용됐다고 말하는 이유는, 적절한 Rank를 찾는 Selection variables을 학습하는 과정이 포함됐기 때문
Train Dataset을 Train/Valid 로 나눔
train dataset으로 U,V를 optimize하고 valid dataset으론 selection variables을 optimize
본 논문이 AdaLoRA에 대해 갖는 차별점
- Importance score 와 Update Matrices가 모두 같은 dataset에 대해 학습하기 때문에 Overfitting 가능성 있다고 주장

AdaLoRA와 동일한 Param을 갖지만 더 좋은 성능을 낸다고 주장

단, 더 높은 Cost 소요

AdaLoRA는 Layer 별 Optimal Rank를 동적으로 찾는 방법을 발견하므로써
LoRA가 더 효율적일 수 있다라는 가설을 제시하고 증명했습니다.

이후에 일괄적인 Rank 할당이 아닌 Dynamic하게 Rank를 할당하는 여러 방법이 등장하는데 (ex. DyLoRA)
AutoLoRA도 그 중 하나입니다.

그러나 AdaLoRA보다 Outperform(사실 그냥 또이또이...) 하다고 하지만 Cost가 거의 2배여서 의미가 있나 싶고,

Optimal Rank를 찾을 때 Dataset을 train/valid로 나누는 것이 Resilient하다고 주장하지만

저는 오히려 같은 Dataset에 대해 찾는 것이 더 적절하다고 생각합니다.

이 논문 역시 AdaLoRA와 비교했을 때 달라지거나 크게 새로운 것을 제시하지 않아

큰 의미가 있진 않다고 생각합니다.

BERT : Bidirectional Encoder Representations from Transformers

quasar529 — Sun, 10 Mar 2024 19:27:39 +0900

Input Representation

Token Embeddings

WordPiece 토크나이저를 사용하여 문장을 토큰으로 분해
- 바이트 페어 인코딩(Byte Pair Encoding, BPE)의 유사 알고리즘
- 흔한 단어를 그대로 유지하고, 흔하지 않은 단어는 부분 단어(subword)로 분해
sentence의 첫번째 token은 언제나 [CLS] (special classification token)
- 여기에 간단한 classifier를 붙이면 단일 문장, 또는 연속된 문장 분류 가능
- 분류 작업 안하면 무시
문장의 구분을 위해 문장의 끝에 [SEP] 토큰을 사용

Segement Embedding

문장 A와 문장 B를 구분하고, 각 문장의 시작과 끝을 알려주는 방법
- 첫 번째 문장의 모든 토큰에는 'A' 임베딩을 부여하고, 두 번째 문장의 모든 토큰에는 'B' 임베딩을 부여
- 첫 번째 문장이 끝나고 두 번째 문장이 시작되는 지점에는 [SEP] 토큰이 삽입됨 → BERT는 문장의 시작과 끝, 그리고 문장 사이의 경계를 인식

Position Embedding

토큰의 순서 정보 반영

Masked Language Model (MLM)

일부 단어를 가려서(masking) 모델이 그 가려진 단어를 예측하도록 하는 방식으로 작동 → 이를 통해 모델은 양쪽 방향의 문맥을 모두 고려
먼저 단어 중의 일부를 [MASK] token 으로 바꾼다
- 바꾸는 비율은 15%
  - 80% : token을 [MASK] token으로ex) my dog is hairy -> my dog is [MASK]
  - 10% : token을 random word로. ex) my dog is hairy -> my dog is apple
    - 실제 비율은 1.5% 밖에 되지 않아 모델의 성능에 크게 영향 없음
  - 10% : token을 원래 단어 그대로 . ex) my dog is hairy -> my dog is hairy
    - 실제 관측 단어에 대한 representation을 bias
LM의 left-to-right (혹은 r2l)을 통하여 문장 전체를 predict하는 방법론과는 달리, [MASK] token 만을 predict
- [MASK] token이 cross entropy loss를 통해 원래 token을 예측
- [MASK] token은 pre-training에만 사용되고, fine-tuning시에는 사용되지 않음 → 왜냐하면 [MASK] token이 fine-tuning과정에서는 나타나지 않기 때문

Next Sentence Prediction (NSP)

입력으로 두개의 문장을 받아 두 번째 문장이 첫 번째 문장의 다음에 오는 문장인지를 맞추는 Binary Classification을 학습
- QA나 Natural Language Inference(NLI)와 같이 두 문장 사이의 관계를 이해하는 것이 중요하기 때문
작동 원리
- 'A'와 'B' 두 가지 유형의 문장
- 'A' 문장은 원래의 텍스트에서 가져온 문장이고, 'B' 문장은 'A' 문장 다음에 오는 문장일 수도 있고, 전혀 관련이 없는 무작위의 문장일 수도 있다
- 'B' 문장이 'A' 문장 바로 다음에 오는 문장 : 'IsNext'
- 만약 'B' 문장이 'A' 문장과 관련이 없는 무작위의 문장: 'NotNext'
두 문장 사이의 관계를 예측하여 문장 간의 관계를 이해하고, 문장의 순서와 문맥을 파악하는 능력을 향상
- 50% : sentence A, B가 실제 next sentence
- 50% : sentence A, B가 corpus에서 random으로 뽑힌(관계가 없는) 두 문장
  - 예시
  - Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
    - LABEL = IsNext
  - Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
    - Label = NotNext

C 토큰은 next sentence prediction(NSP)을 위한 토큰
이 토큰 C를 이용하여 input으로 들어온 두 문장이 원래 corpus에서 이어 붙여져 있던 문장인지(IsNext) 아닌지(NotNext)를 맞춰가며 학습

✔ Masked LM은 숨겨진 단어를 예측하는 문제 → 단어에 초점

✔ Next Sentence Prediction 문제는 두 문장의 관계를 파악해야 하는 문제 → 문장에 초점, 더욱 넓은 범위의 이해를 요구

⇒ 상호 보완적인 두 가지 Pretraining 방법을 동시에 사용하여 더욱 다채로운 성능 가진다

Fine-tuning Procedure

sequence-level classification tasks
- input sequence에 대해서 일정한 차원수의 representation 결과를 얻는다 → [CLS] token의 Transformer output값을 사용
  - $C \in \mathbb{ R }^H$
- classify하고 싶은 갯수(K)에 따라 classification layer 붙인다
  - $W \in \mathbb{ R }^{K \times H}$
span-level, token-level prediction tasks

HiFi: High-Information Attention Heads Hold for Parameter-Efficient Model Adaptation

quasar529 — Sun, 10 Mar 2024 19:14:57 +0900

LLM (본 논문에서는 PLMs)은 large scale of parameters 가진다
➡ Data-Scarce & Resource-Limited 상황에서 Inefficient

Catastrophic forgetting issues
Limited storage infrastructure

PEFT 등장
Only fine-tunes the minority of the original parameters

Effectively decrease parameters
BUT also lead concerns
- Breaks the model structure
- Inference delays

Two types of Methods

Structured Methods
- Extra introduced blocks : LoRA, Prompt-Tuning
- Internal original blocks : BitFit
Non-structured Methods

HiFi

Fine-tuning the relatively significant heads in MHA(multi-head attention module)
- = Highly informative and Strongly correlated attention heads
- 이유 : LLM 대부분 Transformer 기반 & MHA plays a crucial role
Two Big Challenges
- How to measure the individual importance of a head?
- How to measure the relative importance between heads?

Information Richness

$W_h$ ➡️ $O_h$ 근사
$O_h(x)$ SVD
- Singular values ${σt}$ decays slower == informative & contains more meaningful principal components
Information richness of an attention head as $Ih(W_h | x)$
Monte-Carlo
- Stable results can be obtained using a small n (e.g., 300)

Correlation

Weights 간 Correlation ➡️ Outputs 간 Correlation으로 근사
$O`h$ = average O over the sequence axis
Correlation between two heads $(h, h′)$ is computed by the covariance
- strong positive and negative should be considered equally ➡️ 절댓값
Unbiased estimation of covariance (불편추정량)
Monte-Carlo

Joint Optimization

Heads into directed fully-connected graph
$p_h^{(0)}$ = Initial probability per node
$m_h , h`$ = Probability of moving from node h to another node h′
$P (0) = [p(0), p(0), · · · , p(0)]⊤$ : Probability vector
State transition probability matrix $M = [mh, h′] H×H$
PageRank
- d = damping factor

Ablation

Q1: Does the correlation $(rh, h′ )$ between heads really matter?
Q2: Are the higher information richness (Ih) of heads more important for the model?
Q3: Is it enough to only take the correlation (rh,h′ ) into consideration, while ignoring the information richness (Ih)?
Q4: Does PageRank algorithm really work?

시작

quasar529 — Sun, 10 Mar 2024 18:57:21 +0900

기술 블로그

이번엔 진짜입니다.