Transformer Head Pruner by xiaowu0162 · Pull Request #3884 · microsoft/nni

xiaowu0162 · 2021-06-29T09:48:21Z

This pr adds a pruner for pruning attention heads in transformers.
To-do:

basic: module matching, name-based weight grouping, group-aware maskers using weight norm as criteria
graph-based weight grouping
maskers relying on activation
maskers relying on gradient
global sort in maskers
support iterative pruning from scratch ~~/ integration with pruning scheduler for iterative features~~
example
docs

zheng-ningxin · 2021-07-26T08:04:56Z

+                    break
+                except:
+                    continue
+            if layer_idx is not None:


Is there a better way to get the index of the attention head? the first integer may be not strong.

This layer_idx is the layer index of the BERT encoder. Here I include these lines of code only to show the user how they may take advantage of the pruned_heads dict inside pruner to get the pruned heads for each group, and then match each group to the original layer, and finally call the built-in transformers _prune_heads() function to do model speedup. This is meant to be a temporary workaround before we can properly handle speedup for transformers.

If our speedup code after refactor can handle transformer, then I will replace these lines with our speedup methods (maybe in a separate pr)

Since the users are aware of the naming of their own model to prune, I think they can also use their own rules to match layers to groups

zheng-ningxin · 2021-07-26T08:23:20Z

+        and include `model, optimizer, criterion, epoch` as function arguments.
+    criterion: function
+        Function used to calculate the loss between the target and the output.
+        For example, you can use ``torch.nn.CrossEntropyLoss()`` as input.


Feel like that the TransformerHeadPruner is too heavy. I prefer to locate this pruner as a one-shot pruner, which means we do not need handle with the num_iteration, optimizer, trainer, criterion, things. That's much clearer. All those finetuning related things we can offload to the outer search algorithms. We can discuss with Quanlu @QuanluZhang .

Yes, we can further discuss on that. One challenge is that this does not fit well in our current compression V1 framework (since the current iterative and dependency aware pruner are limited to convolutions), and compression V2 is not ready yet. My initial thought was to first integrate all these logic in one pruner (because of empirically good performance compared to one-shot pruning), and then factor out when compression V2 is ready.

…ertpruner

xiaowu0162 added 16 commits June 29, 2021 18:45

local code sync

1a1172c

graph-based weight grouping

5f27c35

fix for pipeline

d960426

pipeline related

faedb0f

add activation-based maskers; refactor example

c62b9a1

minor fix

6877b64

change graph-based grouping logic

595864e

remove redundant code

bd7ff9f

Add taylor masker

b28725f

debug

80bdf06

debug

d5582dd

Add global sorting

0715a70

debug

d1e5d8d

debug

aece26a

Add iterative pruning

9cf94fe

debug

7c73fc8

This comment has been minimized.

Sign in to view

xiaowu0162 and others added 11 commits July 7, 2021 11:25

Simplify API; add doc strings

79186e2

debug

690969a

docstring

9d34493

example v1

1e2329e

Merge branch 'microsoft:master' into bertpruner

7121051

doc skeleton

94e4804

doc update

a5a92d9

doc update

5399dd8

doc update

ec5cdf2

doc update

f747a9a

update

f70343c

xiaowu0162 marked this pull request as ready for review July 9, 2021 08:16

QuanluZhang mentioned this pull request Jul 9, 2021

NNI 2021 June~July Iteration Planning #3724

Closed

update ut

4d91a1e

xiaowu0162 requested a review from zheng-ningxin July 22, 2021 05:29