Init ratio for arch param

Code

# lenna_net_super.py
def init_arch_params(self, init_type='normal', init_ratio=1e-3):
    for param in self.architecture_parameters():
        if init_type == 'normal':
            param.data.normal_(0, init_ratio)
        elif init_type == 'uniform':
            param.data.uniform_(-init_ratio, init_ratio)
        else:
            raise NotImplementedError

# mix_op.py
@property
def probs_over_ops(self):
    probs = F.softmax(self.AP_path_alpha, dim=0)  # softmax to probability
    return probs

Value of arch params prob by init ratio

init_ratio=1e-3 (default)

logging

→ init ratio가 작을 수록 arch param의 편차는 작아지고 softmax를 했을때, 대부분 비슷한 확률을 가진다. 위 예제(logging)경우 모두 0.9x의 값을 가졌고 이는 binary gate가 어떤 것이 모르는 상황이기 때문에, latency가 천천히 수렴할 수 밖에 없다.

defualt값 그대로 init ratio를 사용했다면 arch param 값은 다 비슷한 값밖에 가지지 않았을 것임. 즉 b_type 혹은 in_ch에 의해서만 latency값이 바뀌는 걸 훈련하게 됨.

init_ratio: 100

logging

→ init ratio가 클수록 arch param의 값의 편차가 커지고 softmax를 했을때, 확률 편차 또한 커진다. 따라서 열릴 binary gate가 대부분 특정된 상황이기 때문에 latency가 빠르게 수렴하는 것도 확인할 수 있다.

<aside> 💡 즉 init ratio의 값에 따라 훈련된 정도를 반영할 수 있을 것이다. (init ratio가 클수록 더 훈련된 상황이고 무엇이 선택될지 거의 정해진 상황)

그렇다면 적절하게 init ratio를 고르게 넣어주는 것이 중요할 것이다?

</aside>

find proper range of init ratio

code
init_ratio 50

0-14까지는 mixed edge를 말하고 mixed edge안에 ops들의 arch params의 분포를 확인 가능

count는 mixed edge의 candidate ops 개수.

init_ratio 20

init_ratio 10