17.3. Position-wise Feed-Forward Networks (FFN)

Section 5.3 of the survey paper highlights several research aimed at improving FFN’s performance:

Replacing activation function:
- Ramachandran et al., 2018 proposed replacing ReLU with $ x \cdot \text{sigmoid}(\beta x)$.
- GPT (Radford et al., 2018) opted for Gaussian Error Linear Unit (GELU) instead of ReLU.
Replacing FFN:
- Lample et al., 2019 introduced product-key memory layers as an alternative to the FFN.
- Gshard (Lepikhin et al., 2020) explored a sparsely-gated Mixture-of-Experts MoE (Mixture-of-Experts) approach in place of the FFN.
- Berges et al., 2024 introduced Memory Layers instead of FFNs.

References

Searching for Activation Functions (v1: 6.Oct.2017, v2: 27.Oct.2017)
GPT: Improving Language Understanding by Generative Pre-Training
Large Memory Layers with Product Keys (v1: 10.Jul.2019, v2: 16.Dec.2019)
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (30.Jun.2020)
Memory Layers at Scale (v1: 12.Dec.2024, v2: 20.Dec.2024)