
來源頭條作者:無遠不往“ 時間序列分析問題在日常生活中無處不在,時間是連續的,每一秒鐘都會產生新的變化。”01—AAAI2021,新型的時間序列模型—Informer最新一篇論文來自于AAAI2021的會議,名為《Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting》為時間序列預測任務帶來了新的曙光。時間序列預測模型的必要條件是:超強的長時間序列對齊能力,和超級處理長時間序列的輸入和輸出的操作能力。02—背景知識和相關問題近年來,尤其是2017年Google提出的Transformer模型在處理長時間序列問題的能力遠超于傳統的RNN模型,包括GRU,LSTM等模型。Transformer模型的優勢在于信號傳播路徑長度短,避免了傳統RNN系列網絡的復雜循環結構,但是該模型過于吃透或者消耗GPU資源和服務器的存儲資源,需要大量的硬件成本投入到模型的訓練當中,所以對現實世界中的長時間序列預測任務的應用不是特別的切合實際。Transformer成為將其應用于LSTF問題的瓶頸,本文的研究目標是:can Transformer models be improved to be computation, memory, and architecture efficient, as well as maintain higher prediction capacity?03—當前的挑戰和解決方法論文首先在abstract部分介紹,長時間序列任務是一項非常重要,而且隨著時間的增長難度系數越來越大,即預測精準度逐漸降低。有效的預測,能給目前的研究工作帶來巨大的突破。即針對目前非常熱門的transformer模型,我們所面臨的挑戰和約束可總結為以下三點:The quadratic computation of self-attention. The atom operation of self-attention mechanism, namely canonical dot-product, causes the time complexity and memory usage per layer to be O(L2).The memory bottleneck in stacking layers for long inputs. The stack of J encoder/decoder layer makes total memory usage to be O(J · L2), which limits the model scalability on receiving long sequence inputs.The speed plunge in predicting long outputs. The dynamic decoding of vanilla Transformer makes the step-by-step inference as slow as RNN-based model.目前的最新研究工作,主要集中在解決第一個問題上,即self-attention的計算復雜度上和內存的使用上。針對以上的問題,文章提出了一個新型的預測模型。該模型的主要特點是集中的解決上述的三個問題,模型的主要貢獻如下:We propose Informer to successfully enhance the prediction capacity in the LSTF problem, which validates the Transformer-like model’s potential value to capture individual long-range dependency between long sequence time-series outputs and inputs.We propose ProbSparse Self-attention mechanism to efficiently replace the canonical self-attention and it achieves the O(LlogL) time complexity andO(L log L) memory usage.We propose Self-attention Distilling operation privileges dominating attention scores in J-stacking layers and sharply reduce the total space complexity to be O((2 ? ε)L log L).We propose Generative Style Decoder to acquire long sequence output with only one forward step needed, simultaneously avoiding cumulative error spreading during the inference phase.本文提出的模型框架如下圖所示,圖1 Informer模型的整體圖。左側是編碼器,它接收大量的長序列輸入(綠色序列)。我們已經用提議的ProbSparse self-attention注代替了規范的self-attention。藍色梯形是一種self-attention的蒸餾操作,可提取主要注意力,從而大幅減少網絡規模。層堆疊副本提高了魯棒性。在右側,解碼器接收長序列輸入,將目標元素填充為零,測量特征圖的加權注意力成分,并立即以生成樣式預測輸出元素(橙色序列)。04—解決方法和模型架構目前比較熱門的計算self-attention值的方法是根據輸入的三元組(query,key,value),計算某個query加權后的值,即第i個query的加權值,我們可用以下計算公式獲取:這里self-attention需要O(LQLK)的內存以及二次點積的計算為代價,這也是目前傳統transformer存在的缺點。其次,本文對該方法進行了評估,稀疏性self-attention得分的情況呈現長尾分布,即少數點積對注意有貢獻,其他的點積貢獻極小,可以忽略不計。因此如何區分稀疏性至關重要,針對第i個query的稀疏性評估方法我們參考KL散度,即可由下述公式計算得到:其中,第一項是所有鍵上qi的對數總和(LSE),第二項是它們上的算術平均值。如果第i個query獲得較大的M(qi, K),則其注意概率p更加“多樣化”,并且很有可能在長尾self-attention分布自檢的標頭字段中包含主要的點積對。但是上述方法還是存在計算量過大和LSE操作存在潛在的數值穩定性問題。因此,該篇文章提出了一種對查詢稀疏性度量的近似方法,提出了最大均值測量:其中是和q相同size的稀疏矩陣,它僅包含稀疏評估下下Top-u的queries,由采樣factor所控制,我們令, 這么做self-attention對于每個query-key lookup就只需要計算的內積,內存的使用包含,但是我們計算的時候需要計算沒對的dot-product,即,,同時LSE還會帶來潛在的數值問題,受此影響,本文提出了query sparsity 評估的近似。模型的框架圖如下圖所示:圖2 Informer編碼器的體系結構。(1)每個水平堆棧代表單個編碼器副本;(2)上層堆棧是主堆棧,它接收整個輸入序列,而第二層堆棧則占輸入的一半;(3)紅色層是self-attention mechanism 點積矩陣,通過在每層上進行self-attention蒸餾而使其級聯減少;(4)將2堆棧的功能圖連接為編碼器的輸出。4.1 模型的輸入圖3 Informer的輸入表示。輸入的嵌入包括三個獨立的部分:標量投影,本地時間戳(Position)和全局時間戳嵌入(Minutes, Hours, Week, Month, Holiday etc.)。4.2 模型的Encoder編碼器設計,用于提取長時間序列輸入的魯棒的遠程依賴關系。Self-attention Distilling,由于 ProbSparse self-attention mechanism 的自然結果,編碼器的特征圖具有值V的冗余組合。我們使用蒸餾操作為具有優勢的特性賦予優等品特權,并在下一層制作有重點的自我注意功能圖。看到圖 2 中Attention塊的N-heads權重矩陣(重疊的紅色正方形),它會急劇地修剪輸入的時間維度,our “distilling” procedure forwards from j-th layer into (j + 1)-th layer aswhere, [·]AB contains the Multi-head ProbSparse self- attention and the essential operations in attention block, and Conv1d(·) performs an 1-D convolutional filters (ker- nel width=3) on time dimension with the ELU(·) activa- tion function. We add a max-pooling layer with stride 2 and down-sample Xt into its half slice after stacking a layer, which reduces the whole memory usage to be O((2 ? ε)L log L), where ε is a small number. To enhance the robustness of the distilling operation, we build halving replicas of the main stack and progressively decrease the number of self-attention distilling layers by dropping one layer at a time, like a pyramid in Fig. 2, such that their output dimension is aligned.4.3 模型的Decoder我們在圖1 中使用標準的解碼器結構,它由2個相同的multi- head attention層的堆棧組成。但是,在長時間預測中,采用了生成推理來緩解速度下降。我們向解碼器提供以下向量:where Xtoken ∈ RLtoken×dmodel is the start token, Xt0 ∈RLy×dmodel is a placeholder for the target sequence (set scalar as 0). Masked multi-head attention is applied in theProbSparse self-attention computing by setting masked dot- products to ?∞.4.4 Generative InferenceStart token is an efficient tech- nique in NLP’s “dynamic decoding” , and we extend it into a generative way. Instead of choos- ing a specific flag as the token, we sample a Ltoken long sequence in the input sequence, which is an earlier slice before the output sequence. Take predicting 168 points as an example (7-day temperature prediction) in Fig.(1(b)), we will take the known 5 days before the target sequence as “start- token”, and feed the generative-style inference decoder withXfeed de = {X5d , X0 }. The X0 contains target sequence’s time stamp, i.e. the context at the target week. Note that our proposed decoder predicts all the outputs by one forward procedure and is free from the time consuming “dynamic decoding” transaction in the trivial encoder-decoder archi- tecture. A detailed performance comparison is given in the computation efficiency section.本期學術論文解讀結束,謝謝大家!!!下期繼續分享學術論文和相關智能算法和法律知識!!!
暫時沒有評論,來搶沙發吧~