dyn-ticks ってどうなってるの - big-eyed-hamster’s diary

目標

いまいち dyn-ticks(Tickless kernel) についてよくわかっていなかったので、コードを読みながらまとめる. カーネルのバージョンは 2.6.27.8(古くてごめんなさい) で、CPU は i686(SMP環境)としておく. 話が行ったり来たりしますが、ご了承を.

ハードウェアタイマ

まず、すべてのハードウェアタイマに2つの動作モードがある:

周期モード
ワンショットモード

この2つをうまいこと使い分けるのがOSプログラマの腕の見せ所...のはずだったのだが、1世代前に流行っていたハードウェアタイマであるPIT(Programmable Interval Timer)はワンショットモードの精度が悪くて使い物にならなかったようだ.そのせいか、ちょっと古いOSになると、周期モードにしか対応していないOSがあるとか、ないとか. 周期モードを使うと、比較的OSの実装はシンプルになるだろうしね. 最近のイケてるハードウェア(後で出てくるHPETやLocal APIC Timer)を使うと、ワンショットモードも普通に(実用の範囲内で)動作するようだ.

HPET, RTC, ACPI Timer, PIT など

基本的に、ハードウェアタイマとCPUは割り込み線とIO-APICを介してつながっている.

IO-APIC は、割り込みを各CPUに配分する門番のような役割がある. タイマ割り込みに限らず、外部の割り込みは、外部ハードウェア→IO-APIC→Local APIC→CPUという順に伝わる. この割り込みは、ブロードキャストする場合と、特定のCPUにだけ伝える場合の2種類がある. x86 の場合、下位互換性の問題から、 PIT は IRQ 0 番がハードウェアタイマを結線されており、その他のハードウェアタイマが別の割り込み線を使うことが多い.

Local APIC Timer

ハードウェアタイマとCPUは、割り込み線を介してつながっていると書いた. しかし1つだけ例外がある. CPU に搭載されているLoal APIC Timer だ. Local APIC というのは、Local Advanced Programmable Interrupt Controller の略で、CPUのコア1つ1つにくっついている割り込みコントローラのことだ. こいつにタイマがくっついているのだ. 当然、周期モードとワンショットモードが存在する. CPU にくっついているだけあって、Local APIC Timer と CPU は物理メモリアドレス空間を通じてやりとりを行う.

clockevents フレームワーク

clockevents フレームワークは、アーキテクチャ依存になりがちがタイマのコードの重複をなくそうということで実装されたハードウェアタイマドライバを実装するためのフレームワークである. 特定の関数を実装して、構造体に入れておくと、そのハードウェアタイマのドライバが読み込まれた際に関数をコールバックし、タイマデバイスリストに登録する. それより上のレイヤの面倒は clockevents フレームワーク側が見てくれるという代物だ. ハードウェアタイマの節でも書いたが、ハードウェアタイマにはワンショットモードと周期モードがある. dyn-ticks kernel の実装にあたっては、前者が大きな役割を果たしてくる. clockevents フレームワークからデバイスドライバのワンショットモードを利用するには、clockevents_program_event 関数を呼ぶ.

Linux カーネルのメインループ、cpu_idle()

Linux カーネルは、ブート後すべての初期化が終わると、最終的には cpu_idle というアーキテクチャ固有のメインループに処理がおちつく. 今回想定している環境の場合、arch/x86/kernel/process_32.c で定義されている.

118 /*      
119  * The idle thread. There's no useful work to be
120  * done, so just try to conserve power and have a
121  * low exit latency (ie sit in a loop waiting for
122  * somebody to say that they'd like to reschedule)
123  */
124 void cpu_idle(void)
125 {
126     int cpu = smp_processor_id();
127         
128     current_thread_info()->status |= TS_POLLING;
129 
130     /* endless idle loop with no priority at all */
131     while (1) {
132         tick_nohz_stop_sched_tick(1);
133         while (!need_resched()) { 
                /* sleep ! */
144             local_irq_disable();
145             __get_cpu_var(irq_stat).idle_timestamp = jiffies;
146             /* Don't trace irqs off for idle */
147             stop_critical_timings();
148             pm_idle(); // -> acpi_processor_idle()
149             start_critical_timings();
150         }
151         tick_nohz_restart_sched_tick();
152         preempt_enable_no_resched();
153         schedule();
154         preempt_disable();
155     }
156 }

cpu_idle()の中では、

tick_nohz_stop_sched_tickを呼んで、指定した時間までtick(タイマ割り込みを発生させないようにする.
need_reschedマクロ(include/linux/sched.h)を使って、新たなタスクがないかチェック. なければCPUステートを休止にもっていく.
tick_nohz_restart_sched_tickを呼んで、tickを再開する.

といった処理を行っている. tick_nohz_stop_sched_tick()とtick_nohz_restart_sched_tick()が今回の記事の主役だが、これについては次節で詳しく説明する.

ここでのポイントは、休止中は割り込み全般を禁止している点. どうやって復帰するんだ！と思いきや、システムコールや割り込みハンドラの処理が終わったとき、もしくはwake_up_idle_cpu()呼び出し時に更新される need_resched フラグをチェックしてループから脱出するようだ.

dyn-ticksを追う

ここまでで、dyn-ticksを追うための前提知識について述べた. ここから、 dyn-ticks の実装について述べていく.
dyn-ticks の本体・tick_nohz_stop_sched_tick()とtick_nohz_restart_sched_tick()は、kernel/time/tick-sched.c で定義される.

197 /**
198  * tick_nohz_stop_sched_tick - stop the idle tick from the idle task
199  *
200  * When the next event is more than a tick into the future, stop the idle tick
201  * Called either from the idle loop or from irq_exit() when an idle period was
202  * just interrupted by an interrupt which did not cause a reschedule.
203  */
204 void tick_nohz_stop_sched_tick(int inidle)
205 {
206     unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
207     struct tick_sched *ts;
208     ktime_t last_update, expires, now;
209     struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
210     int cpu;
211 
212     local_irq_save(flags);
213 
217 
218     /*
219      * If this cpu is offline and it is the one which updates
220      * jiffies, then give up the assignment and let it be taken by
221      * the cpu which runs the tick timer next. If we don't drop
222      * this here the jiffies might be stale and do_timer() never
223      * invoked.
224      */
225     if (unlikely(!cpu_online(cpu))) {
226         if (cpu == tick_do_timer_cpu)
227             tick_do_timer_cpu = TICK_DO_TIMER_NONE;
228     }
229 
230     if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
231         goto end;
235 
236     ts->inidle = 1;
237 
253     /* Read jiffies and the time when jiffies were updated last */
254     do {
255         seq = read_seqbegin(&xtime_lock);
256         last_update = last_jiffies_update;
257         last_jiffies = jiffies;
258     } while (read_seqretry(&xtime_lock, seq));
259 
260     /* Get the next timer wheel timer */
261     next_jiffies = get_next_timer_interrupt(last_jiffies);
262     delta_jiffies = next_jiffies - last_jiffies;
263 
264     if (rcu_needs_cpu(cpu))
265         delta_jiffies = 1;
266     /*
267      * Do not stop the tick, if we are only one off
268      * or if the cpu is required for rcu
269      */
270     if (!ts->tick_stopped && delta_jiffies == 1)
271         goto out;
272 
273     /* Schedule the tick, if we are at least one jiffie off */
274     if ((long)delta_jiffies >= 1) {
275 
276         if (delta_jiffies > 1)
277             cpu_set(cpu, nohz_cpu_mask);
278         /*
279          * nohz_stop_sched_tick can be called several times before
280          * the nohz_restart_sched_tick is called. This happens when
281          * interrupts arrive which do not cause a reschedule. In the
282          * first call we save the current tick time, so we can restart
283          * the scheduler tick in nohz_restart_sched_tick.
284          */
285         if (!ts->tick_stopped) {
286             if (select_nohz_load_balancer(1)) {
287                 /*
288                  * sched tick not stopped!
289                  */
290                 cpu_clear(cpu, nohz_cpu_mask);
291                 goto out;
292             }
293 
294             ts->idle_tick = ts->sched_timer.expires;
295             ts->tick_stopped = 1;
296             ts->idle_jiffies = last_jiffies;
297             rcu_enter_nohz();
298         }
299 
300         /*
301          * If this cpu is the one which updates jiffies, then
302          * give up the assignment and let it be taken by the
303          * cpu which runs the tick timer next, which might be
304          * this cpu as well. If we don't drop this here the
305          * jiffies might be stale and do_timer() never
306          * invoked.
307          */
308         if (cpu == tick_do_timer_cpu)
309             tick_do_timer_cpu = TICK_DO_TIMER_NONE;
310 
311         ts->idle_sleeps++;
312 
313         /*
314          * delta_jiffies >= NEXT_TIMER_MAX_DELTA signals that
315          * there is no timer pending or at least extremly far
316          * into the future (12 days for HZ=1000). In this case
317          * we simply stop the tick timer:
318          */
319         if (unlikely(delta_jiffies >= NEXT_TIMER_MAX_DELTA)) {
320             ts->idle_expires.tv64 = KTIME_MAX;
321             if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
322                 hrtimer_cancel(&ts->sched_timer);
323             goto out;
324         }
325 
326         /*
327          * calculate the expiry time for the next timer wheel
328          * timer
329          */
330         expires = ktime_add_ns(last_update, tick_period.tv64 *
331                        delta_jiffies);
332         ts->idle_expires = expires;
333 
334         if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
335             hrtimer_start(&ts->sched_timer, expires,
336                       HRTIMER_MODE_ABS);
337             /* Check, if the timer was already in the past */
338             if (hrtimer_active(&ts->sched_timer))
339                 goto out;
340         } else if (!tick_program_event(expires, 0))
341                 goto out;
342         /*
343          * We are past the event already. So we crossed a
344          * jiffie boundary. Update jiffies and raise the
345          * softirq.
346          */
347         tick_do_update_jiffies64(ktime_get());
348         cpu_clear(cpu, nohz_cpu_mask);
349     }
350     raise_softirq_irqoff(TIMER_SOFTIRQ);
351 out:
352     ts->next_jiffies = next_jiffies;
353     ts->last_jiffies = last_jiffies;
354     ts->sleep_length = ktime_sub(dev->next_event, now);
355 end:
356     local_irq_restore(flags);
357 }

ポイントは、get_next_timer_event()で得た次のタイマイベントを、hrtimer_start(), もしくは tick_program_event()へ渡しているところである.

get_next_timer_event()関数は、「次のタイマイベント=現在リストの中に存在するタイマの中で、最も近い未来に点火するタイマは、いつ点火するか」をjiffiesの値で手にいれてくるという見ていて大変楽しげな関数である(kernel/timer.c). Linuxで実装されているカーネルタイマはいずれもリストで走査できるようになっているので、割と簡単にこの値を手に入れてくることができる. ちなみに、ここで手に入る next_jiffies の値は、カーネルコンフィグで設定したHZの値よりも大きくなり得ることに注意. これによって、消費電力を抑えることができる. 利用するときになったらアップデートしよう、という遅延評価のアプローチである.

さて、ここで手にいれた値を hrtimer_start(),もしくはtick_program_event() に渡す. 後者は、内部で clockevents_program_event 関数を呼ぶ. つまり、次のタイマ割り込みがnext_jiffiesになるまで発生しないようにする. 前者はどうだろう? 実は、hrtimer_startの中では結局 tick_program_event を呼ぶ. 今回入れようとしているタイマイベントは最も小さい値であることは分かっているから、次のタイマ割り込みが発生する直前にsched_timer を点火しよう、という作戦だ.

sched_timer というのは何だろう? これは、CPU毎のワンショットタイマでグローバル割り込みをエミュレーションするためのコードだ. dyn-ticks では、グローバルタイマ割り込みが働かないため、どこかで明示的にグローバルタイマ割り込みのハンドラを呼ばないと、jiffiesやxtimeがアップデートできない. それを行うのが sched_timer で、コールバックされる関数はtick_sched_timer()だ. tick_do_update_jiffies64 は、前回sched_timerが点火した時間からの差がtick_period(カーネルコンフィグで設定したHZの値)よりも大きければ、jiffiesの更新を行う.

最後に、tick_program_event の中身を見てみよう. この関数は kernel/time/tick-oneshot.c で定義されている.
これ自体はtick_dev_program_eventを呼び出しているだけなので、そちらを見てみる.

 65 /**                                                                                           
 66  * tick_program_event                                                                         
 67  */                                                                                           
 68 int tick_program_event(ktime_t expires, int force)                                            
 69 {                                                                                             
 70     struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;                   
 71                                                                                               
 72     return tick_dev_program_event(dev, expires, force);                                       
 73 }

でた！clockevents_program_event!

 25 /**
 26  * tick_program_event internal worker function
 27  */
 28 int tick_dev_program_event(struct clock_event_device *dev, ktime_t expires,
 29                int force)
 30 {                                                                                             
 31     ktime_t now = ktime_get();                                                                
 32     int i;                                                                                    
 33                                                                                               
 34     for (i = 0;;) {                                                                           
 35         int ret = clockevents_program_event(dev, expires, now);                               
 36                                                                                               
 37         if (!ret || !force)                                                                   
 38             return ret;                                                                       
 39                                                                                               
 40         /*                                                                                    
 41          * We tried 2 times to program the device with the given                              
 42          * min_delta_ns. If that's not working then we double it                              
 43          * and emit a warning.                                                                
 44          */                                                                                   
 45         if (++i > 2) {                                                                        
 46             /* Increase the min. delta and try again */                                       
 47             if (!dev->min_delta_ns)                                                           
 48                 dev->min_delta_ns = 5000;                                                     
 49             else                                                                              
 50                 dev->min_delta_ns += dev->min_delta_ns >> 1;                                  
 51                                                                                               
 52             printk(KERN_WARNING                                                               
 53                    "CE: %s increasing min_delta_ns to %lu nsec\n",                            
 54                    dev->name ? dev->name : "?",                                               
 55                    dev->min_delta_ns << 1);                                                   
 56                                                                                               
 57             i = 0;                                                                            
 58         }                                                                                     
 59                                                                                               
 60         now = ktime_get();                                                                    
 61         expires = ktime_add_ns(now, dev->min_delta_ns);                                       
 62     }                                                                                         
 63 }                                                                                             
 64

まとめ

ハードウェアの話から入って、dyn-ticks の話をまとめてみました. 結構な長文になってしまいましたね :P

さて、分かってるかのように書いてますが、1つ疑問があります. cpu_idle でチェックしている、need_resched() が更新されるタイミングです. 全部のCPUがこのループに入って、誰も起こさない状態になってしまう(=デッドロックしてしまう)ことはないんだろうか...?どなたか、ご存知の方がいれば教えて頂ければと思います. ツッコミお待ちしております.

目標