

## Speculative Execution HW BUGs, Virtualization & Other Things...

Dario Faggioli <<u>dfaggioli@suse.com</u>> Software Engineer - Virtualization Specialist, **SUSE** GPG: 4B9B 2C3A 3DD5 86BD 163E 738B 1642 7889 A5B8 73EE <u>https://about.me/dario.faggioli</u> <u>https://www.linkedin.com/in/dfaggioli/</u> <u>https://twitter.com/DarioFaggioli</u> (@DarioFaggioli) Myself, my Company, what we'll cover today...

#### About myself: Work

- Ing. Inf @ UniPI
  - B.Sc. (2004) "Realizzazione di primitive e processi esterni per la gestione della memoria di massa" (Adv.s: Prof. G. Frosini, Prof. G. Lettieri
  - M.Sc (2007) "Implementation and Study of the BandWidth Inheritance protocol in the Linux kernel" (Adv.s: Prof. P. Ancilotti, Prof. G. Lipari)
- Ph.D on Real-Time Scheduling @ <u>ReTiS Lab</u>, <u>SSSUP</u>, Pisa; co-authored SCHED\_DEADLINE, now in mainline Linux
- Senior Software Engineer @ <u>Citrix</u>, 2011; contributor to The Xen-Project, maintainer of the Xen's scheduler



- Virtualization Software Engineer @ <u>SUSE</u>, 2018; still Xen, but also KVM, QEMU, Libvirt. Focuson performance evaluation & improvement
- <u>https://about.me/dario.faggioli</u>, <u>https://dariofaggioli.wordpress.com/about/</u>

#### About my Company: SUSE

- We're one of the oldest Linux company (1992!)
- We're the "open, Open Source company"
- We like changing name: S.u.S.E.  $\rightarrow$  SuSE  $\rightarrow$  SUSE
- We make <u>music parodies</u>
- Our motto is: "Have a lot of fun!"

Academic program: <u>suse.com/academic/</u> We're (~always) hiring: <u>suse.com/company/careers</u>



#### Spectre, Meltdown & Friends

- **Spectre v1** Bounds Check Bypass
- Spectre v2 Branch Target Isolation
- **Meltdown** Rogue Data Cash Load (a.k.a. Spectre v3)
- Spectre v3a- Rogue System Register Read
- **Spectre v4** Speculative Store Bypass
- LazyFPU Lazy Floating Point State Restore
- L1TF L1 Terminal Fault (a.k.a. Foreshadow)
- MDS Microarch. Data Sampling (a.k.a. Fallout, ZombieLoad, ...)

Will cover: Meltdown. *Maybe* Spectre. *Maybe* L1TF Stop me and ask (or ask at the end, or ask offline) Spotted a mistake? Do not hesitate point'n out... Thanks! ;-)

## CPU, Memory, Caches, Pipelines, Speculative Execution...

#### **CPU**, Memory

#### CPU are fast, memory is slow



#### CPU, Memory, Cache(s)

CPU are fast, memory is slow

- Cache == fast memory
- But we can't use it as main memory:
  - takes a lot of space on a chip
  - costs a lot of money
  - consumes a lot of power
  - ..
- On the CPU chip
  - takes most of the space in nowadays CPU CPUs, actually
- It's not cache, it's cache**s** 
  - there's more than 1;
  - organized hierarchically

| Mazini (1908)                                                                                                            |                                             |                                |                                                            |                                 |                                         |                                 |
|--------------------------------------------------------------------------------------------------------------------------|---------------------------------------------|--------------------------------|------------------------------------------------------------|---------------------------------|-----------------------------------------|---------------------------------|
| Social Petrosofti                                                                                                        |                                             |                                | Socket PM1 CANSIN                                          |                                 |                                         |                                 |
| MANNESS PP2 (EXCME)                                                                                                      |                                             |                                | NIMENO PE2 (1222042)                                       |                                 |                                         |                                 |
| 13 (8880)                                                                                                                |                                             |                                | 10000                                                      |                                 |                                         |                                 |
| 12 (2049/3) 12 (2049/3)                                                                                                  | 12 (514808)                                 | 12(294840)                     | 12(20000)                                                  | 12(204943)                      | 12 (2049/3)                             | 12 (314865)                     |
| LU KHN LU KHN                                                                                                            | LL ISAG                                     | 11-3480                        | 13-0900                                                    | 131593                          | LUMM                                    | 13 1640                         |
| 1313901 323900 1313900 1313901                                                                                           | LLA DAUDI LLA DAUDI                         | 13103900 133109001             | 13113930 13310980                                          | L1010R0 L1010R00                | LINDER LINDER                           | LINDER LINDER                   |
| Calo MR         Calo MR         Calo MR         Calo MR           N / MC         N / ML         N / ML         N / ML    | Cold PPH Cold PPD<br>P1 PPH P1/PD<br>P1/PPH | Col.775<br>F1775<br>F1775      | Con 1992<br>11 1993 Con 1993<br>11 1993 Ro 1993            | Con PF1<br>F2 FF125<br>F1 FF125 | Con FFM Con FFS F0 FF3 F0 FF3 F1 FF3    | Col.PPN Col.PPT<br>R.(PE) R(PE) |
| MANNOG PPI JEXNE                                                                                                         |                                             |                                |                                                            |                                 |                                         |                                 |
| 13 (0.000)                                                                                                               |                                             |                                | 11(81000)                                                  |                                 |                                         |                                 |
| 12 (2048/3) 12 (2048/3)                                                                                                  | 12 (00400)                                  | 12 (20400)                     | 12(20000)                                                  | 12 (2048/3)                     | 12 (2049/3)                             | 12 (304868)                     |
| LU KARN LU KARN                                                                                                          | LL (Seeg)                                   | 11-5400                        | (3-0940)                                                   | 1315903                         | 13.848                                  | LIVERS                          |
| Laroson Laroson Laroson Laroson                                                                                          | LLA DARD LLA DARD                           | 1313400 L310901                | 11111900 111109001                                         | UKUM UKUM                       | UKINA UKINA                             | LINDRE LINDRE                   |
| COL MIC         COL MIS         COL MIS         COL MIS           R0 MIS         COL MIS         COL MIS         COL MIS | Cold #PH Cold #PD<br>R(#P)2 R(#P2)          | Call ##5<br>F1 ##14<br>F1 ##54 | Con MD Con MS R HT Con | Con #12<br>21 1426<br>01 1427   | Con ##<br>(9,993)<br>(9,993)<br>(9,993) | Cos PP                          |

Portable Hardware Locality (hwloc)



#### CPU, Memory, Cache(s)

Portable Hardware Locality (hwloc)

Machine (32GB)

| Socket P#0 (16GB)               |                                   |                                   |                                   |
|---------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|
| NUMANode P#0 (8192MB)           |                                   |                                   |                                   |
| L3 (8192KB)                     |                                   |                                   |                                   |
| L2 (2048KB)                     | L2 (2048KB)                       | L2 (2048KB)                       | L2 (2048KB)                       |
| Lli (64KB)                      | L1i (64КВ)                        | L1i (64KB)                        | L1i (64KB)                        |
| Lld (16KB) Lld (16KB)           | L1d (16KB) L1d (16KB)             | L1d (16KB) L1d (16KB)             | L1d (16KB) L1d (16KB)             |
| Core P#0 Core P#1 PU P#0 PU P#1 | Core P#2 Core P#3 PU P#2 PU P#3   | Core P#4 Core P#5 PU P#4 PU P#5   | Core P#6 Core P#7 PU P#6 PU P#7   |
| NUMANode P#1 (8192MB)           |                                   |                                   |                                   |
| L3 (8192KB)                     |                                   |                                   |                                   |
| L2 (2048KB)                     | L2 (2048KB)                       | L2 (2048KB)                       | L2 (2048KB)                       |
| Lli (64KB)                      | Lli (64KB)                        | L1i (64KB)                        | L1i (64KB)                        |
| L1d (16KB) L1d (16KB)           | L1d (16KB) L1d (16KB)             | L1d (16KB) L1d (16KB)             | L1d (16KB) L1d (16KB)             |
| Core P#0 Core P#1 PU P#8 PU P#9 | Core P#2 Core P#3 PU P#10 PU P#11 | Core P#4 Core P#5 PU P#12 PU P#13 | Core P#6 Core P#7 PU P#14 PU P#15 |

| ocket P#1 (16GB     | )                   |                                                             |                                                             |                                   |
|---------------------|---------------------|-------------------------------------------------------------|-------------------------------------------------------------|-----------------------------------|
| NUMANode P#         | £2 (8192MB)         |                                                             |                                                             |                                   |
| L3 (8192KB)         |                     |                                                             |                                                             |                                   |
| L2 (2048KB)         |                     | L2 (2048KB)                                                 | L2 (2048KB)                                                 | L2 (2048KB)                       |
| L1i (64KB)          |                     | L1i (64KB)                                                  | L1i (64KB)                                                  | L1i (64KB)                        |
| L1d (16KB)          | L1d (16KB)          | L1d (16KB) L1d (16KB)                                       | L1d (16KB) L1d (16KB)                                       | L1d (16KB) L1d (16KB)             |
| Core P#0<br>PU P#16 | Core P#1<br>PU P#17 | Core P#2 Core P#3 PU P#18 PU P#19                           | Core P#4         Core P#5           PU P#20         PU P#21 | Core P#6 Core P#7 PU P#22 PU P#23 |
| NUMANode P#         | F3 (8192MB)         |                                                             |                                                             |                                   |
| L3 (8192KB)         |                     |                                                             |                                                             |                                   |
| L2 (2048KB)         |                     | L2 (2048KB)                                                 | L2 (2048KB)                                                 | L2 (2048KB)                       |
| L1i (64KB)          |                     | L1i (64KB)                                                  | L1i (64KB)                                                  | L1i (64KB)                        |
| Lld (16KB)          | L1d (16KB)          | L1d (16KB) L1d (16KB)                                       | L1d (16KB) L1d (16KB)                                       | L1d (16KB) L1d (16KB)             |
| Core P#0<br>PU P#24 | Core P#1<br>PU P#25 | Core P#2         Core P#3           PU P#26         PU P#27 | Core P#4 Core P#5 PU P#28 PU P#29                           | Core P#6 Core P#7 PU P#30 PU P#31 |

## Cache(s): how *faster* are we talking about?

i7-6700 Skylake 4.0 GHz access latencies:

- L1 cache : **4 cycles** (direct pointer) •
- L1 cache : 5 cycles (complex addr. calculations) •
- L2 cache : 11 cycles •
- L3 cache : 39 cycles •
- Memory : 100 ~ 200 cycles •



Latency Numbers Every Programmer Should Know (do check this website!):

- L1 cache : 1 ns •
- L2 cache :4 ns •
- Memory : 100 ns •
- Tx 2KB, 1 Gbps network : 20,000 ns (20 µs) ٠
- SSD random read : 150,000 ns (150 µs) •
- Rotational disk seek •
- : 10,000,000 ns (10 ms)

#### Cache(s): how faster are we talking about?

Real life parallelism:

- 1 CPU Cycle
   0.3 ns
   1 s
- Level 1 cache 0.9 ns
- Level 2 cache 2.8 ns
- Level 3 cache 12.9 ns
- Memory 120 ns **> 6 min**
- SSD I/O 50 150 us
- Rotational disk I/O 1-10 ms

**3 s** 9 s 43 s **> 6 min** 2-6 days 1-12 months



Oh, and do check-out this video too!

#### Caches: how do they work

- Address: splitted [Index, Tag]
- Lookup Index: gives you one or more tags ⇒ match your Tag



#### Caches: how do they work

- Address: splitted [Index, Tag]
- Lookup Index: gives you one or more tags ⇒ match your Tag





| Set #       | LRU | V <sub>0</sub> | D <sub>0</sub> | Tag <sub>0</sub> | Data <sub>0</sub> | <b>V</b> <sub>1</sub> | <b>D</b> <sub>1</sub> | Tag <sub>1</sub> | Data <sub>1</sub> |
|-------------|-----|----------------|----------------|------------------|-------------------|-----------------------|-----------------------|------------------|-------------------|
| 0:          | 0   | 0              | 0              |                  |                   | 0                     | 0                     |                  |                   |
| <b>→</b> 1: | 1   | 1              | 1              | 11110            |                   | 1                     | 0                     | 00110            |                   |
| 2:          | 0   | 0              | 0              |                  |                   | 0                     | 0                     |                  |                   |
| 3:          | 0   | 0              | 0              |                  |                   | 0                     | 0                     |                  |                   |
| 4:          | 0   | 0              | 0              |                  |                   | 0                     | 0                     |                  |                   |
|             |     |                |                |                  |                   |                       |                       |                  |                   |
| 63:         | 0   | 0              | 0              |                  |                   | 0                     | 0                     |                  |                   |

#### CPU, Memory, Cache(s), TLB(s)

CPU are fast, memory is slow

• Even with caches

## CPU, Memory, Cache(s), TLB(s)

Virtual Memory

- Address: virtual ⇒ physical, translated <del>via a table</del>
- ... via a set of tables (we want it sparse!)
- Address split:
   [L4off,L3off,L2off,L1off,off]
- Page Table:
  - Setup by CPU within MMU
  - Translation done by MMU, walking the page table
  - A walk for each memory reference?
     No! No! No!



## CPU, Memory, Cache(s), TLB(s)

Hierarchy of TLBs

- Instruction L1 TLB
- Data L1 TLB
- I+D L2 TLB (called STLB)

Transitional Lookaside Buffer (TLB) Virtual Page Number

- A cache for virtual address translations
- On memory reference, check TLB:
  - Hit: we saved a page table walk!
  - Miss: page table walk needed...

\_ Latency:

- TLB hit: ~ cache hit, 4 cycles / 4 ns
- Page Table Walk: 4~5 memory accesses, 100 cycles / 100ns each!



#### **Superscalar Architectures**

CPU are fast, memory is slow

- Even with caches
- Even with TLBs

#### **Superscalar Architectures**

CPU executing an instruction:



- **F:** fetch the instruction from memory/cache
- **D:** decode instruction:

E.g., 01101101b == ADD %eax,\*(%ebx)

• **E:** execute instruction

do it. E.g., do the add, in CPU's ALU, etc

• W: write result back

update actual registers & caches/memory locations

#### **Superscalar Architectures**

CPU executing multiple instructions:



One after the other... ... slow! :-/

## Superscalar Architectures: pipelining

#### **In-order** execution, pipelined

- 0: Four instructions are waiting to be executed
- 1: green enters pipeline (e.g., IF)
- ...
- 4: pipeline full
   4 stages ⇒ 4 inst. In flight
- 5: green completed
- ...
- 8: all completed



Wikipedia: Instruction Pipelining

### Superscalar Architectures: pipelining

#### **In-order** execution, pipelined

- 0: Four instructions are waiting to be executed •
- 1: green enters pipeline (e.g., IF) ٠

...

- 4: pipeline full • 4 stages  $\Rightarrow$  4 inst. In flight
- 5: green completed
- ...
- 8: all completed •



#### Superscalar Architectures: n-th issue<sup>\*</sup>

Double the game, <del>double ILP</del>, increase ILP



4 pipeline stages, 8 instructions in flight:1. Instr. A, instr. B: write-back

- 2. E, F: execute
- 3. I, L: decode
- 4. **F**, **P**: fetch

(... ... theoretically)

#### Superscalar Architectures: n-th issue

Double the game, <del>double ILP</del>, increase ILP



4 pipeline stages, 8 instructions in flight:

- 1. Instr. A, instr. B: write-back
- 2. E, F: execute
- 3. I, L: decode
- 4. **F**, **P**: fetch

But don't go too far... Or you'll get <u>Itanium</u>! <u>Explicitly</u> parallel instruction computing / <u>Very long instruction word</u>)

#### Superscalar Architectures: deeper pipes

#### The *smaller* the **stage**, the *faster* **clock** can run

- 486 (1989), 3 stages 100 MH
- P5 [Pentium] (1993), 5 stages, 300 MHz
- P6 [Pentium Pro, Pentium II, Pentium III] (1995-1999), 12-14 stages, 450 MHz-1.4 GHz
- NetBurst, Prescott [Pentium4] (2000-2004), 20-31 stages, 2.0-3.8 GHz
- Core (2006), 12 stages, 3.0 GHz
- Nehalem (2008), 20 stages, 3.6 GHz
- Sandy Bridge, Ivy Bridge (2011-2012) 16, 4 GHz
- Skylake (2015), 16 stages, 4.2 Ghz
- Kaby Lake, Coffee Lake (2016-2017), 16 stages, 4.5 GHz
- Cannon Lake (2018), 16 stages, 4.2 GHz

https://en.wikipedia.org/wiki/List of Intel CPU microarchitectures

| TCN                                                                                                                                                                                                                                                                                                                                                                                                                         | B                                                       | Fet                                                              | B                                                         |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|------------------------------------------------------------------|-----------------------------------------------------------|
| 2<br>xt IP                                                                                                                                                                                                                                                                                                                                                                                                                  | asi                                                     | ch                                                               | asi                                                       |
| 3<br>TCF                                                                                                                                                                                                                                                                                                                                                                                                                    | C                                                       | Fe                                                               | cF                                                        |
| 4<br>etch                                                                                                                                                                                                                                                                                                                                                                                                                   | e                                                       | 1ch                                                              | er                                                        |
| 5<br>Drive                                                                                                                                                                                                                                                                                                                                                                                                                  | nti                                                     | Dee                                                              | ntii                                                      |
| 6<br>Alloc                                                                                                                                                                                                                                                                                                                                                                                                                  | m                                                       | 3<br>ode                                                         | m                                                         |
| 7<br>Ren                                                                                                                                                                                                                                                                                                                                                                                                                    | 4                                                       | Dec                                                              | Ξ                                                         |
| ame                                                                                                                                                                                                                                                                                                                                                                                                                         | Pro                                                     | ode                                                              | Pr                                                        |
| 9<br>Que                                                                                                                                                                                                                                                                                                                                                                                                                    | Ce                                                      | Dec                                                              | oc                                                        |
| 10<br>Sch                                                                                                                                                                                                                                                                                                                                                                                                                   | SS                                                      | 5<br>ode                                                         | es                                                        |
| 11<br>Sch                                                                                                                                                                                                                                                                                                                                                                                                                   | Ör                                                      | Ren                                                              | Sor                                                       |
| 12<br>Sch                                                                                                                                                                                                                                                                                                                                                                                                                   | S                                                       | 6<br>ame                                                         | M                                                         |
| 13<br>Disp                                                                                                                                                                                                                                                                                                                                                                                                                  | dsi                                                     | R                                                                | S                                                         |
| 14<br>Disp                                                                                                                                                                                                                                                                                                                                                                                                                  | rec                                                     | 7<br>B Rd                                                        | ore                                                       |
| RF 15                                                                                                                                                                                                                                                                                                                                                                                                                       | dic                                                     | Rd                                                               | dic                                                       |
| 16<br>RF                                                                                                                                                                                                                                                                                                                                                                                                                    | tio                                                     | 8<br>V/Sch                                                       | tic                                                       |
| 17<br>Ex                                                                                                                                                                                                                                                                                                                                                                                                                    | n                                                       | Dis                                                              | ň                                                         |
| 18<br>Figs                                                                                                                                                                                                                                                                                                                                                                                                                  | bip                                                     | 9<br>patcl                                                       | Pip                                                       |
| 1         2         3         4         5         6         7         8         9         10         11         12         13         14         15         16         17         18         19         20           TC Nixt IP         TC Fetch         Drive Alloc         Rename         Que         Sch         Sch         Disp         Disp         RF         RF         Ex         Figs         Br Ck         Drive | <b>Basic Pentium 4 Processor Misprediction Pipeline</b> | 12345678910FetchFetchDecodeDecodeRenameROB RdRdy/SchDispatchExec | <b>Basic Pentium III Processor Misprediction Pipeline</b> |
| 20<br>Drive                                                                                                                                                                                                                                                                                                                                                                                                                 | ne                                                      | 10<br>Exec                                                       | ine                                                       |

### Superscalar Architectures: deeper pipes

#### The *smaller* the **stage**, the *faster* **clock** can run

- 486 (1989), 3 stages 100 MH
- P5 [Pentium] (1993), 5 stages, 300 MHz
- P6 [Pentium Pro, Pentium II, Pentium III] (1995-1999), 12-14 stages, 450 MHz-1.4 GHz
- NetBurst, Prescott [Pentium4] (2000-2004), 20-31 stages, 2.0-3.8 GHz
- Core (2006), 12 stages, 3.0 GHz
- Nehalem (2008), 20 stages, 3.6 GHz
- Sandy Bridge, Ivy Bridge (2011-2012) 16, 4 GH fa
- Skylake (2015), 16 stages, 4.2 Ghz
- Kaby Lake, Coffee Lake (2016-2017), 16 stages, 4.5 GHz
- Cannon Lake (2018), 16 stages, 4.2 GHz

https://en.wikipedia.org/wiki/List of Intel CPU microarchitectures

But don't go too far... or you'll get Pentium4!

| Ş  |          | The second second |        | The second se |        | Concert |                                                               |       |       |
|----|----------|-------------------|--------|-----------------------------------------------------------------------------------------------------------------|--------|---------|---------------------------------------------------------------|-------|-------|
| Fx | Disnatch | Rdv/Sch           | ROB Rd | Rename                                                                                                          | Decode | Decode  | Fetch Fetch Decode Decode Rename ROB Rd Rdv/Sch Dispatch Exec | Fetch | Fetch |
| 10 | 9        | 8                 | 7      | 6                                                                                                               | J      | 4       | 3                                                             | 2     | -     |

Xt IP

Que

Sch 10

Sch

Disp

유망

RF 16

Figs

#### Superscalar Architectures: deeper pipes

The *smaller* the **stage**, the *faster* **clock** can run

- 486 (1989), 3 stages 100 MH
- P5 [Pentium] (1993), 5 stages, 300 MHz
- P6 [Pentium Pro, Pentium II, Pentium III] (1995-1999), 12-14 stages, 450 MHz-1.4 GHz
- NetBurst, Prescott [Pentium4] (2000 Current 2.0-3.8 GHz
   Current processors
- Core (2006), 12 stages, 3.0 GHz
- Nehalem (2008), 20 stages, 3.6 GHz
- Sandy Bridge, Ivy Bridge (2011-2012) 16, 4 GHz
- Skylake (2015), 16 stages, 4.2 Ghz/
- Kaby Lake, Coffee Lake (2016-2017), **16 stages**, 4.5 GHz
- Cannon Lake (2018), **16 stages**, 4.2 GHz

https://en.wikipedia.org/wiki/List of Intel CPU microarchitectures

| 20  | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20<br>TO Not 10 TC Eatch Intriva Alloc Barrana Out Sch Sch Inten Dien DE DE Ex Eles Brick Intriva | 18     | 17  | 16    |     | 14  | 13  | 12   | 24 1 | 10  | 9   | 8    | 7   | 6   | 5    | 4   | 3  | 2   | 1   |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------|--------|-----|-------|-----|-----|-----|------|------|-----|-----|------|-----|-----|------|-----|----|-----|-----|
| ne  | Basic Pentium 4 Processor Misprediction Pipeline                                                                                                  | Pip    | n   | tio   | lic | rec | dsi | S    | or   | SSS | OCE | Pro  | 4   | m   | nti  | Pe  | C  | as  | B   |
|     | Ì                                                                                                                                                 |        | Ì   |       | Ì   |     | Ì   |      | Ì    |     |     |      | Ī   |     | Ì    |     | [  |     |     |
| (ec | Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec                                                                              | spatch | Dis | y/Sch | Rd  | BRd | RO  | lame | Rer  | ode | Dec | code | Dec | ode | Dec  | tch | Fe | tch | Fel |
| 10  |                                                                                                                                                   | 9      |     | 8     |     | 8   |     | 6    |      | 5 6 |     | 4    |     | 3 4 |      | N   |    | -   |     |
| ne  | Basic Pentium III Processor Misprediction Pipeline                                                                                                | Pip    | ň   | ctic  | di  | ore | S   | M    | OS   | ess | 00  | Pr   | =   | m   | ntiu | e   | c  | asi | B   |
|     |                                                                                                                                                   |        |     |       |     |     |     |      |      |     |     |      |     |     |      |     |    |     |     |

CPU are fast, memory is slow

- Even with caches
- Even with TLBs
- Even with pipeline

In-order execution, pipelined

- Instructions takes variable amount of time
- If an (phase of an) instruction takes a lot of time?
   Stalls / bubbles
- Could have I done something else while waiting? **YES!**



Pipeline and out-of-order instruction execution optimize performance

But **not** <u>delay slots</u>! From (old) RISCs, right now only popular in some DSP (probably)



now only popular in some DSP (probably)

- Fetch a bunch of instructions; stash them in a queue (Reservation Station)
   Fetch operands, e.g., from memory
   Execute instructions from the queue with operands ready ⇒ issued to the appropriate stage
   Instructions leaves queue (might be
- Instructions leaves queue (might be before "earlier" instructions) ⇒ results queued (Reorder Buffer, ROB)
- 4. Instruction completes (retires) **after** all earlier instructions also completed

<u>Tomasulo algorithm</u>, IBM, 1967 ⇒ adopted by Pentium Pro (P6 family), 1995

in parallel!



#### In Order



#### **Speculative Execution**

CPU are fast, memory is slow

- Even with caches
- Even with TLBs
- Even with pipeline
- Even with out-of-order execution

#### **Speculative Execution**

CPU are fast, memory is slow

- Even with caches
- Even with TLBs
- Even with pipeline
- Even with out-of-order execution

How come we're still in trouble?

- Branches: if, loops, function calls/returns ...
- Out of Order Exec. works **great** for *data-dependencies*
- Branches are "control flow-dependencies"
  - If I don't know what I'll execute next, I can't reorder instructions!

#### **Branches**

# Unconditional branches (func. call, func. ret, jmp, ...)

#### Conditional branches (if, loops, ...)



#### **Out-of-Order + Speculative Execution**

Yeah, whatever!! Reorder buffer is there, let's use it...

- Ignore control-flow dependencies: execute instructions anyway
- We "occupy" stalls, so we're no any slower!



#### **Out-of-Order + Speculative Execution**

Yeah, whatever!! Reorder buffer is there, let's use it...

- Ignore control-flow dependencies: execute instructions anyway
- We "occupy" stalls, so we're no any slower!



- Ignore control-flow dependencies: execute instructions anyway
- We "occupy" stalls, so we're no any slower!



- Ignore control-flow dependencies: **execute** instructions anyway
- We "occupy" stalls, so we're no any slower!



- Ignore control-flow dependencies: execute instructions anyway
- We "occupy" stalls, so we're no any slower!



- Ignore Guess control-flow dependencies: execute instructions anyway
- We "occupy" stalls, so we're no any slower!



- Ignore Guess control-flow dependencies: execute instructions anyway
- We "occupy" stalls, so we're no any slower!

|                                    | But:<br><b>Q:</b> How do I tell which are the                                   | Whatever instructions they are, I'm not any slower        |
|------------------------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------|
|                                    | right ones, and where are they?<br>A: I guess Even better: I try to<br>predict! | If they could be the <b>right</b> ones, I'll be faster!!! |
| But which c<br>Fetched fro<br>etc. | <b>Q:</b> Ok, cool! But wait, what if you guess wrong (mispredict)? :-O         |                                                           |

**Q:** Ok, cool! But wait, what if you guess wrong (mispredict)? :-O

Yeah, whatever!! Reorder buffer is there,

- Ignore **Guess** control-flow dependencies: <u>execute</u> instructions anyway
- We "occupy" stalls, so we're no any slower!

Execute them *speculatively*:

- Execute them but defer (some of their) effects
- Until we know whether they'll run "for real"
  - If yes, apply the effects (memory/register writes, exceptions, ...)
  - If no, throw everything away

**Q:** Ok, cool! But wait, what if you guess wrong (mispredict)? :-O

Yeah, whatever!! Reorder buffer is there,

- Ignore **Guess** control-flow dependencies: <u>execute</u> instructions anyway
- We "occupy" stalls, so we're no any slower!

Execute them *speculatively*:

- Execute them but defer (some of their) effects
- Until we know whether they'll run "for real"
  - If yes, apply the effects (memory/register writes, **exceptions**, ...)
  - If no, throw everything away

How do we guess:

- 1. Whether or not a *conditional branch* (if, loops) <u>will be</u> taken or not taken
- 2. Where or not an *unconditional direct* branch (function call, function return) or an *unconditional indirect* branch (function pointer) branch <u>will be</u> taken or not taken

We can't. We can predict (e.g., basing on previous history):

- 1. Branch predictors
- 2. Branch Target Buffer, Return Stack Buffer

#### How do we guess direct branches (if, loops)

- Static prediction: no runtime knowledge
  - Always taken (loops! + loops execute multiple times... by definition!): 70% correct
  - Backward taken, forward not taken (BTFNT), (loops again! + compiler help), PPC 601 (1993):

80% correct

- Dynamic prediction:
  - look at history, at runtime
    - 1 bit history, predict basing on last occurrence, DEC/MIPS (1992/1994): 85% correct
    - 2 bit history, "often taken" == always taken, Pentium (1993):
    - Store history, '100100' == taken once every 3 times, Pentium II (1997):
    - Multilevel "agree" predictor, PA-RISC (2001):
    - Neural networks, AMD Zen/Bulldozer (2001)
    - Geometric predictor, predictor chaining, Intel (2006)

<u>Kernel Recipes 2018 - Meltdown and Spectre: seeing through the magician's tricks</u> <u>http://danluu.com/branch-prediction/</u> 85% correct 90% correct 93% correct 95% correct

#### How do we guess **direct branches** (if, loops)





#### How do we guess direct branches (if, loops)

```
0x002 if (A)
0x003 do_A()
0x004 do_notA()
...
```

Predictor ... Pred. for branch at 0x003 ...

CPU: branch taken, no prediction
begin[if (A)]
...
end[if (A)] == true, branch!
begin[do\_A()]
...
end[do\_A()]

#### How do we guess direct branches (if, loops)



```
CPU: branch taken, ok prediction
CPU: branch taken, no prediction
begin[if (A)]
                                              begin[if (A)]
                                              check pred[0x002] ← taken
. . .
                                              spec begin[do A()]
. . .
end[if (A)] == true, branch!
                                              end[if (A)] == true, branch!
begin[do A()]
                                              . . .
                                              spec commit[do A()] == end[do A()]
. . .
end[do A()]
                     Finished do_A() earlier!
                                              . . .
```

#### How do we guess direct branches (if, loops)



**CPU:** branch taken, no prediction

begin[if (A)]

. . .

. . .

```
end[if (A)] == true, branch!
begin[do A()]
```

. . .

end[do\_A()]

Finished do\_A() **no later**!

```
CPU: branch taken, misprediction
begin[if (A)]
check_pred[0x002] ← not taken
spec_begin[do_not(A)]
end[if (A)] == true, branch!
begin[do_A()]
spec_undo[do_notA()]
end[do A()]
```

#### How do we guess **direct branches** (if, loops)



#### Predictor ... 1 = taken Pred. for branch at 0x002 ...

#### How do we guess **direct branches** (if, loops)



(1) Check predictor

How do we guess direct branches (if, loops)



(1) Check predictor

(2) taken  $\Rightarrow$  speculatively execute do\_A()

#### How do we guess direct branches (if, loops)



- (1) Check predictor
- (2) taken  $\Rightarrow$  speculatively execute do\_A()
- (3) Update predictor (with what really happened)

How do we guess **indirect branches** (func. pointers/returns)

1. direct / indirect calls: Branch Target Buffer (BTB)  $\Rightarrow$  a branch cache



How do we guess **indirect branches** (func. pointers/returns)

1. direct / indirect calls: Branch Target Buffer (BTB)  $\Rightarrow$  a branch cache



How do we guess **indirect branches** (func. pointers/returns)

1. direct / indirect calls: Branch Target Buffer (BTB)  $\Rightarrow$  a branch cache



How do we guess **indirect branches** (func. pointers/returns)

1. direct / indirect calls: Branch Target Buffer (BTB) ⇒ a branch cache











How do we guess **indirect branches** (func. pointers/returns)

2. returns: Return Stack Buffer (RSB)  $\Rightarrow$  a stack

RSB



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns) 2. returns: Return Stack Buffer (RSB) ⇒ a stack



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)

2. returns: Return Stack Buffer (RSB)  $\Rightarrow$  a stack

RSB





How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)



How do we guess **indirect branches** (func. pointers/returns)

2. returns: Return Stack Buffer (RSB)  $\Rightarrow$  a stack



On some CPUs, we just use what we find there...

How do we guess **indirect branches** (func. pointers/returns)



# **Alternative Universes**

# **Speculative execution**

speculate = to guess, execution = to do something
 speculative execution = do something based on a guess

#### **Speculative Execution:**

do <x> , while waiting to be
able to check <A>

#### • IRL:

- You to a friend: <<hey, do you want a cup coffee?>>
- While talking/waiting for answer: turn on machine, prep. cups, ...

#### • In CPUs:

- Memory is slow. While waiting for data, do something
- instruction reordering, superscalar pipelines, branch prediction, ...
  - if <A> is true do <x> | check <A>
- Modern CPUs speculate a lot! (~= 200 entries reorder buffers)

<u>Kernel Recipes '18: Paolo Bonzini - "Meltdown and Spectre: seeing through the magician's tricks"</u> <u>NYLUG: Andrea Arcangeli, Ion Masters - "Speculation out of control, taming CPU bugs"</u>

- I can create an alternate universe
- everything the same, **I have superpowers**:
  - I can do whatever I want, I always succeeds (it's *my* alternate universe! :-D )
- After, say, 30 seconds:
  - alternate universe disappears
  - in the original universe, I remember nothing :-(
  - **good** things I've done  $\Rightarrow$  "copied" back to original universe
  - **bad** things I've done  $\Rightarrow$  *never happened* in original universe

l can

### *What if*, alteration of the **heat** of objects, happening in the alternate ever universe, **leaks** to original universe?

- After, say, so seconos.
  - alternate universe disappears \_\_\_\_
  - in the original universe, I remember nothing \_
  - **good** things I've done  $\Rightarrow$  "copied" back to original universe
  - **bad** things I've done  $\Rightarrow$  *never happened* in original universe

### • I can create an alternate universe

- everything the same I have superpowers:
- After, set
  - altein t
  - go – ba



### niverse niverse

(\*) Analogy stolen fro

- I can create an alternate universe
- everything the same I have superpowers:



# **Side Channels**

# Side Channels (Covert Channels)

Gaining information on a system by **observing** its behavior

- Read otherwise unaccessible memory via a buffer overflow
- Measuring microarchitectural properties
- ⇒ not interact with nor influence execution of a program
- ⇒ not let one modify/delete/... any data

Caches as side channels:

- Accessing memory is fast, if data is in cache
- Accessing memory is slow, if data is in cache
- $\Rightarrow$  measuring data access time == *cache side-channel*



# Cache as a Side Channel

Execution time of **instruction**: depending on **data** being in caches

Example:

- I fill the cache (big array)
- **Call** target func(int idx)
  - I control value of idx
- target func() bring its data in cache



Prime and Probe

# Cache as a Side Channel

Execution time of **instruction**: depending on **data** being in caches

Example:

- I fill the cache (big array)
- **Call** target func(int idx)
  - I control value of idx
- target\_func() bring its data in cache
- I measure access time to all array elements
- The slowest one tells me something about what target\_func() has done
  - (remember, I control, idx)



#### Prime and Probe

# Cache as a Side Channel (The Other Way Round)

Execution time of **instruction**: depending on **data** being in caches

Example:

- I fill empty the cache (big array)
- **Call** target func(int idx)
  - I control value of idx
- target\_func() bring its data in cache
- I measure access time to all array elements
- The slowest fastest one tells me something About what target func() has done
  - (remember, I control, idx)



Flush and Reload

# **Attacking Speculative Execution**

### **Speculative Execution Attack**

```
result bit = 0;
                          //goal: read the 5th bit of what's at an address
bit = 4:
                          //that I normally wouldn't be able to read!
flush cacheline(L);
if (fork alt univ()) { //returns 1 in alternate, 0 in original universe :-)
  if ( *target address & (1 << bit) )
    //in the alternate universe now
    load cacheline(L);
                                                 Remember alternate
}
                                                 universes...
if (is cacheline loaded(L))
  //"Back" in in original universe
  result bit = 1;
```

do it in a loop, use a bitmask and shift (<<)

# **Speculative Execution Attack**

This is how we "trick" the CPU to The CPU is executing this "in execute code "in speculation" speculation" ==> **no fault!** (e.g., "poison" branch prediction) //goal: read the 5th bit of what's at an address result bit = 0;bit = 4; //that I normally wouldn't be able to read! flush cacheline(L); if (fork alt univ()) { //returns 1 in alternate, 0 in original universe :-) Cache used as a **side-channel**: if ( \*target address & (1 << bit) ) Extract information from behavior //in the alternate universe now load cacheline(L); if (is cacheline loaded(L)) //"Back" in in original universe E.g., our looking-at-Facebook result bit = 1;"heated" spoon, a stethoscope for

do it in a loop, use a bitmask and shift (<<)

hearing locks' clicks, ...

# **BTB Poisoning Attack**

### Conditional branch predictor:



# **BTB Poisoning Attack**

### Conditional branch predictor:



### **RSB Underflow "Attack"**

Task A

Task B



### **RSB Underflow "Attack"**

Task A

Task B



### **RSB Underflow "Attack"**



























### **RSB "Underflow Attack"**







call

call











# Speculative Execution: Fundamental Assumptions

- Spec. Execution ~= out-of-order execution + branch prediction
- Safe *iff*:
  - a. **Rollback works**: not retired (== executed speculatively, but rolled back) instructions have *no side effects* and leave no trace
  - b. **No messing with guesses**: it is impossible to *reliably* tell whether or not a particular block of code will be executed speculatively

# Speculative Execution: Fundamental Assumptions

- Spec. Execution ~= out-of-order execution + branch prediction
- Safe *iff*:
  - a. Rollback works: not retired (== executed speculatively, but rolled back) instructions have no side effects and leave no trace Architectural registers, flags, ..., ok, no side effects. Caches, TLBs, ..., not ok, side effects!
  - b. **No messing with guesses**: it is impossible to *reliably* tell whether or not a particular block of code will be executed speculatively

# Speculative Execution: Fundamental Assumptions

- Spec. Execution ~= out-of-order execution + branch prediction
- Safe *iff*:
  - a. Rollback works: not retired (== executed speculatively, but rolled back) instructions have no side effects and leave no trace Architectural registers, flags, ..., ok, no side effects. Caches, TLBs, ..., not ok, side effects!
  - b. **No messing with guesses**: it is impossible to *reliably* tell whether or not a particular block of code will be executed speculatively Predictions based on history (branch having previously been taken/!taken can be "poisoned", and hence *controlled*

# **Threat Model / Attack Scenarios**

## Meltdown / Spectre / others: TL;DR

Data Exfiltration "only":

- An unprivileged application can read (but not write) other's memory, irrelevant of the isolation technique (virtualization, container, namespace...) or the OS (Linux, Windows, MacOS...)
- Does not provide privilege escalation per-se, although it can help

Is my credit card data at risk:

- Don't know / don't care
- We'll talk about technical aspects



### Security, isolation, ...

**Attack Scenarios:** 

== **successfully** attacked! (e.g., read data/steal secrets)



### Security, isolation ...

**Attack Scenarios:** 

#### 1. User App to Other User Apps(s)

- Damage contained within App(s) data
- Might be different apps of same user / different apps of different users
- User Apps must protect themselves

#### 2. User App to Kernel

- Implies nr. 1
- Kernel must protect itself



(\*) slightly different between Xen and KVM

### **Attack Scenarios:**

== **successfully** attacked! (e.g., read data/steal secrets)



#### **Attack Scenarios:**

- 1. Host User App to Other Host User Apps(s)
- 2. Guest User App to Other Guest User Apps(s)
  - Damage contained within App(s) data inside a VM
  - VM user must protect his/her apps
- 3. Host User to Host Kernel
- 4. Guest User to Guest Kernel
  - Implies nr. 2
  - Damage contained within VM/customer
  - Guest kernel must protect itself (mitigations ~= Host User to Host Kernel case)
- 5. Guest to Other Guest(s) (\*)
  - VM 3 can steal secrets from VM 4
  - Hypervisor must isolate VMs
- 6. Guest to Hypervisor (Bad! Bad! Bad! Bad!) (\*)
  - Damage: implies nr. 5 "on steroids"!
  - Hypervisor must protect itself

(\*) we don't really care if "Guest User to ..." or Guest Kernel to ..." as one should never trust anything running in a VMs --whatever it is the VM kernel or userspace. Either one (or both!) may have been compromised, and become malicious!

#### **Attack Scenarios:**

- **1.** Host User App to Other Host User Apps(s)
- 2. Guest User App to Other Guest User Apps(s)
  - Damage contained within App(s) data insid
  - VM user must protect his/her apps
- 3. Host User to Host Kernel
- 4. Guest User to Guest Kernel
  - Implies nr. 2
  - Damage contained within VM customer
  - Guest kernel must protect itself ( mitigations ~= Host User to Host Kernel case)
- 5. Guest to Other Guest(s) (\*)
  - VM 3 can steal secrets from VM 4
  - Hypervisor must isolate VMs
- 6. Guest to Hypervisor (Bad! Bad! Bad! Bad!) (\*)
  - Damage: implies nr. 5 "on steroids"!
  - Hypervisor must protect itself
    - (\*) we don't really care if "Guest User to ..." or Guest Kernel to ..." as one should never trust anything running in a VMs --whatever it is the VM kernel or userspace. Either one (or both!) may have been compromised, and become malicious!

- Most critical
- Most important, for someone working on OSes & hypervisors (that would be me ;-P)
- Most interesting (personal opinion)

... We'll focus on these

# Meltdown

# Meltdown ("Spectre v3")

Rouge Data Cache Load (CVE-2017-5754)

- Virtual Memory, paging, system/user (s/u) bit:
  - Kernel: ring0, can access all memory pages
  - User Apps: ring3, can't access kernel's (ring0) pages
- While in speculation:
  - Everyone can access everything!
    - Kernel can read kernel addresses
    - Kernel can read user addresses
    - User can read user addresses
    - User can read kernel addresses...
- **No** leaky gadget needed in kernel/hypervisor. Attacker can use her own **in user code** (much,much worse than Spectre!)
- Affected **CPUs**: Intel, one ARM CPU, PPC (to some extent... only data in L1, ...)





# Meltdown





Yes, virtual memory map **is identical** for User App A, when running in both **user and kernel mode**! Why?

- User apps switch from user to kernel mode: e.g., syscall, interrupts, ...
- Changing virtual memory map come at high price: TLB flush
- Kernel is the same for everyone, so, why bother?



#### **Meltdown** Virtual Memory Virtual Memory of App A running of App A running Host Physical in User Mode in Kernel Mode Memory (ring3) (ring0) Can't App A, running in User Mode, access Kernel memory Mapping Kernel Kernel then? Kernel **Normally:** s/u bit in page (a) tables: App A – No, it can't, when in Mapping user mode – Yes it can, when in App A App A App B kernel mode Page Tables **Speculatively**: (MMU) ${\rm s/u}$ bit in page tables App C ignored Yes it can, *all the time!*

### Meltdown

User space code:

```
int w, x, xx, array[];
if ( <false but predicted as true> ) {
 w = *((int*) kernel memory address);
 x = array[(w \& 0x001)];
}
t0 = rdtsc(); xx = array[0]; t0 = rdtsc() - t0
t1 = rdtsc(); xx = array[1]; t1 = rdtsc() - t1
if (t0 < t1)
 //access to array[0] faster \rightarrow (* kernel memory address)&1 = 0
else
```

//access to array[1] faster  $\rightarrow$  (\* kernel\_memory\_address)&1 = 1





else

//access to array[1] faster  $\rightarrow$  (\* kernel\_memory\_address)&1 = 1

### **Meltdown: Impact**



- **Guest User to Guest Kernel (Guest User App to Guest User App(s)):** 
  - KVM: yes (User to User goes via kernel mappings in User Apps)
  - Xen HVM, PVH, PV-32bit[1]: yes (User to User goes via kernel mappings in User Apps)
  - Xen PV-64bit: no [2]
- Guest to Hypervisor (Guest to Other Guest(s)):
  - **KVM:** no
  - Xen HVM, PVH, PV-32bit: no
  - Xen PV-64bit: yes :-(( [2]
- **Containers:** affected :-((
- Rather easy to exploit !

[1] Address space is too small[2] Looong story... ask offline ;-P





<sup>(\*)</sup> Only "trampolines" for syscalls, IRQs, ...



(\*) Only "trampolines" for syscalls, IRQs, ...

### Meltdown: PCID



- User Mode ⇒ Kernel Mode (and vice-versa)
  - syscalls, IRQs, ...
  - Change virtual memory layout (CR3 register)
  - Flush **all** TLB (~ page tables cache). *It really hurts performance*
- PCID (Process-Context IDentifier):
  - − Tag TLB entries  $\Rightarrow$  flush all **TLB** flush selectively
  - In Intel CPUs since 2010 !!! (PCID in Westmere, INVPCID in Haswell)
- Until now ... ...
  - complicate to use, and we map everything anyway, why bother?
- Now (i.e., after Meltdown):
  - Let's bother!
  - Used in both <u>Xen</u> and <u>Linux</u>

# **Meltdown: Mitigation**



- KVM:
  - Enable KPTI on *host* (protects host kernel from Host User Apps!)
  - Enable KPTI inside guests
- Xen:
  - Enable XPTI, to protect *PV-64bit* guests (including Dom0!)
  - Enable KPTI inside HVM, PVH and PV-32bit guests
- Containers:
  - Enable KPTI on host

### Meltdown: Performance Impact



Expected: from -5% to -30% performance impact

- Workload dependant: worse if I/O and syscall intensive
- Slowdowns of more than -20% reached only on synthetic benchmarks (e.g., doing lots of tiny I/O)
- For "typical" workloads, we're usually well within -10% ...
- ... with PCID support!
  - LKML posts: postgres -5%, haproxy -17%
  - Brendan Gregg KPTI/KAISER Meltdown Initial Performance Regressions
  - <u>Gil Tene PCID is now a critical performance/security feature on x86</u>

# Spectre

### Spectre v1



Bounds-Check Bypass (CVE-2017-5753)

- Attacks conditional branch prediction
- Vulnerable code (**leaky gadget**) must be present in target, or JIT (\*)
- Affected **CPUs**: everyone (Intel, AMD, ARM)

```
uint8_t arr_size, arr[]; //array_size not in cache
Uint8_t arr_size, arr2[]; //elements 1 and 2 not in cache
//untrusted_index_from_attacker = <out of array[] boundaries>
if ( untrusted_index_from_attacker < arr_size ) {
  val = arr[untrusted_index_from_attacker];
  idx2 = (val&1) + 1;
  val2 = arr2[idx2]; //arr2[1] in cache ⇒ (arr[untrusted_index]&1) = 0
} //arr2[2] in cache ⇒ (arr[untrusted_index]&1) = 1
(*)Just in Time code generators
```

|                         | Spectre v1                                                                                                                | Target == kernel/hypervisor                                                   | Xen: no JIT,<br>KVM: eBPF     |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------|
| ð                       | Bounds-Check Bypa                                                                                                         | (Linux if KVM, Xen if Xen). <b>Not really</b>                                 |                               |
|                         | <ul> <li>Attacks condition</li> </ul>                                                                                     | Common!                                                                       |                               |
| Make sure the branch is |                                                                                                                           | e ( <b>leaky gadget</b> ) must be preser<br><i>everyone</i> (Intel, AMD, ARM) | nt in target, or JIT (*)      |
|                         | <pre>uint8_t arr_size, arr[]; //array_size not in cache</pre>                                                             |                                                                               |                               |
|                         | <pre>Uint8_t arr_size, arr2[]; //elements 1 and 2 not in cache</pre>                                                      |                                                                               |                               |
|                         | <pre>//untrusted_index_from_attacker = <out array['="" devices]="" gadget:<="" have="" leaky="" of="" pre=""></out></pre> |                                                                               |                               |
|                         | if ( untrusted_inde                                                                                                       | $\mathbf{x}$ trom attacker $\mathbf{z}$ arr aire $\mathbf{N}$                 | Load a secret                 |
|                         | <pre>val = arr[untrusted_index_from_attacker]</pre> • Leak it, by loading something                                       |                                                                               |                               |
|                         | idx2 = (val&1) + 1;- else, offseted by                                                                                    |                                                                               | else, offseted by that secret |
|                         | val2 = arr2[idx2]; $(//arr2[1])$ in cache $\Rightarrow$ (arr[untrusted_index]&1) = 0                                      |                                                                               |                               |
|                         | }                                                                                                                         | //arr2[2] in cache ⇒ (arr[unt                                                 | rusted_index]&1) = 1          |
|                         | (*) Just in Time code genera                                                                                              | tors                                                                          |                               |

### Spectre v1: Impact, mitigations, performance



- Impact:
  - Guest User App to Guest User App(s): yes (JIT, e.g., Javascript in browsers)
  - Guest User to Guest Kernel, Guest to Hypervisor, Containers: well, theoretically (leaky gadgets or JIT in kernel/hypervisor)
- Extremely hard to exploit
- Mitigation:
  - none... wait, what?
  - Manual code sanitization
    - (a.k.a. playing the whack-a-mole game!)
  - array\_index\_mask\_nospec(), in Xen & Linux, to stop speculation
- Performance Implications: **none** (clever Tricks to avoid "fencing" ...)



## Spectre v2



Branch Target Injection(CVE-2017-5715)

- Attacks indirect branch prediction: *function pointers /* jmp \*(%r11)
- Attacker *might* be able to provide his own **leaky gadget**
- Affected **CPUs**: everyone (Intel, AMD, ARM)

Predictors of indirect branch targets:

- Are based on previous history (BTB); can be "poisoned"
- Branches done in userspace influence predictions in kernel space
- Branches done in SMT thread influence predictions on sibling

Attack:

- Same leaky gadget based strategy (PoC for KVM via eBPF)
- Attacker provided leaky gadget if !SMEP on the CPU (on x86)

Marc Zyngier - KVM/arm Meets the Villain: Mitigating Spectre

Very good talk about ARM specifics challenges

### Spectre v2

Indirect jump:





<ууу>

### Spectre v2

### Indirect jump:

```
Address Instruction
(1) (1) 0x001123 jmp *(%r11)
```

. . .

. . .

. . .

```
//r11 = 0xddeeff
Oxaabbcc
```

```
ls) 0xaabbcc <my leaky gadget> //either target's
... ... //attacker's code
```

(2) Oxddeeff <xxx>

. . .

<ууу>



#### **Regular Execution:**

- We are at (1)
- We jump at **(2)**

#### **Speculative Execution (Attack):**

- We poison BTB to think that r11 = aabbcc
- We are at (1)
- We enter speculation at **(1s)**, where's the leaky gadget

### Spectre v2 (& v1 !): Branch Predictor Poisoning

Guest User App "produce" Poison  $\Rightarrow$  BTB in the CPU



### Spectre v2: Impact



- Guest User to Guest Kernel, (Guest User App to Guest User App(s)): yes (JIT, e.g., Javascript in browsers)
- **Guest to Other Guest(s):** yes (via Guest to Hypervisor)
- Guest to Hypervisor: yes (*existing* leaky gadget if SMEP, or via JIT)



- Containers: affected
- **Reasonably hard** to exploit, exp. for vitrtualization

SMEP: Supervisor Mode Exec. Protection (Fischer, Stephen (2011-09-21))

- Kernel won't execute User App code
- We can't make kernel speculatively jump to a User App provided leaky gadget

### Spectre v2: retpoline



... by Google

Let's set up a trap for speculation:

call set\_up\_target; jmp \*%r11 capture\_spec: pause; lfence; jmp capture\_spec; set\_up\_target: mov %r11, (%rsp); ret;

Replacement can happen:

- kernel/hypervisor & userspace: compiler support (<<yay, let's recompile everything!>> :-/)
- kernel/hypervisor: binary patching (e.g., Linux's <u>alternatives</u>)



- Skylake+: ret target might be predicted with BTB <u>lwn.net/Articles/745111/</u>
- "RSB Stuffing" <u>Retpoline: A Branch Target Injection Mitigation</u>

### Spectre v2: IBPB, STIBP, IBRS



Firmware/Microcode update (e.g., from Intel). Gross hacks... ahem.. New "instructions":

- **IBPB:** flush all branch info learned so far
- **STIBP:** ignore info of branches done on sibling hyperthread
- **IBRS:** ignore info of branches done in a less-privileged mode (before it was most recently set)

Intended usage:

- **IBPB:** on context and/or vCPU switch. Prevents App/VM A influencing (poisoning?) branch predictions of App/VM B
- **STIBP:** when running with HT. Prevents App/VM running on thread influencing (poisoning?) branch predictions of App/VM on sibling
- **IBRS:** when entering kernel/hypervisor. Prevents Apps/VMs influencing (poisoning?) branch predictions in kernel/hypervisor

### Spectre v2: IBPB, STIBP, IBRS



**IBPB** neutralizes BTB poison "horizontally"

(e.g., between processes)



IBRS neutralizes BTB poison "vertically"

### Spectre v2: Mitigation(s)



- User Apps:
  - retpoline
  - Make timer less precise  $\Rightarrow$  harder to measure side effects!
  - IBPB & STIBP (<u>Spectre v2 app2app</u>, in these days)
- Xen: tries to pick best combo at boot
  - retpoline, when safe. IBRS, when reptoline-unsafe
  - IBPB at VM switch
  - Clear RSB on VM switch
- KVM:
  - reptoline + some IBRS (e.g., when calling into firmware)
  - IBPB at VM switch (heuristics for IBPB at context switch)
  - Clear RSB on context/VM switch
- Both Xen, KVM: IBRS, IBPB, STIBP available/virtualized for VMs too

### Spectre v2: Performance Impact



It's complicated!

- **retpoline:** good performance... is it enough <del>paranoia</del> protection?
- **IB\*** barriers:
  - **IBPB:** *moderate* impact
  - **IBRS:** impact *varies a lot*, depending on hardware
  - STIBP: (these days) *huge* impact ⇒ making it per-app opt-in
     E.g. Intel:
    - pre-Skylake: super-bad
    - post-Skylake: not-too-bad
  - $\Rightarrow$  it's not only the flushing
    - x86 : these are, for now, MSR write (**slooow**!)
    - ARM: on one CPU, disable/re-enable the MMU! :-O

# Spectre (Again!)

### Spectre v3a (Spectre-NG)



Rogue System Register Read (CVE-2018-3640)

• Speculative reads of system registers may leak info about system status (e.g., flags)

### Spectre v4 (Spectre-NG)



Speculative Store Bypass (SSB) (CVE-2018-3639)

- Affected CPUs: everyone (Intel, AMD, ARM)
- in speculation, a load from an address can observe the result of a store which is **not the latest** store to that address:

- Similar to Spectre v1: needs leaky gadget or JIT
- New instruction SSDB  $\Rightarrow$  no use Xen/KVM, useful for User Apps in guests



### LazyFP (Spectre v5)

Lazy FPU State Leak (CVE-2018-3665)

- Affected CPUs: *Intel*
- FPU context is **large** 
  - let's *ignore* it at context switches
  - Mark it as invalid
  - If new context (process/VM) needs it: save it, switch it and mark as valid again
- Speculative execution:
  - New context needs it ⇒ uses it **right away**, in speculation, with old context's values in it!
  - "old context's values": how about *keys* or *crypto stuff*?!?!
- XSAVEOPT ...

### L1TF (Foreshadow / ForeshadowNG)



## L1TF - Baremetal (Foreshadow)

L1TF / Foreshadow (<u>CVE-2018-3620</u>)

- Similar to Meltdown, potentially
- Meltdown: user space can read kernel pages, *if they're mapped in its address space* 
  - s/u bit in page table entries, ignored, in speculation
  - User space manages to maliciously read (in speculation) all its *virtual addresses*
- L1TF: user space can kind of read *physical memory directly*!
  - present bit in page table entries ignored, in speculation
  - That means it can maliciously read (in speculation) all RAM ⇒ PTI is useless
- Affects **only** Intel (~= Meltdown)





#### **Regular execution**

App accesses data in present page:

1. Page tables

---

- 2. Check L1 cache
- 3. **Hit!** Load data in CPU
- 4. **Miss!** Fetch from L2/L3/RAM
- 5. Load in L3, L2, L1
- 6. Load in CPU





1. Page tables page !present 2. Page fault

Potentially Malicious App A: stopped!





#### Speculative execution

App accesses data in present page:

1. Page tables

<del>page !present</del>

Page fault
 Check L1 ca

\_ \_ \_

Check L1 cache

3. **Hit!** Load data in CPU





#### Speculative execution

App accesses data in present page:

1. Page tables

page !present

Page fault
 Check L1 cache

3. **Hit!** Load data in

CPU

Wait... What?!?!









#### Speculative execution

App accesses data in present page:

1. Page tables

page !present

2. Page fault

Check L1 cache

8. **Hit!** Load data in CPU

Wait... What?!?!

## L1TF - Baremetal (Foreshadow)



Problems:

- What does the !present page table entry (PTE) contains?
  - Can be anything. Intel manual explicitly say the content will be ignored
  - OS is free to use it at will
  - Linux, Windows, etc.: offset of the page in swap space
- Can an attacker process control its own (!present) PTEs?
  - The kernel is in charge of PTEs, ... ...
  - ... ... yeah, but, e.g., <u>mprotect()</u> (Linux syscall)
  - So, **yes**, it's possible!

# L1TF - Virtualization (Foreshadow-NG)



L1TF / Foreshadow-NG (CVE-2018-3646)

- Like Meltdown. But scarier. And almost harder to fix (for virt)!
- Meltdown: user space can read kernel pages, *if they're mapped in its address space* 
  - s/u bit in page table entries, ignored, in speculation
  - User space manages to maliciously read (in speculation) all its *virtual addresses*
- L1TF: guests can kind of read *physical memory directly*!
  - present bit in page table entries ignored, in speculation
  - Guest manages to maliciously read (in speculation) all RAM  $\Rightarrow$  PTI is useless
  - ... ... and, believe me, **it gets worse** !!!
- Affects **only** Intel (~= Meltdown)





#### **Regular execution**

App accesses data in present page:

- 1. Guest page tables
- 2. Host page tables
- 3. Check L1 cache
- 4. **Hit!** Load data in CPU
- 4. **Miss!** Fetch from L2/L3/RAM
- 5. Load in L3, L2, L1
- 6. Load in CPU





#### Regular execution

App accesses data in non present page:

- 1. Guest page tables page !present
- 2. Guest page fault

Potentially <u>Malicious</u> <u>App A</u> (e.g., trying to steal data within VM 1): **stopped!** 





#### **Regular execution**

Guest accesses data in non present page:

- 1. Guest page tables
- 2. Host page tables page !present
- 3. Host page fault

Potentially <u>malicious</u> <u>App A, or VM 1</u> (or both), trying to steal from host or other VMs: **stopped!** 





#### Speculative execution

App (speculatively) accesses data in non present page:

- Guest page tables page !present
- Host page tables
   Check L1 cache
- 3. **Hit!** Load data in CPU





#### Speculative execution

App (speculatively) accesses data in non present page:

- Guest page tables page !present
- 2. Host page tables
- 2. Check L1 cache
- 3. **Hit!** Load data in CPU

Wait... What?!?!





#### **Speculative execution**

App (speculatively) accesses data in non present page:

- . Guest page tables page !present
- Host page tables
   Check L1 cache

Hit! Load data in CPU Wait... What?!?!





Use already described techniques (i.e., using cache as a side-channel, as in Meltdown) to actually read it, out of speculation: *VM can read arbitrary host data!!* 



#### Speculative execution

App (speculatively) accesses data in non present page:

- . Guest page tables page !present
- 2. Host page tables 2. Check L1 cache

Hit! Load data in CPU

Wait... What?!?!





## L1TF: HyperThreading

### Without Hyperthreading:







### L1TF: HyperThreading



Without Hyperthreading: mitigation



With Hyperthreading: err... mitigation?



- 1. VM 1 runs on CPU
- 2. VM 1 puts secrets in L1 cache
- 3. VM 1 leaves CPU
- 4. Hypervisor: flush L1 cache
- 5. VM 2 runs on CPU

6. VM 2 reads VM 1's secrets!

Context Switch

- I. VM 1 runs on Thread A
- 2. VM 2 runs on Thread B
- VM 1 puts secrets in L1 cache Hypervisor: THERE'S NOTHING I CAN DO !!!
- VM 2 reads VM 1's secret from L1 cache

### L1TF: hyperthreading

#### Without Hyperthreading:





Hypervisor runs on CPU Hypervisor puts secrets in L1 Hypervisor leaves CPU 3. VM 2 runs on CPU . VMEntrv VM 2 reads hypervisor's 5. secrets! Hypervisor runs on Thread A 1. VM 2 runs on Thread B Hypervisor puts secrets in L1 3. VM 2 reads VM 1's secret from 4. L1 cache No VMEntry

needed...

Guest Kernel to Other Guest(s) attack

### L1TF: hyperthreading



VMEntry

Without Hyperthreading: mitigation



### With Hyperthreading: err... mitigation?



- 1. Hypervisor runs on CPU
- 2. Hypervisor puts secrets in L1
- 3. Hypervisor leaves CPU
- 4. Hypervisor: flush L1 cache
- 5. VM 2 runs on CPU
- 6. VM 2 reads hypervisor's secrets!
- I. Hypervisor runs on Thread A
- . VM 2 runs on Thread B
- Hypervisor puts secrets in L1
   Hypervisor: THERE'S NOTHING
   I CAN DO !!!
- 4. VM 2 reads Hypervisor's
  - secret from L1 cache

### L1TF: Impact



- Host User to Host Kernel (Host User to Other Host User(s), Containers):
  - yes (but easy to mitigate, **zero** perf. cost)
- Guest Kernel to Hypervisor (Guest to Other Guest(s)):
  - Xen PV: yes (but easy to mitigate, ~= zero perf. cost)
  - Xen HVM, PVH: yes
  - **KVM:** yes
- Not that hard to exploit !

### **L1TF: Mitigation**

- Host, Containers:
  - Flip address bits in page tables when present bit is 0
  - Resulting address will never be in L1 cache
    - Unless you have terabytes of swap space
    - → swap size limited on vulnerable CPUs (x86/speculation/l1tf: Limit swap file size to MAX\_PA/2)
- Xen PV:
  - Xen intercepts PV guets' page table updates: sanitize/crash malicious guests
- Xen HVM, Xen PVH, KVM:
  - Flush L1 cache on VMEntry
  - Disable hyperthreading
  - If not wanting to disable hyperthreading... *disable hyperthreading!*
  - ... ... Did I say disable hyperthreading?





### L1TF: Performance Impact

- Host, Containers, Xen PV:
  - Negligible
- Xen HVM, Xen PVH, KVM:
  - L1 cache: limited (so small and so fast!)
  - Disable hyperthreading: depends
    - Varies with workloads: realistically, -15% in some of the common cases.
       Not more than -20%, or -30%, in most
    - **-50%** claimed, but only seen in specific microbenchmarks

### L1TF: Performance Impact



- **Alternative ideas?** (to disabling HT)
  - Shadow Page Tables: we'd detect attacks  $\Rightarrow$  slow
  - Core-scheduling: only vCPUs of same VM on SMT-siblings
    - In the works, for both Xen and Linux: complex
    - ok for Guest to Other Guests, not ok for Guest to Hypervisor
  - Core-scheduling + "Coordinated VMExits": complex
  - Secret hiding:
    - Hyper-V ~done
    - Xen maybe doable
    - KVM really hard
  - Shadow Page Table, make it fast by "abusing" Intel CPU feats:
    - CR3-whitelisting, PML (was for live-migration), ...
    - $\Rightarrow$  in the <del>works</del> brains...

KVM Forum '18: Alexander Graf - L1TF and KVM ( has a demo!!! :-D )

## Your Current Protection Status + Tunables

### **Your Current Situation**

#### On a Linux host/guest. PTI, IBRS, IBPB, STIBP:

\$ grep -E 'pti|ibrs|ibpb|stibp' -m1 /proc/cpuinfo

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant\_tsc art arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds\_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand lahf\_lm abm 3dnowprefetch cpuid\_fault epb cat\_13 cdp\_13 invpcid\_single pti intel\_ppin ssbd mba ibrs ibpb stibp tpr\_shadow vnmi flexpriority ept vpid ept\_ad fsgsbase tsc\_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt\_a avx512f avx512dq rdseed adx smap clflushopt clwb intel\_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm\_llc cqm\_occup\_llc cqm\_mbm\_total cqm\_mbm\_local dtherm ida arat pln pts hwp hwp\_act\_window hwp\_epp hwp\_pkg\_req flush l1d

### **Your Current Situation**

#### On a Linux host/guest. PTI, IBRS, IBPB, STIBP:

#### \$ grep -E 'pcid' -m1 /proc/cpuinfo

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant\_tsc art arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds\_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx fl6c rdrand lahf\_lm abm 3dnowprefetch cpuid\_fault epb cat\_13 cdp\_13 invpcid\_single pti intel\_ppin ssbd mba ibrs ibpb stibp tpr\_shadow vnmi flexpriority ept vpid ept\_ad fsgsbase tsc\_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt\_a avx512f avx512dq rdseed adx smap clflushopt clwb intel\_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm\_llc cqm\_occup\_llc cqm\_mbm\_total cqm\_mbm\_local dtherm ida arat pln pts hwp hwp\_act\_window hwp\_epp hwp\_pkg\_req flush lld

### **Your Current Situation**

#### On a Linux host/guest:

\$ ls /sys/devices/system/cpu/vulnerabilities/

lltf meltdown spec\_store\_bypass spectre\_v1 spectre\_v2

\$ grep -H . /sys/devices/system/cpu/vulnerabilities/\*
/sys/devices/system/cpu/vulnerabilities/lltf:
 Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/meltdown:
 Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spec\_store\_bypass:
 Mitigation: Speculative Store Bypass disabled via prctl and seccomp
/sys/devices/system/cpu/vulnerabilities/spectre\_v1:
 Mitigation: \_\_user pointer sanitization
/sys/devices/system/cpu/vulnerabilities/spectre\_v2:
 Mitigation: Indirect Branch Restricted Speculation, IBPB: conditional, IBRS\_FW,
 STIBP: conditional, RSB filling

### **Your Current Situation TODO**

#### On a Xen host:

\$ ls /sys/devices/system/cpu/vulnerabilities/

lltf meltdown spec\_store\_bypass spectre\_v1 spectre\_v2

```
$ grep -H . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/lltf:
    Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/meltdown:
    Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass:
    Mitigation: Speculative Store Bypass disabled via prctl and seccomp
/sys/devices/system/cpu/vulnerabilities/spectre_v1:
    Mitigation: __user pointer sanitization
/sys/devices/system/cpu/vulnerabilities/spectre_v2:
    Mitigation: Indirect Branch Restricted Speculation, IBPB: conditional, IBRS_FW,
    STIBP: conditional, RSB filling
```

### Tunables

<-Greetings, how slow do you want to go today?>>

<-Greetings, how secure do you want to be today?>>

- KVM:
  - pti = on | off | auto
  - spectre\_v2 = on|off|auto|retpoline,generic| retpoline,amd
  - spec\_store\_bypass\_disable = on|off|auto|prctl|seccomp
  - l1tf = full|flush|flush,nosmt
  - kvm-intel.vmentry\_l1d\_flush = always|cond|never
- XEN:
  - xpti = [ dom0 = TRUE/FALSE , domu = TRUE/FALSE ]
  - bti-thunk = retpoline | lfence | jmp
  - {ibrs,ibpb,ssbd,eager-fpu,l1d-flush} = TRUE/FALSE
  - {smt,pv-l1tf} = TRUE/FALSE

### Conclusions

- "Hardware bugs" are **difficult** 
  - Not only to fix mitigate
  - But also to work on, collaboratively (NDAs, etc)
  - Getting better
- Issues like these will **really** hunt us for a few time...
- Speculative Execution has **shaped** Computing World
- We focused on **performance first**, now we deal with consequences. As grandma used to say: <<*L'hai voluta la bicicletta, oh pedala!!!*>>
- Do **update** your firmware/microcode; do **update** your kernel
- Threats are real **but** don't panic: analyze your system, assess risks
- Performance impact may be really high but don't panic: **benchmark** your own workload, look for tunables

### Some Examples / Anecdotes / Curiosities

### NoTimers, NoFlush? Still party!

Cache as a side channel:

- Some control cache content (flush, place own array)
- Accurately time array elements accesses

So... Is forbidding user-space code to flush cache a mitigation?

- No! User code can still cause cache flushes, via memory allocation
- No! User code can "displace" array elements

So... Is reducing timers' resolution for user-space code a mitigation?

- No! If I have shared memory ( & multi-core/multi-thread) I can setup a counter thread == a timer
- (Actually done, e.g., in Android and in some browsers...)

### "Microcode" What ?

Hardware bugs (yes, we've had those before!)

- Cyrix Coma, Pentium FDIV or Pentium F00F)
- $\Rightarrow$  Hardware replacement!

We don't want that:

- CPUs executes "micro-operations" (µops), not real x86 opcodes
- Translation between opcodes and µops: microcode, inside CPUs
- Can be changed/updated (distributed only in binary form)
- Change CPU behavior "in the field"
- Well, up to a certain extent!
- (NB updates are not persistent, reload at boot)

### **Chicken bits**

#### "Chicken bits"

- A control bit stored in a register, used in ASICs and other integrated circuits to disable or enable features within a chip. <u>https://www.urbandictionary.com/define.php?term=Chicken%20Bit</u>
- (electronics) A bit on a chip that can be used to disable one of the features of the chip if it proves faulty or negatively impacts performance. <u>https://en.wiktionary.org/wiki/chicken\_bit</u>

2010, Ilya Wagner & Valeria Bertacco, *Post-Silicon and Runtime Verification for Modern Processors*, <u>Springer</u>, page 165: <<As an example, modules such as branch predictors and speculative execution units can be turned off with a variant of the "chicken bits", control bits common to many design developments to control the activation of specific features.>>

```
call *%rax
         jmp label2
Label0:
  call label1
capture ret spec:
  pause ; lfence
   jmp capture ret spec
Label1:
  mov %rax, (%rsp)
  ret
Label2:
  call label0
... continue execution
```





ret

call label0 ... continue execution

Label2:

2



(5) Speculation (while waiting for the mov to memory). Where? At the "trap"







#### Talking about Spectre-v2, IBRS vs. retpoline

/sys/devices/system/cpu/vulnerabilities/spectre\_v2: Mitigation: full generic retpoline, IBPB: conditional, IBRS\_FW, STIBP: conditional, RSB filling

Task A

Userspace

Compiled with retpoline enabled compiler: safe

#### Talking about Spectre-v2, IBRS vs. retpoline

/sys/devices/system/cpu/vulnerabilities/spectre\_v2: Mitigation: full generic retpoline, IBPB: conditional, IBRS\_FW, STIBP: conditional, RSB filling

Task A



Compiled with retpoline enabled compiler: safe

retpoline enabled as in-kernel mitigation: safe

#### Talking about Spectre-v2, IBRS vs. retpoline

/sys/devices/system/cpu/vulnerabilities/spectre\_v2: Mitigation: full generic retpoline, IBPB: conditional, IBRS\_FW, STIBP: conditional, RSB filling

Task A



Compiled with retpoline enabled compiler: safe

retpoline enabled as in-kernel mitigation: safe

- Is firmware using IBRS?
- Is firmware compiler with retpoline? We can't know: unsafe!

#### Talking about Spectre-v2, IBRS vs. retpoline

/sys/devices/system/cpu/vulnerabilities/spectre\_v2: Mitigation: full generic retpoline, IBPB: conditional, IBRS\_FW, STIBP: conditional, RSB filling

Task A



Compiled with retpoline enabled compiler: safe

retpoline enabled as in-kernel mitigation: safe

- Is firmware using IBRS?
- Is firmware compiler with retpoline? We can't know: unsafe!

#### Wrap firmware calls/services around IBRS

http://www.eventhelix.com/realtimemantra/Basics/CToAssemblyTranslation3.htm https://en.wikipedia.org/wiki/Branch\_table

### Compiling switch() {...}

| int global;      |                               |                  | .L6:                   |
|------------------|-------------------------------|------------------|------------------------|
|                  | gcc jt.c -O2 -S -o/dev/stdout | .L4:             | addl \$1, global(%rip) |
|                  | .file "jt.c"                  | .quad .L9        | movl \$3, %eax         |
| int foo3 (int x) | .text                         | .quad .L7        | ret                    |
| {                | .p2align 4,,15                | .quad .L6        | .p2align 4,,10         |
| switch (x) {     | .globlfoo3                    | .quad .L5        |                        |
| case 0:          | .type foo3, @function         | .quad .L3        | .p2align 3             |
| return 11;       | foo3:                         |                  | .L5:                   |
| case 1:          |                               | .text            | movl \$44, %eax        |
| return 123;      | .LFB0:                        | .p2align 4,,10   | ret                    |
| case 2:          | .cfi_startproc                | .p2align 3       | .p2align 4,,10         |
|                  | cmpl \$4, %edi                | .L9:             | .p2align 3             |
| global += 1;     | ja .L2                        | movl \$11, %eax  | .L7:                   |
| return 3;        | movl %edi, %edi               | ret              | movl \$123, %eax       |
| case 3:          | jmp *.L4(,%rdi,8)             | .p2align 4,,10   |                        |
| return 44;       | .section .rodata              | .p2align 3       | ret                    |
| case 4:          | .align 8                      | :                | .p2align 4,,10         |
| return 444;      |                               | .L3:             | .p2align 3             |
| default:         | .align 4                      | movl \$444, %eax | .L2:                   |
|                  |                               | ret              | xorl %eax, %eax        |
| return 0;        |                               | .p2align 4,,10   | ret                    |
| }                |                               | .p2align 3       | .cfi endproc           |
| }                | 1                             | :                | · ····_·               |

http://www.eventhelix.com/realtimemantra/Basics/CToAssemblyTranslation3.htm

https://en.wikipedia.org/wiki/Branch\_table

### Compiling switch() { . . . }

int global;

int foo3 (int x) switch (x) { case 0: return 11; case 1: return 123; case 2: global += 1;return 3; case 3: return 44; case 4: return 444; default: return 0; }

| J<br>. <b>L</b> 4 : |
|---------------------|
|                     |
|                     |
|                     |
|                     |
|                     |
| .L9:                |
|                     |
|                     |
|                     |
|                     |
| .L3:                |
|                     |
|                     |
|                     |

Jump Table .quad .L9 .quad .L7 .quad .L6 .quad .L5 .quad .L3 .text .p2align 4,,10 .p2align 3 movl \$11, %eax ret .p2align 4,,10 .p2align 3 movl \$444, %eax ret .p2align 4,,10 .p2align 3

```
.L6:
      addl $1, global(%rip)
     movl $3, %eax
     ret
      .p2align 4,,10
      .p2align 3
.L5:
     movl $44, %eax
      ret
      .p2align 4,,10
      .p2align 3
.L7:
     movl $123, %eax
      ret
      .p2align 4,,10
      .p2align 3
.L2:
      xorl %eax, %eax
      ret
      .cfi endproc
```

http://www.eventhelix.com/realtimemantra/Basics/CToAssemblyTranslation3.htm

https://en.wikipedia.org/wiki/Branch\_table

### Compiling switch() { . . . }

|                  |                                                   | Jump Table       | .L6:                   |  |  |  |  |
|------------------|---------------------------------------------------|------------------|------------------------|--|--|--|--|
| int global;      | gcc jt.c -02 -S -o/dev/stdout                     | . 14 :           | addl \$1, global(%rip) |  |  |  |  |
|                  | .file "jt.c"                                      | .quad .L9        | movl \$3, %eax         |  |  |  |  |
| int foo3 (int x) | .text                                             | .quad .L7        | ret                    |  |  |  |  |
| {                | .p2align 4,,15                                    | .quad .L6        | .p2align 4,,10         |  |  |  |  |
| switch (x) {     | .globlfoo3                                        | .quad .L5        | .p2align 3             |  |  |  |  |
| case 0:          | .type foo3, @function                             | .quad .L3        | .L5:                   |  |  |  |  |
| return 11;       | foo3:                                             | .text            | movl \$44, %eax        |  |  |  |  |
| case 1:          | .LFB0:                                            | .p2align 4,,10   | ret                    |  |  |  |  |
| return 123;      | .cfi startproc                                    | .p2align 3       | .p2align 4,,10         |  |  |  |  |
| case 2:          | cmpl \$4, %edi                                    | .L9:             | .p2align 3             |  |  |  |  |
| global += 1;     | ja .L2                                            | movl \$11, %eax  | .L7:                   |  |  |  |  |
| return 3;        | movl %edi, %edi                                   | ret              |                        |  |  |  |  |
| case 3:          | jmp *.L4(,%rdi,8)                                 | .p2align 4,,10   | movl \$123, %eax       |  |  |  |  |
| return 44;       | .section .rodata                                  | .p2align 3       | ret                    |  |  |  |  |
| case 4:          |                                                   | .L3:             | .p2align 4,,10         |  |  |  |  |
| return 444;      | .align 4                                          | movl \$444, %eax | .p2align 3             |  |  |  |  |
| default:         |                                                   |                  | .L2:                   |  |  |  |  |
| return 0;        | • Faster (alternative: ~if/else                   |                  | xorl %eax, %eax        |  |  |  |  |
| }                | <ul> <li>Better fits in cache</li> </ul>          | jn 4,,10         | ret                    |  |  |  |  |
| }                | Perf. independent than nr. cases     .cfi_endproc |                  |                        |  |  |  |  |

### Compiling switch() {...}

|        | normal      |      | retpoline |      | retpo+no-JT |      | retpo+JT=20 |      | retpo+JT=40 |      |        |
|--------|-------------|------|-----------|------|-------------|------|-------------|------|-------------|------|--------|
| cases: | 8:          | 0.70 | (100%)    | 2.98 | (425%)      | 0.75 | (107%)      | 0.75 | (107%)      | 0.75 | (107%) |
| cases: | 16:         | 0.70 | (100응)    | 2.98 | (425%)      | 0.82 | (117%)      | 0.82 | (117%)      | 0.82 | (117%) |
| cases: | 32:         | 0.70 | (100응)    | 3.01 | (430%)      | 0.87 | (124%)      | 2.98 | (426%)      | 0.87 | (124%) |
| cases: | 64 <b>:</b> | 0.70 | (100응)    | 3.52 | (501%)      | 0.94 | (134%)      | 3.52 | (501%)      | 3.52 | (501%) |
| cases: | 128:        | 0.71 | (100응)    | 3.51 | (495%)      | 1.07 | (151%)      | 3.50 | (495%)      | 3.50 | (494%) |
| cases: | 256:        | 0.76 | (100응)    | 3.14 | (414%)      | 1.27 | (167%)      | 3.14 | (414%)      | 3.14 | (414%) |
| cases: | 1024:       | 1.46 | (100응)    | 3.36 | (230%)      | 1.49 | (102%)      | 3.36 | (230%)      | 3.36 | (230%) |
| cases: | 2048:       | 2.25 | (100응)    | 3.19 | (142%)      | 2.70 | (120%)      | 3.19 | (142%)      | 3.19 | (142%) |
| cases: | 4096:       | 2.90 | (100%)    | 3.74 | (129%)      | 4.48 | (155%)      | 3.73 | (129%)      | 3.72 | (129%) |

"I'm going to prepare a patch that will disable JTs for retpolines." <u>https://gcc.gnu.org/bugzilla/show\_bug.cgi?id=86952</u> (<u>https://github.com/marxin/microbenchmark-1</u>)</u>

# **Thanks Everyone!**

### **Questions?**



*titit* 

https://xkcd.com/1938/