Hack in the box & stuff
Skip to
Introduction if you are here for the techstuff. I submitted a CFP to “Hack In
the Box” in Amsterdam end of may and the talk has been accepted. So I’ll be
spending part of my vacation this year presenting some of my research on cache
side channel attacks from the last year or so. I’m really excited because I
think I actually have some really nice stuff to present – not only is it
technically interesting but I think it’s far more relevant that the common
perception in the infosec community. If
you have a chance I think it’ll be worth while going. Amongst those who is
going to present is Jacob Torrey whose material I covered twice on my blog and
who in my opinion really is one of those who is pushing the boundaries of info sec.
Also Felix Wilhelm is going and I currently have his master thesis on my night table
and thus far it is indeed very cool reading.
Preparing a
talk obviously takes a lot of time and since I’m an amateur this is going to
cut down on the time I can spend on this blog. I have two more posts in the
pipeline (including part 2 of this post), but no idea when I can get them done really.
Finally
posting this blog post has been a big dilemma for me. I had the basic idea almost a
year ago and I’ve been putting it off again and again because it’s offensive
and posting it is a moral borderline case for me. I decided to go ahead but
with a heavy heart. Also I have a heavy heart because when I started this blog
I wanted to target the ambitious newbie. However there has been a tendency that
my work has become progressively less accessible and this particular post is
probably the least accessible thus far.
Introduction
KASRL means
“kernel address space randomized layout”. It is a feature of most modern
operating systems meant to protect against privilege escalation attacks that utilize
code reuse attacks such as ROP (return oriented programming). The idea is that
if an attacker doesn’t know where the code is, he cannot setup a ROP chain to
reuse the code. Therefore a modern operating system load driver on different
addresses on each boot and this leaves an attacker in the dark about where the
code is located. Code reuse attacks has grown in importance for privilege
escalations attacks with requirements for driver signing and wide spread use of
“no-execute” settings on stack and other data in kernel mode, thus forcing the
attacker to either bypass driver signing to load a driver, attack page tables
or reuse existing code.
The
attacker has typically relied on information leaks in known drivers that give
away addresses that can be used to locate code. However last year I had an idea
how we might use the CPU design for our information leak and this will be the
topic of this blog post. That we can use CPU design against KARSL is
particularly bad news because CPU’s are not subject to updates as software are
and in the past intel has not been forthcoming with security fixes in this
respect. Also relying on CPU for an attack means that mean thing generalizes
across operating systems – this means many of the findings I’ve made here for Windows
7 will lend themselves to other platforms. On the other hand these kind of
attacks may be applicable only on a subset of CPU’s. I think that all modern
Intel CPU with a 3rd level cache i3,i5,i7 and xeons from Sandy
Bridge and up is likely to be affected and possibly other CPU’s as well.
Finally I’d
like to say these kind of attacks has been made before but I’m not aware of any
with the same methodology as mine.
Finding drivers
The idea
evolved around the “Prefetch” instructions. These instructions are meant for software
to give the prefetcher a hint to load a specific address into a certain cache
level because the software will need it shortly. This can be used to
significantly improve performance as the prefetcher runs independent of the CPU
and thus memory is fetched before it’ll be used. There is however a interesting
thing with the prefetch instructions that caught my eye: They don’t cause
exceptions (except they are incompatible with a lock prefix)[2] Intel. However
they take a virtual address as input and will need a physical address for
loading memory and placing it in the cache. Thus they may entirely by pass
memory protection mechanisms further that seem relatively plausible because
since they aren’t actually reading data they should not in any way or fashion
allow an attacker to read memory he isn’t supposed to. Consequently intel might
have decided that there wasn’t a security issue.
My first
idea was to use the prefetch instruction in a modified prime+probe or a
evict+time attack. That is make a guess about an address in a driver, then use
prefetch instruction to load the cache entry and either evict+time or prime+probe
to tell if you actual hit the address. This of cause is pretty slow and there
is a lot of memory you’d need to run this on before you’d actually find your
way around the kernel address space. I’ll get back to this in another post.
The second
idea is what if the prefetch instructions leak information through it’s
execution time? I thought of 3 scenarios:
1) Prefetch does not leak information
(e.g. the DS descriptor has a limit, prefetcher doesn’t do anything on kernel memory
when called from R0, is designed as
constant time etc.)
2) Prefetch is slower on virtual addresses
which actually maps to a page. One reason could be because telling the prefetcher
to load the cache takes time
3) Prefetch is slower on addresses
which is not does not map to a page because searching all page table entries
for an address is slower than just searching till you hit something.
Well. First
I dug into the descriptor issue mentioned in 1) because that’d be a no go and
since I recall that 32bit XP actually has a limit on it’s DS descriptor making
it useless to access kernel mode memory. It seems though that this is not the
case in 64 bit mode as 5.3.1 of the [1] Intel clearly says that limits are not
checked at run time. (I think there are some exceptions for FS/GS) So time to load
up a debugger and write some code. First I loaded live KD and dumped a map of
the drivers using the “lm n t” command. Then I wrote some user mode code that
times the prefetcht2 instruction. I picked the prefetcht2 instruction because I
assumed loading the largest latency cache was most likely to be actually
carried out by the prefetcher and thus most likely to be successful. I then
wrote up code to time prefetching an address 20 times and pick the lowest
timing found. My argumentation here is that there is noise in memory sub system
(hence the multiple accesses to try and cancel it out) and the lowest value because
searching for the lower bound would tell me more than an average. Then I picked
two addresses to time in this way. One a few thousand pages below the first
loaded driver and one in the first loaded driver. Finally I repeated the
experiment a 1000 timers. And voila the prefetcht2 instruction is faster on a
virtual address that is actually mapped. I speculate that this is because
searching through PTE’s is aborted early and thus faster. I think I know how to
figure out if this is true and I may get back to it in another blog post. So I
moved on and wrote up a small program that’d scan a small portion of the kernel
memory address space and tell me if a page was mapped or not and compared it to
my module list. It worked beyond any expectation.
So how fast
can I scan? It takes about 112 CLK’s for a miss and 85 CLK’s for hit on my wife’s
sandy bridge computer. An approximate value (with 20 timings per address)
including code over head is that I run around 500k probes per second. With a
64bit address space that too is impractical to scan the entire address space
with. However first we should notice that modern operating systems are paging
and thus we only need to probe once per page. A page is usually 4096 bytes
large. Also on windows 7 it appears that the 16 upper bits are all set for
drivers further reducing the problem. Finally Windows 7 loads drivers continuously
– on my computer around 100 mega bytes of them – and thus I only have to run
through a 48 bit address space in 100 megabyte steps to find where the drivers
are hidden. This reduces the problem to a practical 6 second problem. Now I do
get a few false positives on this scan. Other stuff than drivers maybe located
in this space and I may have some tuning problems and additional probing could
further reduce these. But I consistently find the drivers and only a few false
positives.
What was a
blessing above – that the drivers are clustered into one big chuck of pages –
is a curse going forward. The problem is that we do not get the size of the individual
drivers and thus is unable to tell which is which for a gadget search. However
I believe we can solve this problem with modified versions of evict+time and
prime+probe attacks. This however will have to wait for part 2 of this blog
post.
Thanks
I need to
express my gratitude to Nishat Herath of Qualys for discussions on this topic I
had with him.
Litterature
[1] Intel: “Intel®
64 and IA-32 Architectures Software Developer’s Manual. Volume 3A: System
Programming Guide, Part 1”.http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
[2]Intel: “Intel® 64 and IA-32 Architectures.Software
Developer’s Manual.Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z