Thread (14 messages) 14 messages, 4 authors, 2017-11-09

Re: Huge mapping secondary process linux

From: Chao Zhu <hidden>
Date: 2017-11-09 03:08:47

 

 

From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com] 
Sent: 2017年11月7日 18:16
To: Chao Zhu <redacted>
Cc: 'Burakov, Anatoly' <redacted>; bruce.richardson@intel.com; dev@dpdk.org
Subject: RE: [dpdk-dev] Huge mapping secondary process linux

 

"Chao Zhu" <chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> > wrote on 11/07/2017 09:25:26 AM:
From: "Chao Zhu" <chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> >
To: "'Jonas Pfefferle1'" <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >, "'Burakov, Anatoly'" 
<anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
Cc: <bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> >, <dev@dpdk.org <mailto:dev@dpdk.org> >
Date: 11/07/2017 11:00 AM
Subject: RE: [dpdk-dev] Huge mapping secondary process linux

 
 
From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com] 
Sent: 2017年10月28日 3:23
To: Burakov, Anatoly <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> ; chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> ; dev@dpdk.org <mailto:dev@dpdk.org> 
Subject: Re: [dpdk-dev] Huge mapping secondary process linux
 
"Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> > wrote on 27/10/2017 18:00:27:
quoted
From: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
To: Jonas Pfefferle1 <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >
Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> , chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> , dev@dpdk.org <mailto:dev@dpdk.org> 
Date: 27/10/2017 18:00
Subject: Re: [dpdk-dev] Huge mapping secondary process linux

On 27-Oct-17 4:16 PM, Jonas Pfefferle1 wrote:
quoted
"dev" <dev-bounces@dpdk.org <mailto:dev-bounces@dpdk.org> > wrote on 10/27/2017 04:58:01 PM:

 > From: "Jonas Pfefferle1" <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >
 > To: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
 > Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> , chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> , 
dev@dpdk.org <mailto:dev@dpdk.org> 
quoted
quoted
 > Date: 10/27/2017 04:58 PM
 > Subject: Re: [dpdk-dev] Huge mapping secondary process linux
 > Sent by: "dev" <dev-bounces@dpdk.org <mailto:dev-bounces@dpdk.org> >
 >
 >
 > "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> > wrote on 10/27/2017 
04:44:52
 > PM:
 >
 > > From: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
 > > To: Jonas Pfefferle1 <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >
 > > Cc: bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> , chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> , 
dev@dpdk.org <mailto:dev@dpdk.org> 
 > > Date: 10/27/2017 04:45 PM
 > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux
 > >
 > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote:
 > > > "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> > wrote on 10/27/2017
 > > > 04:06:44 PM:
 > > >
 > > > Â > From: "Burakov, Anatoly" <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com> >
 > > > Â > To: Jonas Pfefferle1 <JPF@zurich.ibm.com <mailto:JPF@zurich.ibm.com> >, dev@dpdk.org <mailto:dev@dpdk.org> 
 > > > Â > Cc: chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> , bruce.richardson@intel.com <mailto:bruce.richardson@intel.com> 
 > > > Â > Date: 10/27/2017 04:06 PM
 > > > Â > Subject: Re: [dpdk-dev] Huge mapping secondary process linux
 > > > Â >
 > > > Â > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote:
 > > > Â > >
 > > > Â > >
 > > > Â > > Hi @all,
 > > > Â > >
 > > > Â > > I'm trying to make sense of the hugepage memory mappings in
 > > > Â > > librte_eal/linuxapp/eal/eal_memory.c:
 > > > Â > > * In rte_eal_hugepage_attach (line 1347) when we try to do a
 > private
 > > > Â > > mapping on /dev/zero (line 1393) why do we not use MAP_FIXED 
if we
 >
 > > > need the
 > > > Â > > addresses to be identical with the primary process?
 > > > Â > > * On POWER we have this weird business going on where we use
 > > > MAP_HUGETLB
 > > > Â > > because according to this commit:
 > > > Â > >
 > > > Â > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440
 > > > Â > > Author: Chao Zhu <chaozhu@linux.vnet.ibm.com <mailto:chaozhu@linux.vnet.ibm.com> >
 > > > Â > > Date: Â  Thu Apr 6 15:36:09 2017 +0530
 > > > Â > >
 > > > Â > > Â  Â  Â eal/ppc: fix mmap for memory initialization
 > > > Â > >
 > > > Â > > Â  Â  Â On IBM POWER platform, when mapping /dev/zero file to
 > hugepage
 > > > memory
 > > > Â > > Â  Â  Â space, mmap will not respect the requested address 
hint.This
 > will
 > > > Â > > cause
 > > > Â > > Â  Â  Â the memory initialization for the second 
process fails. 
quoted
This
 > > > patch adds
 > > > Â > > Â  Â  Â the required mmap flags to make it work. 
Beside this, users
quoted
 > > > need to set
 > > > Â > > Â  Â  Â the nr_overcommit_hugepages to expand the VA 
range. When
quoted
 > > > Â > > Â  Â  Â doing the initialization, users need to set both 
nr_hugepages
 > and
 > > > Â > > Â  Â  Â nr_overcommit_hugepages to the same value, like 64, 
128, etc.
 > > > Â > >
 > > > Â > > mmap address hints are not respected. Looking at the mmap 
code in
 > the
 > > > Â > > kernel this is not true entirely however under some 
circumstances
 > > > the hint
 > > > Â > > can be ignored (
 > > > Â > > https://urldefense.proofpoint.com/v2/url?
 > > > Â >
 > > >
 > >
 > 
u=http-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_powerpc_mm_mmap.c-23L103&d=DwICaQ&c=jf_iaSHvJObTbx-
quoted
quoted
 >
 > > > Â > siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN-
 > > > Â > pXjigIjRW0&m=cttQcHlAYixhsYS3lz-
 > > > Â >
 > 
BAdEeg4dpbwGdPnj2R3I8Do0&s=Gp0TIjUtIed05Jgb7XnlocpCYZdFXZXiH0LqIWiNMhA&e=
quoted
quoted
 > > > Â > > ). However I believe we can remove the extra case 
forPPC if we
quoted
quoted
 > use
 > > > Â > > MAP_FIXED when doing the secondary process mappingsbecause we
 > need
 > > > them to
 > > > Â > > be identical anyway. We could also use MAP_FIXED 
whendoing the
quoted
quoted
 > primary
 > > > Â > > process mappings resp. get_virtual_area if we want 
to have any
quoted
quoted
 > > > guarantees
 > > > Â > > when specifying a base address. Any thoughts?
 > > > Â > >
 > > > Â > > Thanks,
 > > > Â > > Jonas
 > > > Â > >
 > > > Â > hi Jonas,
 > > > Â >
 > > > Â > MAP_FIXED is not used because it's dangerous, it 
unmaps anything
quoted
quoted
 > that is
 > > > Â > already mapped into that space. We would rather know 
that we can't
quoted
 > map
 > > > Â > something than unwittingly unmap something that was 
mapped before.
quoted
 > > >
 > > > Ok, I see. Maybe we can add a check to the primary process's memory
 > > > mappings whether the hint has been respected or not? At 
least warn if
quoted
quoted
 > it
 > > > hasn't.
 > >
 > > Hi Jonas,
 > >
 > > I'm unfamiliar with POWER platform, so i'm afraid you'd 
have to explain
quoted
quoted
 > > a bit more what you mean by "hint has been respected" :)
 >
 > Hi Anatoly,
 >
 > What I meant was the mmap address hint:
 >
 > "If addr is not NULL, then the kernel takes it as a hint
 > Â about where to place the mapping; on Linux, the mapping will be
 > Â created at a nearby page boundary."
 >
 > This is actually not true on POWER. It can happen that the address 
hint is
 > ignored and you get any address back that fits your mapping.
 >
 > Thanks,
 > Jonas

Actually looking through the kernel code this is also not 
guaranteed on x86.
quoted
quoted
(https://urldefense.proofpoint.com/v2/url?
u=http-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_x86_kernel_sys-5Fx86-5F64.c-23L165&d=DwID-
quoted
g&c=jf_iaSHvJObTbx-siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN-
pXjigIjRW0&m=iqakzG7nSXLfvDHyS9IV5E9DWPnNcv19zcsl3MKMdvI&s=VqzZpcTaCUMmNieZ3WyUw-
quoted
jsnNP-hAcW487Mumv6xPw&e=)
quoted
So in any case the address hint can be ignored by the kernel and you get 
any address that fits your mapping.
My suggestion is to check when we do the initial mapping in 
get_virtual_area if the hint was respected or not, i.e. if the returned 
address == PAGE_ALIGN(address_hint).
I'm not sure i see the issue here. So, just to make sure i understand 
things correctly:

Whenever we don't request a specific base address through base_address 
EAL parameter, none of this matters - we always ask for memory in 
arbitrary memory locations, correct?

It's also not an issue with secondary processes because we do check 
returned mmap address to see whether it's the same as we requested, correct?

It's only whenever we *do* specify a base_address, we provide an address 
hint to mmap to, but we don't check if the address we got from mmap is 
one in the vicinity of our requested base address, correct? We don't 
check, and the kernel can ignore address hint, so we're not guaranteed 
to respect the base_address flag.

I'm not sure this is a serious issue, because as far as i'm concerned, 
this flag is advisory - we only promise to *attempt* to map things at 
that particular address, not that it will succeed. If the kernel simply 
cannot find an address to satisfy our address hint, or ignores it for 
other reasons - well, tough, nothing we can do about that. I'm not sure 
putting a check like this, where we can't even predict an "expected" 
address is a good idea.

Am i getting this right?
The problem is when we specify a base address we want it to be used. If it is
not respected we basically end up with the case like we would have 
never specified it.
This very likely leads to not being able to run a secondary process because
we will not be able to map the addresses from our primary process 
and that is why we
introduced the base address parameter in the first place.
quoted
-- 
Thanks,
Anatoly
The reason why I put the patch there is that when mapping hugepage 
on POWER, the kernel will never respect the address hints when doing
mmap unless we expand the address space or unmap all the hugepages. 
This is a big difference when compared with x86. And it affects the 
mapping of  the secondary process. I agree that the hints is 
advisory. Just want to see if there are better solutions.

This is not true. I looked through the kernel code and the address
hint is treated almost the same on both platforms: 

PPC:  <https://elixir.free-electrons.com/linux/latest/source/arch/powerpc/mm/mmap.c#L143> https://elixir.free-electrons.com/linux/latest/source/arch/powerpc/mm/mmap.c#L143
Line 169/170

x86:  <https://elixir.free-electrons.com/linux/latest/source/arch/x86/kernel/sys_x86_64.c#L165> https://elixir.free-electrons.com/linux/latest/source/arch/x86/kernel/sys_x86_64.c#L165
Line 189/190

The only thing that might differ is the virtual address layout
(e.g. due to different page size etc) and that might lead to the same 
value for base-virtaddr not working on both x86 and POWER.
However I tested with different address hints and you easily can
find addresses where the address hint is indeed respected. 
That is also why I send in a patch to remove the HUGETLB flags on
the mmap.

Thanks,
Jonas

You can take a look at this. https://bugzilla.linux.ibm.com/show_bug.cgi?id=141628

It’s quite interesting.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help