from dhelios.blogspot.com
SYMPTOMS
Messages file logged below messages on Solaris[TM] 10:
May 23 17:48:32 g4as7030 genunix: [ID 883052 kern.notice] basic rctl process.max-stack-size (value 8388608) exceeded by process 20431
May 23 17:48:32 g4as7030 genunix: [ID 883052 kern.notice] basic rctl process.max-stack-size (value 8388608) exceeded by process 20431
CHANGES
Logging of these messages is not enabled by default; enabling it with rctladm(1M) will make them visible but the results cannot be correctly interpreted without knowing something of how the internal implementation works.
CAUSE
For those very rare processes which do need more than a few KB of stack, it will be beneficial to map large pages into their stack segments, when possible, because these processes will then run more efficiently. The system will allocate pages whenever another page is needed so the process can make progress. This is one of the main responsibilities of the virtual-memory subsystem of any modern operating system and it happens hundreds of times every minute. But that has little to do with stack-growing.
Growing a stack happens whenever a process is exec(2)'d (namely bringing the first 8K page on its stack into existence) and then whenever the process needs more room on its stack than it currently has. The max-stack-size is an artifact of an internal implementation detail, in this case a side product of variable page sizes. The vast majority of processes need only 8K, and can get along for their entire lifetime with just a single stack page of this smallest size. A minority of processes needs 16K or 24K of stack and their stacks get grown by one or two pages as needed. A tiny vanishing minority of processes needs more than 24K and these will have their stacks grown several times during the process lifetime, switching to larger pages when appropriate (which is when these rctl checks happen and when these messages may get logged, if the logging has been enabled). The max-stack-size rctl is also used to keep other address space mappings outside the range reserved for this growth, so that we don't bump into a shared library or anonymous mapping later when we do try to grow the stack. Growth is triggered simply by memory accesses into the "yellowzone" below the already existing stack pages. The pagefault trap handler recognizes them as access to memory destined to become part of the stack, and calls a kernel function named grow().
In the past, when all sun4u pages were 8K in size, things were simple: one check against the stack size limit sufficed. If the check passed, the stack would be grown by enough zero-initialized pages to contain the desired address. If the check failed, no growth is possible, and the process would then be in trouble, typically resulting in a signal (indicating that the pagefault could not be satisfied, rather than arising from a resource control action) and almost always in a coredump, unless the process was programmed to catch and handle it on an alternate stack. But nowadays we have large pages and processes will benefit in performance from using them even on the stack segment, once this has already grown large enough to make this worthwhile. So the grow() algorithm has become more complex. It now tries to convert the existing stack segment to a larger pagesize when it is large enough to benefit from that and when large pages are available. It also tries to use the larger pages for the piece to be grafted on, when possible. Depending on how large the segment already is, on what its current pagesize is, on what page sizes are available, on whether we have large pages ready to be used and on where the requested address falls in the yellowzone, this may require a few successive attempts before it finds something that will fit. So a single stack-growing memory access can now result in several (up to four on these platforms) checks against max-stack-size; one check for each page size being tried.
The messages which are displayed come from such attempts to fit a large-page piece onto the growing stack. This can bump into the default 8M ceiling (or any other configured ceiling), especially when a 4M page on the V240, or a 4M or 256M page on the T2000 is being tried. The failing attempts result in the messages being logged. They are indeed denials of the operation attempted by the page-fault handler at this point, but they aren't fatal to the process: the handler goes on to try the next smaller page size. If the 8M haven't yet been exhausted, then it will eventually succeed, using one or more 8K pages. In no case will stack growth ever go beyond the ceiling, nor can this ceiling be raised to make more room inside a running process (though it can be raised to take effect on a future exec(2) of itself or its future children), since ld.so.1 and shared-library mappings are going to be sitting in the way. (This applies to SPARC and amd64 architectures. On x86, the stack is at the bottom of the address space and a perpetually unmapped redzone page at virtual address 0x0 is in the way, for catching null pointer abuses.)
Running
Growing a stack happens whenever a process is exec(2)'d (namely bringing the first 8K page on its stack into existence) and then whenever the process needs more room on its stack than it currently has. The max-stack-size is an artifact of an internal implementation detail, in this case a side product of variable page sizes. The vast majority of processes need only 8K, and can get along for their entire lifetime with just a single stack page of this smallest size. A minority of processes needs 16K or 24K of stack and their stacks get grown by one or two pages as needed. A tiny vanishing minority of processes needs more than 24K and these will have their stacks grown several times during the process lifetime, switching to larger pages when appropriate (which is when these rctl checks happen and when these messages may get logged, if the logging has been enabled). The max-stack-size rctl is also used to keep other address space mappings outside the range reserved for this growth, so that we don't bump into a shared library or anonymous mapping later when we do try to grow the stack. Growth is triggered simply by memory accesses into the "yellowzone" below the already existing stack pages. The pagefault trap handler recognizes them as access to memory destined to become part of the stack, and calls a kernel function named grow().
In the past, when all sun4u pages were 8K in size, things were simple: one check against the stack size limit sufficed. If the check passed, the stack would be grown by enough zero-initialized pages to contain the desired address. If the check failed, no growth is possible, and the process would then be in trouble, typically resulting in a signal (indicating that the pagefault could not be satisfied, rather than arising from a resource control action) and almost always in a coredump, unless the process was programmed to catch and handle it on an alternate stack. But nowadays we have large pages and processes will benefit in performance from using them even on the stack segment, once this has already grown large enough to make this worthwhile. So the grow() algorithm has become more complex. It now tries to convert the existing stack segment to a larger pagesize when it is large enough to benefit from that and when large pages are available. It also tries to use the larger pages for the piece to be grafted on, when possible. Depending on how large the segment already is, on what its current pagesize is, on what page sizes are available, on whether we have large pages ready to be used and on where the requested address falls in the yellowzone, this may require a few successive attempts before it finds something that will fit. So a single stack-growing memory access can now result in several (up to four on these platforms) checks against max-stack-size; one check for each page size being tried.
The messages which are displayed come from such attempts to fit a large-page piece onto the growing stack. This can bump into the default 8M ceiling (or any other configured ceiling), especially when a 4M page on the V240, or a 4M or 256M page on the T2000 is being tried. The failing attempts result in the messages being logged. They are indeed denials of the operation attempted by the page-fault handler at this point, but they aren't fatal to the process: the handler goes on to try the next smaller page size. If the 8M haven't yet been exhausted, then it will eventually succeed, using one or more 8K pages. In no case will stack growth ever go beyond the ceiling, nor can this ceiling be raised to make more room inside a running process (though it can be raised to take effect on a future exec(2) of itself or its future children), since ld.so.1 and shared-library mappings are going to be sitting in the way. (This applies to SPARC and amd64 architectures. On x86, the stack is at the bottom of the address space and a perpetually unmapped redzone page at virtual address 0x0 is in the way, for catching null pointer abuses.)
Running
# /bin/pmap -sx PID[,...]
against a process or processes of interest will display, among other things, the page sizes currently in use for each mapped segment. You may find a few processes on your system which do use larger pages for their stack segments (at the top of the address space and thus near the end of the pmap output, and marked as "[ stack ]"), which will confirm that this mechanism is indeed getting triggered under the workloads of this system and that it is indeed operating successfully.
SOLUTION
This is NOT a "problem". The messages merely reflect an aspect of normal system operation.
The messages may be quieted in syslog by using:
The messages may be quieted in syslog by using:
# rctladm -d syslog process.max-stack-size
Comments
Post a Comment