Is your Azure web role throwing 503 Service Unavailable? It could be because of a poor default Pagefile size.
LeanSentry itself runs entirely on Windows Azure, including our www.leansentry.com web front-ends that are hosted on Large (A3) instances. A couple weeks ago, we had to scramble to address this problem when our site started returning intermittent 503 Service Unavailable errors.
Troubleshooting low virtual memory on Azure
The first sign of trouble was LeanSentry alerting us to 503s. This was also being reported by Pingdom and seen by our CDN provider.
On the server, the application pool was failing to commit memory needed to dequeue and process requests. As a result, requests queued in the HTTP.sys application queue were being abandoned, causing queueing:
A typical signature for this is QueueFull errors as the worker process begins to fall behind in dequeueing requests and the queue exceeds its limit, as well as Connection_Abandoned_By_ReqQUeue which is triggered when the worker process does not process an already-dequeued request in time.
The nasty thing about virtual memory being exhausted (or in this case, really the Commit limit) is that applications don’t just take longer to execute because of paging. They fail, in many generally unpredictable ways. In our case, WAS was failing to PING the worker process and terminating it. LeanSentry now detects worker process failures which was helping, but without LS you would see things like this:
Windows also detected the low virtual memory condition, which triggers the Resource-Exhaustion-Detector event 2004: “Windows successfully diagnosed a low virtual memory condition”:
The default Pagefile size is too low
The A4 (Large) instance has 7 Gb of memory. A typical IT recommendation is to allow the Pagefile to be at least 2x the physical RAM.
However, the Pagefile size is set to 4Gb, almost ½ of the RAM.
In reality, the Pagefile must be able to grow to be large enough to accommodate the Committed memory on the machine – the virtual memory that has been requested for use by applications and needs to be backed by either the physical RAM or the Pagefile.
As long as the sum of Physical RAM + Pagefile is greater than Commited memory, applications will continue to function (although they will get much slower the more the pagefile is used).
Mark Russinovich has a great article explaining Windows memory usage here. In a nutshell, his recommendation for the Pagefile size is:
- MIN: Peak commit – Physical RAM
- MAX: 2x (Peak commit – Physical RAM)
In other words, if your applications end up asking for 10Gb of RAM, and you have 7Gb on the VM, you should have the Pagefile set to 3Gb MIN and 6Gb MAX.
That said, 4Gb on a 7Gb machine allows for 11Gb maximum commit. If your applications use cache extensively, they’ll begin to fail as soon as the total commit grows beyond 11Gb.
Why is the Pagefile set so low in Azure?
It’s hard to say. My theory is that it’s for one of these reasons:
- Azure wants you to hit the memory limit sooner, so you upgrade to a larger instance that costs more money.
- Azure wants to limit the IO your VM causes against their shared disks.
These may be somewhat reasonable, because if your applications require a lot more memory than the VM provides, you’ll experience significantly better performance by moving to a larger VM with more physical RAM.
That said, if you experience peak memory rarely and do not want your application to completely fail, increasing the Pagefile will allow it more breathing room when it most needs it.
Increasing Pagefile size on Azure Web roles
To protect your apps against failing if the memory Commit demand grows beyond the physical RAM, you can increase the default Page file size when you deploy your web role.
We do it in our startup script, which is bootstrapped as a startup task:
ServiceDefinition.csdef:
<WebRole name="LeanServer.Sentinel.Dashboard.Web" vmsize="Large">
<Startup>
...
<Task commandLine="Startup\Startup.cmd >> c:\Startup.cmd.log 2>>&1" executionContext="elevated" taskType="simple" />
</Startup>
</WebRole>
startup.cmd:
REM Configure pagefile, to increase it from the default custom limit set by Azure (too small, e.g. 4Gb on a 7Gb machine) call %~dp0pagefile.cmd
pagefile.cmd:
Echo setting PageFile settings ...
wmic pagefileset where name="%systemdrive%\\pagefile.sys" set InitialSize=4096,MaximumSize=10240
if ERRORLEVEL 1 goto FAIL
GOTO END
:FAIL
echo Failed!
:END
echo Done!
exit /B 0
Notes:
- Because Azure sets a specific Pagefile size, you can increase it without restarting.
- If you want to move the Pagefile to another drive (we do this for some of our services), you’ll need to reboot. I’ll blog about this in a future post.
Should you change the Pagefile or just upgrade the VM size?
The full answer to this question is a bit more complex, but the simple answer is: YES, you should as long as you are able to keep you regular memory usage close to the physical memory limits.
Because upgrading to larger VM in order to meet peak memory demand locks you into doubling, or quadrupling your cloud costs long-term … while changing the Pagefile helps you keep your costs lower while protecting you against rare times of high memory demand.
I’ll blog more about optimizing memory usage to keep Cloud costs manageable in an upcoming post. That’s it for now.
1p – Low Pagefile can cause outages on Azure web roles | Exploding Ads
[…] 1p – Low Pagefile can cause outages on Azure web roles […]
1p – Low Pagefile can cause outages on Azure web roles | Profit Goals
[…] http://mvolo.com/low-pagefile-can-cause-503-service-unavailable-on-azure-web-roles/ […]
Phil0001
Hello,
We were getting 503 errors from a single server out of 30. One thing that threw us initially is that the 503 errors no longer appear to be logged into the IIS request log on 2012R2 (as they we with IIS7/7.5). After searching these IIS request logs for ages and finding nothing we eventually found 503 in FREB and http.sys log. Any idea where this change is documented, is it your understanding that these 503 should \ should not now be in the IIS request log? Any idea why this change?
Thanks.
mvolo
Hi Phil,
The 503s are usually not logged in the IIS log/FREB, because they represent errors that occur before the request is received by the IIS worker process. They will typically end up in the HTTPERR log and if they are due to a application pool failure, there may also be some eventlog events in the Application log.
There are a couple exceptions, particularly 503s due to ASP.NET and IIS concurrency limits. Those will show up in the IIS logs because they are worker-process level failures.
FWIW, LeanSentry automatically tracks these, as well as other causes of application pool failure, so give it a try if you are having trouble.
Hope this helps!
Mike
jason
Does this work with the current Azure WebSites?
Mayur Jain
Hello Mike,
Is there a way to get the IIS site name in IIS connector configured with tomcat?