Hauling Out the Big RAM

Amazon released a handful of new stuff.

“Make that a Quadruple Extra Large with room for a Planet OSM”

Big Mmeory
Fig 1 – Big Foot Memory

1. New Price for EC2 instances

US EU
Linux Windows SQL Linux Windows SQL
m1.small $0.085 $0.12 $0.095 $0.13
m1.large $0.34 $0.48 $1.08 $0.38 $0.52 $1.12
m1.xlarge $0.68 $0.96 $1.56 $0.76 $1.04 $1.64
c1.medium $0.17 $0.29 $0.19 $0.31
c1.xlarge $0.68 $1.16 $2.36 $0.76 $1.24 $2.44

Notice the small instance, now $0.12/hr, matches Azure Pricing

Compute = $0.12 / hour

This is not really apples to apples since Amazon is a virtual instance, while Azure is per deployed application. A virtual instance can have multple service/web apps deployed.

2. Amazon announces a Relational Database Service RDS
Based on MySQL 5.1, this doesn’t appear to add a whole lot since you always could start an instance with any database you wanted. MySQL isn’t exactly known for geospatial even though it has some spatial capabilities. You can see a small comparison of PostGIS vs MySQL by Paul Ramsey. I don’t know if this comparison is still valid, but I haven’t seen much use of MySQL for spatial backends.

This is similar to Azure SQL Server which is also a convenience deployment that lets you run SQL Server as an Azure service, without all the headaches of administration and maintenance tasks. Neither of these options are cloud scaled, meaning that they are still single instance versions, not cross partition capable. SQL Azure Server CTP has an upper limit of 10Gb, as in hard drive not RAM.

3. Amazon adds New high memory instances

  • High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform $1.20-$1.44/hr
  • High-Memory Quadruple Extra Large Instance 68.4 GB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform $2.40-$2.88/hr

These are new virtual instance AMIs that scale up as opposed to scale out. Scaled out options use clusters of instances in the Grid Computing/Hadoop type of architectures. There is nothing to prohibit using clusters of scaled up instances in a hybridized architecture, other than cost. However, the premise of Hadoop arrays is “divide and conquer,” so it makes less sense to have massive nodes in the array. Since scaling out involves moving the problem to a whole new parallel programming paradigm with all of its consequent complexity, it also means owning the code. In contrast scaling up is generally very simple. You don’t have to own the code or even recompile just install on more capable hardware.

Returning us back to the Amazon RDS, Amazon has presumably taken an optimized compiled route and offers prepackaged MySQL 5.1 instances ready to use:

  • db.m1.small (1.7 GB of RAM, $0.11 per hour).
  • db.m1.large (7.5 GB of RAM, $0.44 per hour)
  • db.m1.xlarge (15 GB of RAM, $0.88 per hour).
  • db.m2.2xlarge (34 GB of RAM, $1.55 per hour).
  • db.m2.4xlarge (68 GB of RAM, $3.10 per hour).

Of course the higher spatial functionality of PostgreSQL/PostGIS can be installed on any of these high memory instances as well. It is just not done by Amazon. The important thing to note is memory approaches 100Gb per instance! What does one do with all that memory?

Here is one use:

“Google query results are now served in under an astonishingly fast 200ms, down from 1000ms in the olden days. The vast majority of this great performance improvement is due to holding indexes completely in memory. Thousands of machines process each query in order to make search results appear nearly instantaneously.”
Google Fellow Jeff Dean keynote speech at WSDM 2009.

Having very large memory footprints makes sense for increasing performance on a DB application. Even fairly large data tables can reside entirely in memory for optimum performance. Whether a database makes use of the best optimized compiler for Amazon’s 64bit instances would need to be explored. Open source options like PostgreSQL/PostGIS would let you play with compiling in your choice of compilers, but perhaps not successfully.

Todd Hoff has some insightful analysis in his post, “Are Cloud-Based Memory Architectures the Next Big Thing?”

Here is Todd Hoff’s point about having your DB run inside of RAM – remember that 68Gb Quadruple Extra Large memory:

“Why are Memory Based Architectures so attractive? Compared to disk, RAM is a high bandwidth and low latency storage medium. Depending on who you ask the bandwidth of RAM is 5 GB/s. The bandwidth of disk is about 100 MB/s. RAM bandwidth is many hundreds of times faster. RAM wins. Modern hard drives have latencies under 13 milliseconds. When many applications are queued for disk reads latencies can easily be in the many second range. Memory latency is in the 5 nanosecond range. Memory latency is 2,000 times faster. RAM wins again.”

Wow! Can that be right? “Memory latency is 2,000 times faster .”

(Hmm… 13 milliseconds = 13,000,000 nanoseconds
so 13,000,000n/5n = 2,600,000x? And 5Gb/s / 100Mb/s = 50x? Am I doing the math right?)

The real question, of course, is what will actual benchmarks reveal? Presumably optimized memory caching narrows the gap between disk storage and RAM. Which brings up the problem of configuring a Database to use large RAM pools. PostgreSQL has a variety of configuration settings but to date RDBMS software doesn’t really have a configuration switch that simply caches the whole enchilada.

Here is some discussion of MySQL front-ending the database with In-Memory-Data-Grid (IMDG).

Here is an article on a PostgreSQL configuration to use a RAM disk.

Here is a walk through on configuring PostgreSQL caching and some PostgreSQL doc pages.

Tuning for large memory is not exactly straightforward. There is no “one size fits all.” You can quickly get into Managing Kernel Resources. The two most important parameters are:

  • shared_buffers
  • sort_mem
“As a start for tuning, use 25% of RAM for cache size, and 2-4% for sort size. Increase if no swapping, and decrease to prevent swapping. Of course, if the frequently accessed tables already fit in the cache, continuing to increase the cache size no longer dramatically improves performance.”

OK, given this rough guideline on a Quadruple Extra Large Instance 68Gb:

  • shared_buffers = 17Gb (25%)
  • sort_mem = 2.72Gb (4%)

This still leaves plenty of room, 48.28Gb, to avoid dreaded swap pagein by the OS. Let’s assume a more normal 8Gb memory for the OS. We still have 40Gb to play with. Looking at sort types in detail may make adding some more sort_mem helpful, maybe bump to 5Gb. Now there is still an additional 38Gb to drop into shared_buffers for a grand total of 55Gb. Of course you have to have a pretty hefty set of spatial tables to use up this kind of space.

Here is a list of PostgreSQL limitations. As you can see it is technically possible to run out of even 68Gb.


Limit

Value
Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250 – 1600 depending on column types
Maximum Indexes per Table Unlimited

Naturally the Obe duo has a useful posting on determining PostGIS sizes: Determining size of database, schema, tables, and geometry

To get some perspective on size an Open Street Map dump of the whole world fits into a 90Gb EBS Amazon Public Data Set configured for PostGIS with pg_createcluster. Looks like this just happened a couple weeks ago. Although 90Gb is just a little out of reach for a for even a Quadruple Extra Large, I gather the current size of planet osm is still in the 60Gb range and you might just fit it into 55Gb RAM. It would be a tad tight. Well maybe the Octuple Extra Large Instance 136Gb instance is not too far off. Of course who knows how big Planet OSM will ultimately end up being.
See planet.openstreetmap.org

Another point to notice is the 8 virtual cores in a Quadruple Extra Large Instance. Unfortunately

“PostgreSQL uses a multi-process model, meaning each database connection has its own Unix process. Because of this, all multi-cpu operating systems can spread multiple database connections among the available CPUs. However, if only a single database connection is active, it can only use one CPU. PostgreSQL does not use multi-threading to allow a single process to use multiple CPUs.”

Running a single connection query apparently won’t benefit from a multi cpu virtual system, even though running multi threaded will definitely help with multiple connection pools.

I look forward to someone actually running benchmarks since that would be the genuine reality check.

Summary

Scaling up is the least complex way to boost performance on a lagging application. The Cloud offers lots of choices suitable to a range of budgets and problems. If you want to optimize personnel and adopt a decoupled SOA architecture, you’ll want to look at Azure + SQL Azure. If you want the adventure of large scale research problems, you’ll want to look at instance arrays and Hadoop clusters available in Amazon AWS.

However, if you just want a quick fix, maybe not 2000x but at least a some x, better take a look at Big RAM. If you do, please let us know the benchmarks!

Posted in Uncategorized

Azure and GeoWebCache tile pyramids


Azure Blob storage tile pyramid
Fig 1 – Azure Blob Storage tile pyramid for citylimits

Azure Overview

Shared resources continue to grow as essential building blocks of modern life, key to connecting communities and businesses of all types and sizes. As a result a product like SharePoint is a very hot item in the enterprise world. You can possibly view Azure as a very big, very public, SharePoint platform that is still being constructed. Microsoft and 3rd party services will eventually populate the service bus of this Cloud version with lots and lots of service hooks. In the meantime, even early stage Azure with Web Hosting, Blob storage, and Azure SQL Server makes for some interesting experimental R&D.

Azure is similar to Amazon’s AWS cloud services, and Azure’s pricing follows Amazon’s lead with the familiar “pay as you go, buy what you use” model. Azure offers web services, storage, and queues, but instead of giving access to an actual virtual instance, Azure provides services maintained in the Microsoft Cloud infrastructure. Blob storage, Azure SQL Server, and IIS allow developers to host web applications and data in the Azure Cloud, but only with the provided services. The virtual machine is entirely hidden inside Microsoft’s Cloud.

The folks at Microsoft are probably well aware that most development scenarios have some basic web application and storage component, but don’t really need all the capabilities, and headaches, offered by controlling their own server. In return for giving up some freedom you get the security of automatic replication, scalability, and maintenance along with the API tools to connect into the services. In essence this is a Microsoft only Cloud since no other services can be installed. Unfortunately, as a GIS developer this makes Azure a bit less useful. After all, Microsoft doesn’t yet offer GIS APIs, OGC compliant service platforms, or translation tools. On the other hand, high availability with automatic replication and scalability for little effort are nice features for lots of GIS scenarios.

The current Azure CTP lets developers experiment for free with these minor restrictions:

  • Total compute usage: 2000 VM hours
  • Cloud storage capacity: 50GB
  • Total storage bandwidth: 20GB/day


To keep things simple, since this is my first introduction to Azure, I looked at just using Blob Storage to host a tile pyramid. The Silverlight MapControl CTP makes it very easy to add tile sources as layers so my project is simply to create a tile pyramid and store this in Azure Blob storage where I can access it from a Silverlight MapControl.

In order to create a tile pyramid, I also decided to dig into the GeoWebCache standalone beta 1.2. This is beta and offers some new undocumented features. It also is my first attempt at using geowebcache as standalone. Generally I just use the version conveniently built into Geoserver. However, since I was only building a tile pyramid rather than serving it, the standalone version made more sense. Geowebcache also provides caching for public WMS services. In cases where a useful WMS is available, but not very efficient, it would be nice to cache tiles for at least subsets useful to my applications.

Azure Blob Storage

Azure CTP has three main components:

  1. Windows Azure – includes the storage services for blobs, queues, and cloud tables as well as hosting web applications
  2. SQL Azure – SQL Server in the Cloud
  3. .NET Services – Service Bus, Access Control Service, Work Flow …

There are lots of walk throughs for getting started in Azure. It all boils down to getting the credentials to use the service.

Once a CTP project is available the next step is to create a “Storage Account” which will be used to store the tile pyramid directory. From your account page you can also create a “Hosted Service” within your Windows Azure project. This is where web applications are deployed. If you want to use “SQL Azure” you must request a second SQL Azure token and create a SQL Service. The .NET Service doesn’t require a token for a subscription as long as you have a Windows Live account.

After creating a Windows Azure storage account you will get three endpoints and a couple of keys.

Endpoints:
http://sampleaccount.blob.core.windows.net/

http://sampleaccount.queue.core.windows.net/

http://sampleaccount.table.core.windows.net/

Primary Access Key: ************************************
Secondary Access Key: *********************************

Now we can start using our brand new Azure storage account. But to make life much simpler first download the following:

Azure SDK includes some sample code . . . HelloWorld, HelloFabric, etc to get started using the Rest interface. I reviewed some of the samples and started down the path of creating the necessary Rest calls for recursively loading a tile pyramid from my local system into an Azure blob storage nomenclature. I was just getting started when I happened to take a look at the CloudDrive sample. This saved me a lot of time and trouble.

CloudDrive lets you treat the Azure service as a drive inside PowerShell. The venerable MSDOS cd, dir, mkdir, copy, del etc commands are all ready to go. Wince, I know, I know, MSDOS? I’m sure, if not now, then soon there will be dozens of tools to do the same thing with nice drag and drop UIs. But this works and I’m old enough to actually remember DOS commands.

First, using the elevated Windows Azure SDK command prompt you can compile and run the CloudDrive with a couple of commands:

C:\AzureTools\samples\CloudDrive\buildme.cmd
C:\AzureTools\samples\CloudDrive\runme.cmd

Now open Windows PowerShell and execute the MounteDrive.ps1 script. This allows you to treat the local Azure service as a drive mount and start copying files into storage blobs.


Azure sample CloudDrive PowerShell
Fig 1 – Azure sample CloudDrive PowerShell

Creating a connection to the real production Azure service simply means making a copy of MountDrive.ps1 and changing credentials and endpoint to the ones obtained previously.

function MountDrive {
Param (
 $Account = "sampleaccount",
 $Key = "***************************************",
 $ServiceUrl="http://sampleaccount.blob.core.windows.net/",
 $DriveName="Blob",
 $ProviderName="BlobDrive")

# Power Shell Snapin setup
 add-pssnapin CloudDriveSnapin -ErrorAction SilentlyContinue

# Create the credentials
 $password = ConvertTo-SecureString -AsPlainText -Force $Key
 $cred = New-Object -TypeName Management.Automation.PSCredential -ArgumentList $Account, $password

# Mount storage service as a drive
 new-psdrive -psprovider $ProviderName -root $ServiceUrl -name $DriveName -cred $cred -scope global
}

MountDrive -ServiceUrl "http://sampleaccount.blob.core.windows.net/" -DriveName "Blob" -ProviderName "BlobDrive"

The new-item command lets you create a new container with -Public flag ensuring that files will be accessible publicly. Then the Blog: drive copy-cd command will copy files and subdirectories from the local file system to the Azure Blob storage. For example:

PS Blob:\> new-item imagecontainer -Public
Parent: CloudDriveSnapin\BlobDrive::http:\\127.0.0.1:10000\devstoreaccount1

Type Size LastWriteTimeUtc Name
---- ---- ---------------- ----
Container 10/16/2009 9:02:22 PM imagecontainer

PS Blob:\> dir

Parent: CloudDriveSnapin\BlobDrive::http:\\127.0.0.1:10000\

Type Size LastWriteTimeUtc Name
---- ---- ---------------- ----
Container 10/16/2009 9:02:22 PM imagecontainer
Container 10/8/2009 9:22:22 PM northmetro
Container 10/8/2009 5:54:16 PM storagesamplecontainer
Container 10/8/2009 7:32:16 PM testcontainer

PS Blob:\> copy-cd c:\temp\image001.png imagecontainer\test.png
PS Blob:\> dir imagecontainer

Parent: CloudDriveSnapin\BlobDrive::http:\\127.0.0.1:10000\imagecontainer

Type Size LastWriteTimeUtc Name
---- ---- ---------------- ----
Blob 1674374 10/16/2009 9:02:57 PM test.png

Because imagecontainer is public the test.png image can be accessed in the browser from the local development storage with:
http://127.0.0.1:10000/devstoreaccount1/imagecontainer/test.png
or if the image was similarly loaded in a production Azure storage account:
http://sampleaccount.blob.core.windows.net/imagecontainer/test.png

It is worth noting that Azure storage consists of endpoints, containers, and blobs. There are some further subtleties for large blobs such as blocks and blocklists as well as metadata, but there is not really anything like a subdirectory. Subdirectories are emulated using slashes in the blob name.
i.e. northmetro/citylimits/BingMercator_12/006_019/000851_002543.png is a container, “northmetro“, followed by a blob name,
/citylimits/BingMercator_12/006_019/000851_002543.png.”

The browser can show this image using the local development storage:
http://127.0.0.1:10000/devstoreaccount1/northmetro/citylimits/BingMercator_12
/006_019/000851_002543.png

Changing to producton Azure means substituting a valid endpoint for “127.0.0.1:10000/devstoreaccount1″ like this:
http://sampleaccount.blob.core.windows.net/northmetro/citylimits/BingMercator_12
/006_019/000851_002543.png

With CloudDrive getting my tile pyramid into the cloud is straightforward and it saved writing custom code.

The tile pyramid – Geowebcache 1.2 beta

Geowebcache is written in Java and synchronizes very well with the GeoServer OGC service engine. The new 1.2 beta version is available as a .war that is loaded into the webapp directory of Tomcat. It is a fairly simple matter to configure geowebcache to create a tile pyramid of a particular Geoserver WMS layer. (Unfortunately it took me almost 2 days to work out a conflict with an existing Geoserver gwc) The two main files for configuration are:


C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\
                     geowebcache1.2\WEB-INF\geowebcache-servlet.xml
C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\
                    geowebcache1.2\WEB-INF\classes\geowebcache.xml

geowebcache-servlet.xml customizes the service bean parameters and geowebcache.xml provides setup parameters for tile pyramids of layers. Leaving the geowebcache-servlet.xml at default will work fine when no other Geoserver or geowebcache is around. It can get more complicated if you have several that need to be kept separate. More configuration info.

Here is an example geowebcache.xml that uses some of the newer gridSet definition capabilities. It took me a long while to find the schema for geowebcache.xml:
http://geowebcache.org/schema/docs/1.2.0/
The documentation is still thin for this beta release project.

<?xml version="1.0" encoding="utf-8"?>
<gwcConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="http://geowebcache.org/schema/1.2.0/geowebcache.xsd"
  xmlns="http://geowebcache.org/schema/1.2.0">
<version>1.2.0</version>
<backendTimeout>120</backendTimeout>
<gridSets>
  <gridSet>
  <name>BingMercator</name>
  <srs><number>900913</number></srs>
  <extent>
  <coords>
  <double>-11706995</double>
  <double>4839671</double>
  <double>-11687135</double>
  <double>4861458</double>
  </coords>
  </extent>
  <alignTopLeft>true</alignTopLeft>
  <levels>15</levels>
  </gridSet>
</gridSets>
<layers>
  <wmsLayer>
  <name>citylimits</name>
  <gridSubsets>
  <gridSubset>
  <gridSetName>BingMercator</gridSetName>
  <zoomStart>0</zoomStart>
  <zoomStop>10</zoomStop>
  </gridSubset>
  <gridSubset>
  <gridSetName>GoogleMapsCompatible</gridSetName>
  </gridSubset>
  </gridSubsets>
  <wmsUrl><string>http://localhost:80/geoserver/wms</string></wmsUrl>
  <wmsLayers>northmetro:citylimits</wmsLayers>
  <wmsStyles>citylimits</wmsStyles>
  </wmsLayer>
</layers>
</gwcConfiguration>

After editing the configuration files, building the pyramid is a matter of pointing your browser at the local webapp and seeding the tiles down to the level you choose with the gridSet you want. The GoogleMapsCompatible gridSet is built into geowebcache and the BingMercator is a custom gridSet that I’ve added with extent limits defined.
http://localhost/geowebcache1.2/rest/seed/citylimits

This can take a few hours/days depending on the extent and zoom level you need. Once completed I use the CloudDrive PowerShell to copy all of the tiles into Azure blob storage:

PS Blob:\> copy-cd C:\Program Files\Apache Software Foundation\Tomcat 6.0\temp\geowebcache\citylimits

This also takes some time for the resulting 243,648 files of about 1Gb.

Silverlight MapControl

The final piece in the project is adding the MapControl viewer layer. First I add a new tile source layer in the Map Control of the MainPage.xaml

  <m:Map
      Name="MainMap"
      NavigationVisibility="Visible"
      Grid.Column="0" Grid.Row="1" Grid.RowSpan="1" Padding="5"
      Mode="Road">
    <m:Map.Children>
       <!-- Azure tile source -->
       <m:MapTileLayer x:Name="citylimitsAzureLayer" Opacity="0.5" Visibility="Collapsed">
         <m:MapTileLayer.TileSources>
             <local:CityLimitsAzureTileSource></local:CityLimitsAzureTileSource>
         </m:MapTileLayer.TileSources>
      </m:MapTileLayer>
             .
             .

The tile naming scheme is described here:
http://geowebcache.org/trac/wiki/filestorage2
The important point is:

“Most filesystems use btree’s to store the files in directories, so layername/projection_z/[x/(2(z/2))]_[y/(2(z/2))]/x_y.extension seems reasonable, since it works sort of like a quadtree. The idea is that half the precision is in the directory name, the full precision in the filename to make it easy to locate problematic tiles. This will also make cache purges a lot faster for specific regions, since fewer directories have to be traversed and unlinked. “

An ordinary tile source class looks just like this:

  public class CityLimitsTileSource : Microsoft.VirtualEarth.MapControl.TileSource
  {
        public CityLimitsTileSource() : base(App.Current.Host.InitParams["src"] +
          "/geoserver/gwc/service/gmaps?layers=northmetro:citylimits&zoom={2}&x={0}&y={1}")
        {
        }

        public override Uri GetUri(int x, int y, int zoomLevel)
        {
           return new Uri(String.Format(this.UriFormat, x, y, zoomLevel));
        }
  }

However, now I need to reproduce the tile name as it is in the Azure storage container rather than letting gwc/service/gmaps mediate the nomenclature for me. This took a little digging. The two files I needed to look at turned out to be:

GMapsConverter works because Bing Maps follows the same upper left origin convention and spherical mercator projection as Google Maps. Here is the final approach using the naming system in Geowebcache1.2.

public class CityLimitsAzureTileSource : Microsoft.VirtualEarth.MapControl.TileSource
{
  public CityLimitsAzureTileSource()
  : base(App.Current.Host.InitParams["azure"] + "citylimits/GoogleMapsCompatible_{0}/{1}/{2}.png")
  {
  }

  public override Uri GetUri(int x, int y, int zoomLevel)
  {
   /*
   * From geowebcache
   * http://geowebcache.org/trac/browser/trunk/geowebcache/src/main/java/org/geowebcache/storage/blobstore/file/FilePathGenerator.java
   * http://geowebcache.org/trac/browser/trunk/geowebcache/src/main/java/org/geowebcache/service/gmaps/GMapsConverter.java
   * must convert zoom, x, y, and z into tilepyramid subdirectory structure used by geowebcache
  */
  int extent = (int)Math.Pow(2, zoomLevel);
  if (x < 0 || x > extent - 1)
  {
     MessageBox.Show("The X coordinate is not sane: " + x);
  }

  if (y < 0 || y > extent - 1)
  {
     MessageBox.Show("The Y coordinate is not sane: " + y);
  }
  // xPos and yPos correspond to the top left hand corner
  y = extent - y - 1;
  long shift = zoomLevel / 2;
  long half = 2 << (int)shift;
  int digits = 1;
  if (half > 10)
  {
     digits = (int)(Math.Log10(half)) + 1;
  }
  long halfx = x / half;
  long halfy = y / half;
  string halfsubdir = zeroPadder(halfx, digits) + "_" + zeroPadder(halfy, digits);
  string img = zeroPadder(x, 2 * digits) + "_" + zeroPadder(y, 2 * digits);
  string zoom = zeroPadder(zoomLevel, 2);
  string test = String.Format(this.UriFormat, zoom, halfsubdir, img );

  return new Uri(String.Format(this.UriFormat, zoom, halfsubdir, img));
  }

/**
  * From geowebcache
  * http://geowebcache.org/trac/browser/trunk/geowebcache/src/main/java/org/geowebcache/storage/blobstore/file/FilePathGenerator.java
  * a way to pad numbers with leading zeros, since I don't know a fast
  * way of doing this in Java.
  *
  * @param number
  * @param order
  * @return
  */
  public static String zeroPadder(long number, int order) {
  int numberOrder = 1;

  if (number > 9) {
    if(number > 11) {
      numberOrder = (int) Math.Ceiling(Math.Log10(number) - 0.001);
    } else {
      numberOrder = 2;
    }
  }

  int diffOrder = order - numberOrder;

    if(diffOrder > 0) {
      //System.out.println("number: " + number + " order: " + order + " diff: " + diffOrder);
      StringBuilder padding = new StringBuilder(diffOrder);

      while (diffOrder > 0) {
        padding.Append("0");
        diffOrder--;
       }
       return padding.ToString() + string.Format("{0}", number);
    } else {
      return string.Format("{0}", number);
    }
  }
}

I didn’t attempt to change the zeroPadder. Doubtless there is a simple C# String.Format that would replace the zeroPadder from Geowebcache.

This works and provides access to tile png images stored in Azure blob storage, as you can see from the sample demo.

Summary

Tile pyramids enhance user experience, matching the performance users have come to expect in Bing, Google, Yahoo, and OSM. It is resource intensive to make tile pyramids of large world wide extent and deep zoom levels. In fact it is not something most services can or need provide except for limited areas. Tile pyramids in the Cloud require relatively static layers with infrequent updates.

Although using Azure this way is possible and provides performance, scalability, and reliability, I’m not sure it always makes sense. The costs are difficult to predict for a high volume site as they are based on bandwidth usage as well as storage. Also you may be paying storage fees for many tiles seldom or never needed. Tile pyramid performance is a wonderful thing, but it chews up a ton of storage, much of which is seldom if ever used.

For a stable low to medium volume application it makes more sense to host a tile pyramid on your own server. Possibly with high volume sites where reliability is the deciding factor moving to Cloud storage services is the right thing. This is especially true where traffic patterns swing wildly or grow rapidly and robust scaling is an ongoing battle.

Azure CTP is of course not as mature as AWS, but obviously it has the edge in the developer community and like many Microsoft technologies it has staying power to spare. Leveraging its developer community makes sense for Microsoft and with easy to use tools built into Visual Studio I can see Azure growing quickly. In time it will just be part of the development fabric with most Visual Studio deployment choices seamlessly migrating out to the Azure Cloud.

Azure release is slated for Nov 2009.

Posted in Uncategorized

Storm Runoff modelling and MRLC

The Multi-Resolution Land Characteristics Consortium, MRLC, is a consortium of agencies at the federal level that produces the National Land Cover Database, NLCD 2001. The dataset was developed using a national coverage set of 8 band Landsat-7 Imagery along with 30m DEM. The imagery is processed from three separate dates to give a seasonal average land cover classification. The resolution is a bit coarse at 30m square, but it is a valuable resource because of its consistent national coverage.

More detailed information on MRLC:

In addition to the NLCD coverage there are two derivative layers:

  • NLCD 2001 impervious surface: The impervious surface data classifies each pixel into 101 possible values (0% – 100%). The data show the detailed urban fabric in which many of us reside. Low percentage impervious is in light gray with increasing values depicted in darker gray and the highest value in pink and red. White areas have no impervious surface.
  • NLCD 2001 canopy density: Like the impervious surface data, the canopy density database element classifies each pixel into 101 possible values (0% – 100%). The canopy density estimate apply to the forest only. These data can be combined with the land cover to estimate canopy density by forest type (deciduous, evergreen, mixed, woody wetland)

The data is available for public download. There is also one of those vintage ESRI viewers that qualifies for James Fee’s “Cash for Geo Clunkers” proposal. These Ancien RĂ©gime viewers are littered all over the federal landscape. It will take years for newer technology to replace this legacy of ArcIMS. Fortunately there is an exposed WMS service ( See GetCapabilities ), which permits access to MRLC layers without going through the “Viewer” nonsense. This WMS service proved very useful on a recent web project for Storm Runoff Mitigation.

I am no hydrologist, but once I was provided with the appropriate calculation approach the Silverlight UI was fairly straightforward. Basically there are a number of potential mitigation practices that play into a runoff calculation. Two fairly significant factors are Impervious Surface Area and Canopy Area, which are both available through MRLC’s WMS service. One simplified calculation model in use is called the TR-55 method.

By making use of the MRLC derived layers for Impervious Surface and Canopy at least approximations for these two factors can be derived for any given area of the Continental US. The method I used was to provide a GetMap request to the WMS service which then returned a pixel shaded image of the impervious surface density. Most of the hard work has already been done by MRLC. All I need to do is extract the value density for the returned png image.

Impervious Surface layer
Fig 3 – Impervious Surface shaded density red
Impervious Surface layer
Fig 4 – Canopy shaded density green

The density values are relative to gray. I at first tried a simple density calculation from the color encoded pixels by subtracting the base gray from the variable green: Green – Red = factor. The sum of these factors divided by the total pixel area of the image times the max 255 byte value is a rough calculation of the percentage canopy over the viewport. However, after pursuing the USGS for a few days I managed to get the actual percentile RGB tables and improve the density calculation accuracy. This average density percentile is then used in TR-55 as An*CNn with the Canopy CN value of 70.

The process of extracting density from pixels looks like this:

  HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(new Uri(getlayerurl));
  using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
  {
    if (response.StatusDescription.Equals("OK"))
    {
      using (Stream stream = response.GetResponseStream())
      {
        byte[] data = ReadFully(stream, response.ContentLength);
        Bitmap bmp = (Bitmap)Bitmap.FromStream(new MemoryStream(data), false);
        stream.Close();

        UnsafeBitmap fast_bitmap = new UnsafeBitmap(bmp);
        fast_bitmap.LockBitmap();
        PixelData pixel;
        string key = "";
        double value = 0;
        for (int x = 0; x < bmp.Width; x++)
        {
          for (int y = 0; y < bmp.Height; y++)
          {
            pixel = fast_bitmap.GetPixel(x, y);
            key = pixel.red + " " + pixel.green + " " + pixel.blue;
            if (imperviousRGB.Contains(key))
            {
              value += Array.IndexOf(imperviousSurfaceRGB, key) * 0.01;
            }
          }

        }
        fast_bitmap.UnlockBitmap();
        double total = (bmp.Height * bmp.Width);
        double ratio = value / total;
        return ratio.ToString();
                   .
                   .
                   .

C#, unlike Java, allows pointer arithmetic in compilation marked unsafe. The advantage of using this approach here is a tremendous speed increase. The array of imperviousRGB strings to percentiles was supplied by the USGS. This process is applied in a WCF service to both the Canopy and the Impervious Surface layers and the result passed back to the TR-55 calculations.

Possible Extensions:

There are several extensions beyond the scope of this project that could prove interesting.

  1. First the NLCD uses a color classifications scheme. A similar color processing algorithm could be used to provide rough percentages of each of these classificcations for a viewport area. These could be helpful for various research and reporting requirements.
  2. However, beyond simple rectangular viewports, a nice extension would be the ability to supply arbitrary polygonal area of interests. This is fairly easy to do in Silverlight. The draw command is just a series of point clicks that are added to a Path geometry as line segments. The resulting polygon is then used as a clip mask when passing through the GetMap image. Probably a very simple point in polygon check either coded manually or using one of the C# ports of JTS would provide reasonable performance.
MRLC NLCD 2001 Colour Classification
Fig 3 - MRLC NLCD 2001 Colour Classification

What about resolution?

It is tempting to think a little bit about resolution. Looking at the MRLC image results, especially over a map base, it is obvious that at 100 ft resolution even the best of calculations are far from the fine grained detail necessary for accurate neighborhood calculations.

It is also obvious that Impervious Surface can be enhanced directly by applying some additional lookup from a road database. Using pavement estimates from a road network could improve resolution quality quite a bit in urbanized areas. But even that can be improved if there is some way to collect other common urban impervious surfaces such as rooftops, walkways, driveways, and parking areas.

NAIP 1m GSD 4 band imagery has fewer bands but higher resolution. NAIP is a resource that has been applied to unsupervised impervious surface extraction. However, the 4 band aquisition is still not consistently available for the entire US.

Now that more LiDAR data is coming on line at higher resolutions, why not use LiDAR classifications to enhance impervious surface?

LidarClassification1
Lidar All Elevation
ILidarClassification2
LidarAll Classification
LidarClassification3
Lidar All Intensity
ILidarClassification4
Lidar All Return

Just looking at the different style choices on the LidarServer WMS for instance, it appears that there are ways to get roof top and canopy out of LiDAR data. LiDAR at 1m resolution for metro areas could increase resolution for Canopy as well as rooftop contribution to Impervious Surface estimates.

In fact the folks at QCoherent have developed tools for classification and extraction of features like roof polygons. This extraction tool appled over a metro area could result in a useful rooftop polygon set. Once available in a spatial database these polygons can be furnished as an additional color filled tile pyramid layer. Availability of this layer would also let the Runoff calculation apply rooftop area estimates to roof drain disconnect factors.

Additionally improved accuracy of impervious surface calculations can be achieved by using a merging version of the simple color scan process. In a merging process the scan loop over the MRLC image does a lookup in the corresponding rooftop image. Any pixel positive for rooftop is promoted to the highest impervious surface classification. This estimate only applies so long as roof top green gardens remain insignificant.

Ultimately the MRLC will be looking at 1m GSD collection for NLCD with some combination of imagery like NAIP and DEM from LiDAR. However, it could be a few years before these high resolution resources are available consistently across the US.

Summary

The utility of WMS resources continues to grow as services become better known and tools for web applications improve. Other OWS services like WFS and WCS are following along behind, but show significant promise as well. The exposure of public data resource in some kind of OGC service should be mandatory at all government levels. The cost is not that significant compared to the cost effectiveness of cross agency, even cross domain, access to valuable data resources.

By using appropriate WMS tools like Geoserver and Geowebcache, vastly more efficient tile pyramids can become a part of any published WMS service layer. It takes a lot more storage so the improved performance may not be feasible for larger national and worldwide extents. However, in this Runoff Mitigation project, where view performance is less important, the OGC standard WMS GetMap requests proved to be quite useful for the TR-55 calculations and performance adequate.

Posted in Uncategorized