This is how the same server looks after the following changes.
- Built a dedicated graphite instance, and set collectd to forward to it
- Created a RAID0 array of both ephemeral disks to store collectd and graphite data
- Set noatime on mounts on both instances
collectd, nagios & graphite on EC2: time to scale
Sometimes knowing when and what to scale is at best an educated guess, like throwing a dart and hoping to land somewhere near the bullseye. Other times it’s just obvious.
As an experienced IT admin, I naturally took the laziest approach up front when implementing a new monitoring infrastructure. I installed Nagios and Graphite on an m1.large instance, then added a collectd listener. I hooked them all up with collectd-nagios and collectd-graphite, then added R.I.Pienaar’s fun tool gdash for faster access to handy graphs. Voila! New instances in my VPC (puppetized, of course) instantly start sending stats to the monitoring system, no configuration required.
This approach was simple to bring online, but I knew it was bound to be I/O limited as my EC2 footprint increased. After all, collectd and carbon are writing duplicate data, and, unfortunately I started with everything on an EBS volume! The system is instance-backed so I can simply remount the ephemeral volume.
The collectd/graphite server has around 7000 counters and is processing 300IOPS on average. As a result, the system profile now looks like the image above. The average load is continually increasing due to I/O wait, and the reason is readily apparent in the ever-increasing network transfer rate. Sadly I don’t have IOPS in graphable form (yet). Nonetheless, this one is a no-brainer; time to scale.
So what are the next steps?
- The obvious: split graphite and the collectd listener on separate instances
- Another obvious one: ensure both collectd and carbon are writing data to ephemeral disk.
- Set noatime to the data directory mount
- Improve RRDtool write performance via RRDtool plugin config options
- Similar tricks with Graphite’s storage backend, Whisper.
I’ll post the results when I have a chance to get this done.
Connecting Amazon VPCs using OpenVPN
At Apigee we’re moving to Amazon VPCs to logically organize the various elements of our API management engine by subnet. Part of this change includes geographically distributed availability for our customers that are multiregional or multinational. For instance, we want replicated Cassandra cluster nodes in multiple regions. Amazon does not offer a native technology to connect VPCs so we are considering our options.
I’ve seen a few forum and mailing lists discussions on using OpenVPN with Amazon but nothing covered what performance could be expected. I did a few simple tests to determine if it would meet our needs.
First I set up a very basic openvpn configuration file.
Client in region us-ea:
remote <elastic.ip.addr.ess>
dev tun
ifconfig 10.8.0.2 10.8.0.1
secret static.key
Server in region us-west-1:
dev tun
ifconfig 10.8.0.1 10.8.0.2
secret static.key
I used this configuration to connect instances in a public subnet in either region, then used iperf to test performance between them. The instances I tested were small, large, and xlarge with high CPU, always on the Amazon Linux AMI. I could use openvpn’s --config
flag to point at these files for basic configuration, then tried various tuning parameters on the command line.
I generally followed the advice in this OpenVPN community article. The best performance I found was on a large 64-bit instance using the following OpenVPN configuration
--tun-mtu 48000 --fragment 0 --mssfix 0 -engine aesni
iperf showed an average of 85Mbps with this configuration. Using --cipher none --auth none the speed was around 130Mbps. For reference, iperf averaged 200Mbps when testing between elastic IPs on the Internet without OpenVPN. Strangely, the performance was significantly lower on xl instances with high CPU than on a large instance.
Private-to-private subnet routing
As an aside, I also needed to test routing between private subnets on the OpenVPN-connected VPCs. The CIDR space option is important when creating a new VPC. The space between regions cannot overlap or instances cannot route properly. The steps to route between instances in private subnets are as follows
- Create a VPC in one region using a specific CIDR block (I used 192.168.0.0/23 in us-east)
- Repeat step one in another region using another address space (I used 192.168.10.0/23 in us-west)
- Create an instance to run OpenVPN in a public subnet in both regions.
- Add the routes to the other region in the OpenVPN configuration. For example, I added
route 192.168.0.0 255.255.254.0for the VPN instance in my us-west VPC. - Disable Source/Dest Check on the openvpn host by right-clicking the instance in the AWS management console. The host will not forward packets if the source/dest check is enabled!
- Enable IP forwarding on the openvpn instances (echo 1 > /proc/sys/net/ipv4/ip_forwarding)
- Add a route to the AWS route table via the AWS management interface. For example, I added a route for 192.168.10.0/23 to the openvpn instance in the public network on the us-east side. You cannot choose the openvpn instance as a target until step 5 above is complete.
- Create an instance in the private subnet in both regions.
If everything is set up correctly (and your security groups permit it), you should be able to ping hosts in the private subnet on both sides of the VPN tunnel.

