AWS continuously improves cloud services and introduces new hardware for processing power, but customers usually do not rush to move to newer instance generations. AWS documents state that newer generations are more powerful and cheaper, but what is the difference in numbers? In this post, I researched and compared four generations of the instance type M (general purpose) to show the difference in performance and price.
Comparing M4, M5, M6g and M7g instances
Four instance generations of the same instance type and family (2 vCPUs and 8Gib RAM) will be compared. All have 2 vCPUs and 8 Gib RAM:
I first checked the price (in the us-east-1 region) and measured network performance via Speedtest:
### Test for m6g.large # curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python - Retrieving speedtest.net configuration... Testing from Amazon.com (52.205.53.191)... Retrieving speedtest.net server list... Selecting best server based on ping... Hosted by eero (Ashburn, VA) [0.81 km]: 1.447 ms Testing download speed................................................................................ Download: 3633.89 Mbit/s Testing upload speed...................................................................................................... Upload: 3298.03 Mbit/s
Here is a table mix of AWS-provided data + my first findings:
Instance Size / Gen | vCPU | Memory (GiB) | Instance Storage (GB) | Network Bandwidth (Gbps) | Speedtest (approximately Mbit/s) | EBS Bandwidth (Gbps) | Hourly Price $ (us-east-1) | AWS Declares |
---|
Instance Size / Gen | vCPU | Memory (GiB) | Instance Storage (GB) | Network Bandwidth (Gbps) | Speedtest (approximately Mbit/s) | EBS Bandwidth (Gbps) | Hourly Price $ (us-east-1) | AWS Declares |
---|---|---|---|---|---|---|---|---|
m4.large | 2 | 8 | EBS-only | Moderate | 500 | 450 | 0,1 | - |
m5.large | 2 | 8 | EBS-only | Up to 10 | 3000 | Up to 4,750 | 0,096 | up to 20% improvement in price/performance compared to M4 instances |
m6g.large | 2 | 8 | EBS-Only | Up to 10 | 3500 | Up to 4,750 | 0,077 | up to 40% better price performance over M5 instances |
m7g.large | 2 | 8 | EBS-Only | Up to 12.5 | 5000 | Up to 10 | 0,0816 | up to 25% better performance over the sixth-generation AWS Graviton2-based M6g instances DDR5 memory, which provides 50% higher memory bandwidth compared to DDR4 memory 20% higher enhanced networking bandwidth compared to M6g instances |
Price difference
Price difference between M4 and M7 is about 20%
M7 is a bit more expensive than M6 because M7 uses newer RAM (DDR5) instead of DD4 in M6.
M7g instances feature Double Data Rate 5 (DDR5) memory, which provides 50% higher memory bandwidth compared to DDR4 memory to enable high-speed access to data in memory.
Network performance
AWS categorizes network performance for some instances with qualitative descriptors like «Low,» «Moderate,» «High,» etc., rather than specifying exact numerical bandwidth values. For «Moderate» network performance, AWS does not publicly disclose precise bandwidth figures, as the actual throughput can vary based on multiple factors, including network congestion and the instance’s physical location.
Speedtest utility was used to get numbers. Network performance significantly increased over the generation evolution:
CPU performance check
Sysbench was used to test the CPU and memory performance.
Sysbench is a scriptable multi-threaded benchmark tool based on LuaJIT. It is most frequently used for database benchmarks but can also create arbitrarily complex workloads that do not involve a database server.
Sysbench comes with the following bundled benchmarks:
- oltp_*.lua: a collection of OLTP-like database benchmarks
- fileio: a filesystem-level benchmark
- cpu: a simple CPU benchmark
- memory: a memory access benchmark
- threads: a thread-based scheduler benchmark
- mutex: a POSIX mutex benchmark
How to install the tool on Amazon Linux 2023:
yum -y install make automake libtool pkgconfig libaio-devel yum -y install openssl-devel sudo wget https://dev.mysql.com/get/mysql80-community-release-el9-1.noarch.rpm sudo dnf install mysql80-community-release-el9-1.noarch.rpm -y sudo rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2023 sudo dnf install mysql-community-client -y sudo dnf install mysql-devel -y git clone https://github.com/akopytov/sysbench.git cd sysbench ./autogen.sh ./configure make -j make install
M4 instance CPU / Memory test
Info about the CPU:
# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: GenuineIntel BIOS Vendor ID: Intel Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz CPU family: 6 Model: 79 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 Stepping: 1 BogoMIPS: 4599.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopolo gy cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invp cid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt Virtualization features: Hypervisor vendor: Xen Virtualization type: full Caches (sum of all): L1d: 32 KiB (1 instance) L1i: 32 KiB (1 instance) L2: 256 KiB (1 instance) L3: 45 MiB (1 instance)
This will run a single-threaded CPU benchmark.
$ sysbench cpu run sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 1 Initializing random number generator from current time Prime numbers limit: 10000 Initializing worker threads... Threads started! CPU speed: events per second: 757.73 Throughput: events/s (eps): 757.7278 time elapsed: 10.0010s total number of events: 7578 Latency (ms): min: 1.30 avg: 1.32 max: 1.68 95th percentile: 1.34 sum: 9987.17 Threads fairness: events (avg/stddev): 7578.0000/0.00 execution time (avg/stddev): 9.9872/0.00
One more test with 16 threads:
# sysbench --threads=16 cpu run sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 16 Initializing random number generator from current time Prime numbers limit: 10000 Initializing worker threads... Threads started! CPU speed: events per second: 1270.80 Throughput: events/s (eps): 1270.7967 time elapsed: 10.0055s total number of events: 12715 Latency (ms): min: 1.55 avg: 12.52 max: 141.00 95th percentile: 71.83 sum: 159180.16 Threads fairness: events (avg/stddev): 794.6875/7.86 execution time (avg/stddev): 9.9488/0.04
Test memory (single thread):
$ sysbench memory run sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 1 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 102400MiB operation: write scope: global Initializing worker threads... Threads started! Total operations: 4285343 (428530.63 per second) 4184.91 MiB transferred (418.49 MiB/sec) Throughput: events/s (eps): 428530.6314 time elapsed: 10.0001s total number of events: 4285343 Latency (ms): min: 0.00 avg: 0.00 max: 0.15 95th percentile: 0.00 sum: 3419.16 Threads fairness: events (avg/stddev): 4285343.0000/0.00 execution time (avg/stddev): 3.4192/0.00
Test memory (16 threads):
$ sysbench --threads=16 memory run sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 16 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 102400MiB operation: write scope: global Initializing worker threads... Threads started! Total operations: 5716923 (571674.03 per second) 5582.93 MiB transferred (558.28 MiB/sec) Throughput: events/s (eps): 571674.0298 time elapsed: 10.0003s total number of events: 5716923 Latency (ms): min: 0.00 avg: 0.01 max: 140.03 95th percentile: 0.00 sum: 54925.26 Threads fairness: events (avg/stddev): 357307.6875/2433.99 execution time (avg/stddev): 3.4328/0.25
MUTEX benchmark
A mutex benchmark evaluates mutex implementations’ performance, scalability, and overhead in a multi-threaded environment. The primary goal is to measure how efficiently a mutex can manage access to shared resources by multiple threads, especially under heavy concurrency.
Throughput refers to the number of operations (or events) completed within a given time frame when the mutex synchronizes access to shared resources. Higher throughput indicates better performance under concurrent access.
$ sysbench mutex run sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 1 Initializing random number generator from current time Initializing worker threads... Threads started! Throughput: events/s (eps): 4.4504 time elapsed: 0.2247s total number of events: 1 Latency (ms): min: 224.58 avg: 224.58 max: 224.58 95th percentile: 223.34 sum: 224.58 Threads fairness: events (avg/stddev): 1.0000/0.00 execution time (avg/stddev): 0.2246/0.00
M5 instance CPU / Memory test
Info about the CPU:
# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz BIOS Model name: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 Stepping: 4 BogoMIPS: 4999.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop ology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand h ypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx 512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 32 KiB (1 instance) L1i: 32 KiB (1 instance) L2: 1 MiB (1 instance) L3: 33 MiB (1 instance)
The full sysbench output is omitted because all details will be provided in a table and graphs later:
$ sysbench cpu run CPU speed: events per second: 1064.75 $ sysbench --threads=16 cpu run CPU speed: events per second: 1671.36
M6g instance CPU / Memory test
Info about the CPU:
# lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM BIOS Vendor ID: AWS Model name: Neoverse-N1 BIOS Model name: AWS Graviton2 Model: 1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs Caches (sum of all): L1d: 128 KiB (2 instances) L1i: 128 KiB (2 instances) L2: 2 MiB (2 instances) L3: 32 MiB (1 instance)
The full sysbench output is omitted because all details will be provided in a table and graphs later:
$ sysbench cpu run CPU speed: events per second: 2853.55 $ sysbench --threads=16 cpu run CPU speed: events per second: 5696.65
M7g instance CPU / Memory test
Info about the CPU:
# lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM BIOS Vendor ID: AWS BIOS Model name: AWS Graviton3 Model: 1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r1p1 BogoMIPS: 2100.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng Caches (sum of all): L1d: 128 KiB (2 instances) L1i: 128 KiB (2 instances) L2: 2 MiB (2 instances) L3: 32 MiB (1 instance)
The full sysbench output is omitted because all details will be provided in a table and graphs later:
$ sysbench cpu run CPU speed: events per second: 3024.28 $ sysbench --threads=16 cpu run CPU speed: events per second: 6044.47
Benchmark results
Here is a table, I collected all results from two experiments (single thread and 16 threads) for four instance generations (M4, M5, M6g, and M7g):
1 thread test | 16 threads test |
---|
1 thread test | 16 threads test | ||||||||
---|---|---|---|---|---|---|---|---|---|
Instance Family | Instance Size | CPU (events/s) | Memory (events/s) | Memory (MiB/sec) | Mutex (events/s) | CPU (events/s) | Memory (events/s) | Memory (MiB/sec) | Mutex (events/s) |
M4 | m4.large | 757,73 | 428530,63 | 418,49 | 4,45 | 1270,80 | 571674,03 | 558,28 | 4,51 |
M5 | m5.large | 1064,75 | 5774973,91 | 5639,62 | 6,07 | 1671,36 | 9205780,94 | 8990,02 | 6,12 |
M6g | m6g.large | 2853,55 | 5020851,87 | 4903,18 | 4,28 | 5696,65 | 3973599,35 | 3880,47 | 8,34 |
M7g | m7g.large | 3024,28 | 5570464,39 | 5439,91 | 5,13 | 6044,47 | 5794674,12 | 5658,86 | 9,88 |
CPU results show a significant performance increase, but the memory test shows a curious result (M5 is the best).
Consideration for migration to Graviton
The performed tests showed a significant increase in CPU and Network performance in Graviton instances + some cost savings.
AWS Graviton is a family of processors designed to deliver the best price performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2).
AWS Graviton-based instances cost up to 20% less than comparable x86-based Amazon EC2 instances.
AWS Graviton-based instances use up to 60% less energy than comparable EC2 instances.
Is your application ready to run on ARM?
A tool, «Porting Advisor for Graviton«, analyzes source code for known code patterns and dependency libraries. It then generates a report with any incompatibilities with Graviton processors. This tool provides suggestions of minimal required and/or recommended versions to run on Graviton instances for both language runtime and dependency libraries.
Currently, the tool supports the following languages/dependencies:
- Python 3+
- Java 8+
- Go 1.11+
- C, C++, Fortran
You can run it as a Docker container. This option eliminates the need to worry about Python or Java versions or any other dependency that the tool needs, and it is the quickest way to get started:
docker build -t porting-advisor .
docker run --rm -v my/repo/path:/repo -v my/output:/output porting-advisor /repo --output /output/report.html
Scan a sample Python code:
Scan a sample Java code:
Scan a sample Go code:
PLEASE NOTE: Even though the tool does its best to find known incompatibilities, it’s still recommended that you perform the appropriate tests on your application on a Graviton instance before going to Production.
Conclusion
Graviton instances look great. They are much more powerful and a bit cheaper than previous generations. In this post, I tested CPU, Memory, and Network performance for M4, M5, M6g, and M7g instances, compared costs, built graphs for visibility, and demonstrated a tool that can help you with the preliminary assessment of how ready your applications are for running on ARM instances.