Sunday, December 20, 2015

Tactfully Handling Common Java Complaints

Java Sucks! ... Or Not?

So, your friends tell you Java blows chunks. They've either heard/read it elsewhere, had a bad history with slow-loading applets in 1996, or have personally worked with the language and loathe it. I'd like to dispel some of the arguments against Java in the modern age since I think it is a decent language. Let's start with some common things we hear or see about Java...

Java is Slow!

I remember loading applets on websites in Windows 95 and absolutely detesting the experience. Of course, this was on a 486 SX operating at 25 MHz with 4M of RAM and a 14.4Kbps modem. Java was also still in its infancy. Many people had similar experiences and the rancor spread. Unfortunately, such sentiment still exists today -- 20 years later. I am even guilty. A coworker in 2006 had an O'Reilly Java book he offered to lend me. I declined the offer and poked fun at him for even suggesting such a thing. Fast forward two years and I had become a Java fan. It was not until I was forced to learn the language in college that I developed an appreciation for it. Let's review some facts:
  1. Hardware has improved since the 1990s. Processors are faster and have more cache and registers. Bus speeds, disks, memory, and network access are also faster which improve program load times.
  2. The JVM has improved since the 1990s. HotSpot/JIT (just-in-time compilation), JNI (Java Native Interface), and other features have been added. There is also an array of garbage collection algorithms to choose from depending on your application.
With the convergence of #1 and #2 above, Java performance has come a long way. However, in many (not all) cases, C and C++ programs still handily beat Java in the performance department. Have a look at the matrix addition performance comparison below. The code is functionally equivalent in both the C and Java programs. Two one-dimensional arrays are filled with random numbers ranging from 0 to 65,535. The sum of the elements at the current index in both arrays are stored at the current position in the first array.
#include <stdlib.h> /* rand() and srand() */
#include <time.h>   /* time_t */

void fill_rand(int *arr, int length)
  int i;
  for(i = 0; i < length; i++)
    arr[i] = rand() % 65535 + 1;

void add_arr(int *arr1, int *arr2, int length)
  int i;
  for(i = 0; i < length; i++)
    arr1[i] += arr2[i];

int main()
  int arr1[5000];
  int arr2[5000];
  time_t t;
  fill_rand(arr1, 5000);
  fill_rand(arr2, 5000);
  add_arr(arr1, arr2, 5000);
  return 0;

import java.util.Random;

class Arr
  static void fillRand(int[] arr)
    Random rand = new Random();
    for(int i = 0; i < arr.length; i++)
      arr[i] = rand.nextInt(65535) + 1;

  static void addArr(int[] arr1, int[] arr2)
    for(int i = 0; i < arr1.length; i++)
      arr1[i] += arr2[i];

  public static void main(String[] args)
    int[] arr1 = new int[5000];
    int[] arr2 = new int[5000];
    addArr(arr1, arr2);

The C code executes in approximately 4ms. In comparison, the Java equivalent takes about 207ms. That is over fifty times longer than the C program. Why? Well, the JVM "warm-up" time needs to be considered. If we make the following changes to the main() method to disregard warm-up time, we get a more reasonable execution time of about 6ms:
    long startTime = System.nanoTime();
    int[] arr1 = new int[5000];
    int[] arr2 = new int[5000];
    addArr(arr1, arr2);
    long endTime = System.nanoTime();
    System.out.println((endTime - startTime) / 1000000); // get ms

That isn't so bad. In fact, that is where JIT shines for long-running and commonly-executed code. If we arbitrarily loop over the C program 100 times and sleep 1 second between iterations, the execution time will be similar for each iteration whereas the Java program should become faster (until a peak is achieved.) Modifying main() demonstrates this:
  public static void main(String[] args) throws InterruptedException
    for(int i = 0; i < 100; i++)
      long startTime = System.nanoTime();
      int[] arr1 = new int[5000];
      int[] arr2 = new int[5000];
      addArr(arr1, arr2);
      long endTime = System.nanoTime();
      System.out.println("Iteration " + i + ": " + (endTime - startTime) / 1000000 + "ms");

[lloyd@lindev ~]$ java Arr
Iteration 0: 6ms
Iteration 1: 3ms
Iteration 2: 3ms
Iteration 3: 3ms
Iteration 4: 3ms
Iteration 5: 3ms
Iteration 6: 3ms
Iteration 7: 2ms
Iteration 8: 2ms
Iteration 9: 2ms
Iteration 10: 2ms
Iteration 11: 2ms
Iteration 12: 1ms
Iteration 13: 1ms
Iteration 14: 1ms
Iteration 15: 1ms
Iteration 16: 1ms
Iteration 17: 1ms
Iteration 18: 1ms
Iteration 19: 0ms
Iteration 20: 0ms
Iteration 21: 0ms
Iteration 22: 0ms
Iteration 23: 0ms
Iteration 24: 0ms
Iteration 25: 0ms
Iteration 26: 0ms
Iteration 27: 0ms
Iteration 28: 0ms
Iteration 29: 0ms
Iteration 30: 0ms
Iteration 31: 0ms

The code within the loop body eventually executes faster than the equivalent C code due to JVM runtime optimizations. Even though the program reports a time of 0ms, it obviously still takes micro/nanoseconds to compute which are truncated off.

Java is Insecure!

As with most programs written in C/C++ (Apache HTTPD, ISC BIND, OpenSSL, etc.) there are vulnerabilities detected periodically for Java. These are primarily due to the potential dangers of inappropriate pointer use or from undersized buffers which allow overflows. The Java language itself features policies (via security manager) you can manipulate to effectively sandbox an application. This isolation limits what the program can do with resources such as disks and network access. Another thing to consider is that the JVM is an entire platform and is fairly sizable. The JVM needs to define types, abstract networking and GUI elements, and more. There is obviously an increased risk of bugs the more lines of code a program contains. For most Windows installations, Java even nags you when updates are available or takes care of updating itself automagically. In summary for this section, Java has admittedly had a large number of vulnerabilities over the years. Battling exploits is a part of life that IT folks must deal with. On the bright side, operating systems, web browsers, and Adobe Flash seem to have more vulnerabilities and keeping Java up-to-date is relatively easy.

Java is Bloated!

Java can handily consume a substantial amount of memory. I had 4M of RAM in 1994 or so. It is fairly common for users to have 8G or 16G these days. Of course, that is no reason to be wasteful. But, one must again consider that the JVM is an entire platform with a large set of features. The memory overhead that accompanies this should be expected. There are many things you can do to tune JVM memory usage. To put things into perspective, I am running a Tomcat instance, a Jetty instance, and one more JVM instance for a JRuby daemon on a virtual private server with 2G of memory (which is also running a plethora of other services) without breaking a sweat. Java also runs on many less-powerful mobile and embedded devices. To recap: Memory is plentiful and fairly cheap these days, the JVM can be tuned to use less memory, and don't be such a tightwad!

Why Java is Annoying

  • Java language lawyers who believe the JLS (Java Language Specification) is the only thing that matters. To them, memory addresses do not exist...
  • Unlike Ruby, everything is not an object (primitives like byte, short, int, long, etc.)
  • No explicit pointers
  • The library is too big
  • Calls to System.gc() are only suggestions that can be ignored
  • Cannot explicitly call deconstructors
  • Forced to put main() method in a class
  • Syntax can be very verbose/repetitive
  • No operator overloading
  • No multiple inheritance
  • It can be a hog unless you cap the heap
  • No native way to become a daemon or service
  • Others?

Why Java is Awesome

  • It picks up after you
  • Not having to deal with explicit pointers
  • Huge library
  • No operator overloading
  • No multiple inheritance
  • Portable networking and GUI code
  • Largely portable for most other things/compile once run anywhere
  • No need for sizeof
  • It's ubiquitous
  • Multithreading
  • Others?

Thursday, December 10, 2015

64-bit Linux Assembly Tutorial Part II


Welcome to the second installment of the 64-bit Linux assembly tutorial. If you have not yet read part one of this tutorial, you can do so here. If you have read it, I hope that you enjoyed it. We will be covering networking, debugging, optimizing, endianness, and analogous C code in this tutorial. Let us get to it!

Useful Links

The C Way

We are going to start by looking at how you create a network program in C. See Beej's Guide to Network Programming for more information. I am illustrating socket programming in a higher-level language to give you a better idea of the sequence of events that occur. In order to accept network connections in a C program (or assembly), you must take the following steps:
  1. Call socket() to obtain a file descriptor to be used for communication. We used file descriptors in the first tutorial (stdin/0 and stdout/1 specifically.)
  2. Call bind() to associate (or bind) the IP address of a network interface with the file descriptor returned by socket().
  3. Call listen() to make the file descriptor be receptive to incoming network connections.
  4. Call accept() to handle incoming network connections.
accept() returns a file descriptor for the client which you can use to send and receive data to and from the remote end. You can also call close() on the client file descriptor once you are done receiving or transmitting data. After putting it all together, it would look something like:
#include <stdio.h>        /* for printf() and puts() */
#include <stdlib.h>       /* for exit() and perror() */
#include <string.h>       /* for strlen() */
#include <sys/socket.h>   /* for AF_INET, SOCK_STREAM, and socket_t */
#include <netinet/in.h>   /* for INADDR_ANY and sockaddr_in */

#define PORT 9990         /* TCP port number to accept connections on */
#define BACKLOG 10        /* connection queue limit */

int main()
  /* server and connecting client file descriptors */
  int server_fd, client_fd;

  /* size of sockaddr_in structure */
  int addrlen;

  /* includes information for the server socket */
  struct sockaddr_in server_address;

  /* message we send to connecting clients */
  char *message = "Greetings!\n";

  /* socket() - returns a file descriptor we can use for our server
   * or -1 if there was a problem
   * Arguments:
   * AF_INET = address family Internet (for Internet addressing)
   * SOCK_STREAM = TCP (Transmission Control Protocol)
   * 0 = default protocol for this type of socket
  server_fd = socket(AF_INET, SOCK_STREAM, 0);

  /* Check for an error */
  if(server_fd == -1)
    perror("Unable to obtain a file descriptor for the server");

  server_address.sin_family = AF_INET;

  /* set the listen address to any/all available */
  server_address.sin_addr.s_addr = INADDR_ANY;

  /* The htons() function below deals with endian conversion which
   * we'll discuss later. This assignment sets the port number to
   * accept connections on. */
  server_address.sin_port = htons(PORT);

  /* bind() - binds the IP address to the server's file descriptor or
   * returns -1 if there was a problem */
  if(bind(server_fd, (struct sockaddr *)&server_address,
          sizeof(server_address)) == -1)
    perror("Unable to bind");

  /* listen() - listen for incoming connections */
  if(listen(server_fd, BACKLOG) == -1)
    puts("Failed to listen on server socket!");

  addrlen = sizeof(server_address);

  puts("Waiting for connections...");

  /* Infinite loop to accept connections forever */
    /* accept() - handle new client connections */
    client_fd = accept(server_fd, (struct sockaddr *)&server_address,
    if(client_fd == -1)
      perror("Unable to accept client connection");
    /* Send greeting to client and then disconnect them */
    send(client_fd, message, strlen(message), 0);

  return 0;

You should be able to copy and paste the above code into a text file.
Compile it with: gcc <file>.c -o network_example
After compiling the program, execute it with: ./network_example
If all went well, you should see something similar to below:
[lloyd@lindev ~]$ ./network_example
Waiting for connections...

Open another terminal and issue: telnet localhost 9990
You should see something like the following:
[lloyd@lindev ~]$ telnet localhost 9990
Trying ::1...
telnet: connect to address ::1: Connection refused
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

You can read more about bind(), listen(), and accept() if you're interested. Next up, we're going to replicate the above C program in x86-64 assembly. Let's see how it looks...

The Assembly Way

[BITS 64]

; Description: 64-bit Linux TCP server
; Author: Lloyd Dilley
; Date: 04/02/2014

struc sockaddr_in
  .sin_family resw 1
  .sin_port resw 1
  .sin_address resd 1
  .sin_zero resq 1

section .bss
    istruc sockaddr_in
      at sockaddr_in.sin_family, resw 1
      at sockaddr_in.sin_port, resw 1
      at sockaddr_in.sin_address, resd 1
      at sockaddr_in.sin_zero, resq 1

section .data
  waiting:      db 'Waiting for connections...',0x0A
  waiting_len:  equ $-waiting
  greeting:     db 'Greetings!',0x0A
  greeting_len: equ $-greeting
  error:        db 'An error was encountered!',0x0A
  error_len:    equ $-error
  addr_len:     dq 16
    istruc sockaddr_in
      ; AF_INET
      at sockaddr_in.sin_family, dw 2
      ; TCP port 9990 (network byte order)
      at sockaddr_in.sin_port, dw 0x0627
      ; (network byte order)
      at sockaddr_in.sin_address, dd 0x0100007F
      at sockaddr_in.sin_zero, dq 0

section .text
global _start
  ; Get a file descriptor for sys_bind
  mov rax, 41           ; sys_socket
  mov rdi, 2            ; AF_INET
  mov rsi, 1            ; SOCK_STREAM
  mov rdx, 0            ; protocol
  mov r13, rax
  push rax              ; store return value (fd)
  test rax, rax         ; check if -1 was returned
  js exit_error

  ; Bind to a socket
  mov rax, 49           ; sys_bind
  pop rdi               ; file descriptor from sys_socket
  mov rbx, rdi          ; preserve server fd (rbx is saved across calls)
  mov rsi, sockaddr
  mov rdx, 16           ; size of sin_address is 16 bytes (64-bit address)
  push rax
  test rax, rax
  js exit_error

  ; Listen for connections
  mov rax, 50           ; sys_listen
  mov rdi, rbx          ; fd
  mov rsi, 10           ; backlog
  push rax
  test rax, rax
  js exit_error
  ; Notify user that we're ready to listen for incoming connections
  mov rax, 1            ; sys_write
  mov rdi, 1            ; file descriptor (1 is stdout)
  mov rsi, waiting
  mov rdx, waiting_len
  call accept

  ; Accept connections
  mov rax, 43           ; sys_accept
  mov rdi, rbx          ; fd
  mov rsi, peeraddr
  lea rdx, [addr_len]
  push rax
  test rax, rax
  js exit_error

  ; Send data
  mov rax, 1
  pop rdi               ; peer fd
  mov r15, rdi          ; preserve peer fd (r15 is saved across calls)
  mov rsi, greeting
  mov rdx, greeting_len
  push rax
  test rax, rax
  js exit_error

  ; Close peer socket
  mov rax, 3            ; sys_close
  mov rdi, r15          ; fd
  push rax
  test rax, rax
  js exit_error
  ;jz shutdown
  call accept           ; loop forever if preceding line is commented out

  ; Close server socket
  mov rax, 3
  mov rdi, rbx
  push rax
  test rax, rax
  js exit_error

  ; Exit normally
  mov rax, 60           ; sys_exit
  xor rdi, rdi          ; return code 0

  mov rax, 1
  mov rdi, 1
  mov rsi, error
  mov rdx, error_len

  mov rax, 60
  pop rdi               ; stored error code

Thank goodness for high-level languages, eh?
You can assemble and link just like you did from the first tutorial:
nasm -f elf64 -o network_example.o network_example.asm
ld -o network_example network_example.o

You can then execute the program and test it with telnet the same way you did with the C version. The functionality should be very similar.

Dissecting the Beast

NASM allows programmers to use structs, so we take advantage of this for better data organization. Just like in the C program, a sockaddr_in structure is defined. This is essentially a template which holds various data members. For review, the BSS section contains memory set aside for variable data during runtime. This makes sense considering it is not known what our connecting client source addresses and ports will be. And since we know what address and port to use on the server side, the information can be set in the data section as literals. I also touched on data types some in the first tutorial. The table below contains the types used in this program along with their sizes and examples.

Type Size Example
resb/db 1 byte (8 bits) A keyboard character such as the letter 'c'
resw/dw 2 bytes (16 bits) -- also called a "word" A network port with a maximum value of 65,535
resd/dd 4 bytes (32 bits) -- also called a "double word" An IPv4 address such as
resq/dq 8 bytes (64 bits) -- also called a "quad word" A "long long" in C/C++ or represents a decimal number (float)

An "octa word" (128 bits) is also worth mentioning, but is not used in this program. These are used for scientific calculations, graphics, IPv6 addresses, globally unique IDs (GUIDs), etc. The dX variety are initialized and the 'd' stands for "data". So, db is "data byte" and dw is "data word". The resX assortment is used for reserving space for uninitialized data. resb would be "reserve byte" and resq is "reserve quad" for example. The "at" macro gets at each field and sets it with the specified data. "struc" and "endstruc" define a structure. "istruc" and "iend" declare an instance of a structure. You can see in the code how to refer to an instance by using a label (peeraddr for example.)

In the text section (code), you should be able to get an idea of what is going on with the comments. The format is the same as the program from the first tutorial. It is all a matter of putting bullets (data) in certain positions (registers) of a revolver and then pulling the trigger with syscall. That is an analogy I like to use anyway. Again, you can refer to Ryan A. Chapman's 64-bit Linux system call table for reference. sys_bind, sys_listen, sys_accept, and other calls are all present there.

Ten Little Endians

Endianness (name originates from Gulliver's Travels) refers to the way data is arranged in memory in the context of hardware architectures. I bring this up because we needed to call htons() (short data from host to network order) in our C program on the network port. We also needed to convert the loopback IP address and TCP port number to network byte order in the assembly program.

x86/x86-64 are considered little-endian architectures whereas SPARC is big endian. Some processors, such as PPC, can handle both modes and are referred to as bi-endian. What does this mean exactly? Well, on little-endian machines, the most-significant byte (MSB) is stored at the highest memory address. The least-significant byte (LSB) is stored at the lowest address. Big endian is the reverse of this. An example would be storing three bytes that make up the word "BEEF". Using the ASCII values for each letter in hexadecimal: 'B' is 0x42, 'E' is 0x45, and 'F' is 0x46. On a big-endian system, the arrangement of bytes would appear as: 42 45 45 46. However, on a little-endian system, they would appear as: 46 45 45 42. Obviously, debugging is easier on a big-endian system since data is still easily readable by humans. Meanwhile, little endian has the advantage of programmers being able to determine if a number is even or odd by looking at its LSB.

Due to these differences, the need for a common format for data being transmitted over a network was clear. Big endian or network byte order was decided on for this purpose. How can we convert? The easiest method is to use a calculator in programmer mode. Windows calculator supports this mode. The TCP port number 9990 in decimal is 2706 in hex. Since 0x27 is the most significant part, it goes in the right-most slot. 0x06 goes on the left resulting in 0x0627. This is similar for the IP address. Each octet of must be converted to hex. This yields 7F 00 00 01. Again, 127 or 0x7F is the most significant part, so it goes on the far right (lowest memory address.) You end up with 0x0100007F.

A Closer Look

You can use gdb or valgrind to debug of course, but this section is more about tracing program execution to demonstrate what is going on from an OS perspective with system calls. If you have strace installed, issue:
strace -f ./network_example

You can actually see each system call from the assembly program and what arguments populate each function such as source port for peer address. Try connecting with telnet with the trace still running and you can see write() and close() being called. Have a look:
[lloyd@lindev ~]$ strace -f ./network_example
execve("./network_example", ["./network_example"], [/* 26 vars */]) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(9990), sin_addr=inet_addr("")}, 16) = 0
listen(3, 10)                           = 0
write(1, "Waiting for connections...\n", 27Waiting for connections...
) = 27
accept(3, {sa_family=AF_INET, sin_port=htons(47944), sin_addr=inet_addr("")}, [16]) = 4
write(4, "Greetings!\n", 11)            = 11
close(4)                                = 0
accept(3, {sa_family=AF_INET, sin_port=htons(47946), sin_addr=inet_addr("")}, [16]) = 4
write(4, "Greetings!\n", 11)            = 11
close(4)                                = 0
accept(3, ^CProcess 27238 detached

You can see from above that the server is assigned a file descriptor of 3 and the client is 4. 11 is the length of the greeting sent to the client. sin_port and sin_addr from accept() contain the connecting client's source IP address and port. Pretty slick, huh?

Compacting A Compact Program

As you can see, the size difference between the assembly program and the C program is significant. The functionally-equivalent C program is over 4 times as large:
[lloyd@lindev ~]$ ls -lah network_example_*
-rwxr-xr-x. 1 lloyd linux_users 2.1K Dec 10 03:51 network_example_asm
-rwxr-xr-x. 1 lloyd linux_users 8.9K Dec 10 04:06 network_example_c

Let's see if we can squeeze both of these binaries a bit more...
[lloyd@lindev ~]$ strip -s network_example_*
[lloyd@lindev ~]$ ls -lah network_example_*
-rwxr-xr-x. 1 lloyd linux_users  888 Dec 10 04:28 network_example_asm
-rwxr-xr-x. 1 lloyd linux_users 6.2K Dec 10 04:28 network_example_c

Even after shaving off symbols from both binaries, the C program is now over 6 times larger than the assembly program. The assembly program isn't even 1K. This is a testament to assembly's efficiency. Yay for assembly!


I apologize for the delay between the first tutorial and this one. Better late than never, right? I hope people still find this information useful. If you have any questions or feedback, please drop me a line in the comments and I would be happy to reply.

Tuesday, September 2, 2014

JVM Memory Tuning and Analysis

The Java Virtual Machine (JVM) can be tuned to meet your program's memory needs. I will be discussing the various memory options along with how you can monitor the status of your Java programs.

Two Mounds

Let's start by looking at the various types of memory that Java uses. As you may already know, Java manages memory for you automatically (much like C# and Ruby do) via a process called garbage collection. There are two areas of memory in Java just like in C and C++. These areas are referred to as the stack and the heap. The stack is where local primitive variables, local object references (just the memory address/location of object data and not the actual object contents itself), and sometimes objects go (using a compile-time optimization as of Java 6 called escape analysis.) Also note that method parameters are considered local to the method where they are declared. When a code block or method goes out of a scope, the local variables go away. Any local reference variables pointing to objects are also gone. When there are no more references pointing to a particular object, the object is garbage collected and that location can be used again by another object. Since Java emphasizes concurrent programming, it is also worth mentioning that each thread has its own stack.

The heap typically contains the bulk of data for a Java program. Objects and their static and instance members reside on the heap. Have a look at the program below to help illustrate this information.

class Example
  public static int total = 0; // heap
  protected int a;             // heap
  protected int b;             // heap

  public static String sumOf(int a, int b) // primitives 'a' and 'b' live on the stack
    String c = "The sum is: " + (a + b); // reference 'c' lives on the stack
                                         // but, "The sum is: ..." lives on the heap
    total += a + b;
    return c;

  public static void main(String[] args)
    Example example = new Example();

    example.a = 5;  // example reference is on the stack and value '5' is on the heap
    example.b = 10; // example reference is on the stack and value '10' is on the heap

    System.out.println(sumOf(example.a, example.b));

Remember though that objects _may_ be allocated on the stack at times depending on what the compiler decides.

Analyzing Java Processes

You can get an idea of how much total memory (stack + heap) your program is consuming using Task Manager, ps, /proc, pmap, top, Activity Monitor, and the like. The Java Development Kit (JDK) actually contains some useful utilities for looking at running Java programs. These utilities live under the $JAVA_HOME/bin directory. First off, there's jps. This command displays running Java processes. It provides the PID (process ID) and name of the program. Below is an example of its output.

ldilley@develop:~$ jps
9515 Jps
9464 LinPot
ldilley@develop:~$ ps -ef | grep -v grep | grep 9464
ldilley   9464  9217  0 03:46 pts/0    00:00:00 java me.dilley.linpot.LinPot

Now that we have the PID of a running Java process, let's investigate it some more using the jmap utility. To use jmap, grab the PID of a running Java process you want to attach to. You can easily find running Java processes with jps as shown above.

ldilley@develop:~$ jmap -heap 9464
Attaching to process ID 9464, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.65-b04

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 0
   MaxHeapFreeRatio = 100
   MaxHeapSize      = 698351616 (666.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 174063616 (166.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 8912896 (8.5MB)
   used     = 1434848 (1.368377685546875MB)
   free     = 7478048 (7.131622314453125MB)
   16.098561006433822% used
From Space:
   capacity = 1048576 (1.0MB)
   used     = 0 (0.0MB)
   free     = 1048576 (1.0MB)
   0.0% used
To Space:
   capacity = 1048576 (1.0MB)
   used     = 0 (0.0MB)
   free     = 1048576 (1.0MB)
   0.0% used
PS Old Generation
   capacity = 21495808 (20.5MB)
   used     = 0 (0.0MB)
   free     = 21495808 (20.5MB)
   0.0% used
PS Perm Generation
   capacity = 22020096 (21.0MB)
   used     = 4385528 (4.182365417480469MB)
   free     = 17634568 (16.81763458251953MB)
   19.916025797526043% used

1131 interned Strings occupying 76888 bytes.

jmap is handy since it provides a detailed view of the heap. Don't worry if this output looks unintelligible. We shall decipher it together. We'll skip the thread-local object allocation setting since it is outside the scope of this post. It is enabled by default and can be toggled with the -XX:UseTLAB option. You can read more about it here if you are interested.

Taking Out the Trash

Next comes the garbage collection method. My Java 7 JVM is using parallel garbage collection by default. This method uses multiple threads to perform collection. It can be explicitly enabled with the -XX:+UseParallelGC option. The JVM also supports "Mark Sweep Compact Garbage Collection" which is a serial collection method recommended for systems with a single processor or for applications with small amounts of data. -XX:+UseSerialGC is used to enable it. Lastly, -XX:+UseConcMarkSweepGC enables the concurrent collector. This collection method aims to increase application response time by minimizing pausing. If concurrent collection incurs significant pausing, you can enable incremental mode with -Xincgc which should shorten the pauses.

To summarize:

Serial Garbage Collection
  • Enabled with -XX:+UseSerialGC
  • Also called "Mark Sweep Compact Garbage Collection"
  • Useful on systems with a single processor
  • Useful when object data does not exceed 100MB
Parallel Garbage Collection
  • Enabled with -XX:+UseParallelGC
  • Is similar to the serial method, but uses multiple threads for collection
  • Useful on systems with multiple processors
  • Number of collector threads can be controlled with -XX:ParallelGCThreads=<number>
  • The number of collector threads should be equal to the number of system cores
Concurrent Garbage Collection
  • Enabled with -XX:+UseConcMarkSweepGC
  • Useful on systems with multiple processors
  • Useful for low-latency applications such as 3D games
  • Can be told to collect incrementally with -Xincgc which lessens pausing that may occur during collection
  • -Xincgc is the equivalent of specifying -XX:+UseConcMarkSweepGC and -XX:+CMSIncrementalMode
  • Is a good default to use
You can monitor garbage collection statistics to see which algorithm works best for your program by enabling the following JVM options:
  1. -verbose:gc
  2. -XX:+PrintGCDetails
  3. -XX:+PrintGCTimeStamps
I'd also like to bring up compaction. Just like filesystems on a hard disk can become fragmented, so too can the heap. You may already know that some data structures such as arrays are allocated in memory sequentially. Let's say for the sake of simplicity that there is 1 simple 1-byte object on the heap followed by a sequential array of 1,000 1-kilobyte elements and then 1 other simple 1-byte object. What happens when the array is garbage collected and a new larger array needs room on the heap? It is not a very efficient use of space to have many gaps. Compaction mitigates this issue by squeezing data closer together.

Digging Through the Heap

As if things weren't complicated enough, Java breaks up the heap into other segments where objects reside. These segments are called generations. Different implementations of the JVM may use different terms to describe these areas. However, they are all generational. This means that objects essentially age. When an object is created, it is spawned in the young/new/Eden space. One exception is that if the object is rather large, it will go directly to the tenured/old generation which typically is larger in size. If all references to the object go away, it is marked for collection. If the object survives, it grows older and is moved to the tenured area. And as you guessed, if an object is geriatric, it will live in the retirement housing area known as the old generation space. "PS" in this context means "parallel scavenge" and simply reflects the garbage collection algorithm employed. If you change from parallel collection to serial or concurrent, the number of areas and names may change.

I have not yet touched on the permanent generation area, but I will do so in the summary of heap areas below.

  • Part of the young generation
  • Where new objects are freshly allocated (unless they are too large -- then they go directly to the tenured/old generation)
  • Used for quick allocation
Survivor (or From Space and To Space)
  • Also part of the young generation
  • Objects that survive Eden/New space collection become survivors
  • The "From" and "To" space roles are swapped once one area has been collected and cleared (one is usually empty)
  • Objects that make it past the survivor phase retire to the tenured/old space to live out their days (or minutes/hours)
  • This area is not as frequently collected as the Eden/New space
  • Objects that are deemed too large skip the young generation altogether and go here
  • The permanent generation
  • Loaded class information lives here such as reflective information about classes (metadata) and interned strings (until Java 7 at least)
The bit about interned strings provides details of how many string objects have been added to the string intern pool. This is an optimization that stores unique instances of string literals thus reducing memory usage.

Bending the JVM to Your Will

You might be wondering why we did not cover the heap configuration section of the jmap output yet. You may also be pondering when and why you would want to change the way the JVM handles memory and how. Wonder and ponder no more. Read on for the respective JVM options and the situations in which you would modify them.

Sets the thread stack size. Even if you have not created threads explicitly, your program is a single thread. If you have a large call stack or are making heavy use of recursion, you will want to increase this value. You'll know when you start receiving StackOverflowError messages.
Example which sets the stack size to 2 megabytes: java -Xss2M someClass

-Xms<number>[M/G] and -Xmx<number>[M/G]
Sets the minimum and maximum heap size. If you are loading a significant amount of objects in memory and see "java.lang.OutOfMemoryError: Java heap space", you'll want to increase these values.
Example which sets the initial heap to 512M and maximum heap size to 4G:
java -Xms512M -Xmx4G someClass

Sets the total size of the young generation.  the size of -Xmx is usually a decent value. If you have frequent allocation of a large number of objects, you should consider increasing this value.
Example: java -Xmx3G -Xmn1G someClass

-XX:PermSize=<number>[M/G] and -XX:MaxPermSize=<number>[M/G]
Sets the initial and maximum permanent generation sizes. If you are loading a large number of classes, you may need to bump these values up. If you see "java.lang.OutOfMemoryError: PermGen space", it is an indication that you need to raise at least MaxPermSize.
Example: java -XX:PermSize=64M -XX:MaxPermSize=256M someClass

-XX:NewSize=<number>[M/G] and -XX:MaxNewSize=<number>[M/G]
Sets the initial and maximum sizes for the young generation. Again, MaxNewSize should be ⅓ the size of -Xmx. By default, MaxNewSize is extremely large. This is where new objects go.
Example: java -XX:NewSize=512M -XX:MaxNewSize=1G someClass

Explicitly sets the old generation size.
Example: java -XX:OldSize=1G someClass

Sets the ratio of space allocated to the young and old generations. For example, setting this option to a value of 2 would mean that the maximum old generation space would be twice as large as the maximum young generation space. So, if MaxNewSize was 1G, the old generation space (OldSize) would be 2G.
Example of the old generation set to three times the size of the young generation:
java -XX:NewRatio=3 someClass

-XX:MinFreeHeapRatio=<number> and -XX:MaxFreeHeapRatio=<number>
Number is a value from 0 to 100 which represents a percentage. When less than MinFreeHeapRatio of the heap is available, the heap grows up to the -Xmx value. When MaxFreeHeapRatio is reached, compaction occurs. In other words, if MaxFreeHeapRatio is 90%, compaction occurs at 10% occupancy. When -Xms == -Xmx, MinFreeHeapRatio has no effect. Many recommend setting MinFreeHeapRatio to 40% and MaxFreeHeapRatio to 70%.
Example: java -XX:MinFreeHeapRatio=40 -XX:MaxFreeHeapRatio=70 someClass

This option sets the ratio between the From/To space (which are always equal in size) and Eden. So, if SurvivorRatio is equal to 5, Eden is 5 times the size of the From/To space. The total parts would be 5 + 1 (From) + 1 (To) = 7.
Example: java -XX:SurvivorRatio=10 someClass


I hope this journey has been beneficial to you. Feel free to leave any feedback or questions. Happy JVM tweaking!

Saturday, April 19, 2014

Introducing MineStat - A Minecraft Server Status Checker

My daughter and I have been playing Minecraft for the past year or so. After configuring a server of our own, I wanted a way to check its status. I initially wrote a Ruby class thinking our website would be a Rails application. I decided to port the code to Java and use JSP instead since I already had a Tomcat server running. I cleaned up the code this evening and figured I would release the source in case it could be of use to anyone else.

The project can be found on GitHub at You can use the software to whip up a monitoring script that periodically polls multiple Minecraft servers or allow visitors of your website to see the status of your Minecraft server. Examples for C#, Java, PHP, and Ruby are below for demonstration purposes.

C# Example:

using System;

class Example
  public static void Main()
    MineStat ms = new MineStat("", 25565);
    Console.WriteLine("Minecraft server status of {0} on port {1}:", ms.GetAddress(), ms.GetPort());
      Console.WriteLine("Server is online running version {0} with {1} out of {2} players.", ms.GetVersion(), ms.GetCurrentPlayers(), ms.GetMaximumPlayers());
      Console.WriteLine("Message of the day: {0}", ms.GetMotd());
      Console.WriteLine("Server is offline!");

Java Example:

import org.devux.MineStat;  
class Example  
  public static void main(String[] args)  
    MineStat ms = new MineStat("", 25565);    
    System.out.println("Minecraft server status of " + ms.getAddress() + " on port " + ms.getPort() + ":");  
      System.out.println("Server is online running version " + ms.getVersion() + " with " + ms.getCurrentPlayers() + " out of " + ms.getMaximumPlayers() + " players.");  
      System.out.println("Message of the day: " + ms.getMotd());  
      System.out.println("Server is offline!");  

PHP Example:


$ms = new MineStat("", 25565);
printf("Minecraft server status of %s on port %s:<br>", $ms->get_address(), $ms->get_port());
  printf("Server is online running version %s with %s out of %s players.<br>", $ms->get_version(), $ms->get_current_players(), $ms->get_max_players());
  printf("Message of the day: %s<br>", $ms->get_motd());
  printf("Server is offline!<br>");

Ruby Example:

require_relative 'minestat'

ms ="", 25565)  
puts "Minecraft server status of #{ms.address} on port #{ms.port}:"
  puts "Server is online running version #{ms.version} with #{ms.current_players} out of #{ms.max_players} players."
  puts "Message of the day: #{ms.motd}"
  puts "Server is offline!"


Saturday, April 12, 2014

64-bit Linux Assembly Tutorial Part I


After finally getting around to playing with 64-bit assembly recently, I wanted to share the knowledge since there does not seem to be many tutorials out there surprisingly. The available information is mostly scattered between AMD64 programming manuals, assembler documentation, code snippets on the Web, and system call tables or Linux source code. I figured that I would consolidate all of that information here to make things easier. If you just want to see the code, feel free to skip ahead.

Useful Links

I recommend reading through the three AMD programming manuals below located at the AMD Developer Guides & Manuals site if you're serious about getting into this stuff:

  1. AMD64 Architecture Programmer's Manual Volume 2: System Programming
  2. AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions
  3. AMD64 Architecture Programmer's Manual Volume 1: Application Programming

These will provide you with a lot of in-depth information about the hardware architecture.

Documentation for NASM can be found here.

I also recommend Jeff Duntemann's Assembly Language Step by Step book. You can find it on Amazon here.

Finally, Mr. Ryan A. Chapman's x86_64 Linux System Call Table is a good reference.

Getting Started

Before making nifty assembly programs, you will need an assembler. An assembler is a program that assembles or translates your source code into machine code. We will use the Netwide Assembler (NASM) for this purpose. You can install nasm via apt-get or yum.

You will also need a text editor so you can write your code. Any editor will suffice so long as it saves files in plain text format without special formatting. In other words: do not use something like OpenOffice Writer. Use nano or vim instead. ;)

Why Learn Assembly?

I know what some of you are probably thinking... "Isn't assembly too old to be useful, dude?" On the contrary, my friend. There are a number of reasons to learn assembly. I've listed some fine examples below.
  • Learn how things work on a low level (work with the CPU directly and manage stacks)
  • Analyze malware
  • Write efficient programs for embedded systems
  • Write boot loaders, operating system kernels, and device drivers
  • Reverse engineer software to patch it or re-create an application in a higher-level language
  • Write performant code for parts of a program that require speed
  • It's fun!
Pretty compelling, huh? Not so fast though...

What Can't You Do?

You must realize a few things about assembly before moving forward. Due to calling conventions and variances with system calls between operating systems, you cannot assemble the same program you wrote for a 64-bit Linux system and expect it to execute on a machine running Windows. FreeBSD and other operating systems actually have the same calling convention as Linux in long mode thankfully. However, it still can take a considerable amount of work to port a program due to the variation of syscalls.

This is even truer when it comes to other hardware architectures such as ARM, PPC, and SPARC which have their own registers and instruction sets. As a result, you cannot take your shiny new assembly program and run it on your Android phone since these devices do not make use of AMD64 or EM64T (Intel's implementation of AMD64) processors.

If it is portability you crave, your princess is in another castle. Use a higher-level language such as C, Java, or Ruby. Another point to note is that most modern compilers do a darn good job of optimizing code. The optimization may be so good that it will render any assembly code your wetware could whip up as futile. So don't try it, human.

Things to Consider

There are actually two dialects of the assembly language for the x86 processor. These are referred to as AT&T and Intel syntax. There are several popular assemblers. Aside from NASM, there's also GNU's GAS, Microsoft's MASM, and Borland's TASM to name a few. NASM uses the Intel vernacular as does MASM and TASM while GNU's assembler initially used and primarily uses AT&T syntax. This is not surprising considering Linux's history with Unix and AT&T. It is worth mentioning that GAS can optionally handle Intel syntax now.

Some of the differences between both dialects are in the table below.

Feature AT&T Intel
Required prefixes $ for immediate values and % for registers None
Operand order source, destination (similar to Linux/Unix commands)

mov $16, %rax
destination, source

mov rax, 16
Addressing memory () []

Do not worry if you fail to understand some of this stuff. We'll touch on mov, rax, and friends shortly.


You can think of registers as little cubby holes within the CPU. If you've ever taken a course in college on C++, your professor may have provided a similar analogy for addressing memory. Unfortunately, there is nowhere near as many registers as there are memory addresses in modern computers. Although, AMD64 has increased the number of registers, so we'll take what we can get. Registers provide the fastest form of access that a computer has. Let me go ahead and introduce you to the family.

Registers Size
AL, AH, BL, BH, CL, CH, DL, DH, R8B, R9B, R10B, R11B, R12B, R13B, R14B, R15B 8 bits or 1 byte
AX, BX, CX, DX, R8W, R9W, R10W, R11W, R12W, R13W, R14W, R15W, BP, SP, DI, SI, CS, DS, ES, FS, GS, SS, IP, FLAGS 16 bits or 2 bytes
EAX, EBX, ECX, EDX, R8D, R9D, R10D, R11D, R12D, R13D, R14D, R15D, EBP, ESP, EDI, ESI, EIP, EFLAGS 32 bits or 4 bytes
RAX, RBX, RCX, RDX, R8, R9, R10, R11, R12, R13, R14, R15, RBP, RSP, RDI, RSI, RIP, RFLAGS 64 bits or 8 bytes

That was a lot to digest. To simplify matters, the majority of the smaller registers above are actually just the same register trimmed down. For example, R8B, R8W, and R8D are just smaller portions of the 64-bit register R8. The actual registers that exist on an x86-64 CPU (excluding floating point/MMX/3DNow!/SSE/AVX) are: RAX, RBX, RCX, RDX, R8, R9, R10, R11, R12, R13, R14, R15, RBP, RSP, RDI, RSI, RIP, RFLAGS, CS, DS, ES, FS, GS, and SS. The segment registers (CS, DS, ES, FS, GS, and SS) were never extended beyond 16 bits and exist for mostly backward compatibility. Furthermore, the lower 8 bits of BP, SP, DI, and SI can be accessed with REX prefixing as BPL, SPL, DIL, and SIL. However, I won't be covering that topic in this tutorial.

The 'L' and 'H' in AL and AH represent the "high" and "low" areas of register AX. AX, BX, CX, and DX all have these. When the x86-64 architecture was created, 8 more general purpose registers (GPRs) were added. They are R8-R15, respectively. These can also store more compact values by appending a 'B' for byte, 'W' for word, or 'D' for double (double word.) There is also a name for a 64-bit value: the quadword. This means 16 (word length) * 4 (quad) = 64 bits.

There are rules and caveats to using these registers which I'll discuss in a section below. Just try and remember RAX, RBX, RCX, RDX, R8-R15, RBP, RSP, RDI, and RSI. You'll be using those the most. I am not going to cover the MMX/3DNow!, SSE, and AVX registers and their mnemonics since they are outside the scope of this tutorial (as if all the registers above were not enough!)

Instruction Set

Now that you are familiar with the registers, let's move on to the AMD64 instruction set. Like the IA32 instruction set before it, the 64-bit instructions consist of mnemonics which correspond to opcodes or low-level instructions. These mnemonics enable us to use friendly symbols to call instructions opposed to referencing difficult-to-remember numeric IDs. I have created a table below containing some of the many instructions that we will be using for this tutorial. I am also including other common instructions to pique your interest in the hopes that you will play around with them on your own.

Instruction Description Example
ADD Adds two values and stores the result in the destination (first operand.) The sum below will be '8' and it will be stored in RAX. mov rax, 3
mov rbx, 5
add rax, rbx
CALL Performs an unconditional jump to a line of a code. This is not to be confused with JMP (which does not save the callee location.) This instruction is paired with RET. call foo_label
DEC Decrements a value. We initially copy a value of '5' into RAX below. The first DEC instruction causes the value in RAX to change to '4'. The last DEC instructions yields '3'. mov rax, 5
dec rax
dec rax
INC Increments a value. We initially copy a value of '1' into RAX below. The first INC instruction results in a value of '2'. The last INC operation yields '3'. mov rax, 1
inc rax
inc ax
JMP Jumps to a location in the code unconditionally. jmp foo_label
JNZ, JS, JZ, and friends Conditional jumps that determine the state of various bits in RFLAGS. JNZ jumps if not zero, JZ jumps if zero, and JS jumps if signed for example. In this example, the jump to "foo_label" would occur since 1 - 1 = 0. The zero flag would be triggered in RFLAGS as a result of SUB. mov rax, 1
sub rax, 1
test rax, rax
jz foo_label
LEA Load effective address. This stores a memory location and is relevant to pointers from languages like C and C++. Remember that [] references a memory location in NASM. lea rax, [foo]
MOV Copies a value into a register. The operation below causes RAX to contain '32'. mov rax, 32
POP Removes the top value from the stack. In the example below, '5' is pushed onto the stack and then popped off and stored in RBX. mov rax, 5
push rax
pop rbx
PUSH Pushes a value onto the stack. '64' is placed onto the stack after the operation below completes. mov rax, 64
push rax
RET Returns to the original location that called CALL. This is paired with CALL. call foo_label
mov r10, 1
mov rax, 1
SUB Subtract. It functions like ADD, only in reverse. The result is stored in the destination (first operand.) The difference will be '4' in the example below. mov rax, 8
mov rbx, 4
sub rax, rbx
SYSCALL Call a system procedure in long mode. This instruction replaces the int 0x80 instruction which interrupted the kernel in 32-bit Linux. mov rax, some_call
TEST Tests values and toggles bits in RFLAGS register appropriately. This instruction is frequently used with conditional jumps. In the example below, the value of RAX is tested. If the signed bit in RFLAGS is on, then jump to another location in the code. test rax, rax
js foo_label
XOR Perform an exclusive or. This is often useful for resetting a register to zero and is more efficient than MOV REG, 0. xor rax, rax

Hopefully that was not too much to absorb. There are a plethora of other instructions such as CMP, DIV/IDIV, and MUL/IMUL, but I would like to keep this tutorial limited.

Memory Models and Operating Modes

I would like to briefly touch on memory models and operating modes just so you are in the know. If you do not want the extra information, feel free to skip ahead as this is mostly historical in nature. There are several memory models that exist. These various models handle the way that memory is addressed. I am sure many of you remember in recent history when 32-bit operating systems were limited to around 4G of RAM. There was something called PAE or Physical Address Extension that enabled you to address >4G of memory. This is actually one model called the paged memory model.

There is also the similar segmented model where memory is referred to by a segment and offset. This allowed older x86 architectures like the 8086 to access larger areas of memory (>640K) much like how PAE increases memory accessibility. Using segment and offset is more tedious than simply using a single real address.

That brings us to the flat model. With this model, you simply refer to an address in its entirety. This is less of a hassle than dealing with memory segmentation. Fortunately, 64-bit Linux operates in long mode (which increases the address width for a ton of memory) and uses the flat addressing method. There is no need for memory segmentation (yet) with such a large address space. The masses are happy for now...

And finally, aside from long mode, there are the three legacy modes that x86-64 supports: protected, real, and virtual. Without going into too much detail, protected is the mode that 32-bit Linux, FreeBSD, and Windows ran in. Real mode is what DOS operated in and virtual is a mode that allowed real mode programs to execute in a protected environment.

Calling Convention

This section is important, so pay attention! If you are familiar with C, C++, Java, Ruby, and others, you will know that functions/methods take parameters in a certain order.

void do_something(int foo, float bar, char *baz)  
  /* ... */

We can see that do_something() takes 3 arguments. In order, they are an integer, a float, and a pointer to char. When programming in assembly for 64-bit Linux, you need to obey the same rules. Registers are designated to accept values in a specific order when making system calls. The calls you make look for certain data inside these aforementioned registers. If you fail to feed in the proper data, you may end up with unexpected inputs and/or outputs. Your program may also segfault.

Now don't let this scare you off, but 64-bit Linux actually has two calling conventions. Thankfully, it's easy to know when to use either. Whenever you are making system calls, you shove stuff into these registers in this particular order:

RAX = system_call_id
RDI = parameter #1
RSI = parameter #2
RDX = parameter #3
R10 = parameter #4
R8 = parameter #5
R9 = parameter #6

The return value almost always ends up in RAX. You do not have to worry about system calls accepting more than 6 parameters. Otherwise, use the following SysV ABI convention (which happens to be the same for FreeBSD, OS X, and Solaris on x64):

XMM0-7 (these are the SSE registers that accept floating-point values)

If you run out of registers to use for passing values to functions in this convention, then you will use the stack which we'll cover soon. The two conventions are similar with the exception of the floating-point registers and RCX/R10 differ. Microsoft uses a different calling convention in x64 land (fun fact!)


Not to be confused with segmented memory model; segments are simply sections of code. The three you will commonly find throughout your assembly adventures are the BSS, data, and text segments. The BSS or Block Started by Symbol (also jokingly referred to as "Better Save Space") section is an area where you specify uninitialized variables. For example, I could allocate space for a struct whose members could be filled by a callee as we will see in part II of this tutorial.

Next comes the data segment. This is the area where you define initialized data. One fine example could be a string constant such as a program name and version. Last but not least, is the text segment. This section usually makes up the bulk of the code and is where you perform instructions.

The Stack

The stack is simply a construct in memory that can hold data for us. It is common for this data structure to be likened to a stack of spring-loaded plates at a buffet. Let's say there are no plates at first. We have an empty stack. An employee comes along and begins adding plates to the stack. This is just like what the PUSH instruction would be doing. You PUSH a value onto the stack. As the employee adds more plates, the first plate that was added gets buried deeper and deeper. The same occurs as you PUSH more data onto the stack. Now, some hungry fellow comes along and removes a plate. You guessed it... POP! You can either opt to keep or throw away upper data in the stack to get at your first "plate".

Stack management is important. Many programming languages make use of a stack to hold variables local to a function. It is also a common source for trouble such as stack smashing attacks/buffer overflows. You can store whatever you'd like in it. The stack is also used when you perform the CALL instruction. The return address is pushed onto the stack and then popped off when you use RET. This happens automagically.

Let's illustrate to help solidify your understanding:

mov rax, 128
push rax     ; the stack now has '128' on it
mov rax, 256
push rax     ; now the stack has '256' at the top and '128' buried under that
pop rax      ; '256' was removed off the stack and placed into RAX
pop rax      ; RAX now contains '128' and '256' is gone forever!

It's not so bad, eh?

File Formats

This is merely a morsel about file formats. There are a fair amount of them in existence. a.out (old Unix), COFF (Windows PE), ELF (FreeBSD, HP-UX, Linux, and Solaris), and Mach-O (OS X) are prominent formats. We will be using ELF64 for the programs we create. This can be done with NASM using the -f (format) option: nasm -f elf64 -o <output_object_file>.o <input_source_file>.asm

Linking is also required to make the executable. We can use ld for that. The syntax is: ld -o <executable_name>  <object_file>.o

Our First Program

This is the moment you have been waiting for. Look no further. Our first 64-bit assembly program for Linux is below.

[BITS 64]

section .bss
  ; nada

section .data
  prompt:          db 'What is your name? '
  prompt_length:   equ $-prompt
  name:            times 30 db 0
  name_length:     equ $-name
  greeting:        db 'Greetings, '
  greeting_length: equ $-greeting

section .text
global _start
  ; Prompt the user for their name
  mov rax, 1                ; '1' is the ID for the sys_write call
  mov rdi, 1                ; '1' is the file descriptor for stdout and our first argument
  mov rsi, prompt           ; prompt string is the second argument
  mov rdx, prompt_length    ; prompt string length is the third argument
  syscall                   ; make the call with all arguments passed

  ; Get the user's name
  mov rax, 0                ; 0 = sys_read
  mov rdi, 0                ; 0 = stdin
  mov rsi, name
  mov rdx, name_length
  push rax                  ; store return value (size of name) on the stack... we'll need this for later

  ; Print our greeting string
  mov rax, 1                ; 1 = sys_write
  mov rdi, 1                ; 1 = stdout
  mov rsi, greeting
  mov rdx, greeting_length

  ; Print the user's name
  mov rax, 1                ; 1 = sys_write
  mov rdi, 1                ; 1 = stdout
  mov rsi, name
  pop rdx                   ; length previously returned by sys_read and stored on the stack

  ; Exit the program normally
  mov rax, 60               ; 60 = sys_exit
  xor rdi, rdi              ; return code 0

All those lines to do something trivial makes you appreciate writing in higher-level languages, doesn't it? ;)

Anyway, let's assemble and link this bad boy...

ldilley@develop:~/asm$ nasm -f elf64 -o greet.o greet.asm
ldilley@develop:~/asm$ ld -o greet greet.o
ldilley@develop:~/asm$ ./greet
What is your name? Lloyd
Greetings, Lloyd

Try it yourself!

Breaking it Down

You should understand the segments, instructions, registers, calling convention, and stack. The comments should also help, but let's discuss the stuff you haven't seen yet...

[BITS 64] at the top of our source code specifies the target processor mode. Per the NASM documentation, you do not always need this, but it certainly helps document the type of code you are writing. A semicolon allows for a single-line comment. The assembler will ignore any text that comes after it. The "section" keyword denotes the segment. Segments begin with a period in NASM. We have .bss, .data, and .text segments in our code. Variables and constants are defined by labels. The syntax for a label should be a meaningful name followed by a colon. Labels are also used as jump points. You can JMP and CALL to them.

"db" stands for "data byte". There is also dw (data word), dd (data double(word)), dq (data quad(word)). This specifies the storage requirements of the data. In the BSS segment, there are similar keywords to reserve data of the same types. Those keywords are: resb (reserve byte), resw (reserve word), resd (reserve double(word)), and resq (reserve quad(word)).

"equ $-" is a slick way to calculate the length of a string. "times 30 db 0" sets aside 30 bytes of storage space for a string and zeroes it out. A word of caution: input length is not checked. A string over 30 bytes will be interpreted by the shell. To remedy this bug, you can either increase the buffer size (change 30 to 255 bytes for example) or sanitize the string/flush anything beyond the length. I am not including that functionality since it would increase code complexity for the reader. If you _really_ want to prevent spillage, see Gunner's NASM I/O Tutorial. The code is in 32-bit mode, but can be easily adapted by replacing the interrupt instruction with SYSCALL and modifying the registers to meet the 64-bit Linux calling convention.

Finally, "global _start" with a label of "_start:" signifies the main entry point of the program much like main() for C, C++, and Java.

Where Do I Go from Here?

I highly recommend using Ryan Chapman's x86_64 Linux System Call Table to play around with various system calls. Practice loading the appropriate registers with the proper data. Be aware of what the system calls are doing however. You do not want to inadvertently overwrite or remove files. I also recommend using the various instructions to see how they can affect data stored within registers. Try incrementing and decrementing values, perform some jumps, and push/pop some stuff onto/off the stack. Feel free to use my code as a starting point or the template below.

[BITS 64]

section .bss

section .data

section .text
global _start
  ; Your code here

  ; Exit the program normally
  mov rax, 60      ; 60 = sys_exit
  xor rdi, rdi     ; return code 0

I will also be releasing part II of this tutorial in a week or two. I will discuss endianness, demonstrate analogous C code, and we will create a 64-bit assembly program for Linux that listens on a TCP port and accepts network connections. How cool is that?

I hope you have enjoyed this tutorial and that it has helped you in some way. Feel free to link to it. If you have any questions or suggestions, do not hesitate to comment.