Taeksang Peter Kim — GeistHaus

I came back to school to study hardware after 5 years of doing ML

Dec 15, 2024 Updated Dec 15, 2024

Show full content

.center-image { margin: 0 auto; display: block; } Background

After 5 years of doing ML engineering, I decided to pivot into ML systems.

Building video recommendation systems was fun, and it was always interesting to see real-world metrics like playback time go up as a result of a better predictive model. Especially when I could see my own video recommendations improve.

However, after getting my hands dirty with some CUDA side projects, I was drawn to something more lower-level, and frankly, more concrete than ML:

Stirring the pile became boring after 5 years.

I started to self-teach myself more CUDA, Triton, Pytorch Internals, … basically anything related to ML systems and compilers. One thing became crystal clear after doing so: all roads lead to hardware.

To squeeze the most performance out of my AI accelerator, I needed to understand my hardware’s microarchitecture.

So I came back to school to study computer engineering and build up my intuition from the hardware level.

I just finished my first semester as a Master’s student in the ECE department at UIUC. And boy did I get close to the hardware.

This blog is about the jarring experience of jumping from Python to SystemVerilog, and general things I’ve felt being a student again.

So what does school feel like the second time around?

How I feel at times.

Lessons Learned Pain is good

In line with my reason returning back to school, I enrolled in “Computer Organization and Design”, numbered as ECE 411. Little did I know, I had bit off more than I could chew.

Over the semester, we built a cache, a pipelined processor (w/ forwarding), and an out-of-order RISC-V processor, using a HDL known as SystemVerilog.

“Pipelining, ey? That doesn’t sound too bad. It’s just this, right?”

The laundry analogy. If only it was this simple.

It is conceptually. But when you’re writing the code and debugging it, it looks more like this:

You haven't felt pain until you've stared at waveforms in Verdi.

For someone who’d been writing Python for the past 5 years, this was damn near impossible. However, through sheer will and the help from excellent peers and TAs, I survived. It was excruciating though. I would spend literal weekends doing the assignments, only to make marginal progress. But the truth is: I learnt as much as I suffered. I can now confidently say that I know how out-of-order processors, caches, memory, (and more!) all actually work at the bare hardware level. I would not have been able to get to this point without going through the gauntlet of these brutal assignments.

My POV at the end of the semester after taking ECE 411. It really is all zeros and ones.

This coincided with my experience in the industry too. The most stressful moments, whether it was a challenging project or a heated argument with a colleage, were the ones that made me into a better engineer and communicator. So yeah, pain is good. And I hope to feel more of it during my time here as a student.

Group projects where you don’t like your group - that’s life

Even as an undergrad, I always knew that the point of these group projects were to prepare us for the industry, where we would be working as a team with others. After being in the industry, I can confirm this is true. However, I was wrong about what kind of a group I would be working in. Putting it bluntly, the experiences of working in a group where you don’t like your group is the most accurate reflection of working in the industry.

You’re not going to dislike everyone, of course. But there will always be a few people who are going to be tough to work with due to their working and communication style or just plain vibes (engineers are humans too).

In the industry, it actually gets worse. Imagine if one of those few people you don’t work well with is actually your manager. And at least right now, when the semester’s done, the group disbands and you never have to see them again. But imagine if the group didn’t. Imagine if you have to work with them for years. Finally, in the setting of a school, you can “report” them to your TA or Professor. In the real world, no one will be there help you (they will all be busy dealing with their own dysfunctional group).

This semester, whenever I faced difficult-to-work-with team members, I was surprised how unphased I was compared to how I would’ve felt as an undergrad. Primarily because I’ve faced worse in the industry for years. Now before you go off thinking that the industry is a place where everyone hates each other and never get any work done, it’s not. It’s a place where everyone hates each other but still get amazing work done. My time in the industry has taught me this skill and it’s the point of these groups projects: not just to work well with people but to work well with people you don’t like.

If you know, you know.

Education should take priority over grades

In class, your success depends on a predefined set of checkboxes. As long as you tick those boxes of doing well in assignments and exams, you are deterministically guaranteed to succeed. But what is success here? Is it getting a good grade? Or is it actually learning the material?

When I was being bombarded with assignments and exams from all three of my classes this semester, I realized how quickly I started to think success was about just completing these tasks. I had to constantly remind myself that doing the task meant nothing if I hadn’t learned something out of it. Ironically, education’s fundamental point gets drowned out in the sea of deadlines that colleges set for their students.

For one of my notoriously hard assignments this semester, I was struggling to pass all test cases. Through my own effort, I had just passed 70% of them. I should’ve have been more proud that I achieved that number on my own merit. But all I really felt was an overwhelming sense of doom that I would probably not have enough time to get the rest of the test cases, considering that the deadline was in less than 24 hours.

Like the devil, a peer approached me that he was willing to share his test code that helped him get 100%. It was tempting. But eventually, I realized that using that code to find my bugs would deprive me of the learning opportunity for me to struggle and discover the solution on my own. I politely declined. I chose my education over getting a good grade. I did ask him for high-level hints though :)

As long as you're learning, it really is fine. Even if your grades don't say so.

Looking forward

Reflecting on this past semester, I do not regret coming back to school to study hardware. I’ve learned so much about computer architecture and now I’m excited to build up from here, developing a first-principles understanding of how to make GPUs go brrr.

Interestingly, I think I will get the most out of school the second time around because of my experience in the industry. I know I shouldn’t be stressed about meeting deadlines and getting good grades as long as I’m continously learning. The only true stress comes from working with other people in group projects and that’s natural. The industry has well prepared me for this.

Special thanks to my wife

Thank you for always being there to support me through the many stressful nights I went through this semester. Because of you, I’m always reminded that I am really doing this for creating a better life with you, and making as many happy memories along the way.

/posts/becoming-a-student-again

Mixed Precision Training from Scratch

May 11, 2024 Updated May 11, 2024

Show full content

.center-image { margin: 0 auto; display: block; }

To really understand the whole picture of Mixed Precision Training, you need to go down to the CUDA level. Staying in Pytorch land can only get you so far:

# Just because you import, doesn't mean you understand.
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for input, target in data:

    # What is happening here?
    with autocast(device_type='cuda', dtype=torch.float16):
        output = model(input)
        loss = loss_fn(output, target)

    # ...and here?
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

While torch.cuda.amp offers a seamless way to apply mixed precision training, it also hides away the most important details. Especially how it makes your model run faster. The answer, as the library’s name suggests, lies in CUDA.

I decided to rewrite mixed precision training from scratch, going down to the CUDA level, and write this guide. I hope this will help you really understand mixed precision training.

What is Mixed Precision Training?

Mixed precision training is training neural nets with half-precision floats (FP16). This means all of our model weights, activations, and gradients during our forward and backward pass are now FP16, instead of single-precision floats (FP32). Now our memory requirement is halved! Unfortunately, just doing this causes issues with our gradient updates:

for p, grad in zip(parameters, grads): # {p,grad}.dtype = FP16
    update = -lr * grad # update.dtype = FP16
    p.data += update # FP16 += FP16

Since grad is now FP16, it can become too small to be represented in FP16 (any value smaller than 2^-24). When the gradient is less than 2^-24, it underflows to 0 and makes update = 0, causing our model to stop learning.

Remember, we also made parameters FP16 too. So even if we somehow preserved update to express the too-small-for-FP16 values, adding it to an FP16 tensor like p will cause no change in our weight values.

FP16 sacrifices numerical stability (fancy word for over/underflowing easily) for reduced memory footprint. So how does Mixed Precision Training overcome this trade-off? As the original paper outlines, it offers three solutions.

Let’s go through them one by one, grounding each concept by using it train a simple 2-layer MLP on MNIST.

1. Loss Scaling

Before revealing the answer, it’s often more useful to provide the intuition first. What exactly are our gradient values, and how many of them are outside of this FP16-representable range?

We can visualize this by training our neural net using good ol’ FP32, and plotting a histogram of gradients by their exponent values. Here is ours, modified to demonstrate this concept:

When we convert our gradients to FP16, all the gradients left of the red line will underflow. How can we fix this?

Mixed precision training does it in the simplest way possible: by shifting the gradients to the right. We do this through multiplying the gradients by a constant scale factor, resulting in the orange scaled gradients.

You might ask: why is this called loss scaling and not gradient scaling? The reason is that mixed precision training achieves the same effect by scaling just the loss, before starting backprop. Then the chain rule takes over, ensuring all gradients with respect to the loss are scaled too. It’s also more efficient to scale the loss (a scalar), than to scale all gradients (tensors).

Before moving on, it’s important to note that not all neural nets require loss scaling. Our 2-layer MLP’s actual gradient histogram proves this:

I still implemented loss scaling for pedagogical reasons, and because it was dead simple to do so:

loss = F.cross_entropy(logits, y) # end of forward pass

# Start manual backward pass. 
# Cast logits to fp32 before softmax to prevent underflow
dlogits = F.softmax(logits, 1, dtype=torch.float32)
dlogits[range(n), y] -= 1
dlogits /= n

# loss scaling
# note: we multiply dlogits, instead of loss, by scale. This is because
#   we are doing backprop manually. The first gradient is dlogits, and the
#   scale will propogate through all gradients.
dlogits *= scale
dlogits = dlogits.to(torch.float16) # no underflow occurs!

The gradient update adds an unscaling step:

for p, grad in zip(parameters, grads):
    update = -lr * (grad.to(torch.float32) / scale) # fp32 upcast, then unscale
    p.data += update # update is fp32, but p is still fp16. Hmm...

It’s important that we convert the gradients to FP32 before we unscale, can you guess why? :)

2. FP32 Master Copy of Weights

Updates no longer underflow in FP32. But when adding them to our FP16 weights, updates will underflow again because they are downcasted to FP16 before the addition. All our loss scaling work would have been for nothing!

Mixed precision training keeps an FP32 master copy of weights to solve this. We still use FP16 weights for our forward and backward pass, but at the optimizer step, we update the FP32 master weights:

# Define master weights. Notice the fp32 dtypes.
# linear layer 1
m_W1 = torch.rand((n_embd, n_hidden), dtype=torch.float32)
m_b1 = torch.rand(n_hidden, dtype=torch.float32)

# linear layer 2
m_W2 = torch.rand((n_hidden, num_classes), dtype=torch.float32)
m_b2 = torch.rand(num_classes, dtype=torch.float32)

parameters = [m_W1, m_b1, m_W2, m_b2]

for i, (x,y) in enumerate(dloader):

    # Convert to fp16 for forward and backward pass
    W1, b1, W2, b2 = m_W1.half(), m_b1.half(), m_W2.half(), m_b2.half()

    # (omitted forward/backward pass code that uses the fp16 weights)

    grads = [dW1, db1, dW2, db2]
    for p, grad in zip(parameters, grads): # updating fp32 master weights
        p.data += -lr * (grad.to(torch.float32) / scale)

Because we are keeping an additional FP32 copy of weights + the FP16 weights, our model memory requirement actually increases by 50% compared to single precision training (only FP32 weights required). For training however, memory consumption is dominated by activations and gradients, especially for large neural nets with many layers and huge batch sizes.

I replicated this phenomenon for our tiny 2-layer MLP:

$ python train.py false
device: cuda, mixed precision training: False (torch.float32)
model memory: 26.05 MB
act/grad memory: 1100.45 MB
total memory: 1126.50 MB
...
$ python train.py true
device: cuda, mixed precision training: True (torch.float16)
model memory: 39.08 MB
act/grad memory: 563.25 MB
total memory: 602.33 MB

While model memory does increase by 50%, our activations decrease by half. Overall, total memory is nearly halved.

3. Mixed Precision Arithmetic

Our gradient updates are now numerically stable. But there still remains one more source of numerical instability.

Technically, it’s not one. Pretty much every other operation we do in our model (matmult, softmax, relu, etc.) can be unstable in FP16. It’s so unstable that even PyTorch refuses to do them:

>>> F.softmax(a,1)
tensor([[0.2598, 0.7402],
        [0.8864, 0.1136]])
>>> F.softmax(a.half(),1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/peter/miniconda3/envs/torch/lib/python3.10/site-packages/torch/nn/functional.py", line 1843, in softmax
    ret = input.softmax(dim)
RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'

So how does my code run? CUDA saves the day here:

>>> F.softmax(a.half().cuda(),1)
tensor([[0.2598, 0.7402],
        [0.8864, 0.1136]], device='cuda:0', dtype=torch.float16)

Just like softmax, CUDA offers numerically stable versions of popular deep learning ops for FP16. It does this through mixing arithmetic precisions. E.g. For matmults, or vector dot-products, the multiplication is done in FP16 but the partial products are accumulated as FP32.

Let’s go down to CUDA land and see this in action. I wanted to demonstrate this as simply as possible, so I reimplemented the matmult operator with cuBLAS:

// C = alpha AB + beta C
void matmult(torch::Tensor A, torch::Tensor B, torch::Tensor C,
    bool transpose_A, bool transpose_B, float alpha, float beta) {

    // Additional code omitted for clarity.
    
    if (A.dtype() == torch::kFloat32) {
        cublasGemmEx(get_handle(), op_B, op_A, n, m, k, &alpha,
                     B.data_ptr<float>(), CUDA_R_32F, B.size(1), 
                     A.data_ptr<float>(), CUDA_R_32F, A.size(1), 
                     &beta, C.data_ptr<float>(), CUDA_R_32F, C.size(1),
                     CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP);
    } else if (A.dtype() == torch::kFloat16) {
        cublasGemmEx(get_handle(), op_B, op_A, n, m, k, &alpha,
                     B.data_ptr<at::Half>(), CUDA_R_16F, B.size(1), 
                     A.data_ptr<at::Half>(), CUDA_R_16F, A.size(1), 
                     &beta, C.data_ptr<at::Half>(), CUDA_R_16F, C.size(1),
                     CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP);
    }
}

First, we call cublasGemmEx with different parameters based on the precision type (this is how PyTorch does it too in essence, through dispatching). For FP16 tensors, we set the cudaDataType as CUDA_R_16F, which will read and write FP16 results. However, by specifying the cublasComputeType as CUBLAS_COMPUTE_32F, the dot-products are accumulated as FP32. Doing this mixed precision arithmetic allows us to retain numerical stability in our FP16 matmults.

In my final code, I load and call cuBLAS through a PyTorch extension:

# load matmult CUDA call as extension
mpt = load(name='mixed_precision_training', sources=['main.cpp', 'matmult.cu'],
           extra_cuda_cflags=['-O2', '-lcublas'])

# (omitted...)

mpt.matmult(x, W1, a1, False, False, 1.0, 1.0) # a1 = x @ W1 + b1
# Sanity check - exact match with PyTorch's FP16, numerically stable matmults
cmp('a1', a1, x @ W1 + b1)

The Secret Sauce: Tensor Cores

Now that we’ve made it down to CUDA land, I can finally show you why using FP16 for our weights and activations magically makes our model train faster. It’s because calling cublasGemmEx with FP16 datatypes (+ few other rules) activates Tensor Cores:

Starting with Volta GPUs (like the V100 that I’m using), FP16 matmults are accelerated from a hardware level using Tensor Cores. They make them about an order of magnitude faster than FP32 matmults (taken from the V100 microbench report):

With the secret sauce that is Tensor Cores, mixed precision training achieves higher training throughput. Thanks to them, my custom matmults are faster during both the forward and backward pass, doubling the training speed for our 2-layer MLP:

$ python train.py false
device: cuda, mixed precision training: False (torch.float32)
model memory: 26.05 MB
act/grad memory: 1100.45 MB
total memory: 1126.50 MB
1: loss 2.327, time: 139.196ms
2: loss 2.237, time: 16.598ms
3: loss 2.175, time: 16.179ms
4: loss 2.117, time: 16.206ms
5: loss 2.058, time: 16.187ms
6: loss 2.006, time: 16.207ms
7: loss 1.948, time: 16.304ms
avg: 16.280ms
$ python train.py true
device: cuda, mixed precision training: True (torch.float16)
model memory: 39.08 MB
act/grad memory: 563.25 MB
total memory: 602.33 MB
1: loss 2.328, time: 170.039ms
2: loss 2.236, time: 8.513ms
3: loss 2.176, time: 8.440ms
4: loss 2.117, time: 8.356ms
5: loss 2.059, time: 8.133ms
6: loss 2.006, time: 8.370ms
7: loss 1.948, time: 8.402ms
avg: 8.369ms

Wow, mixed precision training is 2x faster and uses 2x less memory!

Wrap-up

If you made it this far, you’ve now seen everything that mixed precision training has to offer. It can be reduced to training with lower precision tensors to decrease memory usage, while compensating for the numerical instability through three approaches - loss scaling, fp32 master weights, and mixed precision arithmetic.

Most importantly, you saw the CUDA code that activates hardware-accelerated matmults through Tensor Cores. This is what makes GPUs go brrr. Now you understand why mixed precision training is basically a necessity for training larger and larger neural nets: it is because we want to exploit these Tensor Cores.

Thanks for reading, and I hope now you really understand mixed precision training.

You can check out my full implementation on Github.

/posts/mixed-precision-from-scratch

How to set up Nsight Compute Locally to profile Remote GPUs

Apr 22, 2024 Updated Apr 22, 2024

Show full content

.center-image { margin: 0 auto; display: block; }

Have a remote GPU instance? Want to see some rooflines with Nsight Compute? This tutorial is for you.

0. Set up a remote GPU instance.

Step 0 because I assume you’ve already done this part (using any cloud provider - I’m going with AWS). You just need an SSH-able machine with CUDA installed.

If you are using AWS EC2 like me:

I’m using an Ubuntu image like this on a g4dn.xlarge:

Personally recommend using Ubuntu over Amazon Linux. I couldn’t get nvcc to work on Amazon Linux. Plus when you run into issues, you’ll find more Ubuntu-related stack overflow posts.

1. Download Nsight Compute on your local machine.

Download link. Nvidia developer account required.

2. Set up remote process.

Click Connect and add a new remote connection. Use your remote machine’s config.
Select the CUDA application you want to profile (the binary executable created with nvcc e.g. bench in nvcc bench.cu -o bench).
Select the output file on your local machine you want to dump the profiler’s result to (I had to manually create a dummy file called out).
Click Launch

If this works for you, then you’re done. Enjoy your rooflines!

If not, and you get a permissions error like this, go to Step 3.

Error: ERR_NVGPUCTRPERM - The user does not have permission to access 
NVIDIA GPU Performance Counters on the target device 0. 
For instructions on enabling permissions and 
to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

3. Give profiling permission to all users on your remote instance.

On your remote instance, run sudo vim /etc/modprobe.d/nvidia.conf. Add the following line:

options nvidia NVreg_RestrictProfilingToAdminUsers=0

Reboot your instance.
Run sudo update-initramfs -u -k all

Check out the official guide for more context.

NOTE for EC2 users: When you reboot your instance, AWS will assign a different public IP to your instance. Make sure to update it in your remote connection setting from Step 2.

4. Launch again from Nsight Compute.

Cross your fingers.

5. Enjoy your Rooflines.

/posts/nsight-setup-on-ec2

The One Billion Row Challenge in CUDA: from 17m to 17s

Apr 10, 2024 Updated Apr 10, 2024

Show full content

.center-image { margin: 0 auto; display: block; }

[Hacker News discussion]

On my journey to learn CUDA, I decided to tackle the One Billion Row Challenge with it.

The challenge is simple, but implementing it in CUDA was not. Here I will share my solution that runs in 16.8 seconds on a V100. It’s certainly not the fastest solution, but it is the first one of its kind (no cudf, hand-written kernels only). I challenge other CUDA enthusiasts to make it faster.

Baseline in pure C++

You can’t improve what you don’t measure. Since I’m going to be writing C++ anyways for CUDA, let’s use a pure C++ baseline. My CUDA code should be faster than this.

The approach is straight-forward: read the file line by line, parse it to get the city and temperature, and accumulate them in a STL map.

while (getline(file, line)) {
    istringstream iss(line);
    string station;
    float temp;
    getline(iss, station, ';');
    iss >> temp;

    auto it = stationStats.find(station);
    if (it == stationStats.end()) {
        stationStats[station] = {temp, temp, temp, 1};
    } else {
        Stat& s = it->second;
        s.min = min(s.min, temp);
        s.max = max(s.max, temp);
        s.sum += temp;
        s.count++;
    }
}

ofstream measurements("measurements.out");
for (auto& pair : stationStats) {
    const Stat& s = pair.second;
    float mean = s.sum / s.count;
    measurements << pair.first << "=" << s.min << "/";
    measurements << fixed << setprecision(1) << mean << "/";
    measurements << s.max << endl;
}

This runs in 16.5 minutes. Let’s improve this with CUDA.

Work Partitioning Approach One Billion Threads?

The whole promise of CUDA and other parallel programming APIs is that you can parallelize your workload across many processes. For CUDA, it’s a SIMT model - a single instruction is executed across multiple threads in parallel.

Great, so let’s just use one billion threads to process one billion lines concurrently!

Unfortunately, we can’t just launch one billion threads. We first need to prepare each line buffer for each thread to process. However, preparing these one billion line buffers requires reading the entire file, line by line (unless the lines were already placed in one billion files, but that would make this the One Billion Files Challenge).

The effort involved in setting up these buffers would essentially replicate the baseline workload, making this approach counterproductive.

Use Byte Offsets

The solution is to prepare file offsets instead of line buffers. These offsets are obtained iteratively, stepping through the entire file buffer by the desired split size (= total file size / desired number of parts), and marking the position of a new line character:

long long split_size = size / num_parts;
long long offset = 0;
std::vector<Part> parts;
while (offset < size) {
    long long seek_offset = std::max(offset + split_size - MAX_CITY_BYTE, 0LL);
    if (seek_offset > size) {
        parts.back().length += size-offset;
        break;
    }
    file.seekg(seek_offset, std::ios::beg);
    char buf[MAX_CITY_BYTE];
    file.read(buf, MAX_CITY_BYTE);

    // Find the new line character in the vicinity. 
    // That will be the boundary between this and the next offset.
    std::streamsize n = file.gcount();
    std::streamsize newline = -1;
    for (int i = n - 1; i >= 0; --i) {
        if (buf[i] == '\n') {
            newline = i;
            break;
        }
    }
    int remaining = n - newline - 1;
    long long next_offset = seek_offset + n - remaining;
    parts.push_back({offset, next_offset-offset});
    offset = next_offset;
}

This is much faster than reading the entire file because we are working with integer byte values. For example, say we want to partition the 14GB input file for two threads. The while loop iterates twice (offset = 0GB -> 7GB -> 14GB). Contrast this to a line-based approach. We would need to iterate 500M (= 1B / 2) times to load 500M lines into our first partition.

In reality, we need more than just two threads. But one billion is too much. Not because our GPU can’t handle one billion threads, but because finding one billion offsets becomes the bottleneck in our overall runtime. We haven’t even gotten to launching our CUDA kernels. We need to minimize our preparation time as much as possible.

For my solution, I create 1M partitions that takes 2.6 seconds out of the entire 16.8 seconds. (In comparison, creating 100M partitions alone takes over 3 minutes.)

CUDA Kernel

The rest of the time is spent in the CUDA kernel (finally!). The idea behind it is simple. Each thread indexes into a different part of the file buffer, parses it to get the cities and temperatures, and updates the min/max/avg statistics.

The implementation, however, is not trivial due to the following reasons (in order of annoyance):

CUDA’s AtomicMin and AtomicMax only work with int values. We’ll make our own float-value-accepting variants.
No std::string in CUDA. Time to make our own atof, strcmp, getline. Get ready for null terminators '\0'.
No std::map either. How can we pass in the city string to array index lookup table into our CUDA kernel?

Let’s go through these one by one.

AtomicMin & AtomicMax for Floats

Any atomic operation in CUDA can be written with atomicCAS. Let’s adapt the example from the official programming guide to write atomicMin and atomicMax for float values. Here’s AtomicMin:

__device__ static float atomicMin(float* address, float val) {
    int* address_as_i = (int*) address;
    int old = *address_as_i, assumed;
    do {
        assumed = old;
        old = ::atomicCAS(address_as_i, assumed,
        // Use `fmaxf` for atomicMax.
        __float_as_int(::fminf(val, __int_as_float(assumed))));
    } while (assumed != old);
    return __int_as_float(old);
}

Now we can atomically update min and max temperature floats.

C Strings

While we don’t have std::string, we have char*. A string is just an array of 8-bit characters, and the raw file buffer (e.g. "Hamburg;12.0\nBulawayo;8.9\nPalembang;38.8...") is no different.

Each thread iterates through this char array, at different offsets and for different lengths (computed at our work partitioning step):

char city[MAX_CITY_BYTE];
char floatstr[5];  // longest temperature float str is -99.9 i.e. 5 bytes

for (int i = 0; i < parts[bx].length; i++) {   // bx is the global thread index
    char c = buffer[parts[bx].offset-buffer_offset + i];
    if (parsing_city) {  // City characters
        if (c == ';') {
            city[index] = '\0';
            index = 0;
            parsing_city = false;
        } else {
            city[index] = c;
            index++;
        }
    } else {  // Float characters
        if (c == '\n') {  // Reached end of line
            floatstr[index] = '\0';

            int stat_index = get_index(cities, city, n_city);
            float temp = cuda_atof(floatstr);

            // The heart of the CUDA kernel.
            // Update (atomically) the temperature statistics.
            // Identical in spirit to the simple C++ version.
            atomicMin(&stats[stat_index].min, temp);
            atomicMax(&stats[stat_index].max, temp);
            atomicAdd(&stats[stat_index].sum, temp);
            atomicAdd(&stats[stat_index].count, 1);

            // reset for next line read
            parsing_city = true;
            index = 0;
            floatstr[0] = '\0'; city[0] = '\0';
        } else {
            floatstr[index] = c;
            index++;
        }
    }
}

It’s not pretty. But it’s necessary as we don’t have the luxury of a getline. After parsing each line, we now have a pair of strings; a city string char city[] and a temperature string char floatstr[]. (The latter requires a conversion from string to float and since we don’t have atof in CUDA, I made my own again)

City String to Index Lookup GPU Hash Table?

How do we store temperature statistics for each city? In C++ land, we relied on a hash table - using the city string as key, and temperature statistic floats as value. In CUDA, we don’t have such a convenient std::map.

Okay, let’s write our own. How hard can it be? Turns out, damn near impossible because I have 100-byte city strings as keys.

While I found some implementations online, these approaches are limited to 32-bit keys, due to atomic operations being bounded to limited bits (even on the CPU).

To be clear, you don’t need atomic operations to deal with hash table collisions, but you do need them when the collisions can happen across multiple threads i.e. a parallel setting with concurrent inserts into the hash table.

Ask for forgiveness, not permission

So this is where I bend the rules of the original challenge a bit. I assume that a list of all cities is given along with the input file. This list is the data/weather_stations.csv file, which is actually used to generate the one billion rows for the official challenge.

Hacky Solution: Sorted Cities + Binary Search

Knowing the list of all possible cities, I avoid using a hash table. I just sort the list of all cities, and pass it to my CUDA kernel. I use the sorted indices as my lookup table. E.g. Say sorted cities list was ["A","B","C"]. Given city "B", the lookup index is its position, 1.

Sorting is important because we can do binary search to find the index in logarithmic time. While still slower than a hash table’s constant time lookup, it’s much faster than linearly searching 40k+ city entries.

// Never thought that I would be writing binary search in CUDA, but here we are.
// Thanks LeetCode!
__device__ int get_index(char* cities, char* city_target, int n_city) {
    int left = 0;
    int right = n_city - 1;
    while (left <= right) {
        int mid = left + (right - left) / 2;
        const char* city_query = cities + mid * MAX_CITY_BYTE;

        int cmp = cuda_strcmp(city_query, city_target);
        if (cmp == 0)
            return mid;
        else if (cmp < 0)
            left = mid + 1;
        else
            right = mid - 1;
    }
    return -1;
}

Now we’re finally done! The heart of the kernel boils down to:

int stat_index = get_index(cities, city, n_city);
float temp = cuda_atof(floatstr);

atomicMin(&stats[stat_index].min, temp);
atomicMax(&stats[stat_index].max, temp);
atomicAdd(&stats[stat_index].sum, temp);
atomicAdd(&stats[stat_index].count, 1);

All this just to do 4 atomic operations.

Profiling

On a V100, the CUDA solution runs in 16.8 seconds. It’s a 60X improvement compared to the 16.5 minutes for our C++ baseline. Here is the script to reproduce this.

Interestingly, the kernel is ~1.5X slower on a T4, and so I used ncu to profile on both devices. One thing that caught my eye was the difference in control divergence.

(v100) ~ ncu --section SourceCounters fast data/measurements.txt 1000000 600000
...
Section: Source Counters
    ------------------------- ----------- --------------
    Metric Name               Metric Unit   Metric Value
    ------------------------- ----------- --------------
    Branch Instructions Ratio           %           0.18
    Branch Instructions              inst 53,736,337,800
    Branch Efficiency                   %          92.73
    Avg. Divergent Branches                10,041,990.22
    ------------------------- ----------- --------------

(t4) ~ ncu --section SourceCounters fast data/measurements.txt 1000000 600000
...
Section: Source Counters
    ------------------------- ----------- --------------
    Metric Name               Metric Unit   Metric Value
    ------------------------- ----------- --------------
    Branch Instructions Ratio           %           0.17
    Branch Instructions              inst 53,870,435,921
    Branch Efficiency                   %          92.73
    Avg. Divergent Branches                20,156,806.42
    ------------------------- ----------- --------------

I expected there to be a lot of control divergence (since I’m using ifs and loops everywhere in my kernel), but I did not expect the divergence to be worse on the T4. Compared to the V100, it has twice as many avg. divergent branches. Could this be a reason why the kernel runs more slowly on a T4? Also, why is the branch efficiency so high if there are so many divergent branches?

Clearly, I’m still learning about the ncu profiler. I welcome any guidance on this area.

Possible Optimization - Privatization using Shared Memory

As I finish writing this blog, I realize that my struct Stat doesn’t need to hold the city char array. In which case, each struct will be 16 bytes (min, max, sum, count), and the entire array of statistics will be 16N bytes, where N is the # of unique cities.

Here, N=41k (based on data/weather_stations.csv), and so the entire array will be 66KB. This should be small enough to fit in shared memory (96KB per SM for Volta). Then, each block can update a private version of stats to reduce contention of the atomic operations. This should lead to a faster solution.

Takeaway

The more and more I did this challenge, the more and more I realized that not all parallel workloads are meant for CUDA. Especially those involving strings and dynamic hash tables. I eventually had to cheat a bit and make a static lookup table. I am genuinely curious: is there a way to not require a list of all cities? (and still be faster than the baseline, of course)

Still, no regrets taking on the challenge. I got to experience CUDA’s limitations first-hand, and that was worthwhile. Now I know I should just stick to Tensor matmults like I usually do.

/posts/cuda-1brc

Growing up in six different countries

Feb 17, 2019 Updated Feb 17, 2019

Show full content

.hexa{ border: 0px; float: left; text-align: center; height: 35px; width: 60px; font-size: 22px; background: #f0f0f0; color: #3c3c3c; position: relative; margin-top: 15px; } .hexa:before{ content: ""; position: absolute; left: 0; width: 0; height: 0; border-bottom: 15px solid #f0f0f0; border-left: 30px solid transparent; border-right: 30px solid transparent; top: -15px; } .hexa:after{ content: ""; position: absolute; left: 0; width: 0; height: 0; border-left: 30px solid transparent; border-right: 30px solid transparent; border-top: 15px solid #f0f0f0; bottom: -15px; } .timeline { position: relative; padding: 0; width: 100%; margin-top: 20px; list-style-type: none; } .timeline:before { position: absolute; left: 50%; top: 0; content: ' '; display: block; width: 2px; height: 100%; margin-left: -1px; background: rgb(213,213,213); background: -moz-linear-gradient(top, rgba(213,213,213,0) 0%, rgb(213,213,213) 8%, rgb(213,213,213) 92%, rgba(213,213,213,0) 100%); background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,rgba(30,87,153,1)), color-stop(100%,rgba(125,185,232,1))); background: -webkit-linear-gradient(top, rgba(213,213,213,0) 0%, rgb(213,213,213) 8%, rgb(213,213,213) 92%, rgba(213,213,213,0) 100%); background: -o-linear-gradient(top, rgba(213,213,213,0) 0%, rgb(213,213,213) 8%, rgb(213,213,213) 92%, rgba(213,213,213,0) 100%); background: -ms-linear-gradient(top, rgba(213,213,213,0) 0%, rgb(213,213,213) 8%, rgb(213,213,213) 92%, rgba(213,213,213,0) 100%); background: linear-gradient(to bottom, rgba(213,213,213,0) 0%, rgb(213,213,213) 8%, rgb(213,213,213) 92%, rgba(213,213,213,0) 100%); z-index: 5; } .timeline li { padding: 2em 0; } .timeline .hexa{ width: 16px; height: 10px; position: absolute; background: #00c4f3; z-index: 5; left: 0; right: 0; margin-left:auto; margin-right:auto; top: -30px; margin-top: 0; } .timeline .hexa:before { border-bottom: 4px solid #00c4f3; border-left-width: 8px; border-right-width: 8px; top: -4px; } .timeline .hexa:after { border-left-width: 8px; border-right-width: 8px; border-top: 4px solid #00c4f3; bottom: -4px; } .direction-l, .direction-r { float: none; width: 100%; text-align: center; } .flag-wrapper { text-align: center; position: relative; } .flag { position: relative; display: inline; background: rgb(255,255,255); font-weight: 600; z-index: 15; padding: 6px 10px; text-align: left; border-radius: 5px; } .direction-l .flag:after, .direction-r .flag:after { content: ""; position: absolute; left: 50%; top: -15px; height: 0; width: 0; margin-left: -8px; border: solid transparent; border-bottom-color: rgb(255,255,255); border-width: 8px; pointer-events: none; } .direction-l .flag { -webkit-box-shadow: -1px 1px 1px rgba(0,0,0,0.15), 0 0 1px rgba(0,0,0,0.15); -moz-box-shadow: -1px 1px 1px rgba(0,0,0,0.15), 0 0 1px rgba(0,0,0,0.15); box-shadow: -1px 1px 1px rgba(0,0,0,0.15), 0 0 1px rgba(0,0,0,0.15); } .direction-r .flag { -webkit-box-shadow: 1px 1px 1px rgba(0,0,0,0.15), 0 0 1px rgba(0,0,0,0.15); -moz-box-shadow: 1px 1px 1px rgba(0,0,0,0.15), 0 0 1px rgba(0,0,0,0.15); box-shadow: 1px 1px 1px rgba(0,0,0,0.15), 0 0 1px rgba(0,0,0,0.15); } .time-wrapper { display: block; position: relative; margin: 4px 0 0 0; z-index: 14; line-height: 1em; vertical-align: middle; color: #fff; } .direction-l .time-wrapper { float: none; } .direction-r .time-wrapper { float: none; } .time { background: #00c4f3; display: inline-block; padding: 8px; } .desc { position: relative; margin: 1em 0 0 0; padding: 1em; background: rgb(254,254,254); -webkit-box-shadow: 0 0 1px rgba(0,0,0,0.20); -moz-box-shadow: 0 0 1px rgba(0,0,0,0.20); box-shadow: 0 0 1px rgba(0,0,0,0.20); z-index: 15; } .direction-l .desc, .direction-r .desc { position: relative; margin: 1em 1em 0 1em; padding: 1em; z-index: 15; } @media(min-width: 768px){ .timeline { width: 660px; margin: 0 auto; margin-top: 20px; } .timeline li:after { content: ""; display: block; height: 0; clear: both; visibility: hidden; } .timeline .hexa { left: -28px; right: auto; top: 8px; } .timeline .direction-l .hexa { left: auto; right: -28px; } .direction-l { position: relative; width: 310px; float: left; text-align: right; } .direction-r { position: relative; width: 310px; float: right; text-align: left; } .flag-wrapper { display: inline-block; } .flag { font-size: 18px; } .direction-l .flag:after { left: auto; right: -16px; top: 50%; margin-top: -8px; border: solid transparent; border-left-color: rgb(254,254,254); border-width: 8px; } .direction-r .flag:after { top: 50%; margin-top: -8px; border: solid transparent; border-right-color: rgb(254,254,254); border-width: 8px; left: -8px; } .time-wrapper { display: inline; vertical-align: middle; margin: 0; } .direction-l .time-wrapper { float: left; } .direction-r .time-wrapper { float: right; } .time { padding: 5px 10px; } .direction-r .desc { margin: 1em 0 0 0.75em; } } @media(min-width: 992px){ .timeline { width: 800px; margin: 0 auto; margin-top: 20px; } .direction-l { position: relative; width: 380px; float: left; text-align: right; } .direction-r { position: relative; width: 380px; float: right; text-align: left; } } .center-image { margin: 0 auto; display: block; }

It’s not a trick question. It’s something that I’ve always been confused about since I was a child. To this day, I don’t really know the right answer. But I’ve thought that drawing out a timeline might make it easier to grasp at this usually innocent question. Here’s my definitive (but not really) answer:

Hello World! Dec 1995 Born in Gangnam (Yes, from the song), Seoul - South Korea.
Paris, France Sep 1996 Celebrated my first birthday here. My earliest memory is casually drinking milk while my dad carried me in his arms, as we both took in the Paris night skyline from the balcony of our apartment.
Tehran, Iran Aug 1999 Yes, very random indeed. Although my parents would eventually return many years after. I lived here until a month before the tragic 9/11 incident, so I remember it as a very peaceful place. Another super random fact - I went to an Italian kindergarten during this time.
Seoul, South Korea Aug 2001 My first real experience of living in the motherland. Finished my Kindergarten and my first year of grade school here.
Brussels, Belgium Jul 2003 Attended the International School of Brussels (ISB). Didn't know how to speak a word of English when I first came but kids pick up languages insanely fast and so did I.
Geneva, Switzerland Mar 2005 Also went to an International School here (You guessed it - International School of Geneva). Lived so close to the border of France that my mom would grocery shop in France and come back. Here I really started to take on English as my most comfortable language. My Korean was also fine since I (had to) spoke Korean with my parents, even to this day.
Seoul, South Korea Mar 2007 Finished grade school and the first 2 years of my middle school at an all-boys middle school.
Brussels, Belgium Feb 2010 Round 2! Came back full circle attending the same International School again, this time finishing high-school. My passion for building things and computer science began when I coded up a blackjack game in Visual Basic for an elective.
Los Angeles, U.S. Sep 2013 Worked hard (and played even harder) to get my undergradute and graduate degrees in Computer Science at UCLA. The longest I've ever stayed in one place without moving (almost 6 years!) - fortunate to have made most of my lifelong friends here, and LA will always hold a special place in my heart.
Seoul, South Korea Feb 2019 Military Service. Fortunately, I don't actually have to go to the military since, with my masters, I could qualify for a program to just work as a software engineer for 3 years. I will be working at Buzzvil as a Machine Learning engineer.
??? ??? Stay tuned!

Pretty long answer, right? To get a better sense of everything time-wise, let me use some graphs (I never thought that bar graphs and piecharts would become this useful to describe my personal identity):

my-crazy-timeline
My crazy timeline. You tell me where I’m from.

in-out-korea
I am “Korean” - # of years I lived in each continent

As you can tell, even though I was born here and am ethnically Korean, I am culturally American (There’s a reason why I am writing my blog posts in English). However, that’s not even entirely true. I say American because it’s the closest thing that I can identify to. I only recently came and left the U.S. for college and graduate school and, looking at the piechart above, it would seem like I would be more culturally European (whatever that means). Yet, I went to an International school when I lived abroad in these European nations, speaking English and befriending mostly American friends.

I think the most apparent reason why I feel culturally American is simply because I stayed there the longest compared to anywhere else. Looking at the timeline, you can see the longest I’ve ever stayed in one place was 5.5 years in LA - this is 2 years longer than the runner-up of 3.5 years in Brussels during my high-school years. Considering that my average occupation time in each of these countries is 2.6 years, living in LA for that long has certainly been an outlier (to be precise, at least 2 std dev. away). And consistency goes a long way - coupled with the prior expectation that I would stay in LA for at least 4 long years to finish my undergrad, I settled into the American ways of life much more comfortably and easily than anywhere I’ve lived before.

To be honest, I’m not trying to come to any grand conclusion about myself. Like many things in life, this question about my identity raises more questions than clear answers (so ask away down below in the comments).

/posts/where-am-i-from

A Quick Summary of “A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task”

Nov 13, 2017 Updated Nov 13, 2017

Teaching a machine how to read and comprehend has remained an elusive task in Natural Language Processing (NLP). The majority of information and knowledge that humanity has gathered thus far is stored in the form of plain text, and thus, the task of Reading Comprehension (RC) and Question Answering (QA) based on those text is a crucial component in not only NLP, but General Artificial Intelligence as a whole. Historically, large and realistic datasets have played a critical role for driving fields forward, such as ImageNet for object recognition. So when DeepMind researchers released the first large scale dataset for reading comprehension, it was certainly a historic moment. However, we may need to curb our enthusiasm - a more thorough analysis of their dataset, dubbed CNN/Daily Mail, reveals that: i) this dataset is easier than previously realized, ii) there exists significant noise in the dataset, and iii) current neural networks, like the one presented here, have essentially reached the performance ceiling on this dataset.

Show full content

.center-image { margin: 0 auto; display: block; }

The Question Answering Task

DeepMind’s RC datasets are made by exploiting online news articles from CNN and Daily Mail, and their matching summaries. Let’s look at an example:

fig1

Figure 1: A single training example from the CNN dataset.

A single example is tuple of three elements: the passage p, the question q, and the answer a. The passage is the news article, and the question is formed in Cloze style, where a single entity in the bullet summaries is replaced with a placeholder (@placeholder). The replaced entity becomes the correct answer. The inference goal here is to predict the answer entity (Y) from all the appearing entities in the passage, given the passage and question (X).

As an important preprocessing step, the text is lowercased, tokenized, and named entity recognition and coreference resolution have been applied. This means that all entities are replaced with abstract entity markers (@entityn) according to the constructed coreference chain. The folks at DeepMind argue that such a process ensures that their models are understanding the given passage, as opposed to applying world knowledge or co-occurrence (and not “reading” the passage).

The two datasets are summarized in the following table:

table1

Table 1: Data statistics of the CNN and Daily Mail

There are around 380k and 880k training examples (passage, question, answer) for the CNN and Daily Mail dataset, respectively. The average number of tokens, or words, for each passage and question is also shown. Finally, the average number of entities that appear in a single passage is also given.

The Two Models

In this section, two models are presented and evaluated to set a lower bound of the performance of current NLP systems. First, a conventional feature-based classifier is built by designing a feature vector f_p,q(e) for each candidate entity e, and learning a weight vector theta such that the correct answer a is ranked higher than all other candidate entities:

eqn1

More interestingly, let’s examine a neural network model that is a modified version of the Attentive Reader model proposed by DeepMind. To make every step clearer, I have accompanied code snippets from my TensorFlow-based model here. The system can be described by the three following steps:

Encoding

Every token is mapped to a d-dimensional vector via an embedding matrix that was obtained from pretrained GloVe embeddings. For the passage, we use a shallow bidirectional recurrent neural network (RNN) using Gated Recurrent Unit (GRU) cells. A GRU cell performs similarly to a Long Short Term Memory (LSTM) cell but is computationally cheaper. The outputs from the forward and backward RNN are concatenated together to form the final passage encoding.

with tf.variable_scope('d_encoder'): # Encoding Step for Passage (d_ for document)
  # Apply embeddings: [batch, max passage length in batch, GloVe Dim]
  d_embed = tf.nn.embedding_lookup(embeddings, d_input)
  d_cell_fw = rnn.GRUCell(args.hidden_size)
  d_cell_bw = rnn.GRUCell(args.hidden_size)
  d_outputs, _ = tf.nn.bidirectional_dynamic_rnn(d_cell_fw, d_cell_bw,
    d_embed, dtype=tf.float32) # Create bidirectional rnn
  d_output = tf.concat(d_outputs, axis=-1)
  # [batch, len, h], len is the max passage length, and h is the hidden size

We do the exact same to encode the question, but instead of concatenating the hidden states for every single word, only the final outputs are concatenated.

with tf.variable_scope('q_encoder'): # Encoding Step for Question
  # Apply Embeddings: batch, max passage length in batch, GloVe Dim]
  q_embed = tf.nn.embedding_lookup(embeddings, q_input)
  q_cell_fw = rnn.GRUCell(args.hidden_size)
  q_cell_bw = rnn.GRUCell(args.hidden_size)
  q_outputs, q_laststates = tf.nn.bidirectional_dynamic_rnn(q_cell_fw, q_cell_bw,
    q_embed, dtype=tf.float32) # Create bidirectional rnn
  q_output = tf.concat(q_laststates, axis=-1) # [batch, h]

Attention

Here, we want to compare the question embedding and the passage embedding, and narrow down the entities that are relevant to the question.

with tf.variable_scope('bilinear'): # Bilinear Layer (Attention Step)
  # M computes the similarity between each passage word and the entire question encoding
  M = d_output * tf.expand_dims(tf.matmul(q_output, W_bilinear), axis=1) # [batch, h] -> [batch, 1, h]
  # alpha represents the normalized weights representing how relevant the passage word is to the question
  alpha = tf.nn.softmax(tf.reduce_sum(M, axis=2)) # [batch, len]
  # this output contains the weighted combination of all contextual embeddings
  bilinear_output = tf.reduce_sum(d_output * tf.expand_dims(alpha, axis=2), axis=1) # [batch, h]

Prediction

Finally, the output from the previous step is fed into a dense layer and normalized with a softmax function.

with tf.variable_scope('dense'): # Prediction Step
  # the final output has dimension [batch, entity#], giving the probabilities of an entity being the answer for examples
  final_prob = tf.layers.dense(bilinear_output, units=args.num_labels,  
    activation=tf.nn.softmax, kernel_initializer=tf.random_uniform_initializer(minval=-0.01, maxval=0.01)) # [batch, entity#]

Results

The final results are presented in the table below:

table2

Table 2: Performance of models on the CNN and Daily Mail datasets.

The conventional feature-based classifier obtains a 67.9% accuracy on the CNN test set, which actually outperforms the best neural network model from DeepMind. This implies that the learning task at hand might not be as difficult as suggested, as a simple feature set can cover many of the examples. In addition, our modified attentive reader surpasses the DeepMind’s previous results by a large margin (over 5%).

In-Depth Analysis

So now that we have achieved good results through both of our models, let us ask the following questions: i) Since the dataset was created synthetically, what proportion of questions are trivial to answer, and how many are noisy and not answerable? ii) What have these models learned? iii) What are the prospects of improving them? To answer these, we randomly sample 100 examples from the CNN dev dataset, to perform a breakdown of the examples.

After careful analysis, the 100 examples can be roughly classified into the following categories:

Exact Match - The nearest words around the placeholder in the question also appear identically in the passage, in which case, the answer is self-evident.
Sentence-level paraphrase - The question is a paraphrasing of exactly one sentence in the passage, and the answer can definitely be identified in the sentence.
Partial Clue - No semantic match between the question and document sentences exist but the answer can be easily inferred through partial clues such as word and concept overlaps.
Multiple sentences - Multiple sentences in the passage must be examined to determine the answer.
Coreference errors - This category refers to examples with critical coreference errors for the answer entity or other key entities in the question. Not answerable.
Ambiguous / Very Hard - This category includes examples for which even humans cannot answer correctly (confidently). Not answerable.

The following table presents the distribution of these examples based on their respective categories:

table3

Table 3: An estimate of the breakdown of the sampled 100 examples from the CNN dev dataset.

As one can clearly see, the last two “not answerable” categories account for 25% of this sample, thus setting the barrier for training models with an accuracy match above 75%. Further, only two examples require examination of multiple sentences for inference, suggesting that there is a lower rate of challenging questions than DeepMind originally showcased. Thus, it is reasonable to assume that for most of the “answerable” cases, the inference is based upon identifying the most relevant sentence.

Consequently, we can further analyze the predictions of our two systems, based on the same 100 examples and categorization. The results are as follows:

table3

Table 4: Per-category performance of our two systems on 100 examples from CNN dev dataset

We can observe the following:

The exact match cases are very simple to learn and both systems get 100% correct.
Both systems perform poorly for the coreference errors and ambiguous/hard cases.
The two systems mainly differ in paraphrasing and partial clue cases. This clearly demonstrates that neural networks are better capable of learning semantic matches involving paraphrasing or lexical variations between sentences.
The neural net model seems to achieve near-optimal performance on all of the answerable cases.

Hence, due to the harsh upper bound on accuracy, set by the “not answerable” cases, it will be an unfruitful endeavor to explore more sophisticated NLP systems for this dataset.

Conclusion

Overall, this post was about the careful examination of the CNN/Daily Mail reading comprehension task. The two systems presented here demonstrates state-of-the-art results and an analysis by hand is carried out on a random sample of the CNN dataset. Even though the datasets are certainly valuable for their sheer size, it is evident that i) there exists significant noise due to the datasets’ method of data creation and coreference errors, ii) current neural network models have basically reached the performance ceiling iii) the reasoning required for inference is still quite simple.

< References >
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS), pages 1684–1692.

Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of the cnn/dailymail reading comprehension task. In Association for Computational Linguistics (ACL), 2016.

/posts/a-quick-summary-of-cnn-dm-dataset

The Introspectiveness of Neural Networks

Aug 29, 2017 Updated Aug 29, 2017

Show full content

.center-image { margin: 0 auto; display: block; }

Neural networks are wonderful things, to say the least. They are undoubtedly the state-of-the-art techniques for many applications (if not all) ranging from language translations all the way to self-driving cars. Modern day neural networks, combined with other machine learning concepts such as reinforcement learning, have been able to crush the best human Go players in the world. Go is a game that has more possible board configurations (10^170 to be exact) than the number of atoms in the universe! In fact, just a little over two weeks ago, the Elon Musk-backed research lab OpenAI revealed an AI bot in the popular game of Dota 2 that just manhandled professional e-sport players in 1-vs-1 matches.

uber-web A web-based tool that the Data Visualization team uses over at Uber to explore self-driving car research data

OpenAI’s video showcasing all the different tactics/behaviors that its bot learned on its own. Mind you, there were no hundreds of if statements to program these behaviors like one would usually do in a traditional programming setting. The bot just played a lifetime’s worth 1-vs-1 matches against itself with minimal supervision from the researchers. And it just learns.

To be honest, I can dedicate this entire post to the remarkable accomplishments of neural networks. But now, I want to talk about neural networks from a different perspective, mainly it’s origin story: how it came to be and why it’s our best shot at truly intelligent machines and why it has more far-reaching consequences than we might have imagined.

Let’s go back to what originally inspired neural networks in the first place, our brains. We are all born with these insanely powerful and versatile computers in our heads that allow us to learn truly anything. We immediately learn to see and hear without any effort, and we can pretty complex math and physics, and finally, we can learn to create some of the most impressive engineering feats all the way from smartphones to reusable rockets.

So the brain can do all these different and amazing things. The million dollar question is, how? A reasonable guess would be that the brain works like a computer with a thousand different programs to do a thousand different things. A crude example would be that if I learnt how to speak French, the brain would generate some sort of French speaking module and just piece it in with the thousand different other modules that I already knew how to do.

On the other hand, what if the exact opposite is true? Instead, the brain only knows one learning algorithm that is capable of learning, well, anything. Although this is still a hypothesis, it is one that has some really mind-blowing evidences. Let me share with you my favorite:

In a 1992 neuroscience paper by Roe et al., they rewired a ferret’s retinal input into the auditory cortex [1]. Your auditory cortex is responsible for processing the sound signals you hear, allowing you to understand basic words and speech. The result of this neural-rewiring is nothing short of jaw-dropping: the auditory cortex learned how to process visual information coming from the ferret’s eyes, and in every sense, learned how to see.

roe-experiment-result A neural-rewiring of the brain. This is the biological equivalent of plugging your webcam input into your sound card of your computer. [2]

In another earlier neural-rewiring experiment by Metin and Frost in 1989, they rewired the same retinal input of hamsters (instead of ferrets) into the somatosensory cortex, which is responsible for your sense of touch [3]. As you might have guessed, these hamsters learned how to see without any issues. Overall, if the same section of the physical brain tissue can learn how to see, hear, and touch, surely, there must be a general learning algorithm that can do all these things, and more. At least, it seems very unlikely that the brain has three different learning algorithms for your sight, sound, and touch.

This brings me to the ultimate quest of AI and truly intelligent machines. If we can figure out and understand the brain’s learning algorithm and develop some close approximation to it for a computer, then this might be the best shot we have towards Artificial General Intelligence, one that can help us someday solve humanity’s problems in ways such as preventing global warming or outbreaks of diseases.

It’s hard not to get carried away here because even the slightest progress towards such general intelligence can tremendously benefit our world. (As a positive person, I assume here that AI will be used mainly for the good of mankind. I understand that this might not be the case but that topic is best saved for another blog post). However, the sad truth is that we are nowhere near achieving the ultimate dream of general intelligence, even with the ground-breaking research in neural networks today. This is mainly because, we still are nowhere near understanding the full complexities of how our brain learns, other than the fact that there’s this general learning algorithm. In fact, do a google search right now for “how does the brain work”. The results can go into very good detail about the structure and components of the brain but there’s definitely something left to be desired in how it all comes together. Just take a look at this short summary from the U.S. National Library of Medicine - PubMed Health Journal [4]:

The brain works like a big computer. It processes information that it receives from the senses and body, and sends messages back to the body. But the brain can do much more than a machine can: humans think and experience emotions with their brain, and it is the root of human intelligence.

It certainly is the root of human intelligence, which is why I am so insistent that unlocking the secrets of the brain can allow us to design better neural networks. I’ll give you another favorite example to prove this.

In 1959, Hubel and Wiesel were studying the visual cortex of a cat, by observing the early visual area of the cat brain, as the cat was looking at specific patterns on the screen [5].

hubel-cat
High-level setup of the Hubel and Wiesel’s experiment.

They realized that a neuron/cell would only turn on for edges in a particular orientation and that orientation only. Through this, they devised a model of how the visual cortex processes information. They found that nearby cells in the cortex represented and processed nearby regions in the visual field, preserving the spatial information of whatever the cat was seeing. Further, they found that were was an entire hierarchy of rows of cells. The most low level cells were named simple cells as they only reacted to a particular orientation of an edge. Through all these experiments, Hubel and Wiesel hypothesized that the visual cortex can be described by a hierarchical organization of simple cells that fed into complex cells which have more complicated activations and can form higher level representations. These simple cells would have translationally invariant and locally receptive fields.

hubel-hierarchy
Feature hierarchy of cells that processes visual information.

(At this point, if you are aware of how Convolutional Neural Networks (CNNs) work, you should already see the uncanny resemblance) Hubel and Wiesel eventually won a nobel prize for their amazing discoveries, and it inspired the computer vision community to reproduce this result. In the 1980s, Fukushima came up with the Neocognitron, which was basically a multilayered neural network for pattern recognition tasks. Eventually, Yann LeCun et al. built the first backpropogatable CNN, famously dubbed LeNet, in 1998. Fast-forwarding to today, CNNs not only dominate the computer vision world in terms of raw benchmark performance whether it’s MNIST or ImageNet, but is also deployed in tasks such as language translation (check out https://www.deepl.com/translator - they use a CNN over the more traditional RNN, and supposedly get better results).

The previous example was quite dense so let me summarize: the understanding of how the visual cortex works (not simply where it is inside the brain) allowed us to essentially come up with CNNs, a far more versatile version of a neural network for computer vision.

So far we’ve taken a look at the origin story of neural networks. The single learning algorithm hypothesis and its mind-blowing evidence motivates neural networks as not only a powerful learning algorithm, but perhaps the algorithm that general intelligence can be built upon. Plus, we examined a Nobel-prize worthy discovery into how the visual cortex processes images inside the brain. More importantly, we saw how this inspired the CNN that dominates today’s world of computer vision.

These two things constantly remind me of the introspectiveness of neural networks: their advancement relies on not only progress in machine learning, but also progress in neuroscience. The latter is undoubtedly the harder one to achieve, but also the more likely to fundamentally impact deep learning, and therefore should never be lapsed, especially in the minds of individuals interested in deep learning research and careers.

< References >

[1] Roe, A. W., Pallas, S. L., Kwon, Y. H. & Sur, M. Visual projections routed to the auditory pathway in ferrets: receptive felds of visual neurons in the primary auditory cortex. J. Neurosci. 12, 3651±3664 (1992).
[2] A. Ng, Lecture 4.2-Neurons and the brain: The “one learning algorithm” hypothesis., (2012)
[3] Metin, C. & Frost, D. Visual responses of neurons in somatosensory cortex of hamsters with experimentally induced retinal projections to somatosensory thalamus. Proc. Natl Acad. Sci. USA 86, 357–361 (1989).
[4] “How Does the Brain Work?” National Center for Biotechnology Information, U.S. National Library of Medicine, 29 Dec. 2016, www.ncbi.nlm.nih.gov/pubmedhealth/PMH0072486/.
[5] Hubel, D. H. & Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s striate cortex. J. Physiol. 148, 574-591.

/posts/introspectiveness-of-nn