Friday, June 17, 2016

TensorFlow 0.8 on Jetson TK1

This post gives updated instructions on how to build TensorFlow 0.8 on Jetson TK1 now that NVIDIA has released a new compiler that can handle the variadic templates without compiler internal errors.

If you just want to try to install the whl file, this is  a direct link,  tensorflow-0.8.0-cp27-none-linux_armv7l.whl

I am going to use the same approach highlighted in the previous post, basically use the CUDA runtime 6.5 and CUDDN v2 but compile the code with the newer 7.0 compiler.


Install the 7.0.76 compiler:

Before starting, you will need to download the new compiler. NVIDIA does not make your life easy in finding the link (they would like you to use Jetpack, but I don't like to reformat a working system if not absolutely needed) but you can download the .deb package directly on your Jetson with:



wget http://developer.download.nvidia.com/embedded/L4T/r24_Release_v1.0/CUDA/cuda-repo-l4t-7-0-local_7.0-76_armhf.deb

Now we can install it as usual:

sudo dpkg -i cuda-repo-l4t-7-0-local_7.0-76_armhf.deb 
sudo apt-get update
sudo apt-get install cuda-toolkit-7-0

At this point we need to restore the standard 6.5 toolchain as the default one (we just want the 7.0 compiler to generate the object files), since the current driver on the Jetson TK1will only work with the 6.5 runtime. Go to the /usr/local directory and remove the cuda symlink to cuda-7.0 and make a new one for 6.5: 

ubuntu@tegra-ubuntu:/usr/local$ sudo rm cuda
ubuntu@tegra-ubuntu:/usr/local$ sudo ln -s cuda-6.5/ cuda

You should see this output:

ubuntu@tegra-ubuntu:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Fri_Dec_12_11:12:07_CST_2014
Cuda compilation tools, release 6.5, V6.5.35

ubuntu@tegra-ubuntu:~$ /usr/local/cuda-7.0/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_22_15:38:26_CST_2016
Cuda compilation tools, release 7.0, V7.0.74

Install protobuf and Bazel:
For protobuf you can follow the instruction from the previous blog post ( the only change is the location of protobuf-java-3.0.0-beta-x.jar , now in the java/core/target subdirectory).
Also for Bazel the procedure is similar, the only change required is the version, TF0.8 requires Bazel 0.1.4 so after cloning bazel, you will need to use the proper tag:

$ git clone https://github.com/bazelbuild/bazel.git
$ cd bazel
$ git checkout tags/0.1.4

Install TensorFlow 0.8:
The first thing to do it is to check out the source code and select the proper version:

$ git clone --recurse-submodules https://github.com/tensorflow/tensorflow
$ cd tensorflow
$ git checkout r0.8


TensorFlow is expecting a 64bit system, we will need to change all the reference from lib64 to lib. We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow
$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'

TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2. 
This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.
When prompted, specify that you only want a 3.2 compute capability device.


ubuntu@tegra-ubuntu:~/tensorflow$ TF_UNOFFICIAL_SETTING=1 ./configure
Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 
Please specify the location where CUDA  toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 
Please specify the location where cuDNN  library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 3.2
Setting up Cuda include
Setting up Cuda lib
Setting up Cuda bin
Setting up Cuda nvvm
Configuration finished

Now that the initial set up is done, it is time to change the compiler used by Bazel.

ubuntu@tegra-ubuntu:~/tensorflow$ cd third_party/gpus/cuda/
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ rm -fr bin nvvm
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/bin/ bin
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/nvvm/ nvvm

Before starting the build ( that is going to take a very long time), we will need to modify few files.

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:
 To avoid double instantiation, guard the second functor for InflatePadAndShuffle with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::InflatePadAndShufflefloat
, 4,
                                              Eigen::DenseIndex>;
#endif 
tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:
 To avoid double instantiation, guard the second functor  for ShuffleAndReverse with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::ShuffleAndReversefloat
, 4,
                                           Eigen::DenseIndex>;
#endif

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:
 ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime. You can use the modification from the previous post or the following code:

static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {
#ifdef __arm__
  LOG(INFO) << "ARMV7 does not support NUMA - returning NUMA node zero";
  return 0;
#else
 ........
  return kUnknownNumaNode;
#endif
}

tensorflow/core/common_runtime/gpu/process_state.cc:
this  is a new memory allocator, that is going to cause a floating point exception unless you change the following code:

if (kCudaHostMemoryUseBFC) {
      allocator =
#ifdef __arm__
          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 31,
                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);
#else
          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 36 /*64GB max*/,
                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);
#endif
    } else {


We are now ready to build. The only thing left to do is to remove the check to disable the use of variadic templates in Eigen. I have not found a clean way to do it (someone with better Bezel skills may have a better idea). My solution is to  start the build and then wait for the first failure:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures -s --config=cuda //tensorflow/cc:tutorials_example_trainer

If on your first compile of tensorflow you get the following error:

ERROR: /home/ubuntu/tensorflow/tensorflow/cc/BUILD:61:1: error loading package 'tensorflow/core': Extension file not found. Unable to load package for '//google/protobuf:protobuf.bzl': BUILD file not found on package path and referenced by '//tensorflow/cc:tutorials_example_trainer'.

You need to init update in the tensorflow repository to get the google/protobuf clone using:

git submodule update --init 

At this point, I can edit the file Macros.h in Eigen.
This file is located in the .cache directory:

ubuntu@tegra-ubuntu:~/.cache$ find . -name Macros.h -print
./bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/external/eigen_archive/eigen-eigen-3f653ace7d28/Eigen/src/Core/util/Macros.h

The nvcc check needs to be eliminated:
-#if !defined(__NVCC__) || !defined(EIGEN_ARCH_ARM_OR_ARM64)
 #define EIGEN_HAS_VARIADIC_TEMPLATES 1
 #endif
-#endif


We can now restart the build and it will go through. 
After you are done, you can test it with:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of
# a 2x2 matrix, on GPU. The last few lines look like this.
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

We are now ready to create the pip package and install it:
# To build with GPU support:
$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.8.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.


Most of the tests are passing, but the image classification example is giving the wrong results. Now that the community can build it and play with it, someone can find the source of the error(s).

I downloaded the python files from TensorFlow-Tutorial and they seem to work:

git clone https://github.com/nlintz/TensorFlow-Tutorials.git


Friday, November 27, 2015

Building TensorFlow for Jetson TK1

Google recently released TensorFlow, an open source software library for numerical computation using data flow graphs. 

TensorFlow has a GPU  backend built on CUDA, so I wanted to install it on a Jetson TK1. Even if the system did not meet the requirements ( CUDA 7.0 is not available and the GPU is a compute capability 3.2), I decided to give it a try anyway.  This blog reports all the steps required to build TensorFlow from source, it is quite challenging but it can be done. Including all the prerequisites, the whole build will take several hours ( if you just want to try Tensorflow, you can download the wheel file I generated and do a pip install. The file is at https://drive.google.com/file/d/0B1uGKNpQ7xNqZ2pvSmc3SlZJS2c/view?usp=sharing ). 

TensorFlow is under active development and the coding is using a lot of advanced C++ features that really push the compiler, these instructions worked with the version available on 11/26 but new 

The first challenge is to build Bazel, another software developed at Google used as building system for TensorFlow. Bazel requires a protobuf version newer than the one presents in the Ubuntu 14.04 repos, so the first step will be to install protobuf 3 from source, since there are no prebuilt binary for ARM32.

Java 8:
The first step is to install Java8, but this is quite simple since Oracle provides a package:

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Protobuf:
In order to build protobuf and bazel, we will need several other packages. The exact list will  depend on the status of your Jetson,  but you will need at least these ones:

$ sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev  

After downloading  the latest source from github:

$ git clone https://github.com/google/protobuf.git

you need to first generate the configuration file and then run make:

$ cd protobuf
$ ./autogen.sh 
$ ./configure --prefix=/usr
$ make -j 4
$ sudo make install

Protoc will be installed in /usr/lib and /usr/bin, this will be important when we run bazel since it tries to use a sandbox and will not find the libraries in /usr/local/lib.

You should see this output, if you have followed all the steps:

ubuntu@tegra-ubuntu:~/protobuf$ protoc --version
libprotoc 3.0.0

We also need to build the java interface for protobuf, that will require Maven.
Luckily maven is available from the repos, so we can just issue a:

$ sudo apt-get install maven

Go to the subdirectory java inside protobuf and type:
$ mvn package

Once the build is completes, there will be  a protobuf-java-3.0.0-beta-1.jar inside the target subdirectory.

Bazel:

We are now ready to tackle Bazel.
The first step is to download the source code for Bazel ( using the 0.1.0 version, that it is known to work with Tensorflow). 

$ git clone https://github.com/bazelbuild/bazel.git
$ cd bazel
$git checkout tags/0.1.0

Before compiling, we need to copy the protoc binary we just built as third_party/protobuf/protoc-linux-arm32.exe.
We also need to copy the jar file from protobuf in the same directory. Bazel is expecting an alpha-3 version, but we have built a  beta-1.
There is probably a better way of doing this, but just copying the file and rename it did the trick for me.


$ cp /usr/bin/protoc   third_party/protobuf/protoc-linux-arm32.exe
$ cp ~/protobuf/java/target/protobuf-java-3.0.0-beta-1.jar  third_party/protobuf/protobuf-java-3.0.0-alpha-3.jar

We are now ready to compile bazel. 

$ ./compile.sh

At the end of the compilation, the bazel binary will be in the output directory. You can add this directory
to your path or copy the binary in /usr/local/bin

TensorFlow

We are now ready to tackle the TensorFlow build for GPU. Just be sure to have CUDA 6.5 and CUDNN 6.5 installed on your Jetson TK1. 
You will also need some files from the CUDA 7.0 package ( cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb ) that you can download from
the NVIDIA web site ( it is the one for Jetson TX1).
While Jetson TK1 cannot run the 7.0 runtime, since the driver shipped with the system does not support it, it is still  possible to run the CUDA 7.0 compiler. We need the 7.0 compiler because some of the TensorFlow source files will generate an internal compiler error with the 6.5 nvcc. 
All the libraries and runtime will be the standard 6.5 ones. 


On my system I have also enabled some swap space. You can plug a USB memory stick,  create a swap file and mount it with
$ sudo mkswap /dev/sda
$ sudo swapon /dev/sda 

The first step to build TensorFlow is to clone the github repository:
$ git clone -recurse-submodules https://github.com/tensorflow/tensorflow 

and install other dependencies:
$ sudo apt-get install python-numpy swig python-dev

TensorFlow is expecting a 64bit system and has a bunch of library paths and libraries hard-coded in the files.
Before starting the installation, we will need to modify several files.  We will need to change all the reference from lib64 to lib and change the 7.0 libraries to 6.5.  We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow
$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'
$ grep -Rl "so.7.0"| xargs sed -i 's/so\.7\.0/so\.6\.5/g'


TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2. 
This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.
When prompted, specify that you only want a 3.2 compute capability device.

$ TF_UNOFFICIAL_SETTING=1 ./configure

# Same as the official settings above

WARNING: You are configuring unofficial settings in TensorFlow. Because some
external libraries are not backward compatible, these settings are largely
untested and unsupported.

Please specify a list of comma-separated Cuda compute capabilities you want to
build with. You can find the compute capability of your device at:
https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases
your build time and binary size. [Default is: "3.5,5.2"]: 3.2

Setting up Cuda include
Setting up Cuda lib
Setting up Cuda bin
Setting up Cuda nvvm
Configuration finished


After the configure, bazel has copied or symlinked all the binaries and libraries needed for the build in  the third_party/gpus/cuda subdirectory .
It is now time to replace the cuda compiler with the one from the 7.0 toolchain.

We want to extract (not install) the files from the  cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb package with the following commands:

$ dpkg -x cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb /tmp/cuda_repo
$ cd /tmp/cuda_repo/var/cuda-repo-7-0-local
$ dpkg -x cuda-core-7-0_7.0-71_armhf.deb /tmp/cuda7.0
$ rm -fr /tmp/cuda_repo

$ cd ~tensorflow/third_party/gpus/cuda
$ rm -fr bin nvvm
$ cp -R  /tmp/cuda7.0/usr/local/cuda-7.0/bin bin
$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/nvvm nvvm
$ rm -fr /tmp/cuda7.0

At this point, bazel is ready to use the 7.0 toolchain to compile Tensorflow.

We still need to add the ARM target to the build. 
This can be done adding the following lines to the third_party/gpus/crosstool/CROSSTOOL file:

default_toolchain {
  cpu: "arm"
  toolchain_identifier: "local_linux"
}                                                                                                                                                                                                                                                

Before starting the build, we need to edit few files to avoid compiler crashes and avoid double instantiations 
(on ARM v7, Eigen::DenseIndex is  typedefed to int):

third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h
tensorflow/core/kernels/conv_ops_gpu_2.cu.cc
tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
tensorflow/core/kernels/adjust_contrast_op.h


third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h:  
the compiler is crashing when evaluating the code inside the ifdef at line 312. We can just take the alternative path.
         Change line 312 to something like:
#ifdef EIGEN_HAS_VARIADIC_TEMPLATES_TK1

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:
To avoid double instantiation, guard the second functor for InflateAnsShuffle with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::InflatePadAndShuffle
                                              Eigen::DenseIndex>;
#endif
We also need to comment the tensor.h include ( will crash the compiler)
//#include "tensorflow/core/public/tensor.h"

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:
To avoid double instantiation, guard the second functor  for ShuffleAndReverse with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::ShuffleAndReverse
                                           Eigen::DenseIndex>;
#endif

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:
ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime:
FILE *file = fopen(filename.c_str(), "r");
  if (file == nullptr) {
    LOG(ERROR) << "could not open file to read NUMA node: " << filename;
#ifdef __arm__
    // There is no numa_node on Jetson TK1
    return 0;
#else
    return kUnknownNumaNode;
#endif


tensorflow/core/kernels/adjust_contrast_op.h:
The compiler is crashing on some initializations, we need to rewrite them in a simpler way:

//MF Eigen::array scalar_broadcast{{batch, height, width, channels}};
    Eigen::array scalar_broadcast;
    scalar_broadcast[0] = batch;
    scalar_broadcast[1] = height;
    scalar_broadcast[2] = width;
    scalar_broadcast[3] = channels;
#if !defined(EIGEN_HAS_INDEX_LIST)
//MF Eigen::array reduction_axis{{1, 2}};
//MF Eigen::array scalar{{1, 1, 1, 1}};
//MF Eigen::array broadcast_dims{{1, height, width, 1}};
//MF Eigen::Tensor::Dimensions reshape_dims{{batch, 1, 1, channels}};
     Eigen::array reduction_axis;
      reduction_axis[0]=1;
      reduction_axis[1]=2;
     Eigen::array scalar;
      scalar[0]=1;
      scalar[1]=1;
      scalar[2]=1;
      scalar[3]=1;
     Eigen::array broadcast_dims;
      broadcast_dims[0]=1;
      broadcast_dims[1]=height;
      broadcast_dims[2]=width;
      broadcast_dims[3]=1;
     Eigen::Tensor::Dimensions reshape_dims;
      reshape_dims[0]=batch;
      reshape_dims[1]=1;
      reshape_dims[2]=1;
      reshape_dims[3]=channels;
#else

The source code is now ready. Jeston TK1 has only 2GB of memory and bazel will try to compile several files at the same time.
We want to avoid this, so we will pass a local_resource flag that will use only 2GB and half core (don't ask, if you specify one it will still try
to compile two files at the same time). This build will take a long time:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/cc:tutorials_example_trainer

If you get some failures during the build, keep trying, bazel scheduling seems to be non-deterministic and the Tensorflow code is really stressing the
compiler.

Once the build is completed, we can test the code:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of
# a 2x2 matrix, on GPU. The last few lines look like this.
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]


We are now ready to create the pip package and install it:
# To build with GPU support:
$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.5.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.

We can also try a more interesting example of image classification:
bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/examples/label_image/...

$ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip -O tensorflow/examples/label_image/data/inception5h.zip
$ unzip tensorflow/examples/label_image/data/inception5h.zip -d tensorflow/examples/label_image/data/
$ mv tensorflow/examples/label_image/data/tensorflow_inception_graph.pb tensorflow/examples/label_image/data/googlenet_graph.pb
$ mv tensorflow/examples/label_image/data/imagenet_comp_graph_label_strings.txt tensorflow/examples/label_image/data/googlenet_labels.txt 

And run it with:
$ bazel-bin/tensorflow/examples/label_image/label_image
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 1
E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:890] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: 
name: GK20A
major: 3 minor: 2 memoryClockRate (GHz) 0.852
pciBusID 0000:00:00.0
Total memory: 1.85GiB
Free memory: 218.46MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 18.46MiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:52] GPU 0 memory begins at 0xa45ea000 extends to 0xa585f000
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
I tensorflow/examples/label_image/main.cc:221] military uniform (866): 0.902268
I tensorflow/examples/label_image/main.cc:221] bow tie (817): 0.05407
I tensorflow/examples/label_image/main.cc:221] suit (794): 0.0113196
I tensorflow/examples/label_image/main.cc:221] bulletproof vest (833): 0.0100269
I tensorflow/examples/label_image/main.cc:221] bearskin (849): 0.00649747




Saturday, October 12, 2013

CUDA 5.5 and Xcode 5

The latest Xcode 5 update seems to have broken nvcc.
If you try to compile a  CUDA program, you will see a similar error:

%nvcc -c qr.cu
clang: error: unsupported option '-dumpspecs'
clang: error: no input files

There is a simple workaround.

%nvcc -ccbin=/usr/bin/clang -c qr.cu

A more convenient way of adding this, it is to define an alias for nvcc.
You can add this line to your .bash_profile

alias nvcc='nvcc -ccbin=/usr/bin/clang'

or just define it in your shell,

alias 'nvcc=nvcc -ccbin=/usr/bin/clang'



Wednesday, September 11, 2013

Calling CUDA Fortran kernels from MATLAB

The latest MATLAB versions, starting from 2010b,  have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu,  that adds a scalar to a vector:


__global__ void add(double * in, double a, int N)  {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < N) {
       in[idx] += a;
   }
}

The full documentation is available at
http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html
I am just going to summarize the required steps:

  • Generate a PTX  file from the kernel source
    • nvcc -ptx -arch sm_20 add.cu
  • Construct the kernel object from the PTX file
    • k=parallel.gpu.CUDAKernel('add.ptx','add.cu');
  • Set up the block and grid configuration, for example 28 blocks of 256 threads each:
    • k.ThreadBlockSize=[256 1 1]
    • k.GridSize=[28 1 1]
  • Execute the kernel.
    • o = feval(k,rand(10,1),2.,10)
    • The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want
to learn more about  CUDA Fortran there is a very good book you can pre-order from Amazon,
"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)
    implicit none
    double precision, intent(inout) :: a(*)
    double precision, value :: b
    integer , value :: N
    integer :: i

    i = threadIdx%x+(blockIdx%x-1)*blockDim%x
    if ( i <=N) a(i) = a(i)+b

 end subroutine add

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

               pgf90 -c -Mcuda=keepptx,cc20  addf.cuf
The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files,  you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

               setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

               !pgf90 -c -Mcuda=keepptx,cc20  addf.cuf


In order to load the PTX file in MATLAB, we need to slightly change the syntax.
When loading the PTX file generated by CUDA C, we were passing both the PTX file name and
the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel. 

This is what we need to load the PTX file generated from CUDA Fortran.

       kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

    • kf.ThreadBlockSize=[256 1 1]
    • kf.GridSize=[28 1 1]
and execute the kernel.
    • of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements
i1=gpuArray(rand(10,1))
% Create the kernel object from the PTX file with explicit prototype
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')
% Set execution configuration
kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]
% Execute the kernel
of=feval(kf,i1,10.,10)


An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.


Now that we understand all the steps, let's move to something more complex and discuss few more points.
We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock
( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).
The kernel is embedded in a module, since we are using a global variable for the lock. 
There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers 
for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

  implicit none
  integer, parameter :: fp_kind = kind(0.0d0)
  integer, device::  lock=0

contains

  attributes(global) subroutine sum(input,totalsum,N)
    real(fp_kind), intent(in) :: input(N)
    real(fp_kind) :: totalsum(1)
    integer,value :: N
    real(fp_kind), shared, dimension(256) :: psum
    integer :: i,index, inext
    real(fp_kind) :: lsum

    index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x

    lsum = 0._fp_kind
    do i=index,N,BlockDim%x*GridDim%x
       lsum = lsum+ input(i)
    end do

    ! Local reduction per block
    index=threadIdx%x

    psum(index)=lsum
    call syncthreads()

    inext=blockDim%x/2
    do while ( inext >=1 )
       if (index <=inext) psum(index)=psum(index)+psum(index+inext)
       inext = inext /2
       call syncthreads()
    end do

    ! Final reduction among block with atomic lock
    if (index == 1) then
       do while ( atomiccas(lock,0,1) == 1)
       end do
       totalsum(1)=totalsum(1)+psum(1)
       call threadfence()
       lock =0
    end if

  end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}


The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.
When  the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.
While this is not important when the module contains a single subroutine, it is  crucial for situations in which multiple entry points are defined. If the module had  multiple subroutines,  we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')
Error using handleKernelArgs (line 61)
Found more than one entry point in the PTX code.  Possible names are:
sumgpu_sum_
sumgpu_sum2_


In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')
kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The command now completes correctly.  However, with the  prototype signature we specified, the first array that in the original code was
with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'.  This is not a major problem, but we will
need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.
We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 1
       NumRHSArguments: 3
         ArgumentTypes: {'in double vector'  'inout double vector'  'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only. 
We are now ready to run the code:

%Generate an array of 1024 elements on the CPU
a=rand(1024,1);
% Copy the array to a GPU array ag
ag=gpuArray(a);
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[28 1 1];
% Initialize the sum to zero
sumg=gpuArray.zeros(1,'double');
% Invoke the kernel
disp('CUDA Fortran kernel:')
sumg=feval(kf,ag,sumg,1024)
% Recompute the sum using the intrinsic MATLAB function
disp('Intrinsic MATLAB sum on GPU:')
sum_matlab=sum(ag)
%Check the result
disp('Difference:')
sumg-sum_matlab

obtaining the following output:
CUDA Fortran kernel:

sumg =

  509.2181

Intrinsic MATLAB sum on GPU:

sum_matlab =

  509.2181

Difference:

ans =

     0

Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.
We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU
seed=0;
gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);
parallel.gpu.RandStream.setGlobalStream(gpu_stream);
N=50000000;
%Generate the random numbers directly on the GPU
ag=gpuArray.randn(N,1);
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[128 1 1];
% Initialize the sum to zero
sumg=gpuArray.zeros(1,'double');
% Invoke the kernel and time the execution
tic;sumg=feval(kf,ag,sumg,N);toc
% Invoke the intrinsic sum and time the execution
tic;sum(ag);toc

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.
Elapsed time is 0.000357 seconds.
Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use
standalone code that was already written in a very productive environment such as MATLAB.