Memory Mapping, Clojure, And Apache Arrow
Mmap, Memory, and Operating Systems
This post is about a special facility for dealing with memory and files on disk called memory mapping. Memory mapping blurs the line between filesystem access and access to physical RAM in a way that benefits managing and processing large volumes of data on modern machines. We have personally experienced a lot of success with it over the years. Ignoring it is a big mistake, akin to ignoring tremendous research that has gone into high performance data storage and access techniques.
In this post, and in our work, we apply memory mapping to loading Apache Arrow 1.0 data. So, we will cover the Apache Arrow binary format, its overall architecture and base implementation and why it is such a great candidate for exactly this process.
Memory mapping, available in all modern operating systems, allows programs to work with datasets larger than physical RAM while also allowing multiple programs running concurrently to believe they have access to the entire memory space (or more) of the machine.
To begin, first we'll outline "paging", a slightly simplified model of how RAM is accessed in general. The operating system manages memory in this way whether that access is from a C malloc call or from a Java allocation of an array or an object. For the purpose of this post all access to RAM, regardless of if that access is via the JVM or a native pathway, happens via this mechanism. There are specialized ways avoid this but we can safely ignore those for now as they do not apply to your average Java, Python, R, or C program.
The main idea of Paging is to create a level of indirection between your program and physical RAM. When your program allocates memory the operating system allocates pages of memory. These pages have fixed size such as 4K, 8K or 16K. The OS keeps track of metadata about these pages in an associative data structure called a page table. When you dereference a pointer into that memory your program actually runs special low level code that checks the page table to see where the physical address of the requested memory lies. This implements what is called virtual memory - a mapping between memory that you see as a programmer and physical RAM in the system. This mapping system allows, among other things, the OS to load data from disk and place it in physical memory before your program is allowed to fully dereference the pointer. This enables both swapping (with which you are probably familiar), and memory mapping.
Mapping RAM to filesystem storage has some interesting aspects that we can take advantage of to expand our possibilities when working with data:
- We can map a file much larger than physical ram. If we have 16GB of ram on a machine we can request the OS to memory map a file that is 100GB and it will happily do so, paging the data we read into physical memory as necessary. This can often be both simpler and faster than managing intermediate buffers of data in your own program.
- The same physical page of memory can be mapped into several processes simultaneously.
- When we map a file we get back a simple integer pointer; not an OS stream abstraction.
These possibilities enable the two main features of memory mapping promised in the opening of this post, and further suggest a very efficient form of Inter-Process-Communication (IPC). We can have two processes map the same file and the operating system can potentially map pages of that file to the same locations in physical RAM thus saving overall IO bandwidth. If we write to that address in memory we can update a potentially large file in another program instantly. We see this commonly used with shared libraries; often on many operating systems they are mapped into the address space of the program via a shared, read-only memory mapping that indicates the data should, in addition to being read-only, be executable. This allows many programs to share a subset of the memory required for a given shared library.
There are still other options! For the truly curious please refer to the posix mmap command and potentially cross reference this with the Windows File Mapping documentation.
Memory Mapping On The JVM
Memory mapping is possible on the JVM, but the standard library implementation is superficially hamstrung. When using the standard java.nio.file.FileChannel system to memory map files you are limited to 2GB because Java nio buffers are signed 32bit integer addressable. TechAscent exists (at least in part) to remove short-sighted language-specific shortcomings. We see no reason for a 2GB limit on memory mapped files!
Memory mapping fits in naturally with the architecture of tech.datatype
, and we have implemented support for it there.
Memory mapping a file is a single function call:
user> (require '[tech.v2.datatype :as dtype])
nil
user> (require '[tech.v2.datatype.mmap :as mmap])
nil
user> (doc mmap/mmap-file)
-------------------------
tech.v2.datatype.mmap/mmap-file
([fpath {:keys [resource-type mmap-mode], :or {resource-type :stack, mmap-mode :read-only}}] [fpath])
Memory map a file returning a native buffer. fpath must resolve to a valid
java.io.File.
Options
* :resource-type - Chose the type of resource management to use with the returned
value:
* `:stack` - default - mmap-file call must be wrapped in a call to
tech.resource/stack-resource-context and will be cleaned up when control leaves
the form.
* `:gc` - The mmaped file will be cleaned up when the garbage collection system
decides to collect the returned native buffer.
* `nil` - The mmaped file will never be cleaned up.
* :mmap-mode
* :read-only - default - map the data as shared read-only.
* :read-write - map the data as shared read-write.
* :private - map a private copy of the data and do not share.
nil
user> (mmap/mmap-file "/home/chrisn/Downloads/Rich Hickey on Datomic Ions, September 12, 2018.mp4" {:resource-type :gc})
{:address 139898477133824, :n-elems 284636892, :datatype :int8}
user> (def mmap-data *1)
#'user/mmap-data
user> (type mmap-data)
tech.v2.datatype.mmap.NativeBuffer
This is all it takes and you get back a typed pointer an address, length, and datatype to allow safe element-wise access to any location in the buffer. This interface supports files larger than 2GB and so immediately removes the restriction we saw above with java.nio.file.FileChannel/map. tech.datatype
lsize
, read
, and write
all take/return long
s. The underlying operations are built on java.misc.Unsafe which also takes long
arguments.
In order to control how and when the file is unmapped, we use tech.resource
to do so automatically. For more about resource types, please refer to our blog post on tech.resource.
The returned native buffer can be read manually via read-*
methods in the mmap
namespace and it can be read from via the normal tech.datatype
reader pathway. The datatype library has optimized copy pathways between native buffers and Java arrays.
user> ;;read some bytes
user> (take 10 (dtype/->reader mmap-data))
(0 0 0 24 102 116 121 112 109 112)
user> ;;copy to a byte array
user> (dtype/make-container :java-array :int8
(dtype/sub-buffer mmap-data 0 25))
[0, 0, 0, 24, 102, 116, 121, 112, 109, 112, 52, 50, 0, 0, 0, 0, 105, 115, 111, 109,
109, 112, 52, 50, 0]
Apache Arrow - What Is It?
Recently the Apache Arrow project reached a major milestone - version 1.0. Here is the official definition from their site:
Apache Arrow is a cross-language development platform for in-memory analytics
That is quite grand and broad, and the whole point of this exercise is to blur the line between what is in-memory and what is not. Lucky for us they have an ABI; something we wish more systems had. Not only do they have an ABI nicely defined in words, they used an interesting language-agnostic project, google flatbuffers, to generate low level code that translates between an opaque array of bytes and somewhat less opaque data structures.
These extension points give us a way in, and for our purposes we can instead think of Arrow as:
A typed columnar data storage format with a clear language-agnostic binary spec
This means that we can in theory share data it between different languages or processes and we can potentially share that data without ever copying (read: instantly). We can learn the format of Apache Arrow streaming files by breaking down a simple file using our new memory mapping facility and some low-level helpers from tech.ml.dataset. We will start with our favorite stocks.csv test case.
The outermost layer of an Arrow file consists of a sequence of messages. A Message is a generated class that derives from a flatbuffer Table. Messages have a 32-bit header and a 64-bit body. A single function call takes a native buffer and returns a lazy sequence of Messages.
user> (require '[tech.ml.dataset :as ds])
nil
user> (require '[tech.libs.arrow.in-place :as arrow-in-place])
nil
user> (require '[tech.libs.arrow :as arrow])
nil
user> (def stocks (ds/->dataset "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------|------------|-------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (arrow/write-dataset-to-stream! stocks "stocks.arrow")
nil
user> (require '[tech.v2.datatype.mmap :as mmap])
nil
user> (def file-data (mmap/mmap-file "stocks.arrow" {:resource-type :gc}))
#'user/file-data
user> file-data
{:address 140332842586112, :n-elems 11128, :datatype :int8}
user> (def messages (arrow-in-place/message-seq file-data))
#'user/messages
user> messages
({:next-data {:address 140332842587000, :n-elems 10240, :datatype :int8},
:message
#object[org.apache.arrow.flatbuf.Message 0x6341aee5 "org.apache.arrow.flatbuf.Message@6341aee5"],
:message-type :schema}
{:next-data {:address 140332842587256, :n-elems 9984, :datatype :int8},
:message
#object[org.apache.arrow.flatbuf.Message 0x566e4d7a "org.apache.arrow.flatbuf.Message@566e4d7a"],
:message-type :dictionary-batch,
:body {:address 140332842587192, :n-elems 64, :datatype :int8}}
{:next-data {:address 140332842597232, :n-elems 8, :datatype :int8},
:message
#object[org.apache.arrow.flatbuf.Message 0x11838ad4 "org.apache.arrow.flatbuf.Message@11838ad4"],
:message-type :record-batch,
:body {:address 140332842587496, :n-elems 9736, :datatype :int8}})
This code parsed the raw file data pointer into three messages: a schema
, a dictionary-batch
, and a record-batch
. The Arrow documentation explains their significance.
As a side note, look at how nice it is to see all of this information immediately in the REPL clearly without writing more code. This is one strong advantage of data-oriented Clojure programs. If data is EDN, you can always simply print it. This niceness continues as we dive and and see more about the internal structure of this Arrow data.
Another single function call further expands the Arrow messages into more or less pure EDN.
user> (def parsed-messages (mapv arrow-in-place/parse-message messages))
#'user/parsed-messages
user> parsed-messages
({:fields
[{:name "symbol",
:nullable? true,
:field-type {:datatype :string, :encoding :utf-8},
:metadata
{":name" "\"symbol\"", ":size" "560", ":datatype" ":string", ":categorical?" "true"},
:dictionary-encoding
{:id -887523944, :ordered? false, :index-type {:datatype :int8}}}
{:name "date",
:nullable? false,
:field-type {:datatype :epoch-milliseconds, :timezone "UTC"},
:metadata
{":name" "\"date\"", ":timezone" "\"UTC\"", ":source-datatype" ":packed-local-date", ":size" "560", ":datatype" ":epoch-milliseconds"}}
{:name "price",
:nullable? false,
:field-type {:datatype :float64},
:metadata {":name" "\"price\"", ":size" "560", ":datatype" ":float64"}}],
:encodings
{-887523944 {:id -887523944, :ordered? false, :index-type {:datatype :int8}}},
:metadata {},
:message-type :schema}
{:id -887523944,
:delta? false,
:records
{:nodes [{:n-elems 6, :n-null-entries 0}],
:buffers
[{:address 139690179728440, :n-elems 1, :datatype :int8}
{:address 139690179728448, :n-elems 28, :datatype :int8}
{:address 139690179728480, :n-elems 19, :datatype :int8}]},
:message-type :dictionary-batch}
{:nodes
[{:n-elems 560, :n-null-entries 0}
{:n-elems 560, :n-null-entries 0}
{:n-elems 560, :n-null-entries 0}],
:buffers
[{:address 139690179728744, :n-elems 70, :datatype :int8}
{:address 139690179728816, :n-elems 560, :datatype :int8}
{:address 139690179729376, :n-elems 70, :datatype :int8}
{:address 139690179729448, :n-elems 4480, :datatype :int8}
{:address 139690179733928, :n-elems 70, :datatype :int8}
{:address 139690179734000, :n-elems 4480, :datatype :int8}],
:message-type :record-batch})
That is pretty clear - imagine doing this in C++. We can now both see the entire skeleton of the file and we can start to interactively explore the data itself. The schema is already fully parsed, but the dictionary-batch and the record-batch are worth explaining. Let's start with the dictionary-batch:
user> (def dictionary-batch (second parsed-messages))
#'user/dictionary-batch
user> dictionary-batch
{:id -887523944,
:delta? false,
:records
{:nodes [{:n-elems 6, :n-null-entries 0}],
:buffers
[{:address 139690179728440, :n-elems 1, :datatype :int8}
{:address 139690179728448, :n-elems 28, :datatype :int8}
{:address 139690179728480, :n-elems 19, :datatype :int8}]},
:message-type :dictionary-batch}
Dictionary batches allow repeated elements to be represented by indexes. So they are commonly used as a simple form of compression for string columns with high levels of repetition. We can see above that the dictionary has id -887523944
which corresponds with the symbol column's :encoded-dictionary
id. Sometimes this sort of arrangement is called a "string table".
The dictionary batch starts with a validity buffer which is a bitmask that tells us which indexes are valid. We can see from the nodes entry above that in this case there are no invalid indexes so we will ignore this buffer. That leaves us with two more buffers. According the the specification, one buffer is a buffer of 32 bit offsets while the other buffer is a UTF-8 encoded byte buffer delineated by the offsets (the actual strings).
The mmap namespace contains a method to reinterpret a chunk of data with a different datatype. We can then read from this data and see what we get.
user> (def offsets (get-in dictionary-batch [:records :buffers 1]))
#'user/offsets
user> offsets
{:address 139690179728448, :n-elems 28, :datatype :int8}
user> (def int32-offsets (mmap/set-native-datatype offsets :int32))
#'user/int32-offsets
user> (require '[tech.v2.datatype :as dtype])
nil
user> (dtype/->reader int32-offsets)
[0 0 4 8 11 15 19]
It seems the byte buffer contains 5 strings - there are two 0 elements indicating an empty string followed by 5 nonzero elements in the offset buffer. We can test this theory easily:
user> (def data (get-in dictionary-batch [:records :buffers 2]))
#'user/data
user> data
{:address 139690179728480, :n-elems 19, :datatype :int8}
user> (def offsets-reader (dtype/->reader int32-offsets))
#'user/offsets-reader
user> (def n-elems (dtype/ecount offsets-reader))
#'user/n-elems
user> (for [idx (range (dec n-elems))]
(String. (dtype/->array-copy
(dtype/sub-buffer
data
(offsets-reader idx)
(- (offsets-reader (inc idx))
(offsets-reader idx))))))
("" "MSFT" "AMZN" "IBM" "GOOG" "AAPL")
user> (def dictionary (vec *1))
#'user/dictionary
That's the dictionary. But what about the actual columns of data? For this we need to interpret the single record batch of buffers through the schema.
Looking at our schema and our record batch:
user> (:fields schema)
[{:name "symbol",
:nullable? true,
:field-type {:datatype :string, :encoding :utf-8},
:metadata
{":name" "\"symbol\"", ":size" "560", ":datatype" ":string", ":categorical?" "true"},
:dictionary-encoding
{:id -887523944, :ordered? false, :index-type {:datatype :int8}}}
{:name "date",
:nullable? false,
:field-type {:datatype :epoch-milliseconds, :timezone "UTC"},
:metadata
{":name" "\"date\"", ":timezone" "\"UTC\"", ":source-datatype" ":packed-local-date", ":size" "560", ":datatype" ":epoch-milliseconds"}}
{:name "price",
:nullable? false,
:field-type {:datatype :float64},
:metadata {":name" "\"price\"", ":size" "560", ":datatype" ":float64"}}]
user> record-batch
{:nodes
[{:n-elems 560, :n-null-entries 0}
{:n-elems 560, :n-null-entries 0}
{:n-elems 560, :n-null-entries 0}],
:buffers
[{:address 139764597052776, :n-elems 70, :datatype :int8}
{:address 139764597052848, :n-elems 560, :datatype :int8}
{:address 139764597053408, :n-elems 70, :datatype :int8}
{:address 139764597053480, :n-elems 4480, :datatype :int8}
{:address 139764597057960, :n-elems 70, :datatype :int8}
{:address 139764597058032, :n-elems 4480, :datatype :int8}],
:message-type :record-batch}
There are 3 schema fields, 3 nodes in the record batch, and 6 buffers. In this case, each column of data gets 2 buffers. Columns of raw text (without the string table) would require 3 buffers; same as the dictionary-batch above. In our case the symbol
column uses a dictionary with byte indices into the dictionary; if the string column had no dictionary it would use 3 buffers similar to the dictionary. Since we have no missing values we can drop the initial buffer of each pair as it is the bitwise validity buffer and then group together related items.
user> (def columns (map (fn [field node buffer-pair]
{:field field
:node node
:buffer (second buffer-pair)})
(:fields schema)
(:nodes record-batch)
(partition 2 (:buffers record-batch))))
#'user/columns
user> columns
({:field
{:name "symbol",
:nullable? true,
:field-type {:datatype :string, :encoding :utf-8},
:metadata
{":name" "\"symbol\"", ":size" "560", ":datatype" ":string", ":categorical?" "true"},
:dictionary-encoding
{:id -887523944, :ordered? false, :index-type {:datatype :int8}}},
:node {:n-elems 560, :n-null-entries 0},
:buffer {:address 139764597052848, :n-elems 560, :datatype :int8}}
{:field
{:name "date",
:nullable? false,
:field-type {:datatype :epoch-milliseconds, :timezone "UTC"},
:metadata
{":name" "\"date\"", ":timezone" "\"UTC\"", ":source-datatype" ":packed-local-date", ":size" "560", ":datatype" ":epoch-milliseconds"}},
:node {:n-elems 560, :n-null-entries 0},
:buffer {:address 139764597053480, :n-elems 4480, :datatype :int8}}
{:field
{:name "price",
:nullable? false,
:field-type {:datatype :float64},
:metadata {":name" "\"price\"", ":size" "560", ":datatype" ":float64"}},
:node {:n-elems 560, :n-null-entries 0},
:buffer {:address 139764597058032, :n-elems 4480, :datatype :int8}})
Now getting the actual column data is mainly a matter of setting the datatype on the buffer. This is how the in-place loading mechanism works in general.
Of course, consuming Arrow data with our tools is much simpler than this (this is how they are implemented!). In general, Arrow data can be treated simply as a dataset, enabling powerful data science workflows and interoperability.
The whole idea here is that mmap is supposed to be fast. So, we measure,
Performance of the Mmap Pathway
Memory mapping is generally quite fast. To prove it, we can do a two way comparison between the dataset library's in-place arrow api and the official Java Arrow SDK. First we make a large file and save it in the arrow streaming formats (ten thousand copies of the stocks dataset results in 5M+ rows):
user> (def big-stocks (apply ds/concat-copying (repeat 10000 stocks)))
#'user/big-stocks
user> (dtype/shape big-stocks)
[3 5600000]
user> (arrow/write-dataset-to-stream! big-stocks "big-stocks.arrow")
nil
Timing 3 three things:
- Time till dataset is accessible.
- Time to read the first row.
- time to sum the price column.
We will use the pure Java Arrow bindings for comparison and time them against the in-place mmap bindings. For our timings, we will use the excellent criterium project.
First, time till dataset is accessible.
user> (require '[tech.resource :as resource])
nil
user> (require '[criterium.core :as crit])
nil
user> (require '[tech.libs.arrow.copying :as copying])
nil
user>(import '[org.apache.arrow.vector.ipc ArrowStreamReader])
org.apache.arrow.vector.ipc.ArrowFileReader
user> (require '[tech.io :as io])
nil
user> (defn arrow-stream-load
[]
(with-open [istream (io/input-stream "big-stocks.arrow")
reader (ArrowStreamReader. istream (copying/allocator))]
(.loadNextBatch reader)))
#'user/arrow-stream-load
user> (defn inplace-load
[]
(resource/stack-resource-context
(arrow-in-place/read-stream-dataset-inplace "big-stocks.arrow")
:ok))
#'user/inplace-load
user> (crit/quick-bench (arrow-stream-load))
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 90.585413 ms
Execution time std-deviation : 495.117602 µs
Execution time lower quantile : 89.880458 ms ( 2.5%)
Execution time upper quantile : 91.077616 ms (97.5%)
Overhead used : 2.363001 ns
nil
user> (crit/quick-bench (inplace-load))
Evaluation count : 6054 in 6 samples of 1009 calls.
Execution time mean : 102.176576 µs
Execution time std-deviation : 1.494221 µs
Execution time lower quantile : 100.669201 µs ( 2.5%)
Execution time upper quantile : 103.988031 µs (97.5%)
Overhead used : 2.363001 ns
The copying load process for the file starts with a 90ms penalty against any further benchmarks. But it has loaded all the data into RAM while the inplace method has loaded almost no data into RAM and thus we may see things even out a bit depending on our next operation.
Next, time to read the first row:
user> (defn arrow-stream-first-row
[]
(with-open [istream (io/input-stream "big-stocks.arrow")
reader (ArrowStreamReader. istream (copying/allocator))]
(.loadNextBatch reader)
(let [field-vecs (.getFieldVectors (.getVectorSchemaRoot reader))]
(mapv (fn [^org.apache.arrow.vector.FieldVector fv]
(.getObject fv 0))
field-vecs))))
#'user/arrow-stream-first-row
user> (arrow-stream-first-row)
[0 946684800000 39.81
user> (defn inplace-first-row
[]
(resource/stack-resource-context
(-> (arrow-in-place/read-stream-dataset-inplace "big-stocks.arrow")
(ds/mapseq-reader)
(first))))
#'user/inplace-first-row
user> (inplace-first-row)
{"date" 946684800000, "symbol" "MSFT", "price" 39.81}
user> (crit/quick-bench (arrow-stream-first-row))
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 90.654766 ms
Execution time std-deviation : 562.213037 µs
Execution time lower quantile : 89.932865 ms ( 2.5%)
Execution time upper quantile : 91.290891 ms (97.5%)
Overhead used : 2.363001 ns
nil
user> (crit/quick-bench (inplace-first-row))
Evaluation count : 5196 in 6 samples of 866 calls.
Execution time mean : 117.908280 µs
Execution time std-deviation : 1.616664 µs
Execution time lower quantile : 116.432225 µs ( 2.5%)
Execution time upper quantile : 119.640781 µs (97.5%)
Overhead used : 2.363001 ns
nil
Reading the first row is pretty simple, and rarely is that all you need to do, but the timings are starting to show that memory mapping at this point is fast.
Next up is a parallel summation across the entire price column. For this we introduce a new important primitive - indexed-map-reduce.
user> (require '[tech.v2.datatype.typecast :as typecast])
nil
user> (require '[tech.parallel.for :as parallel-for])
nil
user> (defn fast-reduce-plus
^double [rdr]
(let [rdr (typecast/datatype->reader :float64 rdr)]
(parallel-for/indexed-map-reduce
(.lsize rdr)
(fn [^long start-idx ^long group-len]
(let [end-idx (long (+ start-idx group-len))]
(loop [idx (long start-idx)
result 0.0]
(if (< idx end-idx)
(recur (unchecked-inc idx)
(+ result (.read rdr idx)))
result))))
(partial reduce +))))
#'user/fast-reduce-plus
user> (defn copying-sum
[]
(with-open [istream (io/input-stream "big-stocks.feather")
reader (ArrowStreamReader. istream (copying/allocator))]
(.loadNextBatch reader)
(fast-reduce-plus (.get (.getFieldVectors (.getVectorSchemaRoot reader))
2))))
#'user/copying-sum
user> (copying-sum)
5.64112000000166E8
user> (defn inplace-sum
[]
(resource/stack-resource-context
(let [ds (arrow-in-place/read-stream-dataset-inplace "big-stocks.arrow")]
(fast-reduce-plus (ds "price")))))
#'user/inplace-sum
user> (inplace-sum)
5.64112000000166E8
user> (crit/quick-bench (copying-sum))
Evaluation count : 6 in 6 samples of 1 calls.
Execution time mean : 101.428983 ms
Execution time std-deviation : 4.261582 ms
Execution time lower quantile : 98.672708 ms ( 2.5%)
Execution time upper quantile : 108.666779 ms (97.5%)
Overhead used : 2.363001 ns
nil
user> (crit/quick-bench (inplace-sum))
Evaluation count : 114 in 6 samples of 19 calls.
Execution time mean : 7.444195 ms
Execution time std-deviation : 1.984742 ms
Execution time lower quantile : 5.674894 ms ( 2.5%)
Execution time upper quantile : 9.874004 ms (97.5%)
Overhead used : 2.363001 ns
nil
More than 10x faster.
There are certainly other situations such as small files where the copying pathway is indeed faster, but for these pathways is it not even close. And, along with this raw speed boost, comes the fact that the memory mapped pathway is also more flexible in that you can load files that are larger than your available RAM. So, for moderate to large files on disk, memory mapping is probably both faster and more flexible than copying that entire file into your process memory.
And beyond that mmap also opens the door for shared-memory type IPC with other processes running on your machine, such as Python or R programs. So there are options with memory mapped data that just do not exist when each process copies the file into separate private heaps. When working with data-intensive applications you really want all the options you can get.
We hope that this article helps you to understand both memory mapping and the Arrow streaming format. It has been a pleasure to write as working with the Arrow binary format is quite enjoyable and we believe this important file format is here to stay and has great potential for the future. As an example of a forward-looking system built on top of Arrow, have a look at the gandiva project :-).
TechAscent: Mapping out clear pathways through technical swamps.