Dataframes are an increasingly commonly used data structure. Dataframe implementations provide an API to access and manipulate 2-dimensional “tabular” data. In general the rows represent individual “cases” each of which consists of a number of observations or measurements (columns). The columns can be of differing data types. A special case of a dataframe is a time series. This is where each row represents a set of observations with a specific time attached. Time series data is extremely common in financial analysis - prices, quotes, volumes, exchange rates, etc.

There are mature dataframe implementations in many languages. In R dataframes are built in, Python has the extensive pandas library and Julia also has an implementation.

Common across these are abilities for:

  • Rows and columns can be easily referenced by name or label with various indexing methods.
  • Filtering and selecting subsets of the data.
  • Numerical Analysis - mean, standard deviation etc.
  • Reshaping - groupby, pivot etc.
  • Handling Missing Data - gracefully handling gaps in data.
  • Joining multiple dataframes.
  • Reporting/Plotting - outputting textual and graphical views of the data.

Let’s look at some options for dataframe-like operations in JavaScript.

pandas-js

[https://github.com/StratoDem/pandas-js]

const df = new DataFrame(Immutable.Map({x: new Series([1, 2]), y: new Series([2, 3])}));

// Returns DataFrame(Immutable.Map({x: Series([2]), y: Series([3]));
df.filter(df.get('x').gt(1));

// Returns DataFrame(Immutable.Map({x: Series([2]), y: Series([3]));
df.filter([false, true]);

// Returns DataFrame(Immutable.Map({x: Series([2]), y: Series([3]));
df.filter(Immutable.Map([false, true]));
 

Pandas-js is an experimental library mimicking the Python pandas API in JavaScript. The Python pandas library is built on top of NumPy for its data storage. Panda-js mirrors this structure by building on top of immutable.js. It is well documented and aims to implement a large subset of pandas functionality. The documentation even references the pandas equivalent functions and the internals also resemble pandas internal structure e.g. a DataFrame is a subclass of a NDFrame (n-dimensional data structure). It has been open sourced since late 2016 and is being committed to actively. The code quality of the project looks good and it includes an extensive test suite. It has just over 150 stars (as of May 2018), so has yet to gain widespread popularity.

It implements many of pandas numerical analysis methods and reshaping operations. This library’s joining and index matching abilities are limited currently. It has some limited support for missing values.

Attempting to reproduce the pandas API is both a pro and a con in my opinion. Leveraging the pandas API and name gives the project name recognition and a clear focus. However JavaScript is not Python and following the Python API may end up being a limiting factor. Also it could be argued that the pandas dataframe API is somewhat bloated and should not be repeated.

Ubique

[https://github.com/maxto/ubique]

// set variables
var x = [0.003,0.026,0.015,-0.009,0.014,0.024,0.015,0.066,-0.014,0.039];
var y = [-0.005,0.081,0.04,-0.037,-0.061,0.058,-0.049,-0.021,0.062,0.058];
var z = [0.04,-0.022,0.043,0.028,-0.078,-0.011,0.033,-0.049,0.09,0.087];

// Concatenate X,Y and Z along columns, returns a matrix W with size 10x3
var W = ubique.cat(1,x,y,z);
// [ [ 0.003, -0.005, 0.04 ],
//   [ 0.026, 0.081, -0.022 ],
//   [ 0.015, 0.04, 0.043 ],
//   [ -0.009, -0.037, 0.028 ],
//   [ 0.014, -0.061, -0.078 ],
//   [ 0.024, 0.058, -0.011 ],
//   [ 0.015, -0.049, 0.033 ],
//   [ 0.066, -0.021, -0.049 ],
//   [ -0.014, 0.062, 0.09 ],
//   [ 0.039, 0.058, 0.087 ] ]

// Get statistics for matrix W
ubique.size(W) // size of the matrix
ubique.nrows(W) // number of rows
ubique.ncols(W) // number of columns
ubique.mean(W) // average value for columns
ubique.std(W) // standard deviation (sample)
 

Ubique is more similar to a NumPy implementation, than to a full dataframe implementation. It supports vectors and matrices but crucially does not support named columns. Ubique’s real strength is its implementations of many useful numeric and financial functions. Everything from kurtosis to the Sortino ratio is included. It is well tested but unfortunately no longer developed. One interesting design decision is that functions are not methods on the Matrix object itself which allows a Matrix to have a simpler API. This may lend itself to picking and choosing of methods to keep the code size down.

Gauss

[https://github.com/fredrick/gauss]

var Collection = gauss.Collection;
var things = new Collection(
    { type: 1, age: 1 },
    { type: 2, age: 2 },
    { type: 1, age: 3 },
    { type: 2, age: 4 });
things
    .find({ type: 2 })
    .map(function(thing) { return thing.age; })
    .toVector() // Scope chained converter, converting mapped collection of ages to Vector
    .sum();

var numbers = new gauss.Vector([8, 6, 7, 5, 3, 0, 9]);
numbers.min();
//As above but with a callback parameter,
numbers.min(function(result) {
    result / 2;
    /* Do more things with the minimum*/
});
 

Gauss has over 400 stars on GitHub but is no longer under active development. Its main utility is a kind of enhanced JavaScript Array with added numeric methods known as a Vector. The Vector object provides set of basic statistical methods that can be run on . Vector instances can be passed to various binary operations to allow multiplications, additions etc. One feature worth noting is that each function allows passing in callbacks for handling the results, which is nice adaptation to JavaScript norms. The API also particularly lends itself to method chaining which can lead to nice terse readable code.

Others worth mentioning

  • MathJS implements Matrix operations but has more of a linear algebra focus.
  • Crossfilter Is not really a dataframe implementation but offers super fast filtration and reduction functionality on datasets. This can give a really nice interactive experience. Crossfilter uses sorted indexes for fast performance of filtering, histograms and top N lists.
  • D3.js and Vega datalib. These might seem an odd inclusions here but despite their focus on visualising data, they can perform efficient filtering, reshaping and reduction operations. They are not dataframe implementations but have some overlapping usecases. The fact that these visualisation projects have needed to create their own data wrangling implementations shows the clear requirement for a solution in this space.

What next?

Clearly the implementation of dataframes in JavaScript is a relatively immature space. Given this immaturity and by contrast the maturity of dataframes in other languages, you may ask why even bother with JavaScript dataframes?

In my opinion the most persuasive usecase is providing a JavaScript browser based dataframe API. With a JavaScript dataframe the browser can load the data once and perform reshaping and analysis on it on demand. This removes the need for superfluous roundtrips to the backend, allowing low latency interactive GUIs to be built. JavaScript performance even in mobile browsers is now fast enough to support this analysis on moderately sized datasets.

For modern component based front end frameworks having a commonly understood JavaScript browser based dataframe would be a huge boon. We could create web component libraries to display, plot and explore tabular data without having to reinvent the data manipulation wheel each time. In Python an ecosystem of libraries has grown up around the pandas and NumPy libraries. An ecosystem of display components could grow around a JavaScript dataframe API in a similar way.

E.g. component based JavaScript libraries - Vue.js, React, Polymer etc. could implement components with dataframe support.

<rangefilter data="myDataFrame" column="quantity"></rangefilter>
<datatable data="myDataFrame"></datatable>
<histogram data="myDataFrame" show-quartiles></histogram>
 

Given the above component examples and by taking the best bits from the discussed libraries we can collect some features that an ideal JavaScript dataframe would provide.

  • It would provide reshaping & groupby abilities - like pandas & pandas-js?
  • The ability to do fast filtering and indexing would be very useful - using the clever Crossfilter indexing?
  • Provide well tested numeric method implementations with support for missing data. This is a must have. Ubique provides a leading implementation at the moment.
  • The ideal solution would have a simple small API. Most methods should not be implemented as part of the dataframe itself, but rather as separate optional modules to help keep required download size small. This also helps avoid the API explosion that pandas suffers from. This should take the best bits from Ubique and Gauss?
  • A JavaScript dataframe should have a JavaScript first API. It would use the best of JavaScript - a functional style, callbacks, promises, option to use webworkers, typed binary arrays etc.
  • The data needs to get to the browser quickly. Fast from_json and to_json serialisation methods would help with this. In addition the ability to deserialise binary data to avoid the overhead of json encoding and decoding would be beneficial.
  • Native Arrow support would be a possibility, as would support for the format used by our open source tool Arctic

This is quite a wishlist! I think pandas-js with a slight change in focus might get there. But a clean slate project using the best of each existing solution might be a better bet in the long run.