TechAscent - Delightful Software Solutions
2022-03-15

Java Data Science: Getting Started

Our library, tech.ml.dataset can be used from Java to simplify common tasks when writing data-intensive programs. In particular, the following are all made much easier:

  • Loading datasets from (and saving datasets to) a variety of formats
  • Manipulating in-memory (and out of core) datasets in a database-like way
  • Computing summary statistics

What follows is the most minimal example tutorial.

This post covers some of the highlights from a complete working example, which is itself only a 39 line Java program (counting comments). The example uses maven to manage dependencies and shows how to get started using the library with minimal ceremony.

Loading a dataset is a one-liner (Javadoc makeDataset):

Map ds = makeDataset("https://github.com/scicloj/metamorph-examples/raw/main/data/titanic/train.csv");

In this case a URL string is used to obtain the dataset from the internet, but the input could also be a local file (in a variety of formats including csv, tsv, or json, or other formats such as arrow, parquet, or xls with library support). Compressed files are also supported (e.g., .tsv.gz)

In no case does loading a dataset require more than one line of code, or any factory object instances, or builder pattern incidental complexities.


Printing a sample of the dataset is also a one-liner, and the printed output is formatted nicely:

System.out.println(sample(ds, 10));

Producing output (of the ten randomly sampled rows) like:

https://github.com/scicloj/metamorph-examples/raw/main/data/titanic/train.csv [10 12]:

| PassengerId | Survived | Pclass |                            Name |    Sex |  Age | SibSp | Parch |    Ticket |    Fare | Cabin | Embarked |
|------------:|---------:|-------:|---------------------------------|--------|-----:|------:|------:|-----------|--------:|-------|----------|
|         453 |        0 |      1 | Foreman, Mr. Benjamin Laventall |   male | 30.0 |     0 |     0 |    113051 | 27.7500 |  C111 |        C |
|         272 |        1 |      3 |    Tornquist, Mr. William Henry |   male | 25.0 |     0 |     0 |      LINE |  0.0000 |       |        S |
|         576 |        0 |      3 |            Patchett, Mr. George |   male | 19.0 |     0 |     0 |    358585 | 14.5000 |       |        S |
|         122 |        0 |      3 |      Moore, Mr. Leonard Charles |   male |      |     0 |     0 | A4. 54510 |  8.0500 |       |        S |
|         159 |        0 |      3 |             Smiljanic, Mr. Mile |   male |      |     0 |     0 |    315037 |  8.6625 |       |        S |
|         644 |        1 |      3 |                 Foo, Mr. Choong |   male |      |     0 |     0 |      1601 | 56.4958 |       |        S |
|         813 |        0 |      2 |       Slemen, Mr. Richard James |   male | 35.0 |     0 |     0 |     28206 | 10.5000 |       |        S |
|         408 |        1 |      2 |  Richards, Master. William Rowe |   male |  3.0 |     1 |     1 |     29106 | 18.7500 |       |        S |
|         597 |        1 |      2 |      Leitch, Miss. Jessie Wills | female |      |     0 |     0 |    248727 | 33.0000 |       |        S |
|         744 |        0 |      3 |               McNamee, Mr. Neal |   male | 24.0 |     1 |     0 |    376566 | 16.1000 |       |        S |

This is a toy dataset, popularized by Kaggle, collecting information about passengers on the Titanic.

Notably, the print format of these tables is compatible with some extended markdown processors. If the one you use is sufficiently advanced, then you may be able to render these tables to (for example) HTML.


Maybe one of the simplest questions one could ask about this dataset is, 'what fraction of passengers survived?'.

Computing and printing this kind of summary is a two-liner (Javadoc sum):

double survivors = sum(column(ds, "Survived"));
System.out.println(String.format("%s out of %s passengers survived.", survivors, rowCount(ds)));

Producing output like:

342.0 out of 891 passengers survived.

Less than half, truly a disaster.


Operating on subsets of a dataset is a common task. This dataset captures something called "Sex" for each row. Datasets can be grouped with the library function groupByColumn, producing separated datasets:

Map groups = groupByColumn(ds, "Sex");
Map males = (Map)groups.get("male");
Map females = (Map)groups.get("female");

And then, those new datasets can be operated on similarly:

double maleSurvivors = sum(column(males, "Survived"));
double femaleSurvivors = sum(column(females, "Survived"));
System.out.println(String.format("%s out of %s males survived.", maleSurvivors, rowCount(males)));
System.out.println(String.format("%s out of %s females survived.", femaleSurvivors, rowCount(females)));

Producing output like:

109.0 out of 577 males survived.
233.0 out of 314 females survived.

Women and children in the lifeboats first.


As a final step in this demonstration, the following two lines write out the grouped datasets in JSON format. Being able to fluently communicate in different formats is a huge boon when doing data-intensive work.

writeDataset(males, "males.json");
writeDataset(females, "females.json");

And a preview of the .json files produced in the console:

$ cat males.json | jq | head -n 20
[
  {
    "Embarked": "S",
    "Survived": 0,
    "Pclass": 3,
    "Ticket": "A/5 21171",
    "PassengerId": 1,
    "Parch": 0,
    "Cabin": null,
    "Sex": "male",
    "SibSp": 1,
    "Age": 22,
    "Name": "Braund, Mr. Owen Harris",
    "Fare": 7.25
  },
  {
    "Embarked": "S",
    "Survived": 0,
    "Pclass": 3,
    "Ticket": "373450",
...

$ cat females.json | jq | head -n 20
[
  {
    "Embarked": "C",
    "Survived": 1,
    "Pclass": 1,
    "Ticket": "PC 17599",
    "PassengerId": 2,
    "Parch": 0,
    "Cabin": "C85",
    "Sex": "female",
    "SibSp": 1,
    "Age": 38,
    "Name": "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
    "Fare": 71.2833
  },
  {
    "Embarked": "S",
    "Survived": 1,
    "Pclass": 3,
    "Ticket": "STON/O2. 3101282",
...

Being able to export JSON without ever explicitly defining any classes, or types, or serialization is nice. The library enables transparently operating on data stored in different formats, while maintaining a strongly typed representation of each column.

Learn More

This post only scratches the surface of what's possible with tech.ml.dataset from Java.

If you are interested, there is much more to see:


TechAscent: Shorter programs, fewer bugs, smarter solutions, faster.

Contact us

Make software work for you.

Get In Touch