Kotlin Dataframe in 100 Queries¶

  • This is an attempt to create a tutorial for the Kotlin dataframe
  • It is a clone of "SQL for Data Scientists in 100 Queries" by Greg Wilson, with each SQL query example re-written with the dataframe syntax/API
    • edit: the authors have evolved the document, changing the title, removing some of the examples, and eliminating the numbering -- a copy of the original document upon which this tutorial is based can be found here
  • The original caught my eye when it was the top story on Hacker News (Feb 6, 2024) and seemed to be a very good basis for a SQL → dataframe tutorial
  • Huge thank you to Greg and his collaborators for a great source document!
  • This is a static HTML document, however this repo contains the Jupyter/Kotlin notebook file (.ipynb) if you'd like to run it yourself and edit the code.

Who wrote this?

Andy Goldberg

  • I'm an independent full-stack developer/consultant/database marketer/financial analyst/game inventor
  • My firm's name is Centerfield Nine - a reference to the casino gaming industry, the primary focus of my consulting work
  • https://www.cfnine.com
  • github: @2x2xplz
  • X/twitter: @cfnine

initially published February 2024

  • my motivation for creating this document is described in detail in this essay
  • video of my KotlinConf talk (April 2023, Amsterdam) about dataframe is on Youtube

I am always open to project proposals - please reach out (or with comments about this site): andy [at] cfnine [dot] com

what is dataframe and when would you use it?

dataframe is a programmatic, in-memory representation of tabular data. It is Kotlin's parallel to Python's pandas or R's dplyr. Another way to think about it is Excel, but using a programming language rather than individual cell formulas. It is an easy-to-use interface to manipulating structured data just like Excel can -- sorting, filtering, updating, calculating, and much more. While dataframe has it's own specific API, almost any Kotlin function can be incorporated into data processing routines. Most dataframe projects start by reading data from a relational database, a CSV file, JSON data structure, or Excel sheet.

  • dataframe documentation: https://kotlin.github.io/dataframe/overview.html
  • github repository: https://github.com/Kotlin/dataframe
  • other great tutorials in https://github.com/Kotlin/dataframe/tree/master/examples/notebooks (especially "40 puzzles")

with all of that out of the way... let's go!

In [1]:
// %useLatestDescriptors 
%use dataframe(0.12.1)
%use kandy(0.5.0)
@file:DependsOn("org.xerial:sqlite-jdbc:3.45.1.0")  // SQLite database driver JDBC
@file:DependsOn("org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.2")  // JSON serialization
@file:DependsOn("org.ktorm:ktorm-core:3.6.0")  // database query tool & ORM
@file:DependsOn("org.ktorm:ktorm-support-sqlite:3.6.0")  // database query tool & ORM
@file:DependsOn("com.github.doyaaaaaken:kotlin-csv-jvm:1.9.3")  // better CSV reader than DataFrame built-in

1: select constant¶

In [2]:
listOf<Long>(1).toDataFrame().single()
Out[2]:

DataRow: index = 0, columnsCount = 1

value
1

2: reading table from database¶

In [3]:
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.SQLException
import java.sql.Statement
import org.jetbrains.kotlinx.dataframe.io.DatabaseConfiguration

val penguinsConn : Connection = DriverManager.getConnection("jdbc:sqlite:penguins.db")

DataFrame.readSqlTable(penguinsConn, "little_penguins")
Out[3]:

DataFrame: rowsCount = 10, columnsCount = 7

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieDream37.218.11783900MALE
AdelieDream37.619.31813300FEMALE
GentooBiscoe5015.32205550MALE
AdelieTorgersen37.320.51993775MALE
AdelieBiscoe39.617.71863500FEMALE
GentooBiscoe47.7152164750FEMALE
AdelieDream36.5181823150FEMALE
GentooBiscoe4213.52104150FEMALE
AdelieTorgersen42.119.11954000MALE
GentooBiscoe54.315.72315650MALE
In [4]:
// we can run queries, rather than pulling an entire table
// when dealing with large tables, you'll almost always want to do some filtering or grouping in the database, rather than pulling the entire table
// you can also build views in your database to manipulate data, 
//   and you can also use external DB/ORM tools (ktorm, jOOQ, Exposed, etc) to build more complex queries and pull the data
// personal suggestion is to use the built-in `readSQL` functions only for very basic queries 
DataFrame.readSqlQuery(penguinsConn, "select species, island, sex from little_penguins")
Out[4]:

DataFrame: rowsCount = 10, columnsCount = 3

speciesislandsex
AdelieDreamMALE
AdelieDreamFEMALE
GentooBiscoeMALE
AdelieTorgersenMALE
AdelieBiscoeFEMALE
GentooBiscoeFEMALE
AdelieDreamFEMALE
GentooBiscoeFEMALE
AdelieTorgersenMALE
GentooBiscoeMALE

2a: reading a table from a database and assigning it to a df variable¶

In [5]:
// in this initial pull, we don't know the schema and the result is untyped (DataFrame<*>)
val dfPenguinsNaive : DataFrame<*> = DataFrame.readSqlTable(penguinsConn, "little_penguins")

3: specifying columns with select¶

In [6]:
dfPenguinsNaive.select { cols("species","island","sex") }
Out[6]:

DataFrame: rowsCount = 10, columnsCount = 3

speciesislandsex
AdelieDreamMALE
AdelieDreamFEMALE
GentooBiscoeMALE
AdelieTorgersenMALE
AdelieBiscoeFEMALE
GentooBiscoeFEMALE
AdelieDreamFEMALE
GentooBiscoeFEMALE
AdelieTorgersenMALE
GentooBiscoeMALE

3a: defining a typed schema¶

dataframe API has some functions to generate a schema automatically from a database table:
https://kotlin.github.io/dataframe/readsqldatabases.html#schema-retrieval-for-specific-sql-table
getSchemaForSqlTable()

it also can generate during a Gradle build:
https://kotlin.github.io/dataframe/schemasimportsqlgradle.html
@file:ImportDataSchema
this also supports OpenAPI schemas from a json definition

In [7]:
// here, we define a data class manually and annotate with `@DataSchema` which allows DataFrame to use it as a typed schema

@DataSchema
data class Penguin (   // can also be an interface
    val species	: String,
    val island : String,
    val bill_length_mm : Double?,
    val bill_depth_mm : Double?,
    val flipper_length_mm : Int?,
    val body_mass_g	: Int?,
    val sex : String?
)
In [8]:
val dfPenguins : DataFrame<Penguin> = DataFrame.readSqlTable(penguinsConn, "penguins") as DataFrame<Penguin>
dfPenguins.describe()
Out[8]:

DataFrame: rowsCount = 7, columnsCount = 10

nametypecountuniquenullstopfreqminmedianmax
speciesString?34430Adelie152AdelieChinstrapGentoo
islandString?34430Biscoe168BiscoeDreamTorgersen
bill_length_mmString?344165241.1732.144.459.6
bill_depth_mmString?344812171213.117.321.5
flipper_length_mmString?34456219022172197231
body_mass_gString?344952380012270040506300
sexString?344311MALE168FEMALEMALEMALE
In [9]:
// hey wait a second, this schema doesn't match what we defined! the source SQLite database defines every column as TEXT (no numeric columns)
//   I would have expected `as DataFrame<Penguin>` to throw an error, but it's silent
// however, we can use `convertTo()` to get our schema corrected -- note the `type` column

val dfPenguins : DataFrame<Penguin> = DataFrame.readSqlTable(penguinsConn, "penguins").convertTo<Penguin>()
dfPenguins.describe()
Out[9]:

DataFrame: rowsCount = 7, columnsCount = 12

nametypecountuniquenullstopfreqmeanstdminmedianmax
speciesString34430Adelie152nullnullAdelieChinstrapGentoo
islandString34430Biscoe168nullnullBiscoeDreamTorgersen
bill_length_mmDouble?344165241.100000743.9219305.45958432.10000044.45000059.600000
bill_depth_mmDouble?34481217.0000001217.1511701.97479313.10000017.30000021.500000
flipper_length_mmInt?34456219022200.91520514.061714172197231
body_mass_gInt?3449523800124201.754386801.954536270040506300
sexString?344311MALE168nullnullFEMALEMALEMALE

4: sort with sortBy¶

In [10]:
val dfLittlePenguins : DataFrame<Penguin> = DataFrame.readSqlTable(penguinsConn, "little_penguins") as DataFrame<Penguin>
In [11]:
dfLittlePenguins
    .select { species and sex and island }
    .sortBy { island and sex.desc() }
Out[11]:

DataFrame: rowsCount = 10, columnsCount = 3

speciessexisland
GentooMALEBiscoe
GentooMALEBiscoe
AdelieFEMALEBiscoe
GentooFEMALEBiscoe
GentooFEMALEBiscoe
AdelieMALEDream
AdelieFEMALEDream
AdelieFEMALEDream
AdelieMALETorgersen
AdelieMALETorgersen

5: limit output with take¶

In [12]:
dfPenguins
    .select { species and sex and island }
    .sortBy { species and sex and island }
    .take(10)
Out[12]:

DataFrame: rowsCount = 10, columnsCount = 3

speciessexisland
AdelienullDream
AdelienullTorgersen
AdelienullTorgersen
AdelienullTorgersen
AdelienullTorgersen
AdelienullTorgersen
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe

6: page output with drop and take¶

In [13]:
dfPenguins
    .select { species and sex and island }
    .sortBy { species and sex and island }
    .drop(3)
    .take(10)

// `take` is the equivalent of SQL limit -- the result set will not exceed the designated number of rows

// head() will also present a preview of a designated number of rows
Out[13]:

DataFrame: rowsCount = 10, columnsCount = 3

speciessexisland
AdelienullTorgersen
AdelienullTorgersen
AdelienullTorgersen
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe
AdelieFEMALEBiscoe

7: remove duplicates with distinct¶

In [14]:
dfPenguins
    .distinct { species and sex and island }
Out[14]:

DataFrame: rowsCount = 13, columnsCount = 3

speciessexisland
AdelieMALETorgersen
AdelieFEMALETorgersen
AdelienullTorgersen
AdelieFEMALEBiscoe
AdelieMALEBiscoe
AdelieFEMALEDream
AdelieMALEDream
AdelienullDream
ChinstrapFEMALEDream
ChinstrapMALEDream
GentooFEMALEBiscoe
GentooMALEBiscoe
GentoonullBiscoe

8. filter results with filter¶

In [15]:
dfPenguins
    .distinct { species and sex and island }
    .filter { island == "Biscoe" }
Out[15]:

DataFrame: rowsCount = 5, columnsCount = 3

speciessexisland
AdelieFEMALEBiscoe
AdelieMALEBiscoe
GentooFEMALEBiscoe
GentooMALEBiscoe
GentoonullBiscoe

9. more complex filters¶

In [16]:
dfPenguins
    .distinct { species and sex and island }
    .filter { island == "Biscoe" && sex != "MALE" } // note: this includes sex=NULL, SQL does not
Out[16]:

DataFrame: rowsCount = 3, columnsCount = 3

speciessexisland
AdelieFEMALEBiscoe
GentooFEMALEBiscoe
GentoonullBiscoe
In [17]:
// the given example is not very complex, here's another

dfPenguins
    .filter { (species == "Adelie" && (bill_depth_mm ?: 0.0) >= 21.0) || 
                (island == "Biscoe" && sex == "FEMALE" && ((body_mass_g ?: 0) > 5000 || (bill_length_mm ?: 0.0) > 48.0)) }
Out[17]:

DataFrame: rowsCount = 17, columnsCount = 7

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen38.60000021.2000001913800MALE
AdelieTorgersen34.60000021.1000001984400MALE
AdelieTorgersen46.00000021.5000001944200MALE
AdelieDream39.20000021.1000001964150MALE
AdelieDream42.30000021.2000001914150MALE
AdelieBiscoe41.30000021.1000001954400MALE
GentooBiscoe48.70000014.1000002104450FEMALE
GentooBiscoe48.20000014.3000002104600FEMALE
GentooBiscoe45.10000014.5000002075050FEMALE
GentooBiscoe49.10000014.8000002205150FEMALE
GentooBiscoe44.90000013.3000002135100FEMALE
GentooBiscoe46.50000014.8000002175200FEMALE
GentooBiscoe49.10000014.5000002124625FEMALE
GentooBiscoe48.40000014.4000002034625FEMALE
GentooBiscoe48.50000015.0000002194850FEMALE
GentooBiscoe50.50000015.2000002165000FEMALE
GentooBiscoe45.20000014.8000002125200FEMALE
In [18]:
// we can incorporate any Kotlin features/functions

// this would likely be represented in SQL with a series of CASE...WHEN...THEN statements

dfPenguins
    .filter { when(species) {
        "Adelie" -> (body_mass_g ?: 0) > 4500 
        "Chinstrap" -> (body_mass_g ?: 0) > 4600 
        "Gentoo" -> (body_mass_g ?: 0) > 5750 
        else -> true
        }
    }
Out[18]:

DataFrame: rowsCount = 19, columnsCount = 7

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.20000019.6000001954675MALE
AdelieDream39.80000019.1000001844650MALE
AdelieDream39.60000018.8000001904600MALE
AdelieTorgersen42.90000017.6000001964700MALE
AdelieBiscoe41.00000020.0000002034725MALE
AdelieBiscoe43.20000019.0000001974775MALE
AdelieBiscoe45.60000020.3000001914600MALE
ChinstrapDream52.00000020.7000002104800MALE
GentooBiscoe48.40000014.6000002135850MALE
GentooBiscoe49.30000015.7000002175850MALE
GentooBiscoe49.20000015.2000002216300MALE
GentooBiscoe59.60000017.0000002306050MALE
GentooBiscoe49.50000016.2000002295800MALE
GentooBiscoe48.60000016.0000002305800MALE
GentooBiscoe51.10000016.3000002206000MALE
GentooBiscoe45.20000016.4000002235950MALE
GentooBiscoe49.80000015.9000002295950MALE
GentooBiscoe55.10000016.0000002305850MALE
GentooBiscoe48.80000016.2000002226000MALE

10. perform some math calculations¶

In [19]:
// instead of dealing with NULLs inline, we could 
//   filter out all rows with null values, or use fillNA() to substitute zeroes for nulls 

dfPenguins
    .add("flipper_length_cm") { flipper_length_mm?.div(10.0) }  // `add` is to add a column (not adding numbers)
    .add("body_mass_kg") { body_mass_g?.div(1000.0) }
    .select { "flipper_length_cm" and "body_mass_kg" }
    .take(3)
Out[19]:

DataFrame: rowsCount = 3, columnsCount = 2

flipper_length_cmbody_mass_kg
18.1000003.750000
18.6000003.800000
19.5000003.250000

11. rename columns¶

In [20]:
// dataframe requires all new columns to have a name (see above)
// alternatively, we can define the column type ahead of time, then work with typed column objects
//   strings are fine when accessing new columns once, but defined columns are preferred when referencing them often 

val flipperLengthCM by column<Double?>()
val bodyMassKG by column<Double?>()
val whereFound by column<String>()

dfPenguins
    .add(flipperLengthCM) { flipper_length_mm?.div(10.0) }
    .add(bodyMassKG) { body_mass_g?.div(1000.0) }
    .rename { island }.into(whereFound)
    .select { flipperLengthCM and bodyMassKG and whereFound }
    .dropNulls { bodyMassKG }
    .filter { bodyMassKG()!! < 4.0 }  // note the `()` -- short for `invoke()`

// the `invoke()` is usually needed in filters and when looking at the value of a column, 
//   not needed in select, rename, move, etc, when operating on the DataColumn itself
Out[20]:

... showing only top 20 of 165 rows

DataFrame: rowsCount = 165, columnsCount = 3

flipperLengthCMbodyMassKGwhereFound
18.1000003.750000Torgersen
18.6000003.800000Torgersen
19.5000003.250000Torgersen
19.3000003.450000Torgersen
19.0000003.650000Torgersen
18.1000003.625000Torgersen
19.3000003.475000Torgersen
18.6000003.300000Torgersen
18.0000003.700000Torgersen
18.2000003.200000Torgersen
19.1000003.800000Torgersen
18.5000003.700000Torgersen
19.5000003.450000Torgersen
18.4000003.325000Torgersen
17.4000003.400000Biscoe
18.0000003.600000Biscoe
18.9000003.800000Biscoe
18.5000003.950000Biscoe
18.0000003.800000Biscoe
18.7000003.800000Biscoe

12. calculate with missing values¶

In [21]:
// since Kotlin enforces nullability, there is no way to NOT deal with nulls -- this is identical to #11 above, without the filter

// the columns were defined in the above cell

dfPenguins
    .add(flipperLengthCM) { flipper_length_mm?.div(10.0) }
    .add(bodyMassKG) { body_mass_g?.div(1000.0) }
    .rename { island }.into(whereFound)
    .select { flipperLengthCM and bodyMassKG and whereFound }
    .take(5)
Out[21]:

DataFrame: rowsCount = 5, columnsCount = 3

flipperLengthCMbodyMassKGwhereFound
18.1000003.750000Torgersen
18.6000003.800000Torgersen
19.5000003.250000Torgersen
nullnullTorgersen
19.3000003.450000Torgersen
In [22]:
// one option is to use fillNulls() or fillNA() to replace NULL with 0

// BE CAREFUL! null is not zero, and you may get results that shouldn't be part of the data set

dfPenguins
    .fillNA { flipper_length_mm and body_mass_g }.with { 0 }
    .add(flipperLengthCM) { flipper_length_mm?.toDouble()?.div(10.0) }
    .add(bodyMassKG) { body_mass_g?.toDouble()?.div(1000.0) }
    .rename { island }.into(whereFound)
    .select { flipperLengthCM and bodyMassKG and whereFound }
    .filter { bodyMassKG()!! < 4.0 }  // this will not filter the rows which were originally NULL, because they are now 0.0
    //.take(5)
Out[22]:

... showing only top 20 of 167 rows

DataFrame: rowsCount = 167, columnsCount = 3

flipperLengthCMbodyMassKGwhereFound
18.1000003.750000Torgersen
18.6000003.800000Torgersen
19.5000003.250000Torgersen
0.0000000.000000Torgersen
19.3000003.450000Torgersen
19.0000003.650000Torgersen
18.1000003.625000Torgersen
19.3000003.475000Torgersen
18.6000003.300000Torgersen
18.0000003.700000Torgersen
18.2000003.200000Torgersen
19.1000003.800000Torgersen
18.5000003.700000Torgersen
19.5000003.450000Torgersen
18.4000003.325000Torgersen
17.4000003.400000Biscoe
18.0000003.600000Biscoe
18.9000003.800000Biscoe
18.5000003.950000Biscoe
18.0000003.800000Biscoe

13. null equality¶

In [23]:
dfPenguins
    .distinct { species and sex and island }
    .filter { island == "Biscoe" }
Out[23]:

DataFrame: rowsCount = 5, columnsCount = 3

speciessexisland
AdelieFEMALEBiscoe
AdelieMALEBiscoe
GentooFEMALEBiscoe
GentooMALEBiscoe
GentoonullBiscoe
In [24]:
dfPenguins
    .distinct { species and sex and island }
    .filter { island == "Biscoe" && sex == "FEMALE" }
Out[24]:

DataFrame: rowsCount = 2, columnsCount = 3

speciessexisland
AdelieFEMALEBiscoe
GentooFEMALEBiscoe

14. null inequality (SQL != Kotlin)¶

In [25]:
// be careful, null equality works differently in Kotlin than in SQL

dfPenguins
    .distinct { species and sex and island }
    .filter { island == "Biscoe" && sex != "FEMALE" }   // the SQL example does NOT include the null sex record
Out[25]:

DataFrame: rowsCount = 3, columnsCount = 3

speciessexisland
AdelieMALEBiscoe
GentooMALEBiscoe
GentoonullBiscoe

15. ternary logic (SQL != Kotlin)¶

In [26]:
null == null  // this is null, not true, in SQL ... anything compared to null returns null
Out[26]:
true

16. null conditions¶

In [27]:
dfPenguins
    .select { species and sex and island }
    .filter { sex == null }
Out[27]:

DataFrame: rowsCount = 11, columnsCount = 3

speciessexisland
AdelienullTorgersen
AdelienullTorgersen
AdelienullTorgersen
AdelienullTorgersen
AdelienullTorgersen
AdelienullDream
GentoonullBiscoe
GentoonullBiscoe
GentoonullBiscoe
GentoonullBiscoe
GentoonullBiscoe

17. aggregations¶

In [28]:
dfPenguins.sumFor { body_mass_g }
Out[28]:

DataRow: index = 0, columnsCount = 1

body_mass_g
1437000

18. common aggregation functions¶

In [29]:
dfPenguins.aggregate { 
    max { bill_length_mm } into "longestBill"
    min { flipper_length_mm } into "shortestFlipper"
    (mean { bill_length_mm } / mean { bill_depth_mm }) into "weirdRatio"
}
Out[29]:

DataRow: index = 0, columnsCount = 3

longestBillshortestFlipperweirdRatio
59.6000001722.560871
In [30]:
// if you just want the aggregated values, you may prefer a different collection type, rather than a dataframe

mapOf<String, Double> (
    "longestBill" to dfPenguins.max { bill_length_mm },
    "shortestFlipper" to dfPenguins.min { flipper_length_mm }.toDouble(),
    "weirdRatio" to dfPenguins.mean { bill_length_mm } / dfPenguins.mean { bill_depth_mm }
)
Out[30]:
{longestBill=59.6, shortestFlipper=172.0, weirdRatio=2.5608708253064427}

19. counting¶

In [31]:
dfPenguins.aggregate { 
    count() into "countStar"
    count { sex() != null } into "countSpecific"
    valueCounts { sex }.count() into "distinctCount"
}
Out[31]:

DataRow: index = 0, columnsCount = 3

countStarcountSpecificdistinctCount
3443332
In [32]:
// as a Map

mapOf<String, Int> (
    "countStar" to dfPenguins.count(),
    "countSpecific" to dfPenguins.count { sex != null },
    "distinctCount" to dfPenguins.valueCounts { sex }.count()
)
Out[32]:
{countStar=344, countSpecific=333, distinctCount=2}
In [33]:
// valueCounts can auto-group then count

dfPenguins.valueCounts { species and sex }

/*

select species, sex, count(*)
from penguins
group by species, sex

*/
Out[33]:

DataFrame: rowsCount = 6, columnsCount = 3

speciessexcount
AdelieMALE73
AdelieFEMALE73
GentooMALE61
GentooFEMALE58
ChinstrapFEMALE34
ChinstrapMALE34

20. group, 21. behavior of unaggregated columns¶

In [34]:
// dataframe automatically returns the grouped column (SQL does not, unless specified)

dfPenguins
    .groupBy { sex }
    .mean { body_mass_g }
Out[34]:

DataFrame: rowsCount = 3, columnsCount = 2

sexbody_mass_g
MALE4545.684524
FEMALE3862.272727
null4005.555556

22. arbitrary choice in aggregation¶

In [35]:
// this is a mistake, don't do it

23. filter aggregated values¶

In [36]:
// SQL has different keywords -- WHERE and HAVING -- for filtering before, or after, the aggregation
// Kotlin dataframe just has filter() for both, which can be confusing

// dataframe processes statements in order, so a filter() before grouping is like WHERE
//   and a filter() after grouping is like HAVING

// note: SQL does NOT process things top-to-bottom -- more like FROM (including JOINs), WHERE, GROUP BY, SELECT, HAVING, ORDER BY

dfPenguins
    .groupBy { sex }
    .mean { body_mass_g }
    .filter { "body_mass_g"<Double>() > 4000.0 } // referring to a new, aggregated column, so we must refer to it by string

// important to recognize that after the grouping, even though the mean column is titled "body_mass_g" we cannot refer to it directly
// body_mass_g still refers to the underlying individual data points
// however referring to the new column by its string title will work
Out[36]:

DataFrame: rowsCount = 2, columnsCount = 2

sexbody_mass_g
MALE4545.684524
null4005.555556

24. readable output¶

In [37]:
// we can define the aggregated column ahead of time, ensuring the new column is renamed (and not deal with string column names)

// note that the order of the output is arbitrary unless specified with sortBy() (ORDER BY in SQL)

val avgBodyMass by column<Double>()

dfPenguins
    .groupBy { sex }
    .aggregate { 
        mean { body_mass_g } into (avgBodyMass)
    }
    .filter { avgBodyMass > 4000.0 }
Out[37]:

DataFrame: rowsCount = 2, columnsCount = 2

sexavgBodyMass
MALE4545.684524
null4005.555556

25. filter aggregate inputs¶

In [38]:
// the "100 SQL Queries" source uses a SQLite-only syntax, `filter` keyword within in the SELECT statement

val avgBodyMass by column<Double>()

dfPenguins
    .dropNulls { body_mass_g }
    .filter { body_mass_g!! < 4000.0 }
    .groupBy { sex }
    .aggregate { 
        mean { body_mass_g } into (avgBodyMass)
    }
Out[38]:

DataFrame: rowsCount = 3, columnsCount = 2

sexavgBodyMass
MALE3729.629630
FEMALE3417.289720
null3362.500000
In [39]:
// wait a second, our result includes MALE = 3729.6, the source has 3752.5 -- why the difference?

// remember when we noted that the source data had all its data as TEXT, not numeric fields?

// when comparing text, "4000" < "4000.0"

// therefore, the source result set is actually including all the rows where body_mass_g is exactly 4000

dfPenguins
    .dropNulls { body_mass_g }
    .filter { body_mass_g!! <= 4000 } // gte result = 3752.54 (59 records), gt only result = 3729.62 (54 records)
    .filter { sex == "MALE" }
    .describe()

// when we use `body_mass_g!! <= 4000` the mean for body_mass_g matches the source result
// when we use `body_mass_g!! < 4000` the mean for body_mass_g matches our result
Out[39]:

DataFrame: rowsCount = 7, columnsCount = 12

nametypecountuniquenullstopfreqmeanstdminmedianmax
speciesString5920Adelie39nullnullAdelieAdelieChinstrap
islandString5930Dream35nullnullBiscoeDreamTorgersen
bill_length_mmDouble5946040.600000443.5796615.60427136.30000041.10000055.800000
bill_depth_mmDouble5925018.700000418.9694920.87538317.00000018.90000021.200000
flipper_length_mmInt592501907192.7966107.441217178193210
body_mass_gInt59210395083752.542373198.740371325038004000
sexString5910MALE59nullnullMALEMALEMALE
In [40]:
// since the source rounds the result to 1 decimal place, we'll do that here

// rounding to an arbitrary number of digits is not natively supported in Kotlin, however, we can write an extension function

fun Double.round(decimals: Int): Double {
    var multiplier = 1.0
    repeat(decimals) { multiplier *= 10 }
    return round(this * multiplier) / multiplier
}

dfPenguins
    .dropNulls { body_mass_g }
    .filter { body_mass_g!! < 4000.0 }
    .groupBy { sex }
    .aggregate { 
        mean { body_mass_g } into (avgBodyMass)
    }
    .update(avgBodyMass).with { it.round(1) }
Out[40]:

DataFrame: rowsCount = 3, columnsCount = 2

sexavgBodyMass
MALE3729.600000
FEMALE3417.300000
null3362.500000

26. create tables / build a dataframe, 27. insert data¶

In [41]:
// build df directly from data

val dfJob = dataFrameOf("name", "billable") (
    "calibrate", 1.5,
    "clean", 0.5
)
dfJob
Out[41]:

DataFrame: rowsCount = 2, columnsCount = 2

namebillable
calibrate1.500000
clean0.500000
In [42]:
// create columns, then assemble into df

val person by columnOf<String>("mik","mik","mik","po","po","tay")
val job by columnOf<String>("calibrate","clean","complain","clean","complain","complain")

val dfWork = listOf(person, job).toDataFrame()
dfWork
Out[42]:

DataFrame: rowsCount = 6, columnsCount = 2

personjob
mikcalibrate
mikclean
mikcomplain
poclean
pocomplain
taycomplain

28. update rows with, yes, update¶

In [43]:
dfWork
    .update { person }
    .where{ it == "tay" }
    .with { "tae" }
Out[43]:

DataFrame: rowsCount = 6, columnsCount = 2

personjob
mikcalibrate
mikclean
mikcomplain
poclean
pocomplain
taecomplain
In [44]:
// note one important difference between SQL and Kotlin dataframes 
// SQL is an interface to a persistent data store -- every INSERT/UPDATE/DELETE/DROP/CREATE/etc operation makes persistent changes

// in the source SQL, after performing the UPDATE query changing "tay" to "tae", the new name is persistent
//   there is nobody named "tay" anymore
// with dataframe, we are dealing with immutable objects -- we can perform some manipulation in a chained statement, 
//   but unless we assign the output to a new variable, those changes do not persist at all

dfWork  // does not reflct the changes we jkust made above
Out[44]:

DataFrame: rowsCount = 6, columnsCount = 2

personjob
mikcalibrate
mikclean
mikcomplain
poclean
pocomplain
taycomplain
In [45]:
// only if we assign a new variable, will the changes persist

val dfWorkUpdated : DataFrame<*> = dfWork.update { person }.where{ it == "tay" }.with { "tae" }
dfWorkUpdated
Out[45]:

DataFrame: rowsCount = 6, columnsCount = 2

personjob
mikcalibrate
mikclean
mikcomplain
poclean
pocomplain
taecomplain

29. delete rows with drop¶

In [46]:
// note we are dropping the original "tay" (not "tae") because the `update` above did not persist (and this `drop` will not persist either)
dfWork.drop { person == "tay" }
Out[46]:

DataFrame: rowsCount = 5, columnsCount = 2

personjob
mikcalibrate
mikclean
mikcomplain
poclean
pocomplain

30. backing up¶

dataframe can read from a database, however it cannot write back to it

it can output to CSV, JSON, Excel or Apache Arrow: see https://kotlin.github.io/dataframe/write.html

it can also output to a standard Kotlin list or a map

you could use var rather than val and re-assign variables based on the output of a function (i.e. pandas and R allow this)
but this can be dangerous, especially while iterating during development

In [47]:
// these new variables are not truly "persistent" (not sved on disk anywhere) but can be referenced

val dfWorkTayBackup = dfWork.filter { person == "tay" }
val dfWorkExcludeTay = dfWork.drop { person == "tay" }
In [48]:
dfWorkTayBackup
Out[48]:

DataFrame: rowsCount = 1, columnsCount = 2

personjob
taycomplain
In [49]:
dfWorkExcludeTay
Out[49]:

DataFrame: rowsCount = 5, columnsCount = 2

personjob
mikcalibrate
mikclean
mikcomplain
poclean
pocomplain
In [50]:
dfWorkExcludeTay.writeCSV("./dataframeWorkExample")  // save to disk as CSV
In [51]:
DataFrame.read("./dataframeWorkExample")  // read from CSV
Out[51]:

DataFrame: rowsCount = 5, columnsCount = 2

personjob
mikcalibrate
mikclean
mikcomplain
poclean
pocomplain
In [52]:
val dfWorkSchema = listOf(person, job).toDataFrame().take(0) // copy the schema but not the data
dfWorkSchema
Out[52]:

DataFrame: rowsCount = 0, columnsCount = 2

personjob
In [53]:
dfWork.xs("tay") { person }
Out[53]:

DataFrame: rowsCount = 1, columnsCount = 1

job
complain
In [54]:
dfWorkBackup = dfWorkBackup.concat(
    dfWork.filter { person == "tay" }
)
dfWorkBackup
Line_78.jupyter.kts (1:1 - 13) Unresolved reference: dfWorkBackup
Line_78.jupyter.kts (1:16 - 28) Unresolved reference: dfWorkBackup
Line_78.jupyter.kts (4:1 - 13) Unresolved reference: dfWorkBackup
In [55]:
// using a `var`, this would allow you to replace the dataframe with a modified one after an operation

var dfWorkVar = dfWork.filter { true }
dfWorkVar = dfWorkVar.update(person).with { it.uppercase() }.sortBy { job and person }  // re-assigning the variable
dfWorkVar
Out[55]:

DataFrame: rowsCount = 6, columnsCount = 2

personjob
MIKcalibrate
MIKclean
POclean
MIKcomplain
POcomplain
TAYcomplain

Side note

Just about everything up to this point consists of very basic database operations. Filtering, sorting, aggregating, grouping, updating will almost always be faster directly in the database. It rarely is optimal to transfer an entire table over the wire when the operation is going to reduce the number of rows significantly. In these cases, executing basic SQL queries is often preferred over using a dataframe. Of course, anytime you execute SQL you need to ensure you sanitize it (to avoid SQL injection, even accidentally), but this is relatively easy using almost any query tool or ORM.

Another consideration is whether you need a full dataframe, as opposed to existing Kotlin collections. Collections like Lists can easily filter, sort, calculate aggregates, etc. Dataframe gets more compelling when the complexity increases.

Joins (the next group of queries) are a little more difficult to write in SQL, but again, are usually faster directly in the database. Besides, the conceptual set theory design -- which join type, which columns, whether to filter before or after the join, etc. -- is the same in SQL or dataframe. Especially inner joins which restrict the number of rows returned. However, sometimes you need to join datasets that do not reside in the same database. That's a great use case for dataframe!

31. join tables¶

In [56]:
dfWork.fullJoin(dfJob)
Out[56]:

DataFrame: rowsCount = 12, columnsCount = 4

personjobnamebillable
mikcalibratecalibrate1.500000
mikcalibrateclean0.500000
mikcleancalibrate1.500000
mikcleanclean0.500000
mikcomplaincalibrate1.500000
mikcomplainclean0.500000
pocleancalibrate1.500000
pocleanclean0.500000
pocomplaincalibrate1.500000
pocomplainclean0.500000
taycomplaincalibrate1.500000
taycomplainclean0.500000

32. inner join¶

In [57]:
// simple syntax

dfWork.innerJoin(dfJob) { job match dfJob.name }

// note that the dataframe result, only returns one of the matched columns (job, but not name) 
//   because the join condition ensures they are identical
Out[57]:

DataFrame: rowsCount = 3, columnsCount = 3

personjobbillable
mikcalibrate1.500000
mikclean0.500000
poclean0.500000
In [58]:
// alternative syntax, this allows for more complex join conditions
// `it` refers to initial df and `right` refers to the joined df

dfWork.innerJoinWith(dfJob) { it.job == right.name }
Out[58]:

DataFrame: rowsCount = 3, columnsCount = 4

personjobnamebillable
mikcalibratecalibrate1.500000
mikcleanclean0.500000
pocleanclean0.500000

33. aggregate joined data¶

In [59]:
dfWork
    .innerJoin(dfJob) { job match dfJob.name }
    .groupBy { person }
    .aggregate {
        sum { "billable"<Double>() } into ("pay")
    }
Out[59]:

DataFrame: rowsCount = 2, columnsCount = 2

personpay
mik2.000000
po0.500000

34. left join¶

both SQL and dataframe support multiple JOIN types

there are many tutorials online explaining the different types, such as:
https://www.atlassian.com/data/sql/sql-join-types-explained-visually

In [60]:
dfWork
    .leftJoin(dfJob) { job match dfJob.name }
Out[60]:

DataFrame: rowsCount = 6, columnsCount = 3

personjobbillable
mikcalibrate1.500000
mikclean0.500000
mikcomplainnull
poclean0.500000
pocomplainnull
taycomplainnull

35. aggregate left joins¶

In [61]:
dfWork
    .leftJoin(dfJob) { job match dfJob.name }
    .groupBy { person }
    .aggregate {
        sum("billable") into ("pay")
        mean("billable") into ("pay_avg")
    }    

// note that in this case, dataframe auto-coalesces null values to 0 in the SUM aggregation, but not the MEAN
//   you will probably run into issues unless you specifically decide what to do with null values (filter them out, assign a value, etc)
Out[61]:

DataFrame: rowsCount = 3, columnsCount = 3

personpaypay_avg
mik2.0000001.000000
po0.5000000.500000
tay0.000000NaN

36. coalesce values¶

In [62]:
// here, we at least specify the default value in case of a null

dfWork
    .leftJoin(dfJob) { job match dfJob.name }
    .groupBy { person }
    .aggregate {
        sumOf { "billable"<Double?>() ?: 0.80 } into ("pay")
        meanOf { "billable"<Double?>() ?: 0.80 } into ("pay_avg")
    }
Out[62]:

DataFrame: rowsCount = 3, columnsCount = 3

personpaypay_avg
mik2.8000000.933333
po1.3000000.650000
tay0.8000000.800000

37. negate incorrectly¶

In [63]:
// in SQL, all columns, including un-`select`-ed ones, are available for joins, where, etc
//   dataframe processes instructions sequentially, so if you select columns initially, the excluded ones aren't available for `filter`

/*
error:

dfWork
    .distinct { person }
    .filter { it.job != "calibrate" }
    
*/


// as the source says, this is not the result we wanted, because there is a row where 'mik' does 'calibrate'

dfWork
    .filter { job != "calibrate" }
    .distinct { person }
Out[63]:

DataFrame: rowsCount = 3, columnsCount = 1

person
mik
po
tay

38. set membership¶

In [64]:
dfWork
    .filter { ! listOf("mik", "tay").contains(person) }
Out[64]:

DataFrame: rowsCount = 2, columnsCount = 2

personjob
poclean
pocomplain
In [65]:
// using `in` infix

dfWork
    .filter { person !in listOf("mik", "tay") }
Out[65]:

DataFrame: rowsCount = 2, columnsCount = 2

personjob
poclean
pocomplain

39. subqueries¶

In [66]:
// similar to SQL, nested subqueries can sometimes be tricky to read

dfWork
    .filter { person !in 
                dfWork.filter { job == "calibrate" }.getColumn { person } }
    .distinct { person }
Out[66]:

DataFrame: rowsCount = 2, columnsCount = 1

person
po
tay
In [67]:
// generally, it makes more logical sense to write sub-queries individually (similar to SQL CTEs), 
//   especially given a programming language's natural ability to maintain variables, rather than nesting subqueries inside-out, as is common in SQL

val colPerson = dfWork    
    .filter { job == "calibrate" }
    .getColumn { person }

dfWork
    .filter { person !in colPerson }
    .distinct { person }
Out[67]:

DataFrame: rowsCount = 2, columnsCount = 1

person
po
tay
In [68]:
// using `let` to avoid creating a new variable

dfWork    
    .filter { job == "calibrate" }
    .getColumn { person }.let { calibraters ->

        dfWork
            .filter { person !in calibraters }
            .distinct { person }
            
}
Out[68]:

DataFrame: rowsCount = 2, columnsCount = 1

person
po
tay

40. autoincrement and primary key¶

In [69]:
// dataframe does not have a concept of primary key, nor will it enforce uniqueness

// it also does not have built-in sequences or auto-increment fields

val colName by columnOf<String>("mik","po","tay")
val ident by column<Long>()

// here's an emulation
val dfPerson = listOf(colName)
                .toDataFrame()
                .insert(ident) { it.index() + 1 }.at(0)
                // .addId("id")  is an alternative, however, you cannot +1 to create a 1-based index, and can't use a column variable
dfPerson
Out[69]:

DataFrame: rowsCount = 3, columnsCount = 2

identcolName
1mik
2po
3tay
In [70]:
// .addId() will automatically create a first column with indexed values (0-based)

listOf(colName)
    .toDataFrame()
    .addId("ident")
Out[70]:

DataFrame: rowsCount = 3, columnsCount = 2

identcolName
0mik
1po
2tay
In [71]:
// dataframe does not enforce any uniqueness constraint

dfPerson.append(2, "frank")
Out[71]:

DataFrame: rowsCount = 4, columnsCount = 2

identcolName
1mik
2po
3tay
2frank

41. alter dataframe schema¶

In [72]:
dfJob  // original
Out[72]:

DataFrame: rowsCount = 2, columnsCount = 2

namebillable
calibrate1.500000
clean0.500000
In [73]:
dfJob.add(ident) { if (name == "calibrate") 1 else if (name == "clean") 2 else 0 }
Out[73]:

DataFrame: rowsCount = 2, columnsCount = 3

namebillableident
calibrate1.5000001
clean0.5000002
In [74]:
// if you had a lot of possible values:

val identMap = mapOf(
    "calibrate" to 1,
    "clean" to 2
    // ... lots more values
)

dfJob.add(ident) { identMap[name] }

// `when` is another option
Out[74]:

DataFrame: rowsCount = 2, columnsCount = 3

namebillableident
calibrate1.5000001
clean0.5000002

42. create new dataframes from old¶

In [75]:
val dfNewWork = dfPerson
    .innerJoin(dfWork) { colName match dfWork.person }
    .innerJoin(dfJob.add(ident) { identMap[name] }) { job match dfJob.name }
    .rename(ident).into("person_id")
    .rename("ident1").into("job_id")
    .select { "person_id" and "job_id" }

dfNewWork
Out[75]:

DataFrame: rowsCount = 3, columnsCount = 2

person_idjob_id
11
12
22

43. remove tables¶

In [76]:
// DROP TABLE doesn't really have an equivalent

44. compare individual values to aggregates¶

In [77]:
dfPenguins.mean { body_mass_g }.let { bodyMassAvg ->
    dfPenguins
        .dropNulls { body_mass_g }
        .filter { body_mass_g!! > bodyMassAvg }
}.select { body_mass_g!! }
.take(5)
Out[77]:

DataFrame: rowsCount = 5, columnsCount = 1

body_mass_g
4675
4250
4400
4500
4650

45. compare individual values to aggregates within groups¶

In [78]:
val bodyMassGroupAvg by column<Double>()

dfPenguins.innerJoin(
    dfPenguins
        .groupBy { species }
        .aggregate { mean { body_mass_g } into (bodyMassGroupAvg) }
).dropNulls { body_mass_g }
 .filter { body_mass_g!! > bodyMassGroupAvg() }
 .select { species and body_mass_g!! and bodyMassGroupAvg }
 .update { bodyMassGroupAvg }.with { it.round(1) }
 .take(5)
Out[78]:

DataFrame: rowsCount = 5, columnsCount = 3

speciesbody_mass_gbodyMassGroupAvg
Adelie37503700.700000
Adelie38003700.700000
Adelie46753700.700000
Adelie42503700.700000
Adelie38003700.700000

46. common table expressions (CTEs)¶

CTEs are a SQL concept where instead of nesting subqueries, we define subqueries ahead of time then reference them in the primary query

they are essentially "syntactic sugar," making a query easier to write and read

more info on CTEs can be found online

In [79]:
// CTEs are essentially the same as pre-defining variables with sub-selects

val dfGroupedPenguins = dfPenguins
        .groupBy { species }
        .aggregate { mean { body_mass_g } into ("bodyMassGroupAvg") }

dfPenguins
    .innerJoin(dfGroupedPenguins)
    .dropNulls { body_mass_g }
    .filter { body_mass_g!! > "bodyMassGroupAvg"<Double>() }
    .select { species and body_mass_g!! and "bodyMassGroupAvg" }
    .take(5)
Out[79]:

DataFrame: rowsCount = 5, columnsCount = 3

speciesbody_mass_gbodyMassGroupAvg
Adelie37503700.662252
Adelie38003700.662252
Adelie46753700.662252
Adelie42503700.662252
Adelie38003700.662252

47. enumerate rows¶

In [80]:
dfPenguins
    .insert("rowID") { index() + 1 }.at(0)
    .select { "rowID" and species and island }
    .take(5)
Out[80]:

DataFrame: rowsCount = 5, columnsCount = 3

rowIDspeciesisland
1AdelieTorgersen
2AdelieTorgersen
3AdelieTorgersen
4AdelieTorgersen
5AdelieTorgersen
In [81]:
dfPenguins
    .select { species and island }
    .addId("rowID")
    .take(5)
Out[81]:

DataFrame: rowsCount = 5, columnsCount = 3

rowIDspeciesisland
0AdelieTorgersen
1AdelieTorgersen
2AdelieTorgersen
3AdelieTorgersen
4AdelieTorgersen

48. if-else function¶

In [82]:
// dataframe doesn't need dedicated keywords, just standard Kotlin

dfPenguins
    .dropNulls { body_mass_g }
    .convert { body_mass_g.castToNotNullable() }.with { if (it < 3500) "small" else "large"}
    // .convert { body_mass_g }.with { if ((it ?: 9999) < 3500) "small" else "large"}
    .rename { body_mass_g }.into("size")
    .groupBy { species and "size"}
    .count()
    .sortBy { species and "count" }

// counts below are slightly different than source SQL because we filtered the nulls
// the source SQL is classifying the NULL values as "large" (because NULL is not < 3500, so it falls to the `else`)

// just another reason to decide how to handle nulls ahead of time
Out[82]:

DataFrame: rowsCount = 5, columnsCount = 3

speciessizecount
Adeliesmall54
Adelielarge97
Chinstrapsmall17
Chinstraplarge51
Gentoolarge123

49. select case with when¶

In [83]:
dfPenguins
    .dropNulls { body_mass_g }
    .convert { body_mass_g.castToNotNullable() }.with { 
        when {
            it < 3500 -> "small"
            it < 5000 -> "medium"
            // ... especially useful when there are lots of cases
            // a map also works, as we demonstrated in #41
            else -> "large"
        }
    }
    .rename { body_mass_g }.into("size")
    .groupBy { species and "size"}
    .count()
    .sortBy { species and "count" }
Out[83]:

DataFrame: rowsCount = 6, columnsCount = 3

speciessizecount
Adeliesmall54
Adeliemedium97
Chinstrapsmall17
Chinstrapmedium51
Gentoomedium56
Gentoolarge67

50. check range¶

In [84]:
dfPenguins
    .dropNulls { it.body_mass_g }
    .convert { body_mass_g.castToNotNullable() }.with { 
        when {
            it in (3500..5000) -> "normal"  // `it` refers to body_mass_g
            else -> "abnormal"
        }
    }
    .rename { body_mass_g }.into("size")
    .groupBy { species and "size"}
    .count()
    .sortBy { species and "count" }
Out[84]:

DataFrame: rowsCount = 6, columnsCount = 3

speciessizecount
Adelieabnormal54
Adelienormal97
Chinstrapabnormal17
Chinstrapnormal51
Gentooabnormal61
Gentoonormal62

51. pattern matching¶

In [85]:
val assaysConn : Connection = DriverManager.getConnection("jdbc:sqlite:assays.db")
val dfStaff = DataFrame.readSqlTable(assaysConn, "staff")
dfStaff
Out[85]:

DataFrame: rowsCount = 10, columnsCount = 5

identpersonalfamilydeptage
1KartikGuptanull46
2DivitDhaliwalhist34
3IndransSridharmb47
4PranayKhannamb51
5RiaanDuanull23
6VedikaRouthist45
7AbramChokshigen23
8RomilKapoorhist38
9IshaanRamaswamymb35
10NityaLalgen52
In [86]:
// Kotlin doesn't have keywords `LIKE` or `GLOB` (although GLOB works in SQLite, it is not standard SQL)
//   but with Regex we can create much more complex matchers if necessary
// We also have every other Kotlin string function (and numeric functions for numeric columns)
//   and, we can define our own functions

dfStaff
    .filter { Regex("ya").containsMatchIn(personal) || Regex("De").containsMatchIn(family) }
    .select { personal and family }
Out[86]:

DataFrame: rowsCount = 1, columnsCount = 2

personalfamily
NityaLal
In [87]:
// defining a custom function
fun startsWithK(test : String) = test.startsWith("K")

dfStaff
    .filter { startsWithK(family) }
    .select { personal and family }
Out[87]:

DataFrame: rowsCount = 2, columnsCount = 2

personalfamily
PranayKhanna
RomilKapoor

52. select first and last rows with head and tail¶

In [88]:
val dfExperiment = DataFrame.readSqlTable(assaysConn, "experiment")
In [89]:
// this also uses `concat` like SQL UNION

dfExperiment.sortBy { started }.head(5)
    .concat(dfExperiment.sortBy { started }.tail(5))
    .sortBy { started }
Out[89]:

DataFrame: rowsCount = 10, columnsCount = 4

identkindstartedended
17trial2023-01-292023-01-30
35calibration2023-01-302023-01-30
36trial2023-02-022023-02-03
25trial2023-02-122023-02-14
2calibration2023-02-142023-02-14
40calibration2024-01-212024-01-21
12trial2024-01-262024-01-28
44trial2024-01-272024-01-29
34trial2024-02-012024-02-02
14calibration2024-02-032024-02-03

53. intersection¶

In [90]:
// intersect isn't part of the dataframe API, but it exists in Kotlin, we just need to convert each df to an Iterable with `rows()`
//   then back to a df post-intersect

// note this (either SQL or df) is only usable when the two data sources have the exact same schema

dfStaff.filter { age < 50 }.rows()
    .intersect(dfStaff.filter { dept == "mb" }.rows())
    .toDataFrame()
    .select { personal and family and dept and age }
Out[90]:

DataFrame: rowsCount = 2, columnsCount = 4

personalfamilydeptage
IndransSridharmb47
IshaanRamaswamymb35

54. exclusion¶

In [91]:
// `except` can very often be replaced more effectively by combining filters

dfStaff
    .filter { it.dept == "mb" }
    .filter { ! (it.age < 50) }
    .select { personal and family and dept and age }
Out[91]:

DataFrame: rowsCount = 1, columnsCount = 4

personalfamilydeptage
PranayKhannamb51
In [92]:
// Kotlin's `minus` works like SQL except

// just like `intersect`, we are comparing collections of rows, rather than directly in the dataframe itself

dfStaff.filter { dept == "mb" }.rows()
    .minus(dfStaff.filter { age < 50 }.rows())
    .toDataFrame()
    .select { personal and family and dept and age }
Out[92]:

DataFrame: rowsCount = 1, columnsCount = 4

personalfamilydeptage
PranayKhannamb51

55. random numbers¶

In [93]:
// just using Kotlin standard Random functions

import kotlin.random.Random

dfStaff
    .add {
        "rand" from { Random.nextLong() }
        "selector" from { Random.nextInt(from = 0, until = 9) }
        "name" from { "${it.personal} ${it.family}" }
    }
    .filter { "selector"<Int>() < 5 }
    .select { "rand" and "selector" and "name" }
Out[93]:

DataFrame: rowsCount = 5, columnsCount = 3

randselectorname
-89128891880142211591Kartik Gupta
59769751649773450101Divit Dhaliwal
-27899840308990216340Indrans Sridhar
-92145006215024085584Pranay Khanna
-31805290015184037644Romil Kapoor

56. creating an index¶

neither dataframe nor Kotlin collections support something equivalent to a database index, which are typically implemeted via hash or b-tree (although Kotlin does support a binary search if the collection is sorted)

especially when dealing with larger tables, utilizing existing indices to speed up data retrieval is a very good reason to prefer filtering (WHERE) via SQL, before the dataframe

57. generate sequence¶

In [94]:
val mySeries by IntRange(1, 5).toList().toColumn()
mySeries
Out[94]:

DataColumn: name = "mySeries", type = Int, size = 5

mySeries
1
2
3
4
5
In [95]:
IntRange(0, 20).step(2).reversed().toList().toColumn().named("evenNumbers")
Out[95]:

DataColumn: name = "evenNumbers", type = Int, size = 11

evenNumbers
20
18
16
14
12
10
8
6
4
2
0

58. generate sequence based on data¶

In [96]:
// creating a simple df with min value = 1, max value = 15

val dfTemp = dataFrameOf(
    "num" to listOf(1, 15)
)
dfTemp
Out[96]:

DataFrame: rowsCount = 2, columnsCount = 1

num
1
15
In [97]:
// use the above df to create a sequence based on min & max

val series by IntRange(dfTemp.minOf{ "num"<Int>() }, dfTemp.maxOf{ "num"<Int>() }).toList().toColumn()
series
Out[97]:

DataColumn: name = "series", type = Int, size = 15

series
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

59. generate sequence of dates¶

In [98]:
import java.time.LocalDate
import java.time.temporal.ChronoUnit


dfExperiment.minOf{ started }
Out[98]:
2023-01-29
In [99]:
IntRange(0, dfExperiment.count())
    .map { LocalDate.parse(dfExperiment.minOf{ started }).plusDays(it.toLong()) }
    .toColumn("someDays")
    .take(5)
Out[99]:

DataColumn: name = "someDays", type = java.time.LocalDate, size = 5

someDays
2023-01-29
2023-01-30
2023-01-31
2023-02-01
2023-02-02

60. count experiments started per day without gaps¶

the source SQL is not correct (https://gvwilson.github.io/sql-tutorial/, #60)

the limit 5 hides the error... without the limit, the query returns only 50 rows (corresponding to 50 total rows in the table) but the full date range is 2023-01-29 to 2024-02-03

but the point of this query is to return every day without gaps

In [100]:
// create a range of every day (all 371) between the column's min and max values

val colAllDays : DataColumn<LocalDate> = 
    LongRange(0L, ChronoUnit.DAYS.between(
                LocalDate.parse(dfExperiment.minOf{ started }), 
                LocalDate.parse(dfExperiment.maxOf{ started })
        ))
        .map { LocalDate.parse(dfExperiment.minOf{ started }).plusDays(it) }
        .toColumn("day")

colAllDays  // 371 rows
Out[100]:

... showing only top 20 of 371 rows

DataColumn: name = "day", type = java.time.LocalDate, size = 371

day
2023-01-29
2023-01-30
2023-01-31
2023-02-01
2023-02-02
2023-02-03
2023-02-04
2023-02-05
2023-02-06
2023-02-07
2023-02-08
2023-02-09
2023-02-10
2023-02-11
2023-02-12
2023-02-13
2023-02-14
2023-02-15
2023-02-16
2023-02-17
In [101]:
colAllDays.toDataFrame().leftJoinWith(
    dfExperiment
        .groupBy { started }
        .aggregate {
            count() into "experimentCount"
        }
        .convert { started }.with { LocalDate.parse(it) }
    ) { it["day"] == right["started"] } // "it" refers to first/left df, "right" refers to second/right
.select { "day" and "experimentCount" }
.fillNulls("experimentCount").with { 0 }

// 2 days with >1 experiment  -- 2023-09-07 and 2024-01-19
Out[101]:

... showing only top 20 of 371 rows

DataFrame: rowsCount = 371, columnsCount = 2

dayexperimentCount
2023-01-291
2023-01-301
2023-01-310
2023-02-010
2023-02-021
2023-02-030
2023-02-040
2023-02-050
2023-02-060
2023-02-070
2023-02-080
2023-02-090
2023-02-100
2023-02-110
2023-02-121
2023-02-130
2023-02-141
2023-02-151
2023-02-160
2023-02-170

61. self join¶

In [102]:
// simulating the firt CTE 'person'

val dfStaffPersons = dfStaff
    .add("name") { "${it.personal} ${it.family}" }
    .select { ident and "name" }

dfStaffPersons
Out[102]:

DataFrame: rowsCount = 10, columnsCount = 2

identname
1Kartik Gupta
2Divit Dhaliwal
3Indrans Sridhar
4Pranay Khanna
5Riaan Dua
6Vedika Rout
7Abram Chokshi
8Romil Kapoor
9Ishaan Ramaswamy
10Nitya Lal
In [103]:
dfStaffPersons
    .joinWith(dfStaffPersons) { true }  // cross join, every row on left joined with every row on right, no match necessary
    .select { "name" and "name1" }
    .head(20)
Out[103]:

DataFrame: rowsCount = 20, columnsCount = 2

namename1
Kartik GuptaKartik Gupta
Kartik GuptaDivit Dhaliwal
Kartik GuptaIndrans Sridhar
Kartik GuptaPranay Khanna
Kartik GuptaRiaan Dua
Kartik GuptaVedika Rout
Kartik GuptaAbram Chokshi
Kartik GuptaRomil Kapoor
Kartik GuptaIshaan Ramaswamy
Kartik GuptaNitya Lal
Divit DhaliwalKartik Gupta
Divit DhaliwalDivit Dhaliwal
Divit DhaliwalIndrans Sridhar
Divit DhaliwalPranay Khanna
Divit DhaliwalRiaan Dua
Divit DhaliwalVedika Rout
Divit DhaliwalAbram Chokshi
Divit DhaliwalRomil Kapoor
Divit DhaliwalIshaan Ramaswamy
Divit DhaliwalNitya Lal

62. generate unique pairs¶

In [104]:
// note the syntax here -- 
// in the filter, the original `ident` column is still accessible, but the new `ident1` (created as a result of the join) must be referred to by String
// same is true in the final select with `name` and `name1`

dfStaffPersons
    .joinWith(dfStaffPersons) { it.ident!! < right.ident!! } // this cross join has conditions, the left ident must be less than the right ident
    .filter { ident() <= 4 && "ident1"<Int>() <= 4 }  
    .select { name and "name1" }
Out[104]:

DataFrame: rowsCount = 6, columnsCount = 2

namename1
Kartik GuptaDivit Dhaliwal
Kartik GuptaIndrans Sridhar
Kartik GuptaPranay Khanna
Divit DhaliwalIndrans Sridhar
Divit DhaliwalPranay Khanna
Indrans SridharPranay Khanna

63. filter pairs¶

In [105]:
// it appears each experiment had up to 2 people work on it
// this subquery matches up each experiment with its two staffers

val dfPerformed = DataFrame.readSqlTable(assaysConn, "performed")

val dfTogether = dfPerformed
    .innerJoin(dfPerformed) { "experiment"<Int>() }
    .filter { "staff"<Int>() < "staff1"<Int>() }
    .move("experiment").toLeft()

dfTogether
Out[105]:

DataFrame: rowsCount = 15, columnsCount = 3

experimentstaffstaff1
316
846
1238
1379
1746
2017
2578
2912
3327
3449
3638
3819
43110
4717
5048
In [106]:
// based on the subquery above, we replace the id's with the staffer's name to get a list of all staff pairs who worked on an experiment together

dfTogether
    .innerJoinWith(dfStaffPersons) { it["staff"] == right["ident"] }
    .innerJoinWith(dfStaffPersons) { it["staff1"] == right["ident"] }
    .select { "name" and "name1" }
    .rename("name").into("person_1")
    .rename("name1").into("person_2")
Out[106]:

DataFrame: rowsCount = 15, columnsCount = 2

person_1person_2
Kartik GuptaVedika Rout
Pranay KhannaVedika Rout
Indrans SridharRomil Kapoor
Abram ChokshiIshaan Ramaswamy
Pranay KhannaVedika Rout
Kartik GuptaAbram Chokshi
Abram ChokshiRomil Kapoor
Kartik GuptaDivit Dhaliwal
Divit DhaliwalAbram Chokshi
Pranay KhannaIshaan Ramaswamy
Indrans SridharRomil Kapoor
Kartik GuptaIshaan Ramaswamy
Kartik GuptaNitya Lal
Kartik GuptaAbram Chokshi
Pranay KhannaRomil Kapoor

64. existence and correlated subqueries¶

In [107]:
val dfDepartment = DataFrame.readSqlTable(assaysConn, "department")
dfDepartment
Out[107]:

DataFrame: rowsCount = 4, columnsCount = 3

identnamebuilding
genGeneticsChesson
histHistologyFashet Extension
mbMolecular BiologyChesson
endEndocrinologyTGVH
In [108]:
dfStaff
Out[108]:

DataFrame: rowsCount = 10, columnsCount = 5

identpersonalfamilydeptage
1KartikGuptanull46
2DivitDhaliwalhist34
3IndransSridharmb47
4PranayKhannamb51
5RiaanDuanull23
6VedikaRouthist45
7AbramChokshigen23
8RomilKapoorhist38
9IshaanRamaswamymb35
10NityaLalgen52
In [109]:
// some of the source examples use rather convoluted syntax -- often, standard joins are clearer and work better
// (see note after #65 in source, "avoiding correlated subqueries")

dfDepartment
    .filter { ident in dfStaff.dept }
    .select { name and building }
Out[109]:

DataFrame: rowsCount = 3, columnsCount = 2

namebuilding
GeneticsChesson
HistologyFashet Extension
Molecular BiologyChesson
In [110]:
// same result with an inner join

dfDepartment.innerJoin(dfStaff) { ident match right.dept }
    .distinct { name and building }
Out[110]:

DataFrame: rowsCount = 3, columnsCount = 2

namebuilding
GeneticsChesson
HistologyFashet Extension
Molecular BiologyChesson

65. non-existence¶

In [111]:
dfDepartment
    .filter { ! (ident in dfStaff.dept) }
    .select { name and building }
Out[111]:

DataFrame: rowsCount = 1, columnsCount = 2

namebuilding
EndocrinologyTGVH

66. lead and lag¶

In [112]:
// lead and lag, or shift and offset, are not part of the dataframe API yet, although an issue has been raised
dfExperiment
Out[112]:

... showing only top 20 of 50 rows

DataFrame: rowsCount = 50, columnsCount = 4

identkindstartedended
1calibration2023-08-252023-08-25
2calibration2023-02-142023-02-14
3trial2023-02-222023-02-24
4trial2023-10-102023-10-11
5calibration2024-01-192024-01-19
6trial2023-02-152023-02-16
7trial2023-06-222023-06-23
8trial2023-05-032023-05-04
9calibration2024-01-192024-01-19
10trial2023-05-232023-05-24
11trial2023-03-172023-03-18
12trial2024-01-262024-01-28
13trial2023-05-252023-05-26
14calibration2024-02-032024-02-03
15trial2023-03-092023-03-11
16calibration2023-05-222023-05-22
17trial2023-01-292023-01-30
18calibration2023-03-152023-03-15
19calibration2023-09-212023-09-21
20trial2023-12-162023-12-17
In [113]:
// first, we group each experiment into the month it was started, and generate a count for each month

val startMonth by column<LocalDate>()

val dfMonthlyCount = dfExperiment
    .add(startMonth) { LocalDate.parse(started).withDayOfMonth(1) } // converting start date to its month (by changing date to 1)
    .groupBy { startMonth }
    .count()
    .sortBy { startMonth }
    .rename("count").into("experiments")
    .addId()

dfMonthlyCount
Out[113]:

DataFrame: rowsCount = 13, columnsCount = 3

idstartMonthexperiments
02023-01-012
12023-02-015
22023-03-015
32023-04-011
42023-05-016
52023-06-015
62023-07-013
72023-08-012
82023-09-014
92023-10-016
102023-12-014
112024-01-015
122024-02-012
In [114]:
// from df above, we can emulate lead/lag by left-joining on itself, by adding/subtracting 1 from the right-side id

dfMonthlyCount
    .leftJoinWith(dfMonthlyCount) { id == right.id + 1 }
    .leftJoinWith(dfMonthlyCount) { id == right.id - 1 }
    .select { startMonth and "experiments1" and experiments and "experiments2" }
    .rename("experiments1").into("prevMonth")
    .rename("experiments").into("currentMonth")
    .rename("experiments2").into("nextMonth")
Out[114]:

DataFrame: rowsCount = 13, columnsCount = 4

startMonthprevMonthcurrentMonthnextMonth
2023-01-01null25
2023-02-01255
2023-03-01551
2023-04-01516
2023-05-01165
2023-06-01653
2023-07-01532
2023-08-01324
2023-09-01246
2023-10-01464
2023-12-01645
2024-01-01452
2024-02-0152null

67. window functions¶

query #67 is another query where the source document contains a bug

the progress column is linear, and not based on the num_done column, which is what was intended

in our result, the progress column will be based on the running total

In [115]:
// df API includes a cumulative sum function, however we need to calculate it independently then add it to our dataframe

// create the cumulative sum column
val cumulativeCount : DataColumn<Int> = dfMonthlyCount.experiments.cumSum().named("runningTotal")

dfMonthlyCount
    .add(cumulativeCount)  // add the column
    .add("progress") { "runningTotal"<Int>() / dfMonthlyCount.sumOf{experiments}.toDouble() } // calculate the progress %age inline

// probably better to calculate the sumOf once and assign it to a variable, as opposed to repeating the calculation for every row

// note that the `progress` below is based on a %age of the total (50 experiments), and is not linear
Out[115]:

DataFrame: rowsCount = 13, columnsCount = 5

idstartMonthexperimentsrunningTotalprogress
02023-01-01220.040000
12023-02-01570.140000
22023-03-015120.240000
32023-04-011130.260000
42023-05-016190.380000
52023-06-015240.480000
62023-07-013270.540000
72023-08-012290.580000
82023-09-014330.660000
92023-10-016390.780000
102023-12-014430.860000
112024-01-015480.960000
122024-02-012501.000000

68. partitioned windows¶

window functions are not directly supported in the dataframe API, however most can be emulated, often by partitioning the data manually, performing the calculation, then concat()ing the partitions

In [116]:
// in the above query, the calendar month was represented by a single date column
// on this one, we need a column for year and a separate column for month number

val theYear by column<Int>()
val theMonth by column<Int>()

val dfMonthlyCount2 = dfExperiment
                        .add(theYear) { LocalDate.parse(started).getYear() }
                        .add(theMonth) { LocalDate.parse(started).getMonthValue() }
                        .sortBy { theYear and theMonth }
                        .groupBy { theYear and theMonth }
                        .count()
                        .rename("count").into("experiments")

// val annualTotalsMap = dfMonthlyCount2
//                         .groupBy { theYear }
//                         .sumOf { "experiments"<Int>() }
//                         .associate { it.getValue<Int>("theYear") to it.getValue<Int>("sum") }

dfMonthlyCount2  // same data as dfMonthlyCount above
Out[116]:

DataFrame: rowsCount = 13, columnsCount = 3

theYeartheMonthexperiments
202312
202325
202335
202341
202356
202365
202373
202382
202394
2023106
2023124
202415
202422
In [117]:
// this is where we get the key value of each partition (every distinct `year` value)

val years : Set<Int> = dfMonthlyCount2.distinct { theYear }.getColumn { theYear }.toSet() 
years
Out[117]:
[2023, 2024]
In [118]:
// now we take each year, generate a data partition, do the cumulative count calculation on that partition
// and finally re-join all the partitions back into a single dataframe

years
    .map { year -> 
        val dfSubset = dfMonthlyCount2.filter { theYear == year } // creates a partition of data only for the current `year` value
        val cumulativeCount : DataColumn<Int> = dfSubset.experiments.cumSum().named("runningTotal")
        dfSubset.add(cumulativeCount)
    }.concat()
Out[118]:

DataFrame: rowsCount = 13, columnsCount = 4

theYeartheMonthexperimentsrunningTotal
2023122
2023257
20233512
20234113
20235619
20236524
20237327
20238229
20239433
202310639
202312443
2024155
2024227

69. blobs¶

dataframes can contain virtually any type of object in a column... for example File objects

70. store JSON¶

many relational databases today support a JSON data type which can contain arbitrary, schema-less JSON objects

most of them also provide a syntax for querying that JSON data

Kotlin generally doesn't work too well with untyped data, although kotlinx.serialization provides a JsonObject data type

when working with JSON data in Kotlin, you have many libraries to work with (moshi, jackson, gson, klaxon...) but here we will use kotlinx.serialization

note that dataframe API can import a JSON document and build a dataframe from it -- this is a different case where we are reading a JSON column as part of a relational database table

In [119]:
val labLogConn : Connection = DriverManager.getConnection("jdbc:sqlite:lab_log.db")
val dfMachine = DataFrame.readSqlTable(labLogConn, "machine")
dfMachine
Out[119]:

DataFrame: rowsCount = 3, columnsCount = 3

identnamedetails
1WY401{"acquired": "2023-05-01"}
2Inphormex{"acquired": "2021-07-15", "refurbish...
3AutoPlate 9000{"note": "needs software update"}
In [120]:
// when importing relational data, Kotlin will treat a JSON data column as just a String

dfMachine.describe()
Out[120]:

DataFrame: rowsCount = 3, columnsCount = 12

nametypecountuniquenullstopfreqmeanstdminmedianmax
identInt?330112.0000001.000000123
nameString330WY4011nullnullAutoPlate 9000InphormexWY401
detailsString330{"acquired": "2023-05-01"}1nullnull{"acquired": "2021-07-15", "refurbish...{"acquired": "2023-05-01"}{"note": "needs software update"}

71. select field from JSON¶

dataframe has multiple options for reading/parsing JSON, some of which are undocumented at the time of this writing:

  • parse()
  • readJsonStr()
  • full deserialization strategy with kotlinx.serialization (or alternative library)
In [121]:
// parse is easiest, each JSON field generates a nested column

// note that none of these will automatically convert a source String to a LocalDate - that must be done manually

dfMachine
    .parse { details }
    .convert { details["acquired"] and details["refurbished"] }.with { it?.let { LocalDate.parse(it.toString()) } }
Out[121]:

DataFrame: rowsCount = 3, columnsCount = 3

identnamedetails
1WY401{ acquired:2023-05-01 }
2Inphormex{ acquired:2021-07-15, refurbished:20...
3AutoPlate 9000{ note:needs software update }
In [122]:
// convert() to DataRow.readJsonStr() gives the same hierarchal result

dfMachine
    .convert { details }.with { DataRow.readJsonStr(it) }
    .convert { details["acquired"] and details["refurbished"] }.with { it?.let { LocalDate.parse(it.toString()) } }
Out[122]:

DataFrame: rowsCount = 3, columnsCount = 3

identnamedetails
1WY401{ acquired:2023-05-01 }
2Inphormex{ acquired:2021-07-15, refurbished:20...
3AutoPlate 9000{ note:needs software update }
In [123]:
// with kotlinx.serialization, instead of creating individual columns for every field, we can create a single column with a `details` class

// every row's JSON has some null/unspecified values, so it's very messy data, we need to be extra-careful

import kotlinx.serialization.*
import kotlinx.serialization.json.*
import kotlinx.serialization.descriptors.PrimitiveKind
import kotlinx.serialization.descriptors.PrimitiveSerialDescriptor
import kotlinx.serialization.descriptors.SerialDescriptor
import kotlinx.serialization.encoding.Decoder
import kotlinx.serialization.encoding.Encoder


// normally we don't need this much setup, but kotlinx.serialization does not have a built-in LocalDate serializer
object LocalDateSerializer: KSerializer<LocalDate> {

    override val descriptor: SerialDescriptor =
        PrimitiveSerialDescriptor("LocalDateSerializer", PrimitiveKind.STRING)

    override fun serialize(encoder: Encoder, value: LocalDate) {
        encoder.encodeString(value.toString())
    }

    override fun deserialize(decoder: Decoder): LocalDate {
        return LocalDate.parse(decoder.decodeString())
    }

}

// this is the schema based on the JSON data in the "details" column of the "machine" source table
@Serializable
@DataSchema
data class MachineDetails (
    @Serializable(with= LocalDateSerializer::class) val acquired : LocalDate? = null,  // nullable because this field is sometimes missing
    @Serializable(with= LocalDateSerializer::class) val refurbished : LocalDate? = null,
    val note : String = ""
)
In [124]:
// our goal is to convert the imported JSON document to a native object

// `decodeFromString()` will create MachineDetails objects from the JSON string

// this will create just one `details` column, not 3 like the above examples

dfMachine
    .convert { details }.with { Json.decodeFromString<MachineDetails>(it) }
Out[124]:

DataFrame: rowsCount = 3, columnsCount = 3

identnamedetails
1WY401MachineDetails(acquired=2023-05-01, r...
2InphormexMachineDetails(acquired=2021-07-15, r...
3AutoPlate 9000MachineDetails(acquired=null, refurbi...
In [125]:
// once we have class instances in a column, we can access their native properties

dfMachine
    .convert { details }.with { Json.decodeFromString<MachineDetails>(it) }
    .add("acquiredYear") { "details"<MachineDetails>().acquired?.year }
    .select("acquiredYear")
Out[125]:

DataFrame: rowsCount = 3, columnsCount = 1

acquiredYear
2023
2021
null

72. JSON array access¶

In [126]:
val dfUsage = DataFrame.readSqlTable(labLogConn, "usage")
dfUsage
Out[126]:

DataFrame: rowsCount = 8, columnsCount = 2

identlog
1[{"machine": "Inphormex", "person": [...
2[{"machine": "Inphormex", "person": [...
3[{"machine": "sterilizer", "person": ...
4[{"machine": "sterilizer", "person": ...
5[{"machine": "AutoPlate 9000", "perso...
6[{"machine": "sterilizer", "person": ...
7[{"machine": "WY401", "person": ["Mau...
8[{"machine": "AutoPlate 9000"}]
In [127]:
// `parse` is simplest, but you have less control over the output, and likely need some column variables

val mylog by frameColumn()
val firstObj by frameColumn()

dfUsage
    .parse { log named "mylog" }
    .add("length") { mylog().count() }
    .add(firstObj) { mylog().take(1) }
    .split(firstObj)
Out[127]:

Split

identmyloglengthsplit1
1[4 x 2]4{ machine:Inphormex, person:[Gabriell...
2[5 x 2]5{ machine:Inphormex, person:[Marianne...
3[2 x 2]2{ machine:sterilizer, person:[Josette...
4[1 x 2] { machine:sterilizer, person:...1{ machine:sterilizer, person:[Maude, ...
5[2 x 2]2{ machine:AutoPlate 9000, person:[Bri...
6[1 x 2] { machine:sterilizer, person:...1{ machine:sterilizer, person:[Mariann...
7[3 x 2]3{ machine:WY401, person:[Maude, Goule...
8[1 x 1] { machine:AutoPlate 9000 }1{ machine:AutoPlate 9000 }
In [128]:
// a defined deserialization strategy (using kotlinx.serialization, or another library) provides native object creation

@Serializable
@DataSchema
data class UsageDetail (
    val machine: String,
    val person: List<String> = listOf()
)

dfUsage.add {
        "length" from { Json.decodeFromString<List<UsageDetail>>(log).size }
        "first" from { Json.decodeFromString<List<UsageDetail>>(log).first() }
    }.select { ident and "length" and "first" }
Out[128]:

DataFrame: rowsCount = 8, columnsCount = 3

identlengthfirst
14UsageDetail(machine=Inphormex, person...
25UsageDetail(machine=Inphormex, person...
32UsageDetail(machine=sterilizer, perso...
41UsageDetail(machine=sterilizer, perso...
52UsageDetail(machine=AutoPlate 9000, p...
61UsageDetail(machine=sterilizer, perso...
73UsageDetail(machine=WY401, person=[Ma...
81UsageDetail(machine=AutoPlate 9000, p...
In [129]:
// instead of .decodeFromString() multiple times (above), you can perform it once and reference the result

dfUsage
    .convert { log named "udList" }.with { Json.decodeFromString<List<UsageDetail>>(it) } 
    .add {
        "length" from { "udList"<List<UsageDetail>>().size }
        "first" from { "udList"<List<UsageDetail>>().first() }
    }.select { ident and "length" and "first" }
Out[129]:

DataFrame: rowsCount = 8, columnsCount = 3

identlengthfirst
14UsageDetail(machine=Inphormex, person...
25UsageDetail(machine=Inphormex, person...
32UsageDetail(machine=sterilizer, perso...
41UsageDetail(machine=sterilizer, perso...
52UsageDetail(machine=AutoPlate 9000, p...
61UsageDetail(machine=sterilizer, perso...
73UsageDetail(machine=WY401, person=[Ma...
81UsageDetail(machine=AutoPlate 9000, p...

73. unpack a JSON array¶

json_each is proprietary, it is supported in PostgreSQL and SQLite, maybe others

it replaces a JSON array with individual new rows for each individual element of the array
(i.e. a row with a JSON array of 4 elements is replaced with 4 rows with a single element each)

dataframe enables similar operations but not restricted to JSON -- later we'll look into pivot, gather, flatten, explode, etc

In [130]:
dfUsage
    .convert { log named "udList" }.with { Json.decodeFromString<List<UsageDetail>>(it) }
    .split { "udList"<List<UsageDetail>>() }.intoRows()
    //.head(10)
Out[130]:

DataFrame: rowsCount = 19, columnsCount = 2

identudList
1UsageDetail(machine=Inphormex, person...
1UsageDetail(machine=Inphormex, person...
1UsageDetail(machine=WY401, person=[Ga...
1UsageDetail(machine=Inphormex, person...
2UsageDetail(machine=Inphormex, person...
2UsageDetail(machine=AutoPlate 9000, p...
2UsageDetail(machine=sterilizer, perso...
2UsageDetail(machine=AutoPlate 9000, p...
2UsageDetail(machine=sterilizer, perso...
3UsageDetail(machine=sterilizer, perso...
3UsageDetail(machine=Inphormex, person...
4UsageDetail(machine=sterilizer, perso...
5UsageDetail(machine=AutoPlate 9000, p...
5UsageDetail(machine=sterilizer, perso...
6UsageDetail(machine=sterilizer, perso...
7UsageDetail(machine=WY401, person=[Ma...
7UsageDetail(machine=Inphormex, person...
7UsageDetail(machine=AutoPlate 9000, p...
8UsageDetail(machine=AutoPlate 9000, p...

74. last element of array¶

In [131]:
dfUsage
    .add("lastMachine") { Json.decodeFromString<List<UsageDetail>>(log).last().machine }
    .select { ident and "lastMachine" }
Out[131]:

DataFrame: rowsCount = 8, columnsCount = 2

identlastMachine
1Inphormex
2sterilizer
3Inphormex
4sterilizer
5sterilizer
6sterilizer
7AutoPlate 9000
8AutoPlate 9000

75. modify JSON¶

In [132]:
// in Kotlin we use immutable objects so we can't just add arbitrary fields
// a convoluted workaround is to deal with JsonObjects rather than actual defined classes (like MachineDetails above)
//   even here, the JsonObject from kotlinx.serialization is immutable, we must convert to a MutableMap in order to add a new field
// a different JSON library (jackson, gson, moshi) is probably less strict about immutability

dfMachine
    .convert { details }.with{ (Json.parseToJsonElement(details) as JsonObject).let { originalJson ->
            JsonObject(originalJson.toMutableMap().apply {
                this["sold"] = JsonPrimitive("2024-01-25")
            })
        } 
    }

// every value in the `details` column now has a "sold" key attached to it
Out[132]:

DataFrame: rowsCount = 3, columnsCount = 3

identnamedetails
1WY401{"acquired":"2023-05-01","sold":"2024...
2Inphormex{"acquired":"2021-07-15","refurbished...
3AutoPlate 9000{"note":"needs software update","sold...

76. tombstones¶

In [133]:
// this concept is more suited to a persistent data store, to permanently mark certain data as "inactive"
// (but now the user needs to filter on "active" on every query going forward)

// much better (less error-prone) to create a view that filters out the inactive records

dfPenguins
    .add("active") { species != "Adelie" }  // prefer to use a Boolean rather than 0/1 (SQL bit)
    .filter { "active"<Boolean>() }
    .groupBy { species }
    .count()
Out[133]:

DataFrame: rowsCount = 2, columnsCount = 2

speciescount
Chinstrap68
Gentoo124

77. views¶

VIEWs are a database construct -- a pre-saved query that is exposed like a table and is itself query-able

  • dataframe (and virtually all database clients) treats views exactly the same as tables
  • views can be as complex as any query, with joins, case/whens, calculations, etc
  • views execute their query every time they are called, so potentially, slower than tables

note: a materialized view already has the query results pre-calculated (like a cached query result) and is typically faster

78. add check¶

dataframe API has no validation or constraints (other than incorrect type), but if a class uses Kotlin's require, that will act as validation inside the df

In [134]:
data class Job (
    val name: String,
    val billable: Double
) {
    init { require(billable > 0.0) { "billable must be greater than 0" } }
}
In [135]:
// will fail validation and raise error

listOf<Job>(
    Job("calibrate", 1.5), 
    Job("reset", -0.5)
)
billable must be greater than 0
java.lang.IllegalArgumentException: billable must be greater than 0
	at Line_188_jupyter$Job.<init>(Line_188.jupyter.kts:5)
	at Line_189_jupyter.<init>(Line_189.jupyter.kts:5)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:105)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:47)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
	at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:107)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:284)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:107)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:73)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:646)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:13)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode(ReplForJupyterImpl.kt:480)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evaluateUserCode(ReplForJupyterImpl.kt:125)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:378)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:359)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:124)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:123)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:26)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:21)
	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)
In [136]:
@DataSchema
data class ValidatingPenguin(
    val species	: String,
    val island : String,
    val bill_length_mm : Double,
    val bill_depth_mm : Double,
    val flipper_length_mm : Int,
    val body_mass_g	: Int,
    val sex : String
) {
    init { require(body_mass_g > 4000) { "body_mass must be greater than 4000" } }
}
In [137]:
// will fail validation and raise error
// eliminate the validation in the data class above and it will work fine

DataFrame
    .readSqlTable(penguinsConn, "little_penguins")
    .convertTo<ValidatingPenguin>()
    .mapToColumn("validatingPenguin") { ValidatingPenguin(species, 
                                                          island, 
                                                          bill_length_mm, 
                                                          bill_depth_mm, 
                                                          flipper_length_mm,
                                                          body_mass_g,
                                                          sex
                                                          ) }
body_mass must be greater than 4000
java.lang.IllegalArgumentException: body_mass must be greater than 4000
	at Line_190_jupyter$ValidatingPenguin.<init>(Line_190.jupyter.kts:11)
	at Line_192_jupyter$res192$1.invoke(Line_192.jupyter.kts:7)
	at Line_192_jupyter$res192$1.invoke(Line_192.jupyter.kts:7)
	at org.jetbrains.kotlinx.dataframe.impl.columns.ConstructorsKt.computeValues(constructors.kt:97)
	at org.jetbrains.kotlinx.dataframe.impl.columns.ConstructorsKt.newColumn(constructors.kt:59)
	at org.jetbrains.kotlinx.dataframe.api.MapKt.mapToColumn(map.kt:95)
	at Line_192_jupyter.<init>(Line_192.jupyter.kts:23)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:105)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:47)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
	at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:107)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:284)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:107)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:73)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:646)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:13)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode(ReplForJupyterImpl.kt:480)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evaluateUserCode(ReplForJupyterImpl.kt:125)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:378)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:359)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:124)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:123)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:26)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:21)
	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

79. transactions¶

80. rollback in constraint¶

81. rollback in statement¶

transactions are another database construct, not applicable directly to dataframe
most db clients/ORMs support transactions

dataframes are not peristent data structures so the concepts of rollbacks doesn't really apply

definitely worth learning how your db client handles transactions, rollbacks, updates, etc

In [138]:
// typical example use case for transactions is a bank transfer
// need to ensure that the debit and credit both occur, otherwise someone will be receiving free money (or sending money to a black hole)

// let's set up an example bank with customers
@DataSchema
data class Friend (
    val name: String,
    val bankBalance: Int
)

val friends : List<Friend> = listOf(Friend("Monica", 200), Friend("Ross", 600), Friend("Phoebe", 1000), 
                                    Friend("Joey", 100), Friend("Rachel", 800), Friend("Chandler", 750))

val dfFriends : DataFrame<Friend> = friends.toDataFrame()
In [139]:
// now let's create a transfer function

// we have two validation checks -- that the transfers are both known customers, and the sender has sufficient funds

// our function performs the validation before executing any of the changes, ensuring they will both be a success
//   this is essentially the same as a transaction -- either everything succeeds or nothing changes

// of course, if there are un-anticipated errors that aren't caught in the validation, that would cause a problem
//   or, if these are lengthy operations and the data may have changed after the validation but before/during the updates

fun transferMoney(from: String, to: String, amount: Int) : DataFrame<*> {
    if (from !in dfFriends.name || to !in dfFriends.name) throw Exception("unknown customer")
    if (amount > dfFriends.single { name == from}.bankBalance) throw Exception("insufficient funds")
    return dfFriends
        .update { bankBalance }.where { name == from } .with { bankBalance - amount }
        .update { bankBalance }.where { name == to } .with { bankBalance + amount }
}

transferMoney("Chandler", "Joey", 200)  // works fine
Out[139]:

DataFrame: rowsCount = 6, columnsCount = 2

namebankBalance
Monica200
Ross600
Phoebe1000
Joey300
Rachel800
Chandler550
In [140]:
transferMoney("Chandler", "Brad", 200) // who's Brad?
unknown customer
java.lang.Exception: unknown customer
	at Line_197_jupyter.transferMoney(Line_197.jupyter.kts:12)
	at Line_198_jupyter.<init>(Line_198.jupyter.kts:1)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:105)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:47)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
	at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:107)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:284)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:107)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:73)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:646)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:13)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode(ReplForJupyterImpl.kt:480)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evaluateUserCode(ReplForJupyterImpl.kt:125)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:378)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:359)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:124)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:123)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:26)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:21)
	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)
In [141]:
transferMoney("Monica", "Chandler", 500) // too much
insufficient funds
java.lang.Exception: insufficient funds
	at Line_197_jupyter.transferMoney(Line_197.jupyter.kts:13)
	at Line_199_jupyter.<init>(Line_199.jupyter.kts:1)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:105)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:47)
	at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
	at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:107)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:284)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:107)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:73)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:646)
	at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:71)
	at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:13)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode(ReplForJupyterImpl.kt:480)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evaluateUserCode(ReplForJupyterImpl.kt:125)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:378)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:359)
	at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:375)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:124)
	at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:123)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:26)
	at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$runExecution$execThread$1.invoke(JupyterExecutorImpl.kt:21)
	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

82. upsert¶

dataframe allows for adding new rows (SQL INSERT) via append(), but an UPSERT is not directly supported\

(UPSERT is not an official standard SQL keyword either, although many databases support it)

In [142]:
fun upsertFriends(friend : Friend) : DataFrame<*> {
    return if (friend.name in dfFriends.name)
            dfFriends
                .update { bankBalance }.where { name == friend.name } .with { friend.bankBalance }
        else 
            dfFriends.append(friend.name, friend.bankBalance)
}

upsertFriends(Friend("Rachel", 888)) // updates
Out[142]:

DataFrame: rowsCount = 6, columnsCount = 2

namebankBalance
Monica200
Ross600
Phoebe1000
Joey100
Rachel888
Chandler750
In [143]:
upsertFriends(Friend("Gunther", 125)) // inserts
Out[143]:

DataFrame: rowsCount = 7, columnsCount = 2

namebankBalance
Monica200
Ross600
Phoebe1000
Joey100
Rachel800
Chandler750
Gunther125

83. create trigger¶

In [144]:
// an equivalent to a database trigger would be a Kotlin function 
//   that performs a database operation and also some other procedure, potentially some side effect -- which can be tricky to trace/debug

@DataSchema
data class LabJob (
    val person: String,
    val reported: Double
) {
    init { require(reported > 0.0) { "reported must be greater than 0" } }
}

@DataSchema
data class LabTotals (
    val person: String,  // dataframe has no unique constraint
    val hours: Double
)
In [145]:
val dfLabJobs : DataFrame<LabJob> = listOf<LabJob>().toDataFrame<LabJob>()

val dfLabTotals : DataFrame<LabTotals> = 
    listOf<LabTotals>(LabTotals("gene", 0.0), LabTotals("august", 0.0)).toDataFrame<LabTotals>()

// theoretically, we can use something like this function to simulate a database trigger
// if we assign the result to a new variable (after chaining) then the totals would persist
// however it's far from perfect since the individual jobs (dfLabJobs) would not persist
fun <T> DataFrame<T>.addJob(job : LabJob) : DataFrame<T> {
    if (! this.getColumn(person).contains(job.person) ) throw Exception("unknown person")
        else {
            dfLabJobs.concat( listOf<LabJob>(job).toDataFrame<LabJob>() )
            return this.update { "hours"<Double>() }.where { "person"<String>() == job.person }.with { it + job.reported }
        }
}
In [146]:
dfLabTotals
    .addJob(LabJob("gene", 1.5))
    .addJob(LabJob("august", 0.5))
    .addJob(LabJob("gene", 1.0))
    // .addJob(LabJob("fred", 4.0)) // will cause an error since "fred" is unknown
Out[146]:

DataFrame: rowsCount = 2, columnsCount = 2

personhours
gene2.500000
august0.500000

84. recursive query¶

In [147]:
// seeding the database just like the source `lineage` table 

@DataSchema
data class Lineage (
    val parent : String,
    val child : String
)

val dfLineage = listOf<Lineage>(
    Lineage("Arturo", "Clemente"),
    Lineage("Darío", "Clemente"),
    Lineage("Clemente", "Homero"),
    Lineage("Clemente", "Ivonne"),
    Lineage("Ivonne", "Lourdes"),
    Lineage("Soledad", "Lourdes"),
    Lineage("Lourdes", "Santiago")
).toDataFrame()

@DataSchema
data class Descendant (
    val person : String,
    val generation : Int = 0
)

dfLineage
Out[147]:

DataFrame: rowsCount = 7, columnsCount = 2

parentchild
ArturoClemente
DaríoClemente
ClementeHomero
ClementeIvonne
IvonneLourdes
SoledadLourdes
LourdesSantiago
In [148]:
// we can use `fold` to emulate a recursive query.
// the number of generations cannot exceed the number of people so that is our range to iterate, seeding with an empty list

// for the zero generation, we find the people who are never listed as a child

// for future generations, we filter on the current generation and find the children of those individuals, and add them to the list

(0..dfLineage.count()).fold( listOf<Descendant>() ) { descendants, i ->
    if (i == 0) {
        dfLineage.filter { parent !in dfLineage.child }.map { Descendant(it.parent, i) }
    } else {
        val parents : List<String> = descendants.filter { it.generation == i - 1 }.map { it.person }
        val children = dfLineage.filter { parent in parents } //.getColumn{ child }.toList()
        descendants + children.distinct { child }.map { Descendant(it.child, i) }
    }
}.toDataFrame<Descendant>()
.groupBy { person }.minOf { generation }
Out[148]:

DataFrame: rowsCount = 8, columnsCount = 2

personmin
Arturo0
Darío0
Soledad0
Clemente1
Lourdes1
Homero2
Ivonne2
Santiago2

Run the above with and without the groupBy at the end... each child has 2 parents so they can potentially be part of different generations

For example Lourdes is the child of Ivonne and Soledad. Soledad is zero-generation (no parent listed) but Ivonne is child of Clemente who is child of both Arturo and Dario. Therefore Lourdes is Gen-1 and also Gen-3. The final groupBy keeps only the minimum generation for each person, but that is an arbitrary decision.

85. bidirectional contacts¶

In [149]:
val contactTracingConn : Connection = DriverManager.getConnection("jdbc:sqlite:contact_tracing.db")
val dfPerson = DataFrame.readSqlTable(contactTracingConn, "person")
val dfContact = DataFrame.readSqlTable(contactTracingConn, "contact")
    .rename("left").into("leftCol")
    .rename("right").into("rightCol")

dfContact
Out[149]:

DataFrame: rowsCount = 8, columnsCount = 2

leftColrightCol
Agustín RodríquezAriadna Caraballo
Agustín RodríquezVerónica Altamirano
Juana BaezaVerónica Altamirano
Juana BaezaMicaela Laboy
Pilar AlarcónReina Rivero
Cristal SolizMarco Antonio Barrera
Cristal SolizDaniela Menéndez
Daniela MenéndezMarco Antonio Barrera
In [150]:
// dataframe matches up by column name, not simply position
// even if we switch the order of the columns, when we .concat() the dataframe re-orders by column name

// here, the first 8 rows are the same as the bottom 8 rows even though we switched the order or the columns
//   (when performing a UNION in SQL, the order of the select takes priority over matching column names)
dfContact
    .concat(dfContact.move { rightCol }.toLeft())
Out[150]:

DataFrame: rowsCount = 16, columnsCount = 2

leftColrightCol
Agustín RodríquezAriadna Caraballo
Agustín RodríquezVerónica Altamirano
Juana BaezaVerónica Altamirano
Juana BaezaMicaela Laboy
Pilar AlarcónReina Rivero
Cristal SolizMarco Antonio Barrera
Cristal SolizDaniela Menéndez
Daniela MenéndezMarco Antonio Barrera
Agustín RodríquezAriadna Caraballo
Agustín RodríquezVerónica Altamirano
Juana BaezaVerónica Altamirano
Juana BaezaMicaela Laboy
Pilar AlarcónReina Rivero
Cristal SolizMarco Antonio Barrera
Cristal SolizDaniela Menéndez
Daniela MenéndezMarco Antonio Barrera
In [151]:
// same thing with this syntax

dfContact.select { cols(0,1) }
    .concat(dfContact.select { cols(1,0) })
Out[151]:

DataFrame: rowsCount = 16, columnsCount = 2

leftColrightCol
Agustín RodríquezAriadna Caraballo
Agustín RodríquezVerónica Altamirano
Juana BaezaVerónica Altamirano
Juana BaezaMicaela Laboy
Pilar AlarcónReina Rivero
Cristal SolizMarco Antonio Barrera
Cristal SolizDaniela Menéndez
Daniela MenéndezMarco Antonio Barrera
Agustín RodríquezAriadna Caraballo
Agustín RodríquezVerónica Altamirano
Juana BaezaVerónica Altamirano
Juana BaezaMicaela Laboy
Pilar AlarcónReina Rivero
Cristal SolizMarco Antonio Barrera
Cristal SolizDaniela Menéndez
Daniela MenéndezMarco Antonio Barrera
In [152]:
// we have to rename the columns or rebuild a df with matching names in order to successfully concat()

// this is a bit convoluted but also a pretty unusual operation

val dfBiContact = listOf(
    dfContact.getColumn(1).named("leftCol"),
    dfContact.getColumn(0).named("rightCol")
).toDataFrame()
    .concat(dfContact)

dfBiContact
Out[152]:

DataFrame: rowsCount = 16, columnsCount = 2

leftColrightCol
Ariadna CaraballoAgustín Rodríquez
Verónica AltamiranoAgustín Rodríquez
Verónica AltamiranoJuana Baeza
Micaela LaboyJuana Baeza
Reina RiveroPilar Alarcón
Marco Antonio BarreraCristal Soliz
Daniela MenéndezCristal Soliz
Marco Antonio BarreraDaniela Menéndez
Agustín RodríquezAriadna Caraballo
Agustín RodríquezVerónica Altamirano
Juana BaezaVerónica Altamirano
Juana BaezaMicaela Laboy
Pilar AlarcónReina Rivero
Cristal SolizMarco Antonio Barrera
Cristal SolizDaniela Menéndez
Daniela MenéndezMarco Antonio Barrera

86. update group identifiers¶

In [153]:
dfPerson
    .innerJoinWith(dfBiContact) { it.name == right.leftCol }
    .innerJoinWith(dfPerson) { "rightCol"<String>() == right.name }
    .select { "leftCol" and "ident" and "rightCol" and "ident1" }
    .add("new_ident") { minOf("ident"<Int>(), "ident1"<Int>()) }
Out[153]:

DataFrame: rowsCount = 16, columnsCount = 5

leftColidentrightColident1new_ident
Juana Baeza1Verónica Altamirano51
Juana Baeza1Micaela Laboy41
Agustín Rodríquez2Ariadna Caraballo32
Agustín Rodríquez2Verónica Altamirano52
Ariadna Caraballo3Agustín Rodríquez22
Micaela Laboy4Juana Baeza11
Verónica Altamirano5Agustín Rodríquez22
Verónica Altamirano5Juana Baeza11
Reina Rivero6Pilar Alarcón106
Pilar Alarcón10Reina Rivero66
Daniela Menéndez11Cristal Soliz1311
Daniela Menéndez11Marco Antonio Barrera1211
Marco Antonio Barrera12Cristal Soliz1312
Marco Antonio Barrera12Daniela Menéndez1111
Cristal Soliz13Marco Antonio Barrera1212
Cristal Soliz13Daniela Menéndez1111

87. recursive labeling¶

sorry but this query makes little sense to me... I'm trying to get this completed within a brief amount of time... you're on your own for this

88. query from Python ORM¶

In [154]:
// we're not going to use Python however we can use this opportunity to demonstrate using an ORM
// these examples use ktorm (https://www.ktorm.org/), but there are many other options: jOOQ, komapper, Exposed, SQLDelight, kotliquery...

import org.ktorm.database.Database
import org.ktorm.database.use
import org.ktorm.entity.Entity
import org.ktorm.entity.sequenceOf
import org.ktorm.schema.*
import org.ktorm.dsl.*

// ktorm requires us to build an Entity
interface Penguin : Entity<Penguin> {
    val species : String
    val island : String
    val bill_length_mm : Double
    val bill_depth_mm : Double
    val flipper_length_mm : Int
    val body_mass_g : Int
    val sex : String
}

// and also a mapper from database columns to the Entity
object Penguins : Table<Penguin>("penguins") {
    val species = varchar("species").bindTo { it.species }
    val island = varchar("island").bindTo { it.island }
    val bill_length_mm = double("bill_length_mm").bindTo { it.bill_length_mm }
    val bill_depth_mm = double("bill_depth_mm").bindTo { it.bill_depth_mm }
    val flipper_length_mm = int("flipper_length_mm").bindTo { it.flipper_length_mm }
    val body_mass_g = int("body_mass_g").bindTo { it.body_mass_g }
    val sex = varchar("sex").bindTo { it.sex }
}
In [155]:
import org.ktorm.support.sqlite.SQLiteDialect

val dbPenguins = Database.connect(
    url = "jdbc:sqlite:penguins.db",
    dialect = SQLiteDialect()
)
In [156]:
// the benefit of defining the Entity is now we can write type-safe SQL using ktorm's DSL, 
//   and auto-map the query result to our Penguin class

dbPenguins
    .from(Penguins)
    .select()
    .where { Penguins.sex.eq("MALE") and Penguins.body_mass_g.gt(5000) }
    .orderBy(Penguins.flipper_length_mm.desc())
    .map { row -> 
        Penguins.createEntity(row)
    }.toDataFrame()
Out[156]:

... showing only top 20 of 56 rows

DataFrame: rowsCount = 56, columnsCount = 9

bill_depth_mmbill_length_mmbody_mass_gflipper_length_mmislandsexspeciesentityClassproperties
15.70000054.3000005650231BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.30000050.0000005700230BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
17.00000059.6000006050230BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.80000049.8000005700230BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.00000048.6000005800230BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
17.00000052.1000005550230BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.30000051.5000005500230BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.00000055.1000005850230BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.20000049.5000005800229BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
15.90000049.8000005950229BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
17.30000050.8000005600228BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
17.10000052.2000005400228BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
17.00000055.9000005600228BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
15.00000049.1000005500228BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
15.70000050.8000005200226BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.00000049.6000005700225BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
15.90000050.5000005400225BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
16.50000051.1000005250225BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
15.30000050.4000005550224BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...
15.90000050.0000005350224BiscoeMALEGentooclass Line_226_jupyter$Penguin{species=Gentoo, island=Biscoe, bill_...

89. DataFrame accessors¶

in the source material, queries 89 to 100 demonstrate the use of Python, pandas, polars and ORMs, which aren't relevant to dataframe

so let's use these last 12 slots to break from the source and demonstrate some cool dataframe features

some have equivalents in SQL but others, like hierarchal data, do not

DataFrame has 3 APIs to reference columns. Within a Jupyter notebook, schemas are rebuilt upon every executed cell, so, generally, it is easier to refer to columns directly, even without a defined schema. Gradle projects will be stricter, as there is no "intermediate compilation" like Jupyter, so every schema must be explicity defined or referenced.

In [157]:
// String API - compiler does not know anything about the column, we must specify the data type
// only time this is necessary in Jupyter is within a cell when we create new columns on the fly then reference them in the same cell
dfPenguins
    .filter { "bill_length_mm"<Double>() != null && "bill_depth_mm"<Double>() != null }
    .add("bill_area_mm2") { "bill_length_mm"<Double>() * "bill_depth_mm"<Double>() }
    .mean("bill_area_mm2"<Double>())

// https://kotlin.github.io/dataframe/stringapi.html
Out[157]:
750.7856432748542
In [158]:
// Column Accessors - we define the column once and then can reference it 
// in some context we need to use `invoke()` (shortened to just `()`)
val bill_length_mm by column<Double>()
val bill_depth_mm by column<Double>()
val bill_area_mm2 by column<Double>()

dfPenguins
    .filter { bill_length_mm != null && bill_depth_mm != null }
    .add( bill_area_mm2 ) { bill_length_mm() * bill_depth_mm() }
    .mean( bill_area_mm2 )

// https://kotlin.github.io/dataframe/columnaccessorsapi.html
Out[158]:
750.7856432748542
In [159]:
// note this data class is not annotated with @DataSchema

// not sure when this is preferable to annotating and have the compiler ensure the references are type safe

data class PenguinClass(
    val species	: String,
    val island : String,
    val bill_length_mm : Double,
    val bill_depth_mm : Double,
    val flipper_length_mm : Int?,
    val body_mass_g	: Int?,
    val sex : String,
    val bill_area : Double
)

dfPenguins
    .dropNulls(PenguinClass::bill_length_mm)
    .dropNulls(PenguinClass::bill_depth_mm)
    .add( PenguinClass::bill_area ) { it.get(PenguinClass::bill_length_mm) *  it.get(PenguinClass::bill_depth_mm) }
    .mean( PenguinClass::bill_area )

// https://kotlin.github.io/dataframe/kpropertiesapi.html
Out[159]:
750.7856432748542

90. importing a hierarchal data set¶

for the next few examples, we'll operate on a data set of Tom Brady's career statistics

the source is pro-football-reference.com: https://www.pro-football-reference.com/players/B/BradTo00/gamelog/

we've taken the regular season gamelog and saved it as a CSV

first thing to note in the source: the stats columns are grouped by category: Passing, Rushing, Receiving, etc.

we want our data organized the same way. fortunately, dataframe supports hierarchal data, although it cannot natively read it from a CSV

therefore, we're going to have to generate column names that are hierarchal in nature

In [160]:
// we're going to use a separate CSV reader to do our first pass
import com.github.doyaaaaaken.kotlincsv.dsl.csvReader

val csvFile : File = File("E:\\Kotlin\\HundredSQLQueries\\src\\main\\TomBrady.csv")
val csvRaw : List<List<String>> = csvReader().readAll(csvFile)
// the headers are on the first 2 lines, so let's combine the first and second row into a single string
val hierarchalHeaders : List<String> = (csvRaw[0] zip csvRaw[1]).map { "${it.first};;${it.second}" }
In [161]:
val dfBradyRaw = DataFrame.readCSV(
    file = csvFile,
    skipLines = 2,
    header = hierarchalHeaders
).filter { ";;GS"<String>() == null || ";;GS"<String>() == "*" } // GS = game started - we eliminate any row where Brady was "inactive", "suspended", etc.
.parse() // the existence of "inactive", etc forced some numeric columns to look like Strings, now they are gone, parse() restores the correct data type

dfBradyRaw  // this is a flat table, next step is to reshape to hierarchal
Out[161]:

... showing only top 20 of 335 rows

DataFrame: rowsCount = 335, columnsCount = 65

;;Rk;;Year;;Date;;G#;;Week;;Age;;Tm;;;;Opp;;Result;;GSPassing;;CmpPassing;;AttPassing;;Cmp%Passing;;YdsPassing;;TDPassing;;IntPassing;;RatePassing;;SkPassing;;Yds1Passing;;Y/APassing;;AY/ARushing;;AttRushing;;YdsRushing;;Y/ARushing;;TDReceiving;;TgtReceiving;;RecReceiving;;YdsReceiving;;Y/RReceiving;;TDReceiving;;Ctch%Receiving;;Y/TgtScoring;;TDScoring;;Pts;;SkTackles;;SoloTackles;;AstTackles;;CombTackles;;TFLTackles;;QBHitsFumbles;;FmbFumbles;;FLFumbles;;FFFumbles;;FRFumbles;;YdsFumbles;;TDPunting;;PntPunting;;YdsPunting;;Y/PPunting;;RetYdsPunting;;NetPunting;;NY/PPunting;;TBPunting;;TB%Punting;;In20Punting;;In20%Punting;;BlckOff. Snaps;;NumOff. Snaps;;PctDef. Snaps;;NumDef. Snaps;;PctST Snaps;;NumST Snaps;;Pct;;Status
1220002000-11-23121323.112000NWE@DETL 9-34null1333.33000060042.400000002.0000002.00000000null0000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
1820012001-09-232224.051000NWEnullNYJL 3-10null51050.000000460062.900000004.6000004.600000199.0000000000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
1920012001-09-303324.058000NWEnullINDW 44-13*132356.5200001680079.600000197.3000007.300000122.0000000000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
2020012001-10-074424.065000NWE@MIAL 10-30*122450.000000860058.7000004173.5800003.580000294.5000000000null00.0%null000.000000nullnullnullnullnull2000-14000null00null0null0null0nullnullnullnullnullnullnull
2120012001-10-145524.072000NWEnullSDGW 29-26*335461.1100003642093.4000003196.7400007.480000100.0000000000null00.0%null000.000000nullnullnullnullnull2000-1000null00null0null0null0nullnullnullnullnullnullnull
2220012001-10-216624.079000NWE@INDW 38-17*162080.00000020230148.3000000010.10000013.1000002-2-1.0000000000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
2320012001-10-287724.086000NWE@DENL 20-31*253865.7900002032457.1000002205.3400001.6600002-1-0.5000000000null00.0%null000.0000000000null1001-3000null00null0null0null0nullnullnullnullnullnullnull
2420012001-11-048824.093000NWE@ATLW 24-10*213167.74000025030124.4000003148.06000010.000000300.0000000000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
2520012001-11-119924.100000NWEnullBUFW 21-11*152171.4300001071178.9000007365.1000003.9000004-1-0.2500000000null00.0%null000.0000000000null21010000null00null0null0null0nullnullnullnullnullnullnull
2620012001-11-18101024.107000NWEnullSTLL 17-24*192770.3700001851270.800000266.8500004.260000263.0000000000null00.0%null000.0000000000null0null000000null00null0null0null0nullnullnullnullnullnullnull
2720012001-11-25111124.114000NWEnullNORW 34-17*192673.08000025840143.9000004179.92000013.0000004143.5000000000null00.0%null000.000000nullnullnullnullnull20000000null00null0null0null0nullnullnullnullnullnullnull
2820012001-12-02121224.121000NWE@NYJW 17-16*202871.4300002130093.3000003227.6100007.610000300.0000000000null00.0%null000.000000nullnullnullnullnull10000000null00null0null0null0nullnullnullnullnullnullnull
2920012001-12-09131324.128000NWEnullCLEW 27-16*192867.8600002180261.300000397.7900004.5700004-3-0.7500000000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
3020012001-12-16141424.135000NWE@BUFW 12-9*193554.2900002370163.6000005316.7700005.4900003134.3300000000null00.0%null000.0000000000null0null000000null00null0null0null0nullnullnullnullnullnullnull
3120012001-12-22151524.141000NWEnullMIAW 20-13*111957.8900001081091.6000003145.6800006.7400003-5-1.6700000112323.0000000100.0%23.000000000.0000000000null10010000null00null0null0null0nullnullnullnullnullnullnull
3220012002-01-06161724.156000NWE@CARW 38-6*172958.6200001981262.100000126.8300004.410000122.0000000000null00.0%null000.0000000000null10010000null00null0null0null0nullnullnullnullnullnullnull
3320022002-09-091125.037000NWEnullPITW 30-14*294367.44000029430110.0000002146.8400008.23000000null0000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
3420022002-09-152225.043000NWE@NYJW 44-7*253571.43000026921100.800000007.6900007.540000252.5000000000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
3520022002-09-223325.050000NWEnullKANW 41-38*395472.22000041041110.9000004117.5900008.240000122.0000000000null00.0%null000.0000000000null10020000null00null0null0null0nullnullnullnullnullnullnull
3620022002-09-294425.057000NWE@SDGL 14-21*365367.9200003532283.300000006.6600005.720000133.0000000000null00.0%null000.000000nullnullnullnullnull0null000000null00null0null0null0nullnullnullnullnullnullnull
In [162]:
// grouping columns together creates a hierarchal data set
// we use the header strings we created in the first step to easily identify the columns to be grouped

val dfBradyHierarchal = dfBradyRaw
    .group { cols { it.name.startsWith(";;") } }.into { "Meta" }
    .group { cols { it.name.startsWith("Passing;;") } }.into { "Passing" }
    .group { cols { it.name.startsWith("Rushing;;") } }.into { "Rushing" }
    .group { cols { it.name.startsWith("Receiving;;") } }.into { "Receiving" }
    .group { cols { it.name.startsWith("Scoring;;") } }.into { "Scoring" }
    .remove { cols { !it.isColumnGroup() } }  // we are going to ignore defense, kicking, punting, etc -- so anything that isn't in a group yet will be removed
    .rename { all().recursively() }.into { it.name.replace(";;", "") }

dfBradyHierarchal
Out[162]:

... showing only top 20 of 335 rows

DataFrame: rowsCount = 335, columnsCount = 5

MetaPassingRushingReceivingScoring
{ Rk:12, Year:2000, Date:2000-11-23, ...{ PassingCmp:1, PassingAtt:3, Passing...{ RushingAtt:0, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:18, Year:2001, Date:2001-09-23, ...{ PassingCmp:5, PassingAtt:10, Passin...{ RushingAtt:1, RushingYds:9, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:19, Year:2001, Date:2001-09-30, ...{ PassingCmp:13, PassingAtt:23, Passi...{ RushingAtt:1, RushingYds:2, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:20, Year:2001, Date:2001-10-07, ...{ PassingCmp:12, PassingAtt:24, Passi...{ RushingAtt:2, RushingYds:9, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:21, Year:2001, Date:2001-10-14, ...{ PassingCmp:33, PassingAtt:54, Passi...{ RushingAtt:1, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:22, Year:2001, Date:2001-10-21, ...{ PassingCmp:16, PassingAtt:20, Passi...{ RushingAtt:2, RushingYds:-2, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:23, Year:2001, Date:2001-10-28, ...{ PassingCmp:25, PassingAtt:38, Passi...{ RushingAtt:2, RushingYds:-1, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:24, Year:2001, Date:2001-11-04, ...{ PassingCmp:21, PassingAtt:31, Passi...{ RushingAtt:3, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:25, Year:2001, Date:2001-11-11, ...{ PassingCmp:15, PassingAtt:21, Passi...{ RushingAtt:4, RushingYds:-1, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:26, Year:2001, Date:2001-11-18, ...{ PassingCmp:19, PassingAtt:27, Passi...{ RushingAtt:2, RushingYds:6, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:27, Year:2001, Date:2001-11-25, ...{ PassingCmp:19, PassingAtt:26, Passi...{ RushingAtt:4, RushingYds:14, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:28, Year:2001, Date:2001-12-02, ...{ PassingCmp:20, PassingAtt:28, Passi...{ RushingAtt:3, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:29, Year:2001, Date:2001-12-09, ...{ PassingCmp:19, PassingAtt:28, Passi...{ RushingAtt:4, RushingYds:-3, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:30, Year:2001, Date:2001-12-16, ...{ PassingCmp:19, PassingAtt:35, Passi...{ RushingAtt:3, RushingYds:13, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:31, Year:2001, Date:2001-12-22, ...{ PassingCmp:11, PassingAtt:19, Passi...{ RushingAtt:3, RushingYds:-5, Rushin...{ ReceivingTgt:1, ReceivingRec:1, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:32, Year:2001, Date:2002-01-06, ...{ PassingCmp:17, PassingAtt:29, Passi...{ RushingAtt:1, RushingYds:2, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:33, Year:2002, Date:2002-09-09, ...{ PassingCmp:29, PassingAtt:43, Passi...{ RushingAtt:0, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:34, Year:2002, Date:2002-09-15, ...{ PassingCmp:25, PassingAtt:35, Passi...{ RushingAtt:2, RushingYds:5, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:35, Year:2002, Date:2002-09-22, ...{ PassingCmp:39, PassingAtt:54, Passi...{ RushingAtt:1, RushingYds:2, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:36, Year:2002, Date:2002-09-29, ...{ PassingCmp:36, PassingAtt:53, Passi...{ RushingAtt:1, RushingYds:3, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
In [163]:
// we have a good hierarchal data set however there are items we'd like to clean up

val dfBrady = dfBradyHierarchal
    .convert { "Meta"["untitled"] named "location" }.with { if (it == null) "home" else "away" }
    .convert { "Meta"["GS"] named "started" }.with { it == null } // started? true/false
    .add {  // break up "Result" into 3 fields -- win/loss, team score and opponent score
        "WinLoss" from { "Meta"["Result"]<String>().let { if (it.first() == 'W') "win" else if (it.first() == 'L') "loss" else "tie" } }
        "OurScore" from { "Meta"["Result"]<String>().let { it.drop(1).substringBefore("-").trim().toInt() } }
        "TheirScore" from { "Meta"["Result"]<String>().let { it.substringAfter("-").trim().toInt() } }
    }.move { "WinLoss" and "OurScore" and "TheirScore" }.under("Meta")
    .parse()

dfBrady
Out[163]:

... showing only top 20 of 335 rows

DataFrame: rowsCount = 335, columnsCount = 5

MetaPassingRushingReceivingScoring
{ Rk:12, Year:2000, Date:2000-11-23, ...{ PassingCmp:1, PassingAtt:3, Passing...{ RushingAtt:0, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:18, Year:2001, Date:2001-09-23, ...{ PassingCmp:5, PassingAtt:10, Passin...{ RushingAtt:1, RushingYds:9, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:19, Year:2001, Date:2001-09-30, ...{ PassingCmp:13, PassingAtt:23, Passi...{ RushingAtt:1, RushingYds:2, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:20, Year:2001, Date:2001-10-07, ...{ PassingCmp:12, PassingAtt:24, Passi...{ RushingAtt:2, RushingYds:9, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:21, Year:2001, Date:2001-10-14, ...{ PassingCmp:33, PassingAtt:54, Passi...{ RushingAtt:1, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:22, Year:2001, Date:2001-10-21, ...{ PassingCmp:16, PassingAtt:20, Passi...{ RushingAtt:2, RushingYds:-2, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:23, Year:2001, Date:2001-10-28, ...{ PassingCmp:25, PassingAtt:38, Passi...{ RushingAtt:2, RushingYds:-1, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:24, Year:2001, Date:2001-11-04, ...{ PassingCmp:21, PassingAtt:31, Passi...{ RushingAtt:3, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:25, Year:2001, Date:2001-11-11, ...{ PassingCmp:15, PassingAtt:21, Passi...{ RushingAtt:4, RushingYds:-1, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:26, Year:2001, Date:2001-11-18, ...{ PassingCmp:19, PassingAtt:27, Passi...{ RushingAtt:2, RushingYds:6, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:27, Year:2001, Date:2001-11-25, ...{ PassingCmp:19, PassingAtt:26, Passi...{ RushingAtt:4, RushingYds:14, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:28, Year:2001, Date:2001-12-02, ...{ PassingCmp:20, PassingAtt:28, Passi...{ RushingAtt:3, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:29, Year:2001, Date:2001-12-09, ...{ PassingCmp:19, PassingAtt:28, Passi...{ RushingAtt:4, RushingYds:-3, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:30, Year:2001, Date:2001-12-16, ...{ PassingCmp:19, PassingAtt:35, Passi...{ RushingAtt:3, RushingYds:13, Rushin...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:31, Year:2001, Date:2001-12-22, ...{ PassingCmp:11, PassingAtt:19, Passi...{ RushingAtt:3, RushingYds:-5, Rushin...{ ReceivingTgt:1, ReceivingRec:1, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:32, Year:2001, Date:2002-01-06, ...{ PassingCmp:17, PassingAtt:29, Passi...{ RushingAtt:1, RushingYds:2, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:33, Year:2002, Date:2002-09-09, ...{ PassingCmp:29, PassingAtt:43, Passi...{ RushingAtt:0, RushingYds:0, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:34, Year:2002, Date:2002-09-15, ...{ PassingCmp:25, PassingAtt:35, Passi...{ RushingAtt:2, RushingYds:5, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:35, Year:2002, Date:2002-09-22, ...{ PassingCmp:39, PassingAtt:54, Passi...{ RushingAtt:1, RushingYds:2, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }
{ Rk:36, Year:2002, Date:2002-09-29, ...{ PassingCmp:36, PassingAtt:53, Passi...{ RushingAtt:1, RushingYds:3, Rushing...{ ReceivingTgt:0, ReceivingRec:0, Rec...{ ScoringTD:0, ScoringPts:0 }

91. pivoting¶

In [164]:
// let's get a summary of Brady's win/loss record against every team

// pivot will create a new column from every distinct value in "WinLoss" column (there are only 2)
//   typically needs 3 inputs or dimensions -- the field to pivot (horizontal), the group by (vertical) 
//     and usually an aggregation calculation for the inner cells

// some RDBMS's support PIVOT however it is usually less flexible -- 
//   for example, SQL Server's PIVOT requires the user to specify the name of every horizontal column
//     easy here (just "win" and "loss") but sometimes all the possible values are not known in advance

dfBrady
    .pivot(inward=false) { Meta.WinLoss }
        .groupBy { Meta.Opp named "opponent" } 
        .count()
    .sortBy { "opponent"<String>() }
Out[164]:

... showing only top 20 of 35 rows

DataFrame: rowsCount = 35, columnsCount = 3

opponentlosswin
ARI12
ATL110
BAL36
BUF333
CAR47
CHI16
CIN27
CLE27
DAL07
DEN68
DET25
GNB34
HOU27
IND312
JAX15
KAN65
LAC02
LAR22
LVR01
MIA1224

92. pivoting with more detail¶

In [165]:
// ok, so we know Brady's record against every team

// let's see how it evolves based upon yardage thresholds -- what does his record look like when he throws for 300+, or 400+ yards, vs other games

val yardsThresholds : List<Int> = listOf(0, 200, 300, 350, 400, 450, 500)

dfBrady
    .add("YardsThreshold") { yardsThresholds.filter { Passing.PassingYds > it } }//.under("Passing")
    .explode { "YardsThreshold"<List<Int>>() }
    .pivot(inward=false) { Meta.WinLoss }
        .groupBy { Meta.Opp  and "YardsThreshold"<List<Int>>() } 
        .count()
    .sortBy { "Opp" and "YardsThreshold" }
    .pivot { "YardsThreshold"<Int>() }
        .groupBy { "Opp"<String>() }    
        .values { "loss" and "win" }
Out[165]:

... showing only top 20 of 35 rows

DataFrame: rowsCount = 35, columnsCount = 2

OppYardsThreshold
ARI{ 0:{ loss:1, win:2 }, 200:{ loss:1, ...
ATL{ 0:{ loss:1, win:10 }, 200:{ loss:0,...
BAL{ 0:{ loss:3, win:6 }, 200:{ loss:3, ...
BUF{ 0:{ loss:3, win:33 }, 200:{ loss:1,...
CAR{ 0:{ loss:4, win:7 }, 200:{ loss:4, ...
CHI{ 0:{ loss:1, win:6 }, 200:{ loss:1, ...
CIN{ 0:{ loss:2, win:7 }, 200:{ loss:1, ...
CLE{ 0:{ loss:2, win:7 }, 200:{ loss:2, ...
DAL{ 0:{ loss:0, win:7 }, 200:{ loss:0, ...
DEN{ 0:{ loss:6, win:8 }, 200:{ loss:5, ...
DET{ 0:{ loss:2, win:5 }, 200:{ loss:0, ...
GNB{ 0:{ loss:3, win:4 }, 200:{ loss:2, ...
HOU{ 0:{ loss:2, win:7 }, 200:{ loss:1, ...
IND{ 0:{ loss:3, win:12 }, 200:{ loss:3,...
JAX{ 0:{ loss:1, win:5 }, 200:{ loss:1, ...
KAN{ 0:{ loss:6, win:5 }, 200:{ loss:4, ...
LAC{ 0:{ loss:0, win:2 }, 200:{ loss:0, ...
LAR{ 0:{ loss:2, win:2 }, 200:{ loss:2, ...
LVR{ 0:{ loss:0, win:1 }, 200:{ loss:0, ...
MIA{ 0:{ loss:12, win:24 }, 200:{ loss:7...

93. using a class's native properties/methods¶

In [166]:
// let's define a class with some member calculations

data class WinLossRecord(
    val wins: Int,
    val losses: Int
) {
    val winPct: Double get() = wins.toDouble().div(wins.plus(losses))
    val undefeated: Boolean get() = wins > 0 && losses == 0
}
In [167]:
// start with the same query as #90, 
// but we will add a column of the WinLossRecord class, and then call its methods directly

// in a real-world scenario, doesn't make much sense to create a class just to do simple calculations
//   but if you have a class already in use, you can easily create them from the data set (or import instances directly)

val WL by column<WinLossRecord>() 

dfBrady
    .pivot(inward=false) { Meta.WinLoss }
    .groupBy { Meta.Opp named "opponent" } 
    .count()
    .add(WL) { WinLossRecord("win"<Int>(), "loss"<Int>()) }
    .add {
        "WinPct" from { WL().winPct }            // note the (), we are working on the class instance in each row, not the DataColumn
        "Undefeated" from { WL().undefeated }
    }
    .sortBy { "WinPct"<Double>().desc() }
Out[167]:

... showing only top 20 of 35 rows

DataFrame: rowsCount = 35, columnsCount = 6

opponentlosswinWLWinPctUndefeated
MIN06WinLossRecord(wins=6, losses=0)1.000000true
DAL07WinLossRecord(wins=7, losses=0)1.000000true
TAM04WinLossRecord(wins=4, losses=0)1.000000true
LAC02WinLossRecord(wins=2, losses=0)1.000000true
LVR01WinLossRecord(wins=1, losses=0)1.000000true
NWE01WinLossRecord(wins=1, losses=0)1.000000true
BUF333WinLossRecord(wins=33, losses=3)0.916667false
ATL110WinLossRecord(wins=10, losses=1)0.909091false
CHI16WinLossRecord(wins=6, losses=1)0.857143false
NYG16WinLossRecord(wins=6, losses=1)0.857143false
PHI15WinLossRecord(wins=5, losses=1)0.833333false
JAX15WinLossRecord(wins=5, losses=1)0.833333false
NYJ730WinLossRecord(wins=30, losses=7)0.810811false
IND312WinLossRecord(wins=12, losses=3)0.800000false
OAK14WinLossRecord(wins=4, losses=1)0.800000false
CLE27WinLossRecord(wins=7, losses=2)0.777778false
HOU27WinLossRecord(wins=7, losses=2)0.777778false
CIN27WinLossRecord(wins=7, losses=2)0.777778false
DET25WinLossRecord(wins=5, losses=2)0.714286false
SDG25WinLossRecord(wins=5, losses=2)0.714286false

94. rolling totals¶

In [168]:
// what are Brady's highest passing yards total over 3 consecutive games?

dfBrady
    .select { Meta.Rk and Meta.Date and Passing.PassingYds }
    .add("3gamePassYds") { (relative(-2..0)["PassingYds"] as DataColumn<Int>).sum() } // relative allows us to lookback (or forward) to other rows
    .remove("PassingYds")
    .sortBy { "3gamePassYds"<Int>().desc() }
    .head(10)

// highest 3-game passing yards total was 1327 yards, ending 2011-09-25  
//   (517, 423, 387 to start the 2011 season, according to pro-football-reference)
Out[168]:

DataFrame: rowsCount = 10, columnsCount = 3

RkDate3gamePassYds
1792011-09-251327
2222013-12-151153
1782011-09-181139
3362021-01-031137
2212013-12-081133
2762017-10-011132
3372021-09-091126
2432015-09-271112
3412021-10-101112
2442015-10-111099

95. flattening hierarchies¶

In [169]:
// where did Brady have a bigger home-field advantage, New England or Tampa?

dfBrady
    .pivot() { Meta.WinLoss } 
        .groupBy { Meta.Tm and Meta.location } 
        .count()
    .flatten()
    .add("winPct") { WinLossRecord("win"<Int>(), "loss"<Int>()).winPct }
    .pivot() { "location"<String>() }
        .groupBy { "Tm"<String>() }
        .values { "winPct"<Double>() }
    .flatten()
    .add("homeAdv") { "home"<Double>() - "away"<Double>() }
    .sortBy { "homeAdv"<Double>().desc() }
    
// win %age at home in New England was 16% higher than when New England was the away team (85 vs 68%)
// in Tampa, the difference was 8% (68 to 60%)
Out[169]:

DataFrame: rowsCount = 2, columnsCount = 4

TmawayhomehomeAdv
NWE0.6853150.8521130.166798
TAM0.6000000.6800000.080000

96. longest consecutive streak¶

In [170]:
// what is the longest consecutive streak of games where Brady has at least 1 touchdown and zero interceptions (min. 20 pass attempts)?

dfBrady
    .select { Meta.Rk and Meta.Date and Passing.PassingAtt and Passing.PassingTD and Passing.PassingInt }
    .filter { "PassingAtt"<Int>() >= 20 }  // filter out games without 20 attempts
    .add("qualifies") { "PassingTD"<Int>() >= 1 && "PassingInt"<Int>() == 0 } // determine whether a game qualifies
      // looking back 50 games is a "cheat", just a random guess than no streak is this long
      // so we get the last 50 games, convert it to a list of booleans, reverse it (so most recent games are first), 
      //    "take" until we hit a false, then measure the list size
    .add("streak") { (relative(-50..0)["qualifies"] as DataColumn<Boolean>).toList().reversed().takeWhile { it == true }.size }
    .filter { "streak"<Int>() > 0 }
    .sortBy { "streak"<Int>().desc() }

// note that we simply filter out games with fewer than 20 pass attempts -- they don't break the streak, we just ignore them

// longest streak = 10 games, ending 2010-12-26
Out[170]:

... showing only top 20 of 160 rows

DataFrame: rowsCount = 160, columnsCount = 7

RkDatePassingAttPassingTDPassingIntqualifiesstreak
1752010-12-262730true10
1742010-12-192420true9
1732010-12-124020true8
1722010-12-062940true7
1712010-11-252740true6
1702010-11-212520true5
2032012-11-222730true5
3592022-10-164010true5
1132007-09-092830true4
1202007-10-283830true4
1502009-10-183460true4
1692010-11-144330true4
1882011-12-043820true4
2022012-11-183530true4
2322014-10-263550true4
2442015-10-112720true4
2642016-10-303340true4
2822017-11-193730true4
3072019-09-224220true4
3282020-11-024020true4

97. correlation¶

In [171]:
// is there a correlaion between sacks and interceptions?

dfBrady
    .corr { Passing.PassingInt }.with { Passing.PassingSk }

// doesn't look like it
Out[171]:

DataFrame: rowsCount = 1, columnsCount = 2

columnPassingSk
PassingInt0.011552
In [172]:
// let's look by year

dfBrady
    .distinct { Meta.Year }
    .map { it["Year"] }
    .associateWith { yr ->
        dfBrady
            .filter { Meta.Year == yr }
            .corr { Passing.PassingInt }.with { Passing.PassingSk }
            .values { "PassingSk"<Double>() }.first()
    }.let {
        dataFrameOf(
            "year" to it.keys.toList(),
            "corr" to it.values.toList()
        )
    }

// a few years have a significant positive correlation, but plenty of years have a negative one
Out[172]:

... showing only top 20 of 23 rows

DataFrame: rowsCount = 23, columnsCount = 2

yearcorr
2000NaN
2001-0.025321
2002-0.150002
2003-0.039223
20040.400064
20050.280399
2006-0.271364
20070.419982
2008NaN
2009-0.276026
20100.125140
2011-0.173422
20120.479397
20130.344656
2014-0.132886
2015-0.057716
20160.090660
2017-0.225762
2018-0.571649
20190.247841

98. plotting¶

In [173]:
// we thought there might be a correlation between sacks and interceptions (i.e. good defenses might generate both)
//   but the coefficient was very close to zero

// we can plot the count of games and get a visual look

// JetBrains recently introduced `kandy` plotting library (https://kotlin.github.io/kandy/overview.html)
// some other libraries include lets-plot, plotly and kravis

dfBrady.plot {
    statCount2D(Passing.PassingInt, Passing.PassingSk) {
        points {
            x(Stat.x) { 
                axis {
                    name = "Interceptions" 
                    breaks((0..4).toList(), format = "d")
                }
            }
            y(Stat.y) { 
                axis { 
                    name = "Sacks" 
                    breaks((0..7).toList())
                }
            }
            size(Stat.count) {
                scale = continuous(2.0..20.0) //continuous(1..10)
            }
            color = Color.BLUE
        }
    }
    layout {
        size = 720 to 480
    }
}
Out[173]:
0 1 2 3 4 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Sacks Interceptions count 10 20 30 40 50

99. facet plots¶

In [174]:
import java.time.Month

// kandy is still early in development, some advanced features such as faceted grids are incomplete and subject to API changes

dfBrady
    .add("month") { if (Meta.Date.month == Month.JANUARY) Month.DECEMBER else Meta.Date.month }
    .add("marginOfVictory") { Meta.OurScore - Meta.TheirScore }
    .select { "month" and "marginOfVictory" and Passing.PassingAtt and Meta.location and Meta.WinLoss and Meta.Opp and Meta.Result and Meta.Year and Meta.Week}
    .plot {
        points {
            x("PassingAtt") { axis.name = "Passing Attempts" }
            y("marginOfVictory"<Int>())
            symbol = Symbol.CIRCLE
            size = 3.5
            color("WinLoss") {
                scale = categorical(listOf(Color.RED, Color.GREEN))
            }
            tooltips("Year", "Week", "Opp", "Result")
        }

        facetGrid("month"<Month>(), "location"<String>(), 
                  scalesSharing = ScalesSharing.FIXED, 
                  xOrder = OrderDirection.DESCENDING,
                  yOrder = OrderDirection.DESCENDING
                 )
        layout {
            theme(Theme.Grey)
            size = 1080 to 540
        }
    }

// do more passing attempts hint at a loss? does it matter what month the game is?
Out[174]:
SEPTEMBER -20 0 20 40 60 OCTOBER NOVEMBER DECEMBER home 0 20 40 60 -20 0 20 40 60 0 20 40 60 0 20 40 60 away 0 20 40 60 marginOfVictory Passing Attempts WinLoss loss win

100. writing to Excel and to HTML¶

In [175]:
// hierarchal data sets encode their sub-columns in JSON objects, 
//   which isn't very useful since Excel doesn't natively read or convert JSON
dfBrady.writeExcel("Brady.xlsx")
In [176]:
// however flat data works as expected
dfBrady.Passing.writeExcel("BradyPassing.xlsx")
In [177]:
// just like CSV and databases, you can choose a separate library instead of DataFrame native for reading/writing to Excel
// Apache POI is very good, and a number of other libraries use it under the hood
In [178]:
dfBrady.toStandaloneHTML(DisplayConfiguration(rowsLimit = null)).openInBrowser()
In [179]:
// can also write to a file -- documentation is limited but the API is available at:
//   https://github.com/Kotlin/dataframe/blob/master/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/format.kt

Thank you for following along! As you may have seen, Kotlin Dataframe often requires a little trial-and-error to master its API, especially with more complex operations like grouping, pivoting, hierarchies, and aggregations.

Dataframe is a terrific, powerful tool, and is a necessairy library for Kotlin to gain traction among data scientists. It's one additional tool in the toolbox but certainly not the only one. Dataframe does not make SQL obsolete by any means, databases have very intelligent query engines and can often perform operations much faster (and send far less data over the wire) than pulling all the data down and performing every operation in dataframe. Additionally, standard Kotlin collections are also powerful and many calculations can be done easily with a fold or reduce.

Hope you found this helpful and maybe even learned something! Please feel free to send a comment, submit a PR, or ask a question.

andy [at] cfnine [dot] com

...open to consulting or employment opportunities as well, thanks!