GraphQL API
Types
A typed schema is automatically generated from the arrow table and its columns. However, advanced usage of tables often creates new columns - or changes the type of existing ones - and therefore falls outside the schema. Fields which create columns also allow aliasing, otherwise the column is replaced.
Output
A column within the schema can be accessed by Table.columns
.
{
columns {
<name> { ... }
}
}
Any column can be accessed by name using Dataset.column
and inline fragments.
{
column(name: "...") {
... on <Type>Column { ... }
}
}
Input
Input types don't have the equivalent of inline fragments, but GraphQL is converging on the OneOf input pattern. Effectively the type of the field becomes the name of the field.
Dataset.scan
has flexible selection and projection.
{
scan(filter: { ... }, columns: [{ ... }, ...]) { ... }
}
Table.filter
provides a friendlier interface for simple queries on columns within the schema.
{
filter(<name>: { ... }, ...) { ... }
}
Note list inputs allow passing a single value, coercing the input to a list of 1.
Batches
Datasets and scanners are processed in batches when possible, instead of loading the table into memory.
group
,scan
, andfilter
- native parallel batch processingsort
withlength
apply
withlist
functionsrank
flatten
Partitions
Partitioned datasets use fragment keys when possible.
group
on fragment keys with countsrank
andsort
with length on fragment keys
Column selection
Each field resolver transforms a table or array as needed. When working with an embedded library like pandas, it's common to select a working set of columns for efficiency. Whereas GraphQL has the advantage of knowing the entire query up front, so there is no select
field because it's done automatically at every level of resolvers.
List Arrays
Arrow ListArrays are supported as ListColumns. group: {aggregate: {list: ...}}
and runs
leverage that feature to transform columns into ListColumns, which can be accessed via inline fragments and further aggregated. Though group
hash aggregate functions are more efficient than creating lists.
tables
returns a list of tables based on the list scalars.flatten
flattens the list columns and repeats the scalar columns as needed.apply(list: {filter:, ..., sort: ..., rank: ...})
applies vector functions to the list scalars.
The list in use must all have the same value lengths, which is naturally the case when the result of grouping. Iterating scalars (in Python) is not ideal, but it can be faster than re-aggregating, depending on the average list size.
Dictionary Arrays
Arrow has dictionary-encoded arrays as a space optimization, but doesn't natively support some builtin functions on them. Support for dictionaries is extended, and often faster by only having to apply functions to the unique values.
Nulls
GraphQL continues the long tradition of confusing "optional" with "nullable". Graphique strives to be explicit regarding what may be omitted versus what may be null.
Output
Arrow has first-class support for nulls, so array scalars are nullable. Non-null scalars are used where relevant.
Columns and rows are nullable to allow partial query results. Dataset.optional
enables client controlled nullability.
Input
Default values and non-null types are used wherever possible. When an input is optional and has no natural default, there are two cases to distinguish:
- if null is expected and semantically different, the input's description explains null behavior
- otherwise the input has an
@optional
directive, and explicit null behavior is undefined