Introduction
GraphQL service for arrow tables and parquet data sets. The schema for a query API is derived automatically.
Usage
% env PARQUET_PATH=... uvicorn graphique.service:app
Open http://localhost:8000/ to try out the API in GraphiQL. There is a test fixture at ./tests/fixtures/zipcodes.parquet
.
% env PARQUET_PATH=... strawberry export-schema graphique.service:app.schema
outputs the graphql schema for a parquet data set.
Configuration
Graphique uses Starlette's config: in environment variables or a .env
file. Config variables are used as input to a parquet dataset.
- PARQUET_PATH: path to the parquet directory or file
- FEDERATED = '': field name to extend type
Query
with a federatedTable
- DEBUG = False: run service in debug mode, which includes metrics
- COLUMNS = None: list of names, or mapping of aliases, of columns to select
- FILTERS = None: json
filter
query for which rows to read at startup
For more options create a custom ASGI app. Call graphique's GraphQL
on an arrow Dataset, Scanner, or Table. The GraphQL Table
type will be the root Query type.
Supply a mapping of names to datasets for multiple roots, and to enable federation.
import pyarrow.dataset as ds
from graphique import GraphQL
source = ds.dataset(...)
app = GraphQL(source) # Table is root query type
app = GraphQL.federated({<name>: source, ...}, keys={<name>: [], ...}) # Tables on federated fields
Start like any ASGI app.
uvicorn <module>:app
Configuration options exist to provide a convenient no-code solution, but are subject to change in the future. Using a custom app is recommended for production usage.
API
types
Dataset
: interface for an arrow dataset, scanner, or table.Table
: implements theDataset
interface. Adds typedrow
,columns
, andfilter
fields from introspecting the schema.Column
: interface for an arrow column (a.k.a. ChunkedArray). Each arrow data type has a corresponding column implementation: Boolean, Int, Long, Float, Decimal, Date, Datetime, Time, Duration, Base64, String, List, Struct. All columns have avalues
field for their list of scalars. Additional fields vary by type.Row
: scalar fields. Arrow tables are column-oriented, and graphique encourages that usage for performance. A singlerow
field is provided for convenience, but a field for a list of rows is not. Requesting parallel columns is far more efficient.
selection
slice
: contiguous selection of rowsfilter
: select rows with simple predicatesscan
: select rows and project columns with expressions
projection
columns
: provides a field for everyColumn
in the schemacolumn
: access a column of any type by namerow
: provides a field for each scalar of a single rowapply
: transform columns by applying a functionjoin
: join tables by key columns
aggregation
group
: group by given columns, and aggregate the othersruns
: partition on adjacent values in given columns, transforming the others into list columnstables
: return a list of tables by splitting on the scalars in list columnsflatten
: flatten list columns with repeated scalars
ordering
sort
: sort table by given columnsrank
: select rows with smallest or largest values
Performance
Graphique relies on native PyArrow routines wherever possible. Otherwise it falls back to using NumPy or custom optimizations.
By default, datasets are read on-demand, with only the necessary rows and columns scanned. Although graphique is a running service, parquet is performant at reading a subset of data. Optionally specify FILTERS
in the json filter
format to read a subset of rows at startup, trading-off memory for latency. An empty filter ({}
) will read the whole table.
Specifying COLUMNS
will limit memory usage when reading at startup (FILTERS
). There is little speed difference as unused columns are inherently ignored. Optional aliasing can also be used for camel casing.
If index columns are detected in the schema metadata, then an initial filter
will also attempt a binary search on tables.
Installation
% pip install graphique[server]
Dependencies
- pyarrow
- strawberry-graphql[asgi,cli]
- numpy
- isodate
- uvicorn (or other ASGI server)
Tests
100% branch coverage.
% pytest [--cov]