blob: 11fd0c452e8db7aec21cc57e17d975ad2acca44b [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# TypeScript Beam SDK
This is the start of a fully functioning JavaScript (actually, TypeScript) SDK.
There are two distinct aims with this SDK:
1. Tap into the large (and relatively underserved, by existing data processing
frameworks) community of JavaScript developers with a native SDK targeting this language.
1. Develop a new SDK which can serve both as a proof of concept and reference
that highlights the (relative) ease of porting Beam to new languages,
a differentiating feature of Beam and Dataflow.
To accomplish this, we lean heavily on the portability framework.
For example, we make heavy use of cross-language transforms,
in particular for IOs.
In addition, the direct runner is simply an extension of the worker suitable
for running on portable runners such as the ULR, which will directly transfer
to running on production runners such as Dataflow and Flink.
The target audience should hopefully not be put off by running other-language
code encapsulated in docker images.
## Getting started
To install and test the Typescript SDK from source, you will need `npm` and
`python`. Other requirements can be installed by `npm` later on.
(**Note** that Python is a requirement as it is used to orchestrate Beam
functionality.)
1. First you must clone the Beam repository and go to the `typescript` directory.
```
git checkout https://github.com/apache/beam
cd beam/sdks/typescript/
```
2. Execute a local install of the necessary packages:
```
npm install
```
3. Then run `npm run build` to transpile Typescript files into JS files.
### Development workflows
All of the development workflows (build, test, lint, clean, etc) are defined in
`package.json` and can be run with `npm` commands (e.g. `npm run build`).
### Running a pipeline
The `wordcount.ts` file defines a parameterizable pipeline that can be run
against different runners. You can run it from the transpiled `.js` file
like so:
```
node dist/src/apache_beam/examples/wordcount.js ${PARAMETERS}
```
To run locally:
```
node dist/src/apache_beam/examples/wordcount.js --runner=direct
```
To run against Flink, where the local infrastructure is automatically
downloaded and set up:
```
node dist/src/apache_beam/examples/wordcount.js --runner=flink
```
To run on Dataflow:
```
node dist/src/apache_beam/examples/wordcount.js \
--runner=dataflow \
--project=${PROJECT_ID} \
--tempLocation=gs://${GCS_BUCKET}/wordcount-js/temp --region=${REGION}
```
## TODO
This SDK is a work in progress. In January 2022 we developed the ability to
construct and run basic pipelines (including external transforms and running
on a portable runner) but the following big-ticket items remain.
* Containerization
* Actually use worker threads for multiple bundles
(unsure if this is a large benefit, mitigated using sibling workers).
* API
* There are several TODOs of minor features or design decisions to finalize.
* Consider using (or supporting) 2-arrays rather than {key, value} objects
for KVs.
* Force the second argument of map/flatMap to be an Object, which would lead
to a less confusing API (vs. Array.map) and clean up the implementation.
Also add a [do]Filter, and possibly a [do]Reduce?
* Move away from using classes.
* Advanced features like state, timers, and SDF.
* Other
* Relative vs. absoute imports, possibly via setting a base url with a
`jsconfig.json`.
* More/better tests, including tests of illegal/unsupported use.
* Set channel options like `grpc.max_{send,receive}_message_length` as we
do in other SDKs.
* Reduce use of `any`.
* Could use `unknown` in its place where the type is truly unknown.
* It'd be nice to enforce, maybe re-enable `noImplicitAny: true` in
tsconfig if we can get the generated proto files to be ignored.
* Enable a linter like eslint and fix at least the low hanging fruit.
There is probably more; there are many TODOs littered throughout the code.
## Development.
### Getting stared
Install node.js, and then from within `sdks/typescript`.
```
npm install
```
### Running tests
```
npm test
```
### Style
We have adopted prettier which can be run with
```
npx prettier --write .
```