This is the start of a fully functioning JavaScript (actually, TypeScript) SDK. There are two distinct aims with this SDK:
Tap into the large (and relatively underserved, by existing data processing frameworks) community of JavaScript developers with a native SDK targeting this language.
Develop a new SDK which can serve both as a proof of concept and reference that highlights the (relative) ease of porting Beam to new languages, a differentiating feature of Beam and Dataflow.
To accomplish this, we lean heavily on the portability framework. For example, we make heavy use of cross-language transforms, in particular for IOs. In addition, the direct runner is simply an extension of the worker suitable for running on portable runners such as the ULR, which will directly transfer to running on production runners such as Dataflow and Flink. The target audience should hopefully not be put off by running other-language code encapsulated in docker images.
To install and test the Typescript SDK from source, you will need npm and python. Other requirements can be installed by npm later on.
(Note that Python is a requirement as it is used to orchestrate Beam functionality.)
typescript directory.git checkout https://github.com/apache/beam cd beam/sdks/typescript/
npm install
npm run build to transpile Typescript files into JS files.All of the development workflows (build, test, lint, clean, etc) are defined in package.json and can be run with npm commands (e.g. npm run build).
The wordcount.ts file defines a parameterizable pipeline that can be run against different runners. You can run it from the transpiled .js file like so:
node dist/src/apache_beam/examples/wordcount.js ${PARAMETERS}
To run locally:
node dist/src/apache_beam/examples/wordcount.js --runner=direct
To run against Flink, where the local infrastructure is automatically downloaded and set up:
node dist/src/apache_beam/examples/wordcount.js --runner=flink
To run on Dataflow:
node dist/src/apache_beam/examples/wordcount.js \
--runner=dataflow \
--project=${PROJECT_ID} \
--tempLocation=gs://${GCS_BUCKET}/wordcount-js/temp --region=${REGION}
This SDK is a work in progress. In January 2022 we developed the ability to construct and run basic pipelines (including external transforms and running on a portable runner) but the following big-ticket items remain.
Containerization
API
There are several TODOs of minor features or design decisions to finalize.
Consider using (or supporting) 2-arrays rather than {key, value} objects for KVs.
Force the second argument of map/flatMap to be an Object, which would lead to a less confusing API (vs. Array.map) and clean up the implementation. Also add a [do]Filter, and possibly a [do]Reduce?
Move away from using classes.
Advanced features like state, timers, and SDF.
Other
Relative vs. absoute imports, possibly via setting a base url with a jsconfig.json.
More/better tests, including tests of illegal/unsupported use.
Set channel options like grpc.max_{send,receive}_message_length as we do in other SDKs.
Reduce use of any.
Could use unknown in its place where the type is truly unknown.
It'd be nice to enforce, maybe re-enable noImplicitAny: true in tsconfig if we can get the generated proto files to be ignored.
Enable a linter like eslint and fix at least the low hanging fruit.
There is probably more; there are many TODOs littered throughout the code.
Install node.js, and then from within sdks/typescript.
npm install
npm test
We have adopted prettier which can be run with
npx prettier --write .