blob: 2236983c5e1edd9d5d9ad3efba2386a8fe1f421d [file] [log] [blame] [view]
# (Yet another) LLM PDF Summarizer 📝
Here's an extensible and production-ready PDF summarizer that you can run anywhere! The frontend uses streamlit, which communicates with a FastAPI backend powered by Hamilton. You give it a PDF file via the browser app and it returns you a text summary using the OpenAI API. If you want, you skip the browser inteface and directly access the `/summarize` endpoint with your document! Everything is containerized using Docker, so you should be able to run it where you please 🏃.
## Why build this project?
This project shows how easy it is to productionize Hamilton. Its function-centric declarative approach makes the code easy to read and extend. We invite you to clone the repo and customize to your needs! We are happy to help you via [Slack](https://hamilton-opensource.slack.com/join/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) and are excited to see what you build 😁
Here are a few ideas:
- Modify the streamlit `file_uploader` to allow sending batches of files through the UI
- Add PDF parsing and preprocessing to reduce number of tokens sent to OpenAI
- Add Hamilton functions to gather metadata (file length, number of tokens, language, etc.) and return it via `SummaryResponse`
- Support other file formats; use the `@config.when()` decorator to add alternatives to the `raw_text()` function for PDFs
- Extract structured data from PDFs using open source models from the HuggingFace Hub.
![](./backend/summarization_module.png)
*The Hamilton execution DAG powering the backend*
# Setup
1. Clone this repository `git clone https://github.com/dagworks-inc/hamilton.git`
2. Move to the directory `cd hamilton/examples/LLM_Workflows/pdf_summarizer`
3. Create a `.env` (next to `README.md` and `docker-compose.yaml`) and add your OpenAI API key in such that `OPENAI_API_KEY=YOUR_API_KEY`
4. Build docker images `docker compose build`
5. Create docker containers `docker compose up -d`
6. Go to [http://localhost:8080/docs](http://localhost:8080/docs) to see if the FastAPI server is running
7. Go to [http://localhost:8081/](http://localhost:8081/) to view the Streamlit app
8. If you make changes, you need to rebuild the docker images, so do `docker compose up -d --build`.
9. To stop the containers do `docker compose down`.
10. To look at the logs, your docker application should allow you to view them,
or you can do `docker compose logs -f` to tail the logs (ctrl+c to stop tailing the logs).
## Connecting to DAGWorks
1. Create a DAGWorks account at www.dagworks.io - follow the instructions to set up a project.
2. Add your DAGWorks API Key to the `.env` file. E.g. `DAGWORKS_API_KEY=YOUR_API_KEY`
3. Uncomment dagworks-sdk in `requirements.txt`.
4. Uncomment the lines in server.py to replace `sync_dr` with the DAGWorks Driver.
5. Rebuild the docker images `docker compose up -d --build`.
# Running on Spark!
Yes, that's right, you can also run the exact same code on spark! It's just a oneline
code change. See the [run_on_spark README](run_on_spark/README.md) for more details.