Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 35 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# doccano

[![Codacy Badge](https://app.codacy.com/project/badge/Grade/35ac8625a2bc4eddbff23dbc61bc6abb)](https://www.codacy.com/gh/doccano/doccano/dashboard?utm_source=github.com&utm_medium=referral&utm_content=doccano/doccano&utm_campaign=Badge_Grade)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/35ac8625a2bc4eddbff23dbc61bc6abb)](https://www.codacy.com/gh/doccano/doccano/dashboard?utm_source=github.com&utm_medium=referral&utm_content=doccano/doccano&utm_campaign=Badge_Grade)
[![doccano CI](https://github.com/doccano/doccano/actions/workflows/ci.yml/badge.svg)](https://github.com/doccano/doccano/actions/workflows/ci.yml)

doccano is an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence tasks. You can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotating. You can build a dataset in hours.
Expand Down Expand Up @@ -149,8 +149,8 @@ docker-compose -f docker/docker-compose.prod.yml --env-file .env up

## Prerequisites

* Docker Desktop (or Docker Engine)
* Bash (for `tools/local.sh`)
- Docker Desktop (or Docker Engine)
- Bash (for `tools/local.sh`)

## 1) Configure environment

Expand All @@ -168,9 +168,9 @@ ADMIN_EMAIL=admin@example.com
tools/local.sh full
```

* App: [http://127.0.0.1/](http://127.0.0.1/)
* Uses `docker/docker-compose.local.yml` with your local `backend/` + `frontend/` sources.
* An admin user and rolls from the env above will be created.
- App: [http://127.0.0.1/auth](http://127.0.0.1/auth)
- Uses `docker/docker-compose.local.yml` with your local `backend/` + `frontend/` sources.
- An admin user and rolls from the env above will be created.

## 3) Common dev loops

Expand Down Expand Up @@ -203,26 +203,26 @@ See notes in tools/local.sh for full documentation.

## Notes

* `tools/local.sh` auto-detects `docker compose` vs `docker-compose`.
* After `clean`/`purge`/`purge-all`, run `tools/local.sh full` to recreate the DB and admin.
* Troubleshooting:
- `tools/local.sh` auto-detects `docker compose` vs `docker-compose`.
- After `clean`/`purge`/`purge-all`, run `tools/local.sh full` to recreate the DB and admin.
- Troubleshooting:

```bash
tools/local.sh ps
tools/local.sh logs-backend
tools/local.sh logs-nginx
```


### One-click Deployment

| Service | Button |
|---------|---|
| AWS[^1] | [![AWS CloudFormation Launch Stack SVG Button](https://cdn.rawgit.com/buildkite/cloudformation-launch-stack-button-svg/master/launch-stack.svg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=doccano&templateURL=https://doccano.s3.amazonaws.com/public/cloudformation/template.aws.yaml) |
| Heroku | [![Deploy](https://www.herokucdn.com/deploy/button.svg)](https://dashboard.heroku.com/new?template=https%3A%2F%2Fgithub.com%2Fdoccano%2Fdoccano) |
<!-- | GCP[^2] | [![GCP Cloud Run PNG Button](https://storage.googleapis.com/gweb-cloudblog-publish/images/run_on_google_cloud.max-300x300.png)](https://console.cloud.google.com/cloudshell/editor?shellonly=true&cloudshell_image=gcr.io/cloudrun/button&cloudshell_git_repo=https://github.com/doccano/doccano.git&cloudshell_git_branch=CloudRunButton) | -->
| Service | Button |
| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |
| AWS[^1] | [![AWS CloudFormation Launch Stack SVG Button](https://cdn.rawgit.com/buildkite/cloudformation-launch-stack-button-svg/master/launch-stack.svg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=doccano&templateURL=https://doccano.s3.amazonaws.com/public/cloudformation/template.aws.yaml) |
| Heroku | [![Deploy](https://www.herokucdn.com/deploy/button.svg)](https://dashboard.heroku.com/new?template=https%3A%2F%2Fgithub.com%2Fdoccano%2Fdoccano) |
| <!-- | GCP[^2] | [![GCP Cloud Run PNG Button](https://storage.googleapis.com/gweb-cloudblog-publish/images/run_on_google_cloud.max-300x300.png)](https://console.cloud.google.com/cloudshell/editor?shellonly=true&cloudshell_image=gcr.io/cloudrun/button&cloudshell_git_repo=https://github.com/doccano/doccano.git&cloudshell_git_branch=CloudRunButton) | --> |

> [^1]: (1) EC2 KeyPair cannot be created automatically, so make sure you have an existing EC2 KeyPair in one region. Or [create one yourself](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). (2) If you want to access doccano via HTTPS in AWS, here is an [instruction](https://github.com/doccano/doccano/wiki/HTTPS-setting-for-doccano-in-AWS).

<!-- > [^2]: Although this is a very cheap option, it is only suitable for very small teams (up to 80 concurrent requests). Read more on [Cloud Run docs](https://cloud.google.com/run/docs/concepts). -->

## FAQ
Expand Down Expand Up @@ -260,25 +260,26 @@ Here are some tips might be helpful. [How to Contribute to Doccano Project](http

For help and feedback, feel free to contact [the author](https://github.com/Hironsan).


## Build your own container
from root dir `doccano/` run

from root dir `doccano/` run

- `docker build --no-cache --progress=plain --file ./docker/Dockerfile.prod --platform=linux/amd64 -t doccano:be_20240813 ./`
- `docker build --no-cache --progress=plain --file ./docker/Dockerfile.nginx --platform=linux/amd64 -t doccano:fe_20240813 ./`


test:
`docker build --no-cache --progress=plain -t doccano:20230911 ./docker/docker-frontend/ &> build.log`


## Run in Docker compose

from the `/` root forder:

- sudo docker-compose -f docker/docker-compose.prod.yml ps
- sudo docker-compose -f docker/docker-compose.prod.yml up -d
- docker-compose -f docker/docker-compose.prod.yml --env-file .env up (not tried yet)


## CREATE AWS ENVIRONMENT:

Doing this in us-east-1 - Virginia and used the base name `doccano`, so for instance `doccano-vpc`, `doccano-sg` etc etc

- Create Secrets
Expand All @@ -293,6 +294,7 @@ Doing this in us-east-1 - Virginia and used the base name `doccano`, so for inst
- Update SSL Cert and listeners

### Populate Secrets needed by the EC2

- add useful secrets to secrets manager:
- quay_io_creds (quay.io login creds)
```
Expand Down Expand Up @@ -322,24 +324,29 @@ Doing this in us-east-1 - Virginia and used the base name `doccano`, so for inst
```

### Create VPC

Select the following options:

- VPC and more
- pick your CIDR block and name (`doccano-vpc`)
- 2 availbiulity zones, 2 private, 2 public subnets
- only 1 nat gateway (in 1 availability zone. We are going to deploy only in that one zone since this application doesn't need to be fault tollerant and can have some downtime. We create 2 so it will be easier to add eventually a second nat gateway down the line if we decide to.)
- Leave the other options as they are

### Create Security Group for ALB

- `doccano-alb-sg`
- select `doccano-vpc`
- create security group for ALB listen to all from 80 and 443
- Maybe restrict to UChicago IPs for now?

### Create Target Group
- create target group for ALB (type instances, name `doccano-target-group`, protocol 80, http1, health check: /)

- create target group for ALB (type instances, name `doccano-target-group`, protocol 80, http1, health check: /)
- Create button, add instances later

### Create ALB

- name (`doccano-alb`)
- internet facing
- ipv4
Expand All @@ -350,6 +357,7 @@ Select the following options:
- click on create

### Create RDS psql instance

- psql
- 13.13
- production
Expand All @@ -362,15 +370,16 @@ Select the following options:
- doccano vpc
- force to create a new DB subnet group
- doccano vpc default sec group
-
-

### Create EC2 instance

- name= doccano-ec2-20240813-1, Amazon Linux 2023 AMI, t3.medium
- select key pair
- select doccano VPC, private subnet
- No public IP, existing security group 'default' for the doccano VPC
- 40gb gp3
- previously created role ec2_secrets_manager_role ( or create it if not created previously. Give EC2 read permission on the secrets manager with policy SecretsManagerReadWrite and trust relationship to EC2, as well as CloudWatchLogsFullAccess to push the docker logs to cloudwatch) as instance profile.
- previously created role ec2_secrets_manager_role ( or create it if not created previously. Give EC2 read permission on the secrets manager with policy SecretsManagerReadWrite and trust relationship to EC2, as well as CloudWatchLogsFullAccess to push the docker logs to cloudwatch) as instance profile.
- update ec2_user_data.sh script
- click on create

Expand All @@ -380,6 +389,7 @@ Select the following options:
- wait for it to be healthy.

### Update SSL Cert and listeners

- get a cert from certificate manager for the ALB
- update the DNS provider with the CNAME of the ALB
- add certificate to the HTTPS 443 listener in the ALB and point to the target group
Expand All @@ -392,8 +402,8 @@ Select the following options:
- Remove old instance from the target group
- Terminate old instance


# Debug psql and install on machine when ssh from bastion host

- sudo dnf search postgresql
- sudo dnf install -y postgresql15
-

4 changes: 4 additions & 0 deletions annotations/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Ignore every file in this directory
*
# Except for this .gitignore file
!.gitignore
9 changes: 8 additions & 1 deletion backend/config/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from django.views.static import serve
from drf_yasg import openapi
from drf_yasg.views import get_schema_view
from rest_framework import permissions

schema_view = get_schema_view(
openapi.Info(
Expand All @@ -34,9 +35,11 @@
license=openapi.License(name="MIT License"),
),
public=True,
permission_classes=(permissions.IsAuthenticated,),
)

urlpatterns = []

if settings.DEBUG or os.environ.get("STANDALONE", False):
static_dir = Path(__file__).resolve().parent.parent / "client" / "dist"
# For showing images and audios in the case of pip and Docker.
Expand Down Expand Up @@ -67,6 +70,10 @@
path("v1/projects/<int:project_id>/", include("examples.urls")),
path("v1/projects/<int:project_id>/", include("labels.urls")),
path("v1/projects/<int:project_id>/", include("label_types.urls")),
path("v1/projects/", include("projects.urls")),
path("swagger/", schema_view.with_ui("swagger", cache_timeout=0), name="schema-swagger-ui"),
re_path("", TemplateView.as_view(template_name="index.html")),
re_path(
r"^(?!media/)(?!static/).*",
TemplateView.as_view(template_name="index.html"),
),
]
21 changes: 10 additions & 11 deletions backend/data_import/celery_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,8 +231,6 @@ def check_uploaded_files(upload_ids: List[str], file_format: Format):


@shared_task(
# Retries only for likely-transient infra issues.
# We intentionally DO NOT retry for data/validation/constraint errors.
autoretry_for=(OperationalError, ConnectionError, TimeoutError, SoftTimeLimitExceeded),
retry_backoff=2,
retry_jitter=True,
Expand All @@ -258,17 +256,15 @@ def import_dataset(user_id, project_id, file_format: str, upload_ids: List[str],
project = get_object_or_404(Project, pk=project_id)
user = get_object_or_404(get_user_model(), pk=user_id)

# Discover max label length constraint for this project's label model (if any).
label_max_len = get_label_text_max_length_for_project(project)

try:
# Build format adapter (e.g., JSONL, CSV, etc.).
fmt = create_file_format(file_format)

# Validate file size/MIME and clean the upload IDs.
upload_ids, errors = check_uploaded_files(upload_ids, fmt)
upload_ids, upload_errors = check_uploaded_files(upload_ids, fmt)
upload_errors = [e.dict() if hasattr(e, "dict") else e for e in upload_errors]

# Convert remaining uploads into pipeline FileName objects.
temporary_uploads = TemporaryUpload.objects.filter(upload_id__in=upload_ids)
filenames = [
FileName(
Expand All @@ -281,9 +277,12 @@ def import_dataset(user_id, project_id, file_format: str, upload_ids: List[str],

# Pre-flight
preflight_errors = _preflight_files(fmt, filenames, project, kwargs, label_max_len)
if preflight_errors:

# FIX: combine both error sources instead of discarding upload_errors.
combined_errors = upload_errors + preflight_errors
if combined_errors:
# Stop early; nothing written to DB yet.
return {"error": preflight_errors}
return {"error": combined_errors}

# Build dataset object using the pipeline.
dataset = load_dataset(task, fmt, filenames, project, **kwargs)
Expand All @@ -292,12 +291,12 @@ def import_dataset(user_id, project_id, file_format: str, upload_ids: List[str],
with transaction.atomic():
dataset.save(user, batch_size=settings.IMPORT_BATCH_SIZE)

# Move upload after succes
# Move upload after success
upload_to_store(temporary_uploads)

# Normalize and return any non-fatal dataset errors (usually empty).
errors.extend(getattr(dataset, "errors", []))
return {"error": [e.dict() if hasattr(e, "dict") else e for e in errors]}
dataset_errors = [e.dict() if hasattr(e, "dict") else e for e in getattr(dataset, "errors", [])]
return {"error": dataset_errors}

except FileImportException as e:
# Known import error type from the pipeline: return in the same shape.
Expand Down
22 changes: 11 additions & 11 deletions backend/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions backend/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ filetype = "^1.0.10"
flower = "^1.2.0"
django-allauth = "^0.52.0"
pydantic = "^2.0.3"
requests = ">=2.28,<3.0"
environs = "^14.5.0"
django-polymorphic = "^4.9.0"
drf-yasg = "^1.21.11"
Expand Down
3 changes: 2 additions & 1 deletion docker/docker-compose.local.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
version: "3.7"
services:

backend:
build:
context: ../backend
Expand All @@ -25,6 +24,8 @@ services:
DJANGO_SETTINGS_MODULE: "config.settings.production"
depends_on:
- postgres
ports:
- "8000:8000"
networks:
- network-backend
- network-frontend
Expand Down
Loading
Loading