Publishing Transcripts

Overview

In this article we’ll cover recommended ways to publish transcript databases for use by others. Whenever publishing transcripts you should be mindful to do everything you can to prevent them from entering the training data of models (as this may “leak” benchmark datasets). The main mitigations available for this are:

Making access to the transcripts authenticated (e.g. S3 or Hugging Face); and
Encrypting the transcript database files so that if they are republished in an unauthenticated context that crawlers won’t be able to read them.

We’ll cover both of these scenarios in detail below.

Hugging Face

Publishing transcript databases as a Hugging Face Dataset is useful when you want to share with a broader audience. Benefits of using Hugging Face include:

You can make access private to only your account or organization.
You can create a Gated Dataset that requires users to provide contact information and optionally abide by a usage agreement and share other information to obtain access.

See the Hugging Face documentation on uploading datasets for details on how to create datasets. For transcript databases, you can just upload the parquet file(s) into the root of the dataset repository.

To access a dataset on Hugging Face:

Install the huggingface_hub Python package:
```
pip install huggingface_hub
```
Configure credentials either by setting the HF_TOKEN environment variable or via login:
```
hf auth login
```
Refer to your dataset in a scout scan using the hf:// protocol. For example:
```
scout scan scanner.py -T hf://datasets/account-name/dataset-name
```

See Encryption below for details on adding encryption to database files as an additional measure of protection from crawlers.

S3

Publishing transcripst databases to AWS S3 enables you to configure authenticated access using S3 credentials. S3 buckets support a wide variety of options for authorization (see the documentation for further details).

After you have uploaded the parquet file(s) for your transcript database to an S3 bucket, you can refer to it in a scout scan using the s3:// protocol. For example:

scout scan scanner.py -T s3://my-transcript-databases/database-name

See Encryption below for details on adding encryption to database files as an additional measure of protection from crawlers.

Encryption

You can optionally use encryption to provide further protection for transcript databases. To encrypt a database, use the scout db encrypt command, passing it a valid AES encryption key (16, 24, or 32 bytes). For example:

scout db encrypt /path/to/my/database \
   --output-dir /path/to/my/database-enc \
   --key 0123456789abcdef

If you don’t want to include the key in a script, you can also pass it via stdin (--key -) or pass it via the SCOUT_DB_ENCRYPTION_KEY environment variable.

Reading Encrypted Databases

When using an encrypted database during a scan, you should set the SCOUT_DB_ENCRYPTION_KEY environment variable to the appropriate key. For example:

export SCOUT_DB_ENCRYPTION_KEY=0123456789abcdef
scout scan scanner.py -T /path/to/my/database-enc

You can also decrypt the database using the scout db decrypt command:

scout db decrypt /path/to/my/database-enc \
    --output-dir /path/to/my/database \
    --key 0123456789abcdef

Limitations

Scout uses DuckDB Parquet Encryption to implement encryption. While this will provide additional protection for data, there are some drawbacks:

It is not currently compatible with the encryption of, e.g., PyArrow, so encrypted Parquet files will currently only be readable with DuckDB.
Compression ratios for encrypted Parquet are much lower than for unencrypted (e.g. database files might be 5-8 times larger).
Read performance may be a bit slower due to decryption (but it’s unlikely this will matter as most time in scanning is spent on inference not reading).