Participating as a Data Preparer

Slingshot v3 aims to engage individuals/teams of Data Preparers (DPs) that own the process of downloading datasets, processing them, and sharing generated CAR files with eligible SPs for deal making. If you participated in past versions of Slingshot and want to learn more about this iteration, start with the Program Details.

In order to participate as a DP:

  1. Register (top right) on this website
  2. Finish the sign-up flow by updating the required Account Details

Selecting datasets to prepare

Once you're signed up, you should be able to browse available datasets in the website. Pick up to 3 datasets that you are interested in preparing for Slingshot SPs to store. Dataset metadata shared should include the region in which the dataset is currently hosted, its size, how many files it has, and how you can obtain it (i.e., AWS bucket). You should use these to chose datasets that you can obtain and process most efficiently.

Note that once you claim a particular dataset, no other DP can claim it. If you release your claimed dataset, you will not be able to claim it again unless an admin overrides this. You can request an admin to override this by creating an issue.

Each dataset is associated with a slug-name, i.e. coco-common-objects-in-context-fastai-datasets. You can check the slug name of your claimed datasets under My Claimed Datasets. You'll see how this is used below during dataset preparation.

Preparation process

As you go through the preparation process for each dataset, please remember to update its Preparation Progress in the website. The Slingshot team will be keeping track of progress across datasets and may remove you from a claimed dataset if there is no update on it for > 4 weeks. The preparation states tracked in the website are:

  • Not started (default)
  • Downloading dataset
  • CAR generation in progress
  • CAR generation complete

Preparation tool

The tool used for dataset preparation is singularity. It will help properly split the dataset (or files within it), construct the IPLD structure for the CAR file and generate the corresponding manifest file which needs to be uploaded later.

Daemon or Standalone

There are two ways to use the tool

  • The Daemon version comes with better management experience where you can manage all your dataset preparation requests, including pausing, resuming and retrying. This is the recommended option if you are preparing dataset larger than 10TiB, or if you're relying on unreliable data source, such as an S3 FUSE mount.
  • The Standalone version is easier to set up and is more suitable for those who want to smaller datasets or datasets that have already been downloaded.

Quick Start

# Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
source ~/.bashrc
nvm install 16
# Install Singularity
npm i -g @techgreedy/singularity
# Use standalone version
singularity-prepare -h
# Use daemon version
singularity init
singularity daemon
singularity prep -h

Hardware requirements

Each job will take 1-2 CPU cores, 50-100MB/s disk I/O. To control parallelism:

  • For standalone version, use -j flag
  • For daemon version, use deal_preparation_worker.num_workers in the config file (~/.singularity/default.toml)

RAM usage is negligible, however, if using the daemon version, the built-in database engine may take up to 80% of your system memory. To change this behavior, you'll need to point the daemon service to your own database engine.

Some datasets includes lots of smaller files, in which case you may need to increase Linux open file limit, i.e. ulimit -Hn 100000

Prepare and download at the same time

You can use geesefs or goofys to mount a remote S3 bucket to a local filesystem. geesefs is slightly slower when scanning the directory but is more stable when dealing with large amount of directory entries.

This will allow you to prepare the dataset without downloading it beforehand. However, the speed is limited by your Internet speed. This is the recommended way to prepare dataset larger than 10TiB and below are a few tricks to gain the best experience and most performance.

  1. Always use the daemon version so any network I/O error can be retried with singularity prep retry
  2. Always Use --tmp-dir flag when creating the dataset preparation request, this will download the files to a temporary path. This will boost the speed by 2x if the local drive is significantly faster than your Internet speed. This flag is unnecessary if the dataset is already downloaded.
  3. Use a much higher number of workers, such as 32 to 128 to saturate the Internet bandwidth. With geesefs, you'll need to tune the --memory-limit to a higher amount to allow more readahead for all workers, otherwise you'll see cannot allocate memory error.
  4. Use the forked repo from the above link. The patch in the fork will handle some special cases for public datasets.

Preparation Requirements

To ensure the data prepared is consumable for future applications such as compute over data, the data preparer needs to follow below practice to ensure the data model consistency

  1. Always target 32GiB sector size, which is the default
  2. Avoid slicing the dataset manually to multiple subfolders - this is prone to errors such as having different root path for different slices
  3. Must cover all files of the dataset before claiming completion - we may check your manifest file with our scanned records, we may also check the CID or perform a retrieval to validate the integrity
  4. All failed CAR generations need to be retried until they are complete
  5. If you plan to download the dataset first or already have it downloaded. Make sure you check the completeness of the download by running aws s3 sync --no-sign-request --delete s3://<bucket> <destination>. This is because aws s3 creates temporary files like data.gz.6CCeb4D4 during download and you don't want to include them in the dataset. It may also skipped failed downloads silently.
  6. Always use the parent of the mounted dataset as the root. This allows us to recognize the name of the S3 bucket for the files. This is especially important when a single dataset contains multiple S3 buckets. Use below as an example: Assume the dataset contains two buckets: s3://dataset-example-a and s3://dataset-example-b/prefix-b. You would like to create the folder structure as below
    mkdir -p slug-name/dataset-example-a
    # Mount the S3 bucket to the folder of same name under root
    geesefs --memory-limit 8000 --endpoint https://s3.amazonaws.com dataset-example-a slug-name/dataset-example-a
    mkdir -p slug-name/dataset-example-b/prefix-b
    # Similar as above but with S3 bucket that comes with a prefix
    geesefs --memory-limit 8000 --endpoint https://s3.amazonaws.com dataset-example-b:prefix-b slug-name/dataset-example-b/prefix-b
    # Create a preparation request using the <tmp-dir> as temporarily folder, <slug-name> as displayed in your Slingshot V3 portal, root of the dataset and <out-dir> for saving CAR files.
    singularity prep create -t <tmp-dir> <slug-name> ./slug-name <out-dir>
    
    If you already have dataset downloaded, then you'll need to manually create a directory and symlink them to construct the same directory hierarchy
    mkdir -p ./slug-name
    # Assume the bucket has been downloaded to ./downloaded/dataset_a, link it to the folder with the same name as S3 bucket under slug-name
    ln -s ./downloaded/dataset_a ./slug-name/dataset-example-a
    mkdir -p ./slug_name/dataset-example-b
    # Similiar with above but with prefix
    ln -s ./downloaded/dataset_b/prefix-b ./slug_name/dataset-example-b/prefix-b
    # Note the <tmp-dir> is not needed for downloaded dataset
    singularity prep create <slug-name> ./slug-name <out-dir>
    

A typical flow to prepare a dataset with a single S3 bucket is shown below

# Use daemon version for better management
singularity init
# Edit ~/.singularity/default.toml and set deal_preparation_worker.num_workers to >=64
singularity daemon
# Mount dataset to local file system to avoid downloading them beforehand
mkdir -p ~/slug-name/s3-bucket-name
geesefs --memory-limit 8000 --endpoint https://s3.amazonaws.com  s3_bucket_name ~/slug-name/s3-bucket-name
# Prepare some temporary directory to download the dataset, it won't consume more than 32G x num_workers
mkdir -p ~/tmp_dir
# Prepare some directory to store the CAR files
mkdir -p ~/car_dir
# Make a preparation request, note we are using the roo
singularity prep create -t ~/tmp_dir slug_name ~/slug-name ~/car_dir
# Check the status of preparation request. The daemon will start scanning the folder and generating CAR files at the same time
# scanningStatus - indicate whether the initial folder scan has completed. This may take minutes to days depending on the complexity of the dataset
# generationTotal - the total number of CAR files that will be generated
# generationActive - the number of CAR files waiting to be generated
# generationPaused - the generation that is paused by using singularity prep pause
# generationCompleted - the number of CAR files that has been generated
# generationError - the number of generation requests that encountered an error
singularity prep list
# To check the error message of failed generations
singularity prep status slug-name
# If the error seems retryable, such as i/o error caused by network mount, you may retry using
singularity prep retry -h
# If you mess up and would like to start from scratch
singularity prep remove --purge slug-name

Common issues, questions and bug reports

Check singularity repo for known common issues.

If you encounter any other issue using the tools or need help to troubleshoot, feel free to ask them in the #large-clients-tooling Filecoin slack channel or report a bug

Uploading CAR file metadata

After dataset has been prepared, the manifest files need to be uploaded to Web3.Storage.

  • For dataset prepared by the daemon version, use upload-manifest-daemon.sh to upload the manifest
    export WEB3_STORAGE_TOKEN=eyJ...
    ./upload-manifest-daemon.sh slug-name slug-name
    
  • For dataset prepared by the standalone version, use upload-manifest-standalone.sh to upload the manifest
    export WEB3_STORAGE_TOKEN=eyJ...
    ./upload-manifest-standalone.sh out-dir slug-name
    

Getting deals on-chain

In order for your Slingshot data preparation to be eligible for rewards, corresponding deals must be made by the Slingshot Deal Engine with participating SPs. SPs who are able to obtain your CAR files can initiate dealmaking directly with the Slingshot Deal Engine. Your prepared CAR files can be obtained by SPs in two ways:

  • you can distribute your CAR files to SPs
  • SPs can retrieve pieces from other SPs (after initial replicas have been stored)

Distributing CAR files

For at least the first few copies, DPs are recommended to:

  • host CAR files somewhere where SPs can download them
  • send SPs CAR files directly over-the-wire
  • send SPs CAR files directly offline, i.e., via shipping drives

DPs choosing to host CAR files for SPs to download them have several options. One simple path here is to host a basic HTTP server:

sudo apt install nginx

Modify /etc/nginx/sites-available/default and add below lines

server {
  ...
  location / {
    root /home/user/car_dir;
  }
  ...
}

This will allow storage providers to download files with URL http://<your_site-ip>/<piece_cid>.car

To improve download speed for your storage providers, we recommend you to sign up with CloudFlare which protects your service as well as improves the throughput and latency. We also recommend your storage provider to use a multithreaded download software such as aria2 or axel.

Finding SPs to work with

You are free to partner with specific SPs to onboard your prepared data on-chain. This may help in prioritization of your CAR files to ensure you can build the maximum replicas in the alloted time. Options for identifying SPs to work with include:

  • advertise available CAR files to SPs in #slingshot or #fil-deal-market
  • use marketing making tools like https://www.bigd.exchange/

Participating as a Storage Provider (SP)

Storage Providers interested in serving deals for prepared Slingshot data should sign to participate here.

Details on SP participation requirements can be found in the Program Details.

Storage Requirements

  • Participating SPs commit to serving fast retrievals for this data throughout the duration of the deal, for 0FIL. SPs with retrieval success rates below 95% may be temporarility suspended from participating in the program
  • Suspended SPs will be re-activated through successfully serving ongoing retrieval checks

How it works

The process for participating in deal making from the Slingshot Deal Engine is the same as that for the Evergreen program. The engine is the same, but is hosting Slingshot as a separate tenant, so you still need to register and be approved for deals from Slingshot.

  1. Ensure your minerID gets on the list of eligible SPs by going through the application process.
  2. Use the authenticator tool to validate that your requests are coming from the right SP ID. You will need access to your current SP worker key (the same one you use to use for ProveCommits). This is required in order to use the API.
    • Download the authenticator using curl -OL https://raw.githubusercontent.com/filecoin-project/evergreen-dealer/master/misc/fil-spid.bash
    • Run chmod 755 fil-spid.bash
    • Run curl -sLH "Authorization: $( ./fil-spid.bash f0XXXX )" https://api.evergreen.filecoin.io/pending_proposals
  3. Use the deal engine to examine the list of CIDs you can store
    • A list of all deals eligible for storage using curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/eligible_pieces/anywhere
  4. Get the piece(s) in order to be able to store them. You can do this in two ways:
    • Coordinating directly with the DP that created the CAR files. DPs can host the files for you to obtain or find an alternative way of transferring them to you. You can coordinate with DPs directly or in the #slingshot channel in Filecoin Slack.
    • Retrieve it from SPs currently hosting the data. For each piece CID, there may be one or multiple SPs that currently have it in a deal, make sure to search through the table/API results before attempting a retrieval (ideally from one geographically close to you).
  5. Once you are ready to request deal proposals for the deals you would like to renew,
    • For each deal, curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/request_piece/bagaChosenPieceCid
    • Note that from the moment of invoking this method your SP system will receive a deal proposal within ~5 minutes with a deal-start-time about 3 days (~72 hours) in the future.
    • These will be verified deals with DataCap
    • Deals will be made for maximum practical duration (~532 days)
  6. You can view the set of outstanding deals against your SP at any time by invoking curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/pending_proposals
    • Note that in order to prevent abuse you can have at most 10TiB (320 x 32GiB sectors) outstanding against your SP at any time.