Slingshot v3 aims to engage individuals/teams of Data Preparers (DPs) that own the process of downloading datasets, processing them, and sharing generated CAR files with eligible SPs for deal making. If you participated in past versions of Slingshot and want to learn more about this iteration, start with the Program Details.
In order to participate as a DP:
Once you're signed up, you should be able to browse available datasets in the website. Pick up to 3 datasets that you are interested in preparing for Slingshot SPs to store. Dataset metadata shared should include the region in which the dataset is currently hosted, its size, how many files it has, and how you can obtain it (i.e., AWS bucket). You should use these to chose datasets that you can obtain and process most efficiently.
Note that once you claim a particular dataset, no other DP can claim it. If you release your claimed dataset, you will not be able to claim it again unless an admin overrides this. You can request an admin to override this by creating an issue.
Each dataset is associated with a slug-name
, i.e. coco-common-objects-in-context-fastai-datasets
. You can check the slug name of your claimed datasets under My Claimed Datasets. You'll see how this is used below during dataset preparation.
As you go through the preparation process for each dataset, please remember to update its Preparation Progress in the website. The Slingshot team will be keeping track of progress across datasets and may remove you from a claimed dataset if there is no update on it for > 4 weeks. The preparation states tracked in the website are:
The tool used for dataset preparation is singularity. It will help properly split the dataset (or files within it), construct the IPLD structure for the CAR file and generate the corresponding manifest file which needs to be uploaded later.
There are two ways to use the tool
# Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
source ~/.bashrc
nvm install 16
# Install Singularity
npm i -g @techgreedy/singularity
# Use standalone version
singularity-prepare -h
# Use daemon version
singularity init
singularity daemon
singularity prep -h
Each job will take 1-2 CPU cores, 50-100MB/s disk I/O. To control parallelism:
-j
flagdeal_preparation_worker.num_workers
in the config file (~/.singularity/default.toml)RAM usage is negligible, however, if using the daemon version, the built-in database engine may take up to 80% of your system memory. To change this behavior, you'll need to point the daemon service to your own database engine.
Some datasets includes lots of smaller files, in which case you may need to increase Linux open file limit, i.e. ulimit -Hn 100000
You can use geesefs or goofys to mount a remote S3 bucket to a local filesystem. geesefs is slightly slower when scanning the directory but is more stable when dealing with large amount of directory entries.
This will allow you to prepare the dataset without downloading it beforehand. However, the speed is limited by your Internet speed. This is the recommended way to prepare dataset larger than 10TiB and below are a few tricks to gain the best experience and most performance.
singularity prep retry
--tmp-dir
flag when creating the dataset preparation request, this will download the files to a temporary path. This will boost the speed by 2x if the local drive is significantly faster than your Internet speed. This flag is unnecessary if the dataset is already downloaded.--memory-limit
to a higher amount to allow more readahead for all workers, otherwise you'll see cannot allocate memory
error.To ensure the data prepared is consumable for future applications such as compute over data, the data preparer needs to follow below practice to ensure the data model consistency
aws s3 sync --no-sign-request --delete s3://<bucket> <destination>
. This is because aws s3
creates temporary files like data.gz.6CCeb4D4
during download and you don't want to include them in the dataset. It may also skipped failed downloads silently.s3://dataset-example-a
and s3://dataset-example-b/prefix-b
. You would like to create the folder structure as belowmkdir -p slug-name/dataset-example-a
# Mount the S3 bucket to the folder of same name under root
geesefs --memory-limit 8000 --endpoint https://s3.amazonaws.com dataset-example-a slug-name/dataset-example-a
mkdir -p slug-name/dataset-example-b/prefix-b
# Similar as above but with S3 bucket that comes with a prefix
geesefs --memory-limit 8000 --endpoint https://s3.amazonaws.com dataset-example-b:prefix-b slug-name/dataset-example-b/prefix-b
# Create a preparation request using the <tmp-dir> as temporarily folder, <slug-name> as displayed in your Slingshot V3 portal, root of the dataset and <out-dir> for saving CAR files.
singularity prep create -t <tmp-dir> <slug-name> ./slug-name <out-dir>
If you already have dataset downloaded, then you'll need to manually create a directory and symlink them to construct the same directory hierarchymkdir -p ./slug-name
# Assume the bucket has been downloaded to ./downloaded/dataset_a, link it to the folder with the same name as S3 bucket under slug-name
ln -s ./downloaded/dataset_a ./slug-name/dataset-example-a
mkdir -p ./slug_name/dataset-example-b
# Similiar with above but with prefix
ln -s ./downloaded/dataset_b/prefix-b ./slug_name/dataset-example-b/prefix-b
# Note the <tmp-dir> is not needed for downloaded dataset
singularity prep create <slug-name> ./slug-name <out-dir>
A typical flow to prepare a dataset with a single S3 bucket is shown below
# Use daemon version for better management
singularity init
# Edit ~/.singularity/default.toml and set deal_preparation_worker.num_workers to >=64
singularity daemon
# Mount dataset to local file system to avoid downloading them beforehand
mkdir -p ~/slug-name/s3-bucket-name
geesefs --memory-limit 8000 --endpoint https://s3.amazonaws.com s3_bucket_name ~/slug-name/s3-bucket-name
# Prepare some temporary directory to download the dataset, it won't consume more than 32G x num_workers
mkdir -p ~/tmp_dir
# Prepare some directory to store the CAR files
mkdir -p ~/car_dir
# Make a preparation request, note we are using the roo
singularity prep create -t ~/tmp_dir slug_name ~/slug-name ~/car_dir
# Check the status of preparation request. The daemon will start scanning the folder and generating CAR files at the same time
# scanningStatus - indicate whether the initial folder scan has completed. This may take minutes to days depending on the complexity of the dataset
# generationTotal - the total number of CAR files that will be generated
# generationActive - the number of CAR files waiting to be generated
# generationPaused - the generation that is paused by using singularity prep pause
# generationCompleted - the number of CAR files that has been generated
# generationError - the number of generation requests that encountered an error
singularity prep list
# To check the error message of failed generations
singularity prep status slug-name
# If the error seems retryable, such as i/o error caused by network mount, you may retry using
singularity prep retry -h
# If you mess up and would like to start from scratch
singularity prep remove --purge slug-name
Check singularity repo for known common issues.
If you encounter any other issue using the tools or need help to troubleshoot, feel free to ask them in the #large-clients-tooling Filecoin slack channel or report a bug
After dataset has been prepared, the manifest files need to be uploaded to Web3.Storage.
export WEB3_STORAGE_TOKEN=eyJ...
./upload-manifest-daemon.sh slug-name slug-name
export WEB3_STORAGE_TOKEN=eyJ...
./upload-manifest-standalone.sh out-dir slug-name
In order for your Slingshot data preparation to be eligible for rewards, corresponding deals must be made by the Slingshot Deal Engine with participating SPs. SPs who are able to obtain your CAR files can initiate dealmaking directly with the Slingshot Deal Engine. Your prepared CAR files can be obtained by SPs in two ways:
For at least the first few copies, DPs are recommended to:
DPs choosing to host CAR files for SPs to download them have several options. One simple path here is to host a basic HTTP server:
sudo apt install nginx
Modify /etc/nginx/sites-available/default and add below lines
server {
...
location / {
root /home/user/car_dir;
}
...
}
This will allow storage providers to download files with URL
http://<your_site-ip>/<piece_cid>.car
To improve download speed for your storage providers, we recommend you to sign up with CloudFlare which protects your service as well as improves the throughput and latency. We also recommend your storage provider to use a multithreaded download software such as aria2 or axel.
You are free to partner with specific SPs to onboard your prepared data on-chain. This may help in prioritization of your CAR files to ensure you can build the maximum replicas in the alloted time. Options for identifying SPs to work with include:
Storage Providers interested in serving deals for prepared Slingshot data should sign to participate here.
Details on SP participation requirements can be found in the Program Details.
The process for participating in deal making from the Slingshot Deal Engine is the same as that for the Evergreen program. The engine is the same, but is hosting Slingshot as a separate tenant, so you still need to register and be approved for deals from Slingshot.
curl -OL https://raw.githubusercontent.com/filecoin-project/evergreen-dealer/master/misc/fil-spid.bash
chmod 755 fil-spid.bash
curl -sLH "Authorization: $( ./fil-spid.bash f0XXXX )" https://api.evergreen.filecoin.io/pending_proposals
curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/eligible_pieces/anywhere
curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/request_piece/bagaChosenPieceCid
curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/pending_proposals