Storing files

Storing files#

There are two types of storage on the cloud hub:

user directories, under /home, for notebooks, scripts and very small data files
shared cloud buckets - storage on our S3 buckets associated with the hub

User Directories#

You can navigate this storage like a normal file browser when you open the hub. They behave as a normal UNIX file system. From a terminal on the hub, directories can be navigated and moved around as on a local machine:

Cloud buckets#

For medium-sized data used as part of ongoing projects, two S3 buckets can be used:

The scratch bucket at s3://reflective-scratch-prod/<username>, (also saved in the convenience variable $SCRATCH_BUCKET)
The persistent bucket, at s3://reflective-persistent-prod/<username>, (also saved in the convenience variable $PERSISTENT_BUCKET)

Data saved in the scratch bucket is deleted every 7 days, so the scratch bucket should be used only for intermediate data produced temporarily during analysis, or as a staging location.

Data produced as part of analyses and in active use, or data which you need to upload for use in a project should typically be stored in the persistent bucket.

Please only ever write to the folder under your own username on the persistent bucket, except if agreed beforehand with Reflective, as will be the case for some community datasets which we want to store in a more visible location.

The environment variables SCRATCH_BUCKET and PERSISTENT_BUCKET come preloaded with your username, e.g. s3://reflective-persistent-prod/alistairduffey. This is on purpose to track file ownership and prevent overwriting of other users’ data. Using these environment variables rather than hard coded file paths is therefore preferable for safety!

Do not put ever sensitive data (e.g. passwords) on the hub!#

Data under user directories can be accessed by hub admins.
Data on S3 buckets is freely accessible by any user of the hub, even if under your username.

Uploading data to the hub#

for small-to-medium files, you can upload in the jupyterlab interface (via the GUI), or run wget scripts on the hub to download from the web, and then move files to buckets using aws s3 mv <source> <destination> or via the python commands below.
for larger datasets (>10s GB), get in touch with our team and we will advise on if we can store the data, and if we can, share guidance for uploading to the buckets via the command line.

Example - writing a netcdf file to the scratch bucket#

The code below shows an example of writing a netcdf file from the hub. It uses two steps as netcdf files (unlike zarr) can’t be written directly onto an S3 bucket.

For more examples and information, see 2i2c docs here, and the NASA Earthdata Cloud Cookbook and CryoCloud docs.

s3 = s3fs.S3FileSystem()

scratch = os.environ['SCRATCH_BUCKET']

out_path_on_scratch = f"{scratch}/test_loc.nc" # Where we want to store it

# Create a temporary intermediate netcdf on user directory, then move it to the bucket

with tempfile.NamedTemporaryFile(suffix = ".nc") as tmp:

ds.to_netcdf(tmp.name) # save to a temporary file

s3.put(tmp.name, out_path_on_scratch) # move that file to the scratch bucket