Expected speed for batch uploads. Advice on better approaches for large numbers of files.

Comments

4 comments

  • Avatar
    Erwin Lukas

    Hi Ben,

    Do you know if you are uploading those 96 assets in parallel or serial? If you can uploading them in parallel with 3-4 threads, that might help on performance. 

    And if you are using eager transformation, please make sure to also use `eager_async: true`. This to make sure the eager transformation is done in the background rather than wait for you Upload API to wait until the derived is generated. Please refer to the following documentation for more information.

    As you said, AWS lambda should be fast since it's within AWS infrastructure, but if you are passing S3 links directly, that would be less hop & point of failure. So regardless direct S3 link should be faster.

    Regarding upload the file without eager transformations, I think `eager_async: true` should address this issue for you.

    Regarding soft-limit on the upload, we are not throttling or limiting on the upload. Although there is a concurrency limit in-place to prevent abuse.

    If you can share examples where you see the uploads are slow, so I can try to take a look from my end. I would recommend to open a support ticket with the details.

    Regarding the startup program, do you have a ticket# associated with that request so I can take a look at them and see where it is at?

    Thanks,
    Erwin Lukas

    0
    Comment actions Permalink
  • Avatar
    Ben Ferns

    Hi Erwin, thanks for the quick & detailed response.

    I'm on Laravel/PHP (unfortunately not my strongest language!) so the async answer has some qualifiers. I am attempting to make it async via using Guzzle Promises and unwrap() called at the end, but I'm not entirely sure it will matter if this is the only process running on a lambda instance. Heres some stripped down code of the section in question:

    $uploadPromises = [];
    $uploads->each(function ($upload) use (&$uploadPromises){
    $upload->uploadPromise = $this->cloudinary->uploadApi()->uploadAsync(Storage::disk('local')->path($localPath), $options);
    $uploadPromises[$upload->texturePath] = $upload->uploadPromise;
    });

    $results = Utils::unwrap($uploadPromises);

    foreach ($uploads as $upload) {
    $upload->setApiResponse($results[$upload->texturePath]);
    }

    Including a logging step via ->then() on the promise directly shows them executing in order, at a steady rate of 2-3 seconds between each, even when the image in question is a <1kb jpg.

    Hearing that this isn't an API throttle is enough for me to start solving it though - I'll swap to s3 paths, try with/without guzzle promises, and try with/without eager transforming vs building a url later until I find a version that seems fastest.

    I didn't receive a ticket or any response at all after filling in the form, so that suggests it didn't go through! I've resent and have the ticket #148429.

    Thanks Again,

    Ben

    0
    Comment actions Permalink
  • Avatar
    Ben Ferns

    I've switched to using Guzzle Pools and using async eager transforms and tested further, and now have concurrency on download (which is verifies there isn't a problem with my curl config, which afaik would manifest via never having parallel requests). However I'm still struggling to get concurrency on upload.

    Heres the current pseudo-code:

    $callables = [];
    $uploads->each(function ($upload) use (&$callables){
    $callable = function () use ($localPath, $options) {
    Log::debug('starting async call for: ' . $options['public_id']);
    return $this->cloudinary->uploadApi()->uploadAsync($file, $options);
    };
    array_push($callables, $callable);
    });

    $pool = new Pool($this->client, $callables, [
    'concurrency' => 20,
    'fulfilled' => function ($response, $index){
    \Log::debug('completed async call for: ' . $response['public_id']);
    },
    'rejected' => function (ApiError $reason, $index) {},
    ]);

    \Log::debug('Pool created. Initiating batch upload');
    $pool->promise()->wait()
    \Log::debug('All cloudinary uploads completed');

    The pool seems to start the number of items intended (e.g. if I set concurrency to 3, I will receive 3x 'starting async call for...' logs in a row before any 'completed async' logs.) but the timing still reflects them happening in serial order, which is confirmed by the logs reported 'completed' in the same order as the input array every time.

    I think next I will try POSTing the files manually via Guzzle, since I have working concurrency using GET pools and Guzzle. It will take me some time to figure out signing and params vs options etc however.

    Apologies in advance if this turns out to just be my own code or loose understanding of PHP async!

    0
    Comment actions Permalink
  • Avatar
    Stephen Doyle

    Hi Ben,

    I'm afraid I'm not very familiar with Guzzle or with using promises in PHP to give much specific advice on your code above, but one thing I see which may affect the timing here is that I think $uploads->each() is going to happen sequentially even if the action taken after that is happening in the background

    If I understand correctly, you need to change the structure of this so that you have multiple threads, each of which takes one item from a shared list of uploads, sends it to the API, then gets the next upload, and so on until the list is empty. In big systems there are message queue libraries used for that, but i'm not sure if your case would justify something like Gearman ( http://gearman.org/)
    I think in PHP 8, Parallel can do this on a smaller scale: https://github.com/krakjoe/parallel 

    Overall, the problem with making sequential calls is that there are overheads associated with every upload call regardless of how small the file is or how long the data transfer takes. Some of those are on your side (DNS request, local code running, taking the file from disk and preparing it for the network request, etc) and some are on ours (saving the file your account's storage and backups, opening the file to check its validity and save the metadata, updating your account database and search index, running any analysis or transformations you requested, etc)

    Assuming there's no bottleneck in the network layer / bandwidth available, you may find that the time taken to upload a single image is the same regardless of whether it's 10 KB or 1 MB, so making the calls in parallel should definitely help with that, I'm just not 100% sure how to modify your existing code to use that method
    Regards,

    Stephen

    0
    Comment actions Permalink

Please sign in to leave a comment.