Expected speed for batch uploads. Advice on better approaches for large numbers of files.
Hi - I've got a few questions about the speed of the API to help me dial in my expectations: I'm seeing some pretty slow performance (15 mins to upload 96 ~200KB files from an AWS lambda instance with an eager transform), but I'm not 100% sure yet if its Cloudinary or my misusing guzzle promises thats causing the slowdown (its all happening via a job queue on AWS lambda so debugging is a bit of a pain).
Out of curiosity, could you let me know the expected processing & response time of the upload API for an image upload with eager transform (e.g. just resizing from 4096px square to 2048px)?
Is passing s3 paths faster in anyway than direct uploading? (as I mentioned, I'm on AWS lambda so its all via AWS network regardless). When I am transferring directly to/from s3 the same files take <1s.
Would it be advisable to upload the file without an eager transform, then grab it w/transform later? I assume there might be some copying etc happening in the background that the client also ends up waiting on when doing eager transforms.
I was also thinking maybe I'm hitting some kind of soft-API limit (like only processing one image every 5 seconds or similar). Are there any limits beyond the 500 images per hour I should be aware of?
(also on an unrelated note, I'd like to join your startup programme, but couldn't get a response via the form. I'm building a SaaS from scratch to generate 3D models for AR e-commerce, and I'd like to fully integrate cloudinary, especially for your USDZ/GLTF support down the line, but as you'd expect the budget's quite tight while just starting out!
-
Hi Ben,
Do you know if you are uploading those 96 assets in parallel or serial? If you can uploading them in parallel with 3-4 threads, that might help on performance.
And if you are using eager transformation, please make sure to also use `eager_async: true`. This to make sure the eager transformation is done in the background rather than wait for you Upload API to wait until the derived is generated. Please refer to the following documentation for more information.
As you said, AWS lambda should be fast since it's within AWS infrastructure, but if you are passing S3 links directly, that would be less hop & point of failure. So regardless direct S3 link should be faster.
Regarding upload the file without eager transformations, I think `eager_async: true` should address this issue for you.
Regarding soft-limit on the upload, we are not throttling or limiting on the upload. Although there is a concurrency limit in-place to prevent abuse.
If you can share examples where you see the uploads are slow, so I can try to take a look from my end. I would recommend to open a support ticket with the details.
Regarding the startup program, do you have a ticket# associated with that request so I can take a look at them and see where it is at?
Thanks,
Erwin Lukas0 -
Hi Erwin, thanks for the quick & detailed response.
I'm on Laravel/PHP (unfortunately not my strongest language!) so the async answer has some qualifiers. I am attempting to make it async via using Guzzle Promises and unwrap() called at the end, but I'm not entirely sure it will matter if this is the only process running on a lambda instance. Heres some stripped down code of the section in question:
$uploadPromises = [];
$uploads->each(function ($upload) use (&$uploadPromises){
$upload->uploadPromise = $this->cloudinary->uploadApi()->uploadAsync(Storage::disk('local')->path($localPath), $options);
$uploadPromises[$upload->texturePath] = $upload->uploadPromise;
});
$results = Utils::unwrap($uploadPromises);
foreach ($uploads as $upload) {
$upload->setApiResponse($results[$upload->texturePath]);
}Including a logging step via ->then() on the promise directly shows them executing in order, at a steady rate of 2-3 seconds between each, even when the image in question is a <1kb jpg.
Hearing that this isn't an API throttle is enough for me to start solving it though - I'll swap to s3 paths, try with/without guzzle promises, and try with/without eager transforming vs building a url later until I find a version that seems fastest.
I didn't receive a ticket or any response at all after filling in the form, so that suggests it didn't go through! I've resent and have the ticket #148429.
Thanks Again,
Ben
0 -
I've switched to using Guzzle Pools and using async eager transforms and tested further, and now have concurrency on download (which is verifies there isn't a problem with my curl config, which afaik would manifest via never having parallel requests). However I'm still struggling to get concurrency on upload.
Heres the current pseudo-code:
$callables = [];
$uploads->each(function ($upload) use (&$callables){
$callable = function () use ($localPath, $options) {
Log::debug('starting async call for: ' . $options['public_id']);
return $this->cloudinary->uploadApi()->uploadAsync($file, $options);
};
array_push($callables, $callable);
});
$pool = new Pool($this->client, $callables, [
'concurrency' => 20,
'fulfilled' => function ($response, $index){
\Log::debug('completed async call for: ' . $response['public_id']);
},
'rejected' => function (ApiError $reason, $index) {},
]);
\Log::debug('Pool created. Initiating batch upload');
$pool->promise()->wait()
\Log::debug('All cloudinary uploads completed');The pool seems to start the number of items intended (e.g. if I set concurrency to 3, I will receive 3x 'starting async call for...' logs in a row before any 'completed async' logs.) but the timing still reflects them happening in serial order, which is confirmed by the logs reported 'completed' in the same order as the input array every time.
I think next I will try POSTing the files manually via Guzzle, since I have working concurrency using GET pools and Guzzle. It will take me some time to figure out signing and params vs options etc however.
Apologies in advance if this turns out to just be my own code or loose understanding of PHP async!
0 -
Hi Ben,
I'm afraid I'm not very familiar with Guzzle or with using promises in PHP to give much specific advice on your code above, but one thing I see which may affect the timing here is that I think $uploads->each() is going to happen sequentially even if the action taken after that is happening in the background
If I understand correctly, you need to change the structure of this so that you have multiple threads, each of which takes one item from a shared list of uploads, sends it to the API, then gets the next upload, and so on until the list is empty. In big systems there are message queue libraries used for that, but i'm not sure if your case would justify something like Gearman ( http://gearman.org/)
I think in PHP 8, Parallel can do this on a smaller scale: https://github.com/krakjoe/parallelOverall, the problem with making sequential calls is that there are overheads associated with every upload call regardless of how small the file is or how long the data transfer takes. Some of those are on your side (DNS request, local code running, taking the file from disk and preparing it for the network request, etc) and some are on ours (saving the file your account's storage and backups, opening the file to check its validity and save the metadata, updating your account database and search index, running any analysis or transformations you requested, etc)
Assuming there's no bottleneck in the network layer / bandwidth available, you may find that the time taken to upload a single image is the same regardless of whether it's 10 KB or 1 MB, so making the calls in parallel should definitely help with that, I'm just not 100% sure how to modify your existing code to use that method
Regards,Stephen
0
Post is closed for comments.
Comments
4 comments